The Functional Significance of Common Polymorphisms in Zinc Finger Transcription Factors

Variants that alter the DNA-binding specificity of transcription factors could affect the specificity for and expression of potentially many target genes, as has been observed in several tumor-derived mutations. Here we examined if such trans expression quantitative trait loci (trans-eQTLs) could similarly result from common genetic variants. We chose to focus on the Cys2-His2 class of zinc finger transcription factors because they are the most abundant superfamily of transcription factors in human and have well-characterized DNA binding interactions. We identified 430 SNPs that cause missense substitutions in the DNA-contacting residues. Fewer common missense SNPs were found at DNA-contacting residues compared with non-DNA-contacting residues (P = 0.00006), consistent with possible functional selection against SNPs at DNA-contacting positions. Functional predictions based on zinc finger transcription factor (ZNF) DNA binding preferences also suggested that many common substitutions could potentially alter binding specificity. However, Hardy-Weinberg Equilibrium analysis and examination of seven orthologs within the primate lineage failed to find evidence of trans-eQTLs associated with the DNA-contacting positions or evidence of a different selection pressure on a contemporary and evolutionary timescales. The overall conclusion was that common SNPs that alter the DNA-contacting residues of these factors are unlikely to produce strong trans-eQTLs, consistent with the observations by others that trans-eQTLs in humans tend to be few and weak. Some rare SNPs might alter specificity and remained rare due to purifying selection. The study also underscores the need for large-scale eQTLs mapping efforts that might provide experimental evidence for SNPs that alter the choice of transcription factor binding sites.


trans-expression
quantitative trait loci zinc finger proteins nonsynonymous SNPs Hardy-Weinberg Equilibrium transcription factors Genetic variation in humans influences many traits, including development and susceptibility to disease (Stranger et al. 2007b;Schadt et al. 2008;Gibbs et al. 2010;Montgomery et al. 2010).
Common single nucleotide polymorphisms (SNPs), those occurring in 1% or more of a population, can be grouped into two broad categories based on their relationship to the gene they affect. SNPs that change the expression of the gene in which they occur, such as by altering a coding exon or promoter binding site, are considered cis expression quantitative trait loci (cis-eQTLs). Those exerting an effect on a different gene are considered trans expression quantitative trait loci (trans-eQTLs). SNPs in nongeneic regions such as enhancers are also considered trans-eQTLs if they alter expression of a gene that is more than 100 kb away (Stranger et al. 2007a). The trans-eQTLs have been observed less frequently than cis-eQTLs in humans and tend to display reduced effects on the regulated gene (Schadt et al. 2008). The mechanisms have also been less well-studied. Such SNPs could alter the sequence of distal enhancer elements, change the expression level of a regulatory RNA or protein, or, in principle, alter the DNAbinding specificity of a transcription factor and thus change its selection of target genes. Because transcription factors often have multiple target genes, SNPs of this latter class might be predicted to alter the expression of many genes. In this study, we examined the hypothesis that common SNPs in transcription factors can lead to changes in the spectrum and expression level of target genes. We chose to focus on the Cys 2 -His 2 (C2H2) class of zinc finger transcription factors (ZNFs) because they are the most abundant superfamily of transcription factors in human (more than 700 members), accounting for nearly half of all annotated transcription factors (Tupler et al. 2001;Vaquerizas et al. 2009). In addition, although the ability to predict their DNA binding preferences is far from perfect, more is known about ZNF DNA recognition than for any other class of human transcription factors. Thus, they represent the best possibility to predict the effect of SNP-dependent alterations in target gene recognition (Wolfe et al. 2000).
The DNA-binding domain of ZNFs contains tandem arrays of zinc finger repeats or "fingers." Arrays of up to 40 fingers have been reported (Wolfe et al. 2000), although typically only three to five fingers are directly involved in DNA binding. Each finger binds three to four base pairs of DNA. Sequence recognition is determined largely by interactions between the DNA bases and four amino acids in the zinc finger a-helix, namely those at positions 21, 2, 3, and 6 (Pavletich and Pabo 1991;Elrod-Erickson et al. 1996). Engineered substitutions of amino acids at these positions cause altered DNA binding specificity (Segal et al. 1999;Dreier et al. 2001), and rare natural mutations have been reported to alter the spectrum of genes targeted by the ZNF. Mutations in the zinc fingers of GFI1 were linked to severe congenital neutropenia (Person et al. 2003). Tumorderived mutations in the DBDs of the tumor suppressor protein p53 (Campomenosi et al. 2001;Inga et al. 2001;Malcikova et al. 2010), thyroid hormone receptor (Chan and Privalsky 2009), and CTCF (Filippova et al. 2002) all resulted in altered DNA recognition and target gene selection.
Here we identified 430 SNPs that cause nonsynonymous substitutions in the four primary DNA-contacting amino acids in 252 ZNFs in the human genome. Evidence of selection against common SNPs at DNA-contacting compared to non-DNA-contacting amino acids supported our hypothesis that common variants in DNAcontacting positions could affect transcription factor function. However, we failed to find evidence of trans-eQTLs associated with any of the DNA-contacting positions in this study. To better understand potential selection pressures on these SNPs, we performed a broad analysis of sequence variation across primate species and within the human population for a subset of seven ZNFs. Most ZNFs appeared to be under negative selection pressure; there was little evidence of positive selection. The analysis also revealed a complex landscape of variation and function, with a few SNPs likely to have high functional significance but with most having little effect. These results therefore add to our understanding and highlight the complexities of genetic variation and trans-eQTLs.

Localization of SNPs within zinc finger domains
The fingerFinder.pl Perl script identifies clusters of three C2H2 zinc finger domains with TGEKP-like linkers between the fingers. The search used HMMER version 2.3.2 (Sonnhammer et al. 1998) and the Pfam profile for C2H2 zinc fingers PF00096.16 (Schuster-Bockler et al. 2004). The HMM profile is embedded within fingerFinder.pl to maintain consistency in case of updates. Protein sequences were obtained from the Ensemble database corresponding to the GRCh37/ hg19 genome assembly. SNPs from dbSNP version 136 were identified at the C2, 22, 21, 1, 2, 3, 5, and 6 positions in each finger using Perl script snpTOzf.pl. Both Perl scripts are available at www.genomecenter. ucdavis.edu/segallab/segallabsoftware. Only SNPs from the 1000 Genomes project (release 20110521) that also contained frequency data were retained for this study.

Hardy-Weinberg Equilibrium analysis
For the 1040 DNA-contacting and non-DNA-contacting SNPs, chromosomal positions were obtained from Ensemble Biomart Variation 72. The VCFtools htscmd command (github.com/samtools/htslib) was used to extract diploid genotypes for all 2188 individuals from the variant call format (vcf) files of 1000 Genomes release 20110521. Individuals were then subsetted by population according to phase1_integrated_calls.20101123.ALL.panel. The allele and genotype frequencies for each SNP were computed with a custom R script. Deviation from Hardy-Weinberg Equilibrium (HWE) was assessed by applying both a chi-square test and a Fisher exact test. Calculations were not performed if any genotype category contained less than five counts. The HWExact command from the GWASExactHW R package (Painter and Washington 2013) was used to compute the Fisher exact test. D was computed by subtracting the expected number (assuming that the SNP is in HWE) from the observed number of heterozygous individuals divided by two. A positive D indicates the number of observed heterozygous individuals is greater than expected, suggesting that heterozygosity at the locus may be beneficial. A negative D suggests that one or both homozygotes were favored.
For analysis of the seven ZNFs, the genotypes of SNPs located in the coding sequences were extracted from the vcf files based on the genomic coordinates from NCBI's Consensus CDS (CCDS) database (release 9; September 7, 2011). ZNF99 did not have a CCDS number; its genomic coordinates were obtained using BLAT to search the UCSC Genome Browser human genome (GRCh37/hg19). Classification of each SNP as missense, silent, or other (downstream variant, frameshift variant, gain/loss of stop codon) was obtained from the NCBI dbSNP database (version 136). The DNA contact status of any SNP not occurring in zinc finger positions 21, 2, 3, and 6 (DNA-contacting) or C2, 22, 1, and 5 (non-DNA-contacting) was designated as unknown.
Selection (dN/dS) analysis of seven zinc fingers in the primate lineage For the human ortholog of the ZNFs, the longest transcript was chosen from the Nucleotide database of the National Center for Biotechnology Information (NCBI). The other primate orthologs of these ZNFs were obtained by performing a TBLASTN, an NCBI translated nucleotide database, with the human protein ortholog sequence against the Nucleotide Collection (nr/nt) database with a filter for primates (taxid:9443). The sequence with the highest blast score was retrieved for each primate. If the ZNF sequences for a particular primate had excessive missing data, then the primate was excluded from the study for that ZNF. Multiple alignments of the ortholog nucleotide and protein sequences were generated using TranslatorX (translator.co.uk) with the default setting, using Muscle as the protein alignment method (Abascal et al. 2010).
Likelihood ratio test was used to compare the model of neutral evolution (M1a) and that of positive selection (M2a). The site-specific model assumes all lineages share the same v (dN/dS ratio) for each codon or amino acid site in the protein, but v can vary among sites. The null model (M1a) assumes no positive selection. A proportion p 0 of amino acids have v 0 , 1 (under negative selection) and the remaining proportion p 1 = 1 2 p 0 have v 1 = 1 (are neutral). The alternative model (M2a) has one more class, having proportion p 2 = 1 2 p 0 2 p 1 of amino acids with v 2 . 1 (positive selection). The x 2 distribution with two degrees of freedom was used for the likelihood ratio test. The species tree, obtained from the Tree of Life website for the primates (tolweb.org/Primates/15963/1999.01.01), was used as the guide tree for the CODEML program in the PAML package (version 4.5) (Yang 2007). The Bayes Empirical Bayes (BEB) method was used to calculate the posterior probability of v falling into each of the three classes: v , 1, v = 1, and v . 1 (Yang et al. 2005).
For DNA-binding specificity predictions, sequence logos and position frequency matrices (PFM) were obtained by entering the reference and SNP individual zinc finger sequences to the ZFModels website (stormo.wustl.edu/ZFModels) (Gupta et al. 2014) using the parameters ZF Protein Sequences, One Finger Model, and Information Content. The difference between the "REF" PFM (R) and the "SNP" PFM (S) was calculated as the Kullback-Leibler distance (D) for all positions (i) in the matrix as D(R||S) = P (R i Ã ln(R i /S i )), as well as the reciprocal D(S||R) = P (S i Ã ln(S i /R i )). The two distances were then added to create the summed Kullback-Leibler distance. Of the 435 REF and SNP fingers analyzed, ZFModels was unable to calculate PFMs for 19, which appear in Supporting Information, Table S1 as NA.

RESULTS
Common missense SNPs affecting the DNA-contacting residues of zinc finger proteins are less abundant than those at non-DNA-contacting residues C2H2 zinc fingers are known to mediate DNA as well as RNA and protein interactions (Mackay and Crossley 1998;Brown 2005). To maximize the likelihood of examining fingers that bind DNA in all potential splice isoforms, the human proteome was searched for clusters of three or more fingers joined by TGEKP-like linkers. In mammals, zinc fingers are typically found in tandem arrays, with approximately 50% of fingers connected by linkers having the sequence TGEKP (Wolfe et al. 2000). Almost every residue in this conserved linker plays an identifiable role in stabilizing the protein-DNA interaction. Although some known DNA-binding ZNFs do not have this linker (e.g., Tramtrack ZF1-2) and some that do not bind DNA do have it (e.g., Gli ZF2-3), an array of two to three fingers joined by a TGEKP-like linker is currently the best predictor of DNA binding (Ryan and Darby 1998;. Only SNPs causing missense mutations at DNA-contacting positions 21, 2, 3, and 6 or non-DNA-contacting positions C2, 22, 1, and 5 were retained for this study. SNPs causing frameshift or splice mutations were omitted to avoid changes in ZNF specificity due to truncations or loss of entire exons. The ability of the zinc fingers to recognize different DNA sequences is due to the diversity of the amino acids that appear in the DNA-contacting positions. Thus, when zinc fingers that recognize different DNA sequences are aligned, these positions appear to have little conservation ( Figure 1A). For comparative purposes, non-DNAcontacting positions were chosen that had similarly low sequence conservation. However, in contrast to the DNA-contacting residues, positions C2, 22, 1, and 5 are thought to have low conservation because they have no functional role in DNA recognition or protein folding (Wolfe et al. 2000). In particular, beta carbons of these residues direct the side chains away from the DNA bases, unlike the DNA-contacting residues that point directly at the bases ( Figure  1B). The search identified 1040 missense SNPs in 398 ZNF proteins (Table S1). Of these, 166 SNPs were "common" [minor allele frequency (MAF) .1%] with respect to the combined 2188 individuals of the 1000 Genomes dataset. Significantly fewer common SNPs were found in the DNA-contacting positions compared with the non-DNA-contacting positions (P = 0.00006) ( Figure 1C). One interpretation is that substitutions of the DNA-contacting amino acids altered the DNA-binding specificity of the protein, leading to deleterious effects and negative selection.
No trans-eQTLs have been reported for missense SNPs at the DNA-contacting positions SNPs that alter DNA-binding specificity might cause the protein to regulate a different spectrum of target genes. The SNP could cause the loss or gain of affinity to a promoter or enhancer of one or more target genes that are distant to the gene encoding the transcription factor, thus producing trans-eQTLs. The NCBI maintains the GTEx (Genotype-Tissue Expression) eQTL Browser (www.ncbi.nlm.nih.gov/gtex) that is currently based on gene expression data from liver, brain, and lymphoblastoid cell lines (Stranger et al. 2007b;Schadt et al. 2008;Gibbs et al. 2010;Montgomery et al. 2010). Only four of the 1040 SNPs were reported to be associated with eQTLs at a significance less than 10 25 (Table 1). Interestingly, all four produced trans-eQTLs, all were common SNPs, and all were in non-DNA-contacting positions of the ZNFs. Three were found to deviate from HWE (described below), which in all cases favored a heterozygous state (in contrast, 85% of deviating SNPs in this study favored a homozygous state). The potential significance of these trans-eQTLs associated with non-DNAcontacting regions is explored in the Discussion. However, no cis or trans-eQTLs were reported for any SNPs at the DNA-contacting positions. Two additional studies examining gene expression in skin, adipose, lymphoblastoid cell lines, and peripheral blood did not report cis or trans-eQTL for any of the 1040 SNPs (Fehrmann et al. 2011;Grundberg et al. 2012).
SNPs deviating from HWE are few and occur in both DNA-contacting and non-DNA-contacting positions One explanation for the failure to observe trans-eQTLs for the altered DNA-contacting positions is a simple lack of data. Very little is known about the biology of most ZNFs, and it is likely that they could exert their regulatory influence in specific cell types or developmental stages that were not examined in the six studies above. An alternative method to investigate if the missense SNPs have functional effects is to determine whether the genotype frequencies deviate from HWE. Deviations from HWE are caused by evolutionary influences such as selection, but also mutation, non-random mating, or recent population admixture. The effects of admixture can be reduced by confining examination to individual populations of the 1000 Genomes project. SNPs that deviate from HWE under these conditions are likely to represent either positive or negative selection. A high MAF may suggest positive selection; a low MAF may suggest negative selection. An excess of heterozygotes can indicate overdominance selection. However, if the MAF is too low, then there may be too few genotypes to accurately calculate HWE. Of the 1040 missense SNPs, only 55 (5%) demonstrated significant deviation from HWE (P , 0.05), and only 13 (1%) deviated from HWE in more than one of the 1000 Genomes populations (Figure 2 and Table S1). Of the 55 deviants, 85% had a negative D value, indicating selection favoring the homozygous state for most SNPs. However, after adjusting for the total number of SNPs, there was no significant difference in the percent of SNPs deviating at DNA-contacting or non-DNA-contacting positions (P = 0.64).
Orthologs in the primate lineage reveal a trend of negative selection Evidence for functional selection based on genetic variation within the human population can be potentially confounded by several factors. Many variants in the human genome may have arisen relatively recently due to rapid population expansion and therefore may not have had sufficient time for selection. Also, selection may be most important in the early stages of species evolution. To increase the power of our test for selection, we examined orthologs that spanned the evolutionary time periods between species in the primate lineage. Orthologs of human CTCF, CTCFL, PRDM10, PRDM9, YY1, ZNF221, and ZNF99 were found (see Materials and Methods) in the translated genomes of Pan troglodytes (chimpanzee), Pan paniscus (bonobo or pygmy chimpanzee), Gorilla gorilla gorilla (gorilla), Pongo pygmaeus abelii (orangutan), Nomascus leucogenys (gibbons), Macaca mulatta (rhesus macaque), and Tarsius syrichta (tarsier). To identify amino acid sites undergoing positive selection, the CODEML program was used to perform a likelihood ratio test between the null model of neutral evolution (M1a) and alternative model of positive selection (M2a). The test statistics follows a chi-square distribution of degree 2. The nonsynonymous-to-synonymous rate ratio, v (dN/dS), measures selective pressure at the protein level. A site undergoing positive selection can be inferred when v is greater than 1. The P values of the likelihood ratio tests are 0.9995, 1.00, 0.904, 0.000, 0.000, 0.000, and 0.000 for CTCF, CTCFL, PRDM10, PRDM9, YY1, ZNF221, and ZNF99, respectively. These are consistent with the red dots above the P = 0.95 lines in Figure 3. The Bayes Empirical Bayes (BEB) method was used to calculate the posterior probability of v falling into the three classes: v , 1, v = 1, and v . 1 (Yang et al. 2005). The differences in selective pressure between this set of proteins were quite striking (Figure 3). However, the general observation was that almost all positions in the zinc fingers showed v , 1, indicating negative selection. Significant (posterior probability .0.95) negative selection was observed for the zinc fingers of CTCF, PRDM10, ZNF99, and YY1 (the last half sites), whereas those of CTCFL, PRDM9, and ZNF221 did not display strong evidence of negative selection. Two notable deviations were several positions in the fingers of PRDM9 and ZNF99 that showed significant evidence for positive selection (Figure 3, red dots above probability 0.95). However, for zinc fingers displaying negative selection, essentially all the residues are likely functionally important, and missense SNPs such as those in this study would likely not be tolerated.
Missense SNPs at DNA-contacting positions are generally not predicted to be deleterious but might change DNA-binding specificity The low number of SNPs deviating from HWE and the lack of difference between DNA-contacting and non-DNA-contacting n positions suggest that the majority of polymorphisms in the DNAcontacting amino acids do not produce an effect that is either beneficial or deleterious to fitness. There are many algorithms to predict if nonsynonymous variants might lead to deleterious effects on protein function. For example, dbNSFP 2.0 is an integrated database of functional annotations from multiple sources for the comprehensive collection of more than 87,361,054 human nonsynonymous SNPs. It compiles scores from prediction algorithms such as SIFT (Kumar et al. 2009), Polyphen2 (Adzhubei et al. 2010), MutationTaster (Schwarz et al. 2010), and FATHMM (Shihab et al. 2013). Because each algorithm used a somewhat different set of criteria to predict if a SNP would be deleterious, we decided to place increasing confidence in SNPs predicted by multiple methods. Interestingly, there was very little agreement among methods regarding which of the 1040 SNPs might be functionally deleterious. Of the 314 SNPs predicted to be deleterious, only one was common to all four methods ( Figure 4A and Table S1). With rare exceptions, common SNPs were not predicted to be deleterious or were predicted to be so by only one method ( Figure 4B). SNPs predicted to be deleterious by two, three, or four methods usually had MAFs ,0.1, in general agreement with the concept that truly deleterious alleles tend to be rare in populations due to negative selection. However, there were no notable differences between DNA-contacting and non-DNA-contacting positions in the number or distribution of predicted deleterious alleles, again suggesting that SNPs in the DNA-contacting positions are not more likely to disrupt the function of the protein than SNPs at any other position. However, a SNP that changes DNA-binding specificity may not necessarily be expected to alter protein function. The more likely expectation would be that the transcription factor would still bind, but to somewhat different DNA targets. Unfortunately, it is currently not possible to accurately predict the target sites for any DNA-binding protein based only on its primary sequence (with the possible exception of transcription activator-like effector proteins, which are not found in the human genome) (Boch et al. 2009;Moscou and Bogdanove 2009). Considerable progress has also been made in predicting the binding behavior of ZNF proteins, based on large datasets of natural and engineered zinc fingers (Gupta et al. 2012;Enuameh et al. 2013;Zhu et al. 2013;Gupta et al. 2014;. A recently described random forest-based predictive model, ZFModels (Gupta et al. 2014), was used to estimate the specificity of individual zinc fingers that harbor a SNP. The summed Kullback-Leibler distance was used to quantify the difference between the predicted DNA binding sites of the reference and SNP forms of the zinc finger. The more dramatic changes in predicted specificity (summed KL .2.5) were found for SNPs with MAFs ,0.1 (Figure 4, C and D). These data suggest that some of the SNPs, particularly rare SNPs, would be expected to alter the DNA-binding specificity of their ZNFs.

DISCUSSION
The role of natural genetic variation in human health and disease has been a major area of focus in the past several years. Most variants act in cis to the genes they affect, often altering promoter regions, coding regions, or splice sites (Stranger et al. 2007a,b). Here we have investigated a potential mechanism for how genome-encoded information could affect gene expression in trans. We hypothesized that natural genetic variation could alter the binding site preferences and activities of ZNF transcription factors and therefore alter their gene regulatory functions. In the context of cancer, some tumor-specific mutations in transcription factors had been reported to alter DNA-binding specificity, providing proof-of-concept (Campomenosi et al. 2001;Inga et al. 2001;Filippova et al. 2002;Chan and Privalsky 2009;Malcikova et al. 2010). Our initial observation in this study that there are significantly less common SNPs at the DNA-contacting positions compared with non-DNA-contacting positions encouraged us that rare SNPs may have functional consequences, and that selective pressure is operating against them.
Functional consequences could include the transcription factor binding either better or worse to its original target sites, or a change in specificity such that some new gene targets are regulated. In most cases, these consequences should be manifest in the change in expression of one or more genes that was dependent on the SNP in the ZNF gene, that is, trans-eQTLs. It is important to note that not all C2H2 zinc finger proteins are transcription factors and thus would not be expected to associate with eQTLs. For example, PRDM9   (Baudat et al. 2013) and CTFC (Ong and Corces 2014) would manifest altered DNA recognition as a change in homologous recombination hotspots and chromatin looping boundaries, respectively. However, despite decades of research of engineered zinc fingers, approximately 90% of naturally occurring ZNFs remain largely unstudied. Little is known about where these proteins bind or which genes they regulate. Databases of transcription factor binding motifs, such as TRANSFAC (Matys et al. 2006) and JASPAR (Mathelier et al. 2014), contain relatively few position weight matrices (PWMs) for ZNFs. For example, of the 252 ZNFs that had nonsynonymous substitutions in the four primary DNA-contacting amino acids in this study, only two were found in the JASPAR 2014 Core dataset of PWMs (jaspar. genereg.net). Furthermore, these databases do not provide information regarding which DNA-contacting amino acids are responsible for the observed PWM pattern. It is therefore not possible to accurately assign a position in the PWM to an amino acid affected by a SNP.
In principle, changes in transcription factor binding or gene expression could be measured directly in vitro or in cell culture with exogenously expressed proteins. However, these experiments are technically challenging for most zinc finger proteins. The mean number of zinc finger repeats in human proteins is 8.5, but some proteins contain 30 or more zinc finger repeats (Emerson and Thomas 2009). Although most KRAB-ZNFs encode all fingers in one exon, many without KRAB, such as CTCFL and PRDM10, have more than seven isoforms that typically splice together different sets of fingers. Engineering the mutant allele is laborious, and purification for in vitro studies is problematic because the two critical cysteine residues on each zinc finger quickly oxidize and lose binding activity. Overexpression of natural ZNFs in cells is often cytotoxic, and nonphysiological concentrations can lead to occupancy of atypical binding sites.
For these reasons, we chose instead to use existing data from largescale studies of natural eQTLs in various cell types. Unfortunately, these studies failed to identify eQTLs for any of the SNPs in DNAcontacting positions. These results could indicate that the SNPs do not result in altered expression of any gene. However, the power of detection might be limited. In humans, trans-eQTLs are far less frequent than cis-eQTLs, and their effect sizes are typically small (Stranger et al. 2007b;Schadt et al. 2008;Gibbs et al. 2010;Montgomery et al. 2010). Although the large number of SNP-containing proteins in our study (398 ZNFs out of a total of approximately 712 in the genome) should have ensured that at least some would be expressed in the cell types available, little is known about the cell types and developmental stages in which these proteins are normally expressed. Also, the effect on gene expression may have been too small to detect, especially given the adjustments for multiple testing required in a genome-wide survey. The results from the other functional prediction experiments in this study could form the basis of a more targeted approach for seeking trans-eQTLs in the future.
Interestingly, the only gene expression effects that were observed were trans-eQTLs at the non-DNA-contacting positions. Of the four ZNFs containing such SNPs, ZNF584 and ZNF132 affected expression of the same gene, SLC27A5. The EMBL-EBI interaction database In-tAct only found data for ZNF177, which also indicated an interaction with SLC27A5. Although it is not completely clear how alterations to non-DNA-contacting residues could alter the expression of a gene in trans, one hypothesis is that these fingers may actually facilitate protein-protein rather than protein-DNA interactions. Furthermore, the three ZNFs show significant deviation from HWE, and in the unusual direction of favoring the heterozygous genotype. It could be that ZNF177, ZNF584, and ZNF132 form a complex with the SLC27A5 gene or gene product to regulate its transcription. It is known that some C2H2 zinc fingers can bind RNA or protein, and that the protein interaction can involve any face of the finger , including the b-turn (position C2), the loop (position -2), and the a-helix (position 5). However, it cannot be ruled out that these non-DNA-contacting positions are influencing the neighboring DNAcontacting positions and exerting their effects by a DNA-recognition mechanism.
Having been unable to demonstratefunctional consequences for the SNPs, we sought evidence for selective pressure operating against the SNPs. HWE analysis revealed that very few missense SNPs were under selective pressure in the human genome, whereas the phylogenetic data suggested that missense SNPs should be under strong negative selection. The latter result was consistent with a previous study that also reported high conservation in all a-helix residues in cow-human-mouse ortholog trios (Emerson and Thomas 2009). One model that would be largely consistent with our seemingly opposing results is that common SNPs that could change DNA-binding specificity were likely selected against during human evolution, so that only SNPs that do not cause significant changes in binding specificity remain common today. This model would also be consistent with the relative paucity of common SNPs at DNA-contacting positions compared with non-DNA-contacting positions, the lack of robust prediction that they are deleterious, and the result that no eQTLs were found. The few common SNPs that deviate from HWE could be relatively new variants that arose at a frequency similar to SNPs in other non-DNA-contacting regions of the protein.
If the SNPs that alter DNA binding have been suppressed by negative selection, what is it about the remaining SNPs that would make them not functional? The majority of the common SNPs in this study were predicted to produce only modest changes in binding specificity (summed Kullback-Leibler ,2.5). This change in an individual finger might be insufficient to alter the overall specificity of a multi-finger protein. It could also be that the individual finger was not used for DNA binding in that particular protein, and thus its substitution had no effect. This model may appear to discount the value of the phylogenetic data indicating that the amino acids from Tarsius to Homo sapiens were generally under negative selection. However, the majority of the data still support the model that many substitutions would not be tolerated, but the ones that persisted into the present are those that could be tolerated.
In summary, we report that common SNPs seem depleted in the DNA-contacting positions of ZNFs, but we find no significant evidence of function or selective pressure for those that remain. These data argue against our initial hypothesis that common SNPs in transcription factors might function as trans-eQTLs in the human genome. However, rare SNPs are more likely to deviate from HWE, to be predicted as deleterious, and to produce high-confidence DNA specificity changes. Algorithms that predict the deleterious nature of nonsynonymous mutations are gaining value in genome interpretation, especially in medicine. Such algorithms should consider this additional category of potentially nondeleterious but altered function. Further elucidation of the functional roles of these variants will be greatly aided by expanded large-scale eQTL mapping efforts such as the GTEx consortium.