Overlapping ETS and CRE Motifs (G/CCGGAAGTGACGTCA) Preferentially Bound by GABPα and CREB Proteins

Previously, we identified 8-bps long DNA sequences (8-mers) that localize in human proximal promoters and grouped them into known transcription factor binding sites (TFBS). We now examine split 8-mers consisting of two 4-mers separated by 1-bp to 30-bps (X4-N1-30-X4) to identify pairs of TFBS that localize in proximal promoters at a precise distance. These include two overlapping TFBS: the ETS⇔ETS motif (C/GCCGGAAGCGGAA) and the ETS⇔CRE motif (C/GCGGAAGTGACGTCAC). The nucleotides in bold are part of both TFBS. Molecular modeling shows that the ETS⇔CRE motif can be bound simultaneously by both the ETS and the B-ZIP domains without protein-protein clashes. The electrophoretic mobility shift assay (EMSA) shows that the ETS protein GABPα and the B-ZIP protein CREB preferentially bind to the ETS⇔CRE motif only when the two TFBS overlap precisely. In contrast, the ETS domain of ETV5 and CREB interfere with each other for binding the ETS⇔CRE. The 11-mer (CGGAAGTGACG), the conserved part of the ETS⇔CRE motif, occurs 226 times in the human genome and 83% are in known regulatory regions. In vivo GABPα and CREB ChIP-seq peaks identified the ETS⇔CRE as the most enriched motif occurring in promoters of genes involved in mRNA processing, cellular catabolic processes, and stress response, suggesting that a specific class of genes is regulated by this composite motif.

ployed to identify biologically relevant transcription factor binding sites (TFBS). The computational methods typically examine DNA sequence enrichment near a biologically defined regulatory region like the transcriptional start site (TSS) (Frith et al. 2002;Ohler et al. 2002;Kel et al. 2003;Bina et al. 2004;FitzGerald et al. 2004;Marino-Ramirez et al. 2004;Matys et al. 2006;Pachkov et al. 2007;Ji et al. 2008;Kharchenko et al. 2008;Portales-Casamar et al. 2010;Oh et al. 2011;Vinson et al. 2011). Examination of related mammals has also identified many DNA motifs in promoters that are conserved, suggesting that they may be TFBS, while the 39UTR have conserved sequences thought to be microRNAs (Xie et al. 2005).
In an earlier study, we identified 8-bps long DNA sequences (8-mers) that are localized in human proximal promoters (FitzGerald et al. 2004) and Drosophila promoters (FitzGerald et al. 2006), and we presented evidence that motifs near the TSS are biologically functional. In human promoters, these sequences were grouped into known TFBS, including SP1, CCAAT, ETS, E-Box, CRE, Box A, NRF1, and TATA. Analyses of promoters with the conservation of DNA sequences among the related mammals greatly enhanced the identification of regulatory motifs (Xie et al. 2005).
To identify additional biologically important DNA sequences in human proximal promoters, we analyzed the distribution of discontinuous 8-mers, also called split 8-mers (Vinson et al. 2011). Each split 8-mer is composed of two 4-mers separated by 1-bp to 30-bps. If each 4-mer represents a part of a TFBS, this calculation would identify pairs of TFBS that co-occur in the same proximal promoter as observed in other mammalian promoters (FitzGerald et al. 2004). Split 8-mer enrichment in promoters declines with increasing distance between the two 4-mers. In contrast, Drosophila contains many split 8-mers in which the 4-mers are separated by 20-bps to 30-bps that localize in promoters (Vinson et al. 2011).
This article examines the split 8-mers that localize in human promoters. We extended our previous work with split 8-mers in human promoters (Vinson et al. 2011) by evaluating whether the split 8-mers that localize in promoters have a preferred distance between the two 4-mers. This analysis identified an ETS motif overlapping with a CRE motif (ETS⇔CRE) that localizes in proximal promoters. DNA binding experiments show that GABPa and CREB preferentially bind the two TFBS when they overlap and produce the ETS⇔CRE motif enriched in proximal promoters.

Dataset generation
From University of California Santa Cruz Genome Bioinformatics website (http://genome.ucsc.edu/), we obtained the DNA sequence data for RefSeq genes in the Golden Path Human Genome Assembly with annotated TSS, representing sequences from -1,000 bp to +500 bp relative to the TSS. The initial dataset contained 26,431 promoters. The set was further processed to improved relevance and the validity of the analysis using the following criteria. First, for promoters with 100% identical sequences, only one copy of them was kept (5483 promoters were removed). Second, promoters containing unknown nucleotides (N) of at least 150 bps were removed (8 promoters). Third, promoters with duplicated RefSeq numbers were removed (411 promoters). Fourth, of the remaining 20,529 promoters, 18,451 were determined to have unique sequences, whereas 2078 promoters had duplicated sequences shared among themselves. Among these 2078 promoter sequences, 68 had more than 10 overlapping duplicated regions of at least 250 bps with other promoter sequences and were deleted from the analysis. One thousand five hundred thirty-five (1535) promoter sequences contained closely identical sequences among themselves, and they comprised 701 unique groups (pairs in most cases); only 701 "representative" promoters were kept for the analysis. An additional 475 promoters were kept for the analysis, although they did have some mixed overlapping sequencing. This allowed us to retain only 1176 out of these 2078 promoters. Fifth, two thousand four hundred eighty-four (2484) promoters had start of the coding sequences (translational start sites) within 30-bps of the TSS, and these promoters were excluded from the following analysis. Finally, a set of 17,143 promoters (18,451 + 1,176 2 2,484) was obtained and considered for the analysis.

Analysis of split 8-mers distributions
There are 4 8 discontinuous non-degenerative 8-mers (X 4 -N k -X 4 ; N denotes any arbitrary nucleotides and k denotes spacing between two 4-mers), and of these, j4 4 are palindromes and ð4 8 2 j4 4 Þ are nonpalindromes, where each sequence and its complement is represented and j = 1 if k is even and 0 if odd. Thus, the number of 8-mers can be reduced toð4 8 2 j4 4 Þ=2 þ j4 4 ¼ 4 4 ð4 4 þ jÞ 2 . Those 32,896 or 32,768 8-mers were automatically generated by a custom-made program. The promoter set was searched against them, and final distributions were generated. To analyze the data, we divided 1500-bps into 75 bins each containing 20-bps, numbering bin 1 [-1000 bp; -981 bp] to bin 75 [+481 bp; +500 bp]. We determined the number of times the first nucleotide of a studied DNA sequence (or the last of its complement) occurred within each 20-bps bin. To detect and quantify non-uniform distributions (localization) and the probability of non-uniformity of split 8-mers, we determined localization factor (LF) and P-value as described previously (FitzGerald et al. 2004;Vinson et al. 2011).

Molecular modeling
The molecular model of the ETS and CREB dimer interacting with a single chain DNA with a specific base pair sequence of CCGGA AGTGACGTCA was built by using two PDB structures, the ETS-1 protein bound to an ETS site (PDB ID: 1K79) (Garvie et al. 2001) and the CREB dimer bound to the CRE (PDB ID: 1DH3) (Schumacher et al. 2000). The 10 nucleotides (shown underlined) of the E chain of the DNA (TAGTGCCGGAAATGT) of 1K79 were aligned to the 10 nucleotides (shown underlined) in the B chain of the DNA (CCTT GGCTGACGTCAGCCAAG) of 1DH3, using Chimera visualization software (Pettersen et al. 2004). This alignment also results in the nucleotides ATG (shown in bold) of 1K79 aligning with the nucleotides CTG (shown in bold) of 1DH3. The ETS-1 protein and the complementary strand (F chain) of DNA of 1K79 were carried along with the E chain of its DNA during this alignment. From this aligned structures, the first 10 nucleotides (CCTTGGCTGA) and their base pairs in the complimentary chain in the 1DH3 structure were deleted. The remaining chains containing the nucleotides TAGTGCCGGA AATGT of 1K79 and the nucleotides CGTCAGCCAAG of 1DH3 were covalently linked to one another using Chimera software to form one long chain of DNA with the sequence TAGTGCCGGAAATGT CGTCAGCCAAG. Similarly, its complimentary DNA chain was also built. The 12 th and 15 th bases in this long chain (shown in bold) were mutated to G and A bases, respectively, and the final complex containing this long DNA and the ETS and CRE was subjected to an energy minimization using the Discovery Studio (Accelrys Software) molecular modeling software.
Electrophoretic mobility shift assay (EMSA) EMSA was performed similarly as described previously (Rishi et al. 2010). GABPa and CREB proteins were in vitro translated using PURExpress In Vitro Protein Synthesis Kit (New England Biolabs, USA) according to manufacturer instructions. The T7 expression plasmids containing the DNA binding domain of GABPa (Badis et al. 2009) or the B-ZIP domain of CREB (Ahn et al. 1998) was used as the template DNA. GABPa has a GST-tag at the N-terminus. The protein concentrations were estimated by Western blot using purified GST-CREB or CREB with known concentrations as concentration standards. In vitro translated proteins were mixed with 7 pM 32 P endlabeled double-stranded oligonucleotides containing variants of ETS and CREB binding sites in the gel shift buffer (0.5 mg/ml BSA, 10% glycerol, 2.5 mM DTT, 12.5 mM K 2 HPO 4 -KH 2 PO 4 , pH 7.4, 0.25 mM EDTA). The final volume of the reaction was adjusted to 20 ml. For regular EMSA, the reactions were incubated at 37°for 20 min, followed by cooling at room temperature for 5 min before loading. For supershift experiments, the reactions were first incubated at 37°for 20 min without antibodies. Antibodies (catalog # sc-186, sc-459, or sc-2027, Santa Cruz Biotechnology, USA) were then added, and the reactions were incubated on ice for 30 min, followed by incubation at room temperature for 15 min before loading. 10 ml samples were resolved on 7.5% PAGE at 150 V for 1.5 hr in the 1x TBE buffer (25 mM Tris-boric acid, 0.5 mM EDTA). Sequences of oligonucleotides used for EMSA experiments are listed in Table 1. For EMSA using ETV5 and CREB, we used purified proteins containing the DNA binding domain of ETV5 or the B-ZIP domain of CREB.
Motif enrichment using ChIP-seq peaks For motif analysis, we used published 6442 GABPa ChIP-seq peaks from human Jurket cell line (Valouev et al. 2008) and 3998 CREB ChIP-seq peaks from mouse in GC1 cells (Martianov et al. 2010). For motif detection, we used MEME (Machanick & Bailey 2011) and the peak-motifs package of the Regulatory Sequence Analysis Tools (RSAT) (Thomas-Chollier et al. 2011). Two thousand eight hundred thirty-four (2834) CREB binding promoters, which were obtained from the ChIP-chip data on human HEK293T cells in three time points (Zhang et al. 2005), were mapped to human (hg18), which successfully resulted in 2384 promoters bound by CREB. For de novo motif prediction, we used 1463 common binding regions of human CREB ChIP-chip and GABPa ChIP-seq data.

PhyloP conservation
Base by base PhyloP score or the P-values for conservation or acceleration P-values based on an alignment and a model of neutral evolution among the 36 mammalian genomes were (Pollard et al. 2010) downloaded from UCSC database (http://genome.ucsc.edu/). PhyloP scores for each nucleotide in the motif, including 15-bps upstream and 15-bps downstream of each occurrence in the genome, were averaged for all occurrences of each motif.

Gene Ontology analysis
Gene Ontology (GO) analysis was performed using DAVID (http:// david.abcc.ncifcrf.gov/). Go terms with P-values , 0.01 were considered as significantly enriched GO terms. Additionally, Benjamini-Hochberg corrected P-values , 0.01 were considered for the analysis with in vivo ChIP data.

RESULTS
Split 8-mers that localize in human proximal promoters We aligned human promoters relative to the TSS and determined the distribution of split 8-mers in the promoter region. The split 8-mers consist of two 4-mers separated by 1-bp to 30-bps (X 4 -N 1-30 -X 4 ). We considered the promoter region from 21000-bps to +500-bps relative to the TSS and divided the 1500-bp region into 75 bins of 20-bps each. We used a human DNA promoter sequence set obtained from UCSC and removed promoters containing repetitive sequences, resulting in a set of 17,143 promoter sequences (see Materials and Methods). The distribution of each split 8-mer in promoters was determined and a measure of non-uniform distribution termed "localization factor" (LF) was calculated (Vinson et al. 2011). The statistical significance of the non-random distribution of LF was determined by calculating a probability value (P-value) for each split 8-mer.
Many continuous 8-mers (X 4 -N 0 -X 4 ) are enriched in proximal promoters (2120-bps to the TSS) (supporting information, Figure  S1, A and B, and Table S1) (FitzGerald et al. 2004(FitzGerald et al. , 2006Xie et al. 2005;Vinson et al. 2011). In contrast, fewer split 8-mers with an insert length of 4-bps (X 4 -N 4 -X 4 ) localize in proximal promoters (FitzGerald et al. 2004;Vinson et al. 2011) (Figure 1, A and B). As insert length increases, preferential localization of split 8-mers in the proximal promoter decreases for both CG-and non-CG split 8-mers and is much more pronounced for the non-CG 8-mers (Figure 1, C and D).
The most localizing split 8-mer sequences with an insert length of 1-bps and 2-bps both represent the CRE motif ( Figure 1D and Table  S1), suggesting that the CRE is 10-bps long (GTGACGTCAC). The most localizing sequence with both a 3-bps and 4-bps insert are a CGrich 4-mer followed by TATA (CCGG-N 3 -TATA and GCCG-N 4 -TATA), sequences previously identified that function in proximal promoters (Lagrange et al. 1998). These split 4-mers are not strand specific, indicating that the CG-rich 4-mer can be either before or after the strand-specific TATAA (FitzGerald et al. 2004). Virtually all the localizing split 8-mers with an insert length of 5-bps or more contain the CG dinucleotide ( Figure 1, C and D). The 20 most localizing split 8-mers with insert length of 0-bps, 2-bps, 4-bps, and 5-bps to 30-bps are presented in Table S1.

Split 8-mers that localize in promoters at a unique insert length
The split 8-mers that localize in proximal promoters were grouped into three classes (Table S1): (i) split 8-mers with a short insert length of 1-bps or 2-bps representing a single TFBS ( Figure S2, A-D); (ii) split 8-mers that localize in proximal promoters at many insert lengths representing co-localizing TFBS, each represented by a single 4-mer ( Figure S2, E-H); and (iii) split 8-mers that localize in proximal promoters at a specific insert length. These include CGGA-N 4 -ACGT, which represents an ETS motif and a CRE motif, and unidentified sequences; e.g. GGGA-N 2 -TGTA ( Figure S2, I and J).
To identify split 8-mers that localize in proximal promoters at only a precise insert length, the max LF for all split 8-mers with insert lengths from 0-bps to 30-bps (X 4 -N 1-30 -X 4 ) were determined and compared with the ratio of max LF to the second highest LF ( Figure  2, A and B). A close to 1 ratio of max LF to the second highest LF indicates localization of split 8-mers at various insert lengths, whereas a ratio with higher values is indicative of split 8-mers that are localized at a precise insert length. Both kinds of sequences are observed for 8-mers with a high LF. To identify the insert length that produces the precisely positioned pairs of 4-mers, we examined each insert length. Continuous 8-mers (X 4 -N 0 -X 4 ) have many sequences with a high LF and large ratio (LF(MAX)/LF(MAX-1). These sequences are the TFBS previously described that localize in proximal promoters (FitzGerald et al. 2004). The two 4-mers (TGAC and GTCA) that create the CRE (TGACGTCA) motif preferentially localize in promoters when the insert length is 0-bps ( Figure 2D and Figure S2, A and B). Similar results were obtained for the ETS motif ( Figure S2, C and D). When we examined split 8-mers with an insert length of 2-bps, fewer 8-mers had both a high LF and ratio (Figure 2, E and F). These include GTGA-N 2 -TCAC, representing the CRE; CGGA-N 2 -TGAC, representing overlapping ETS and CRE TFBS (ETS⇔CRE) (CGGAAGT GAC); and GGAA-N 2 -GGAA, representing an ETS motif overlapping with a second ETS motif (ETS⇔ETS) (GGAAGCGGAA) (Table S1 n Table 1 DNA probe sequences for EMSA (binding sites underlined)

Probe
Sequence (59 to 39) and Figure 2). A systematic analysis of the human promoters using comparative genomics for the detection of regulatory motifs also identified an unannotated motif GGAANCGGAANY (Xie et al. 2005), which is essentially the ETS⇔ETS motif. Insert length of 4-bps produced even fewer sequences that are precisely localized (Figure 2, G and H). Insert length of 5-bps to 30-bps identified many 8-mers with a high LF but a low ratio, indicating that they are co-occurring in promoters at many insert lengths ( Figure 2, I and J). This analysis identified many split 8-mers with distinctive distributions; we focused our analysis on the overlapping ETS and CRE motifs. The distribution of the ETS⇔CRE motif split 8-mer CGGA-N 4 -ACGT shows localization in proximal promoters ( Figure 3A). The split 8-mer CGGA-N 0-30 -ACGT preferentially localizes in proximal promoters when separated by 4-bps, with the continuous 12-mer CGGAAGTG ACGT being the most localizing and abundant (Figure 3, A and B). More modest localization is observed at 20-bps and 22-bps, which has not been evaluated. This sequence contains both the ETS motif (CGGAAGTG) and the CRE motif (GTGACGT). The GTG trinucleotide is common to both the ETS and CRE motifs. These TFBS overlap to produce the ETS⇔CRE motif. The full ETS⇔CRE motif would be the two 16-mers C / G CGGAAGTGACGTCAC that occur five times in the human genome (Table 2). There are more than 4·10 9 16-mers, and thus, each 16-mer would be expected to occur by chance only about once in a vertebrate genome of $3·10 9 bps.
Two versions of the ETS motif that localize in proximal promoters differ only in the first nucleotide, the more common CCGGAA and the rarer GCGGAA ( Figure S3A) (FitzGerald et al. 2004). DNA binding specificities of the 27 human ETS family members identify three proteins (SPI1, SPIB, and SPIC) that preferentially bind the rarer ETS motif (Kaplan et al. 2010). The rarer GCGGAA ETS motif is enriched compared with the CCGGAA motif in the ETS⇔CRE motif ( Figure S3B).

Molecular model of ETS⇔CRE motif bound by DNA
To evaluate the potential for simultaneous binding of three proteins (ETS monomer and CREB dimer) to the ETS⇔CRE motif, we built a molecular model using PDB files of the ETS1 protein bound to an ETS site (PDB ID: 1K79) (Garvie et al. 2001) and the CREB dimer bound to the CRE (PDB ID: 1DH3) (Schumacher et al. 2000). The two structures were aligned computationally after superimposing 10 DNA bases on each strand of DNA. The combined structure did not produce protein clashes, suggesting that both proteins could potentially bind the ETS⇔CRE motif simultaneously (Figure 3, C-E). The GTG trinucleotide, which is common to both the ETS and CRE motifs, interacts with both proteins in the model. The ETS domain, a winged helix-turn-helix protein fold, interacts with the major groove using an a-helix to bind the core GGAA of the motif. It also crosses the phosphate backbone and interacts with the minor groove of the GTG trinucleotide (Hollenhorst et al. 2011b). The CREB dimer interacts with the GTG trinucleotide in the major groove and never crosses the DNA backbone.
The ETS protein GABPa and the B-ZIP protein CREB preferentially bind to ETS⇔CRE EMSA was used to investigate whether ETS and B-ZIP proteins could simultaneously bind the ETS⇔CRE motif (Table 1). In the EMSA experiments, we used the B-ZIP protein CREB to bind the CRE motif and the ETS proteins GABPa or ETV5 to bind the ETS motif ( Figure  4). Eight DNA probes were examined. Three DNA probes contained mutations in either or both motifs that abolished protein binding to the expected TFBS ( Figure 4A). Five DNA probes examined the spacing between the two motifs; one probe has a deletion of 1-bp and three DNA probes have an insert of 1-bps, 2-bps, or 3-bps between the ETS and CRE motifs. CREB bound well at 10 nM (Ahn et al. 1998), whereas GABPa binding was weaker, being detectable at 200 nM. When GABPa and CREB were mixed, GABPa binding was enhanced only on the DNA probe containing the ETS⇔CRE motif (compare lane 17 with lane 9 of Figure 4A). None of the deletion or insertion probes form the CREB|GABPa|DNA complex (lanes 18-24, Figure  4A). Supershift experiments demonstrated that both GABPa and CREB proteins were present in the complex formed only on the Figure 1 (A and B) LF and probability for split 8-mers with a 4-bp insert (X 4 -N 4 -X 4 ). (C and D) For each 8-mer (X 4 -N 0-30 -X 4 ), we determine which insert length produced the largest LF and plot that value in the column representing that insert length. (C) LF for the 12,547 continuous 8-mers and 10,951 split 8-mers containing the CG dinucleotide. We plot that -log P-value at the insert length with the highest LF. (D) Same as (C) but for all non-CG containing 8-mers, the 20,349 continuous 8-mers, and 21,945 split 8mers with insert length from 1-bp to 30-bps.
ETS⇔CRE motif containing DNA probe ( Figure 4A), suggesting that this specific overlap of three base pairs between ETS and CRE motifs is important for binding by both GABPa and CREB. Importantly, the ETV5 member of the ETS family formed neither the CREB|ETV5|DNA complexes nor the CREB|DNA or ETV5|DNA complex forms ( Figure 4B). A dose-response EMSA showed that binding of one protein precludes the binding of another protein.
Even when we saturated the probes with higher concentrations of ETV5 or CREB proteins, no CREB|ETV5|DNA complex was observed.

Motif detection in CREB and GABPa ChIP-seq peaks
We examined published ChIP-seq data sets for GABPa (Valouev et al. 2008) in humans and CREB in mouse (Martianov et al. 2010) to determine whether the ETS⇔CRE motif is enriched in the ChIP-seq peaks. The peak-motif package (Thomas-Chollier et al. 2011) of RSAT was used for evaluating the enriched motifs in these ChIP-seq regions. Using all CREB peak regions, the peak-motif identified the overlapping ETS⇔CRE motif, which is more enriched than the canonical CRE motif ( Figure 4C and Table 3). When we used only the GABPa ChIP-seq peaks for de novo motif detection, we identified the canonical ETS and the ETS⇔ETS motif, but not the ETS⇔CRE motif. However, when we examined the 2953 peaks that contain the canonical ETS motif, we detected that the ETS⇔CRE motif is the bestenriched motif ( Figure 4C).
An additional analysis used the GABPa ChIP-seq data already described from human Jurkat cell line and CREB ChIP-chip data from human HEK293T cells (Zhang et al. 2005). One thousand four hundred sixty-three (1463) peaks are common between CREB and GABPa binding sites. De novo motif detection using these regions by peak-motif detected ETS⇔CRE motif as the best-enriched motif ( Figure 4C). Interestingly, among the other enriched motifs, we observed a palindromic ETS⇔CRE⇔ETS motif, in which the second ETS canonical motif is in the opposite strand ( Figure 4C), suggesting the biological significance of the coordinated regulation of ETS and CREB in regulating the gene expression. The promoters with ETS⇔CRE, obtained from the commonly bound regions by CREB and GABPa, are significantly enriched for the GO terms of proteolysis involved in macromolecule catabolic process, RNA processing, and cellular response to stress (Table 4). However, the MEME-ChIP package (Machanick & Bailey 2011) of the MEME Suite failed to detect the ETS⇔CRE motif as an enriched motif in any data set.

Length of ETS⇔CRE motif
Two strategies were used to evaluate the length of the ETS⇔CRE motif: (i) enrichment in 8000 housekeeping DNase I hypersensitive sites (DHS) (Sabo et al. 2004) and (ii) conservation in mammalian genomes.
We extended the ETS motif 8-mer CGGAAGTG toward the CRE ( Figure 5A) and counted the occurrences in the genome and known regulatory regions, including annotated promoters, proximal promoters, CpG islands, housekeeping DHS, and all DHS identified in 45 cell types (Sabo et al. 2004) (Table S2). The housekeeping DHS are defined as the DNase hypersensitive regions that are present in all 45 cell types (Sabo et al. 2004). The ETS 8-mer CGGAAGTG occurs 16,846 times in the genome and 6% of them are in housekeeping DHS. Similar results were observed when the motif is extended to the 9-mer (CGGAAGTGA) and 10-mer (CGGAAGTGAC). A transition occurs with the 11-mer (CGGAAGTGACG), with 60% occurring in housekeeping DHS and 83% occurring in known regulatory regions (Table S2). The 11-mer contains two CG dinucleotides, which are rare outside of regulatory regions.
It is important to note that the 11-mer CGGAAGTGACT can represent the overlapping of an ETS motif and an AP1 motif (TGA C / G TCA) to create the ETS⇔AP1. The ETS⇔AP1 motif may be cooperatively bound by an ETS protein and B-ZIP proteins that bind the AP1 motif. This sequence does not occur in housekeeping DHS, but it is enriched in tissue-specific DHS (Table 2) as observed previously (Hollenhorst et al. 2011b). When the motif is extended to a 12-mer, localization in housekeeping DHS does not increase but the occurrence decreases, indicating that the 11-mer is the core of longer and diverse ETS⇔CRE motifs ( Figure 5A).
When the motif is extended from the CRE side toward the ETS motif, we again observe that localization in housekeeping DHS jumps to its maximal value when the motif is extended to the second CG and forms the 11-mer CGGAAGTGACG. This suggests that the 226 ETS⇔CRE 11-mers in the genome contain different versions of the longer ETS⇔CRE 16-mers that may have distinct functions when they are bound by different combinations of ETS and B-ZIP family members.

Conservation of the ETS⇔CRE motif in mammals
The conservation of the ETS⇔CRE motif was examined in 36 mammalian genomes (Pollard et al. 2010). Initially, we examined the Phy-loP signature for the ETS (CGGAAGTG) and CRE (TGACGTCA) 8-mers. Both PhyloP signatures show conservation ( Figure 5, B and C), except for the CG that has negative PhyloP values. We presume this simply reflects the chemical deamination of the C in the CG dinucleotide when it is methylated, a well-known hypermutable process that is not directly modeled in PhyloP. In contrast, in the ETS⇔CRE 11-mer (GGAAGTGACG), all nucleotides, including both CG, are "highly" conserved, having scores four times larger than either the ETS or CRE motifs ( Figure 5D). Conservation extends 1-bp beyond the CG on the ETS (59) side of the motif to either a C or G, which is known to affect DNA binding of ETS family members (Wei et al. 2010). Beyond the CG on the CRE (39) side to the ETS⇔CRE motif, the 4-bps (TCAC) region, which is the second half of the CRE motif, is not conserved. Provocatively, these nucleotides actually have   negative PhyloP values and as here it does not have deamination effect of CG dinucleotides, it suggests that the sequences bound by the second monomer of the B-ZIP dimer in this context are evolving faster than neutral (Pollard et al. 2010).

1-bp variants of the ETS⇔CRE 11-mer
We examined whether 1-bp variants of the ETS⇔CRE 11-mer are also enriched in housekeeping DHS (Figure 6, A-D). Of the 147 occurrences, 51 (35%) of the most abundant 1-bp variant (CGGAAGTGGCG) are in housekeeping DHS. Two additional variants (CGGACGTGACG and CGGAAGTGCCG) are abundant and enriched in housekeeping promoters, suggesting that they may also be functional. The GGA in the core of the ETS motif is critical for the sequence-specific binding (Graves & Petersen 1998) and shows very little variability in housekeeping DHS, suggesting that there are virtually no occurrences of the crippled ETS⇔CRE motif in regulatory regions. In the genome, all 1-bp variants that do not disrupt the CG are less abundant than the ETS⇔CRE 11-mer. In contrast, 1-bp variants that do disrupt either of the two CG are typically more abundant than the ETS⇔CRE, highlighting the profound effect of the CG dinucleotide on the occurrence of a DNA sequence in the genome. A molecular model of the ETS⇔CRE 16-mer bound by ETS and  6 and lanes 7-11) to the ETS⇔CRE motif. Increasing concentrations of ETV5 with fixed concentrations of CREB shows that both proteins cannot simultaneously bind to the ETS⇔CRE motif. (C) Enriched motifs generated using the peakmotifs package of Regulatory Sequence Analysis Tools (RSAT). For de novo motif detection, we used all 6442 human GABPa ChIP-seq peaks (Valouev et al. 2008) and all 3998 mouse CREB ChIP-seq peaks (Martianov et al. 2010) as input sequences. In CREB ChIP-seq peaks, the most enriched motif is the canonical CRE, and ETS⇔CRE motif is among the other significantly enriched motifs. In GABPa ChIP-seq peaks, ETS motif is the primary enriched motif, and ETS⇔ETS is among the other enriched motifs. De novo motif detection using all 2953 ETS motif-containing regions predicted ETS⇔CRE as the bestenriched motif. De novo motif detection using 1453 commonly bound region by CREB and GABPa predicted ETS⇔CRE as the best-enriched motif. ETS⇔CRE⇔ETS is one of the other enriched motifs in these regions. The number of sites below each motif indicates the number of peaks that have at least one predicted motif.
CREB is color-coded to visualize each nucleotide ( Figure 6E). Potentially, the abundant 1-bp nucleotide variants of the ETS⇔CRE motif in housekeeping promoters are bound by different combinations of ETS and B-ZIP proteins.
Four abundant ETS⇔CRE 13-mers ( C / G CGGAAGTGACG T / C ) The abundance of longer versions of the ETS⇔CRE 11-mer in the genome and regulatory regions was evaluated ( Figure S3C). We initially focused on 16-mers, the potential length of the ETS⇔CRE motif. Of the 226 11-mers in the genome, 171 different 16-mers occur, and the most abundant 16-mer (CCGGAAGTGACGCGAG) occurs seven times. The canonical motif CCGGAAGTGACGTCAC occurs three times in the genome. The alignment of ETS⇔CRE 11-mers, including surrounding DNA sequences, identified four abundant ETS⇔CRE 13-mers ( C / G CGGAAGTGACG T / C ) ( Figure S3C), representing 70% of all ETS⇔CRE 11-mers (Figure 7, A and B). Each 13-mer correlated with different GO terms, suggesting distinct functions (Table S3). The nucleotide before the CG in the ETS motif is either G or C, and these are known to be bound by different ETS family members (Wei et al. 2010). The nucleotide after the central CG in the CRE is typically a pyrimidine, T and C. They are 5-fold more abundant than the purines G and A ( Table 2). The T and C in this position are optimal for binding the B-ZIP proteins CREB and C/EBP, respectively (Johnson 1993). Each of the four ETS⇔CRE 13-mers is expected to be optimally bound by a specific combination of ETS monomers and B-ZIP dimers.
The dinucleotides following the CRE 6-mer TGACGT-N 2 in proximal promoters are enriched only for CA dinucleotide, which produces the canonical CRE 8-mer TGACGTCA ( Figure 7C). In contrast, the dinucleotides following the ETS⇔CRE 12-mer CGGAAGTGA CGT are also enriched for AN dinucleotides, suggesting that the CRE and the ETS⇔CRE motifs in promoters are bound by different B-ZIP proteins ( Figure 7D).

Localization of pairs of CG in DHS
In the ETS⇔CRE motif, the two CG are separated by 7-bps (CG-N 7 -CG). To identify whether additional pairs of CG preferentially occur in promoters, we counted in the whole genome the occurrence of sequences containing a pair of CG separated by 0-bps to 9-bps (CG-N 0-9 -CG) and determined what fraction are in housekeeping DHS. The ETS⇔CRE motif stands out among all other sequences containing pairs of CG, being abundant and primarily in promoters (Figure 7, E and F).

CG methylation status of the ETS⇔CRE motif in two mouse primary cells
Methylation of the CG dinucleotide in canonical ETS and CRE motifs inhibits binding of both ETS and CREB proteins (Iguchi-Ariga & Schaffner 1989;Umezawa et al. 1997;Rozenberg et al. 2008). An important feature of the ETS⇔CRE motif is the presence of two CG that can be methylated. We used two mouse methylomes at 100X coverage for newborn mouse dermal fibroblasts and 45X coverage for primary keratinocytes. The four ETS⇔CRE 13-mers have different methylation properties (Table S4, Figure S4, Figure S5, and Figure S6). All 21 occurrences of the GCGGAAGTGACGT 13-mer are unmethylated on both CG dinucleotides in dermal fibroblasts and keratinocytes, suggesting that they are in functional regions of the genome. Of the 45 occurrences of the more abundant 13-mer CCGGAAGTGACGT, 33 are unmethylated in both cells ( Figure  S4C, Figure S5A, and Figure S6A).
Not all 13-mers with two CG dinucleotides separated by 7-bp are unmethylated. Only 10% of CACGCACACACCG is unmethylated ( Figure S4G, Figures 5E and 6E). Comparing two methylome data for these motifs shows that unmethylated 13-mer motifs are common and generally unmethylated in both cell types ( Figure S6, A-D) and that these unmethylated ETS⇔CRE sequences are mainly enriched in promoters (Table S4), lending support to the suggestion that every occurrence of an unmethylated version of the ETS⇔CRE motif is biologically important.

DISCUSSION
We determined the distribution in human promoters of split DNA 8-mers consisting of a pair of 4-mers separated by 1-bp to 30-bps. A striking result is that few split 8-mers with insert length of 5-bps or greater (X 4 -N 5-30 -X 4 ) localize in proximal promoters. This is in sharp n contrast to Drosophila promoters, in which many split 8-mers with a 20-bp to 30-bp insert length (X 4 -N 20-30 -X 4 ) localize in proximal promoters (Vinson et al. 2011). We examined split 8-mers in human promoters and identified pairs of 4-mers that localized at a specific insert length and not others. This article focused on the ETS motif ( C / G CGGAAGTG) precisely overlapping with a CRE motif (GTGACGT CAC) to create a composite site, the ETS⇔CRE motif ( C / G CGGAAGT GACGTCAC). The trinucleotide GTG is common in the two TFBS, being the 39 end of the ETS motif and 59 end of the palindromic CRE motif. Molecular modeling using X-ray structures of ETS and B-ZIP proteins binding the ETS⇔CRE motif suggests that the ETS monomer and B-ZIP dimer can bind the overlapping TFBS without any protein-protein clashes. Instead of ETS and B-ZIP proteins competing for binding the ETS⇔CRE motif, the ETS protein GABPa and the B-ZIP protein CREB preferentially bind the ETS⇔CRE motif only when the GTG trinucleotide overlaps. In contrast, the ETS protein ETV5 competes with CREB to bind the ETS⇔CRE motif.
De novo enriched motif detection using the in vivo CREB and GABPa ChIP-seq binding regions identified the ETS⇔CRE motif along with the canonical CRE and ETS motifs, suggesting an in vivo function for the motif. Additionally, the conservation of the ETS⇔CRE motif is signifying its biological function (Xie et al. 2005;Pollard et al. 2010). The ETS domain has been shown to interact with several different DNA binding proteins to bind sequences containing chimeric aspects of each TFBS (Hollenhorst et al. 2011b). The ETS protein GABPa initially was observed interacting with GABPb to bind a chimeric sequence (Batchelor et al. 1998). ETS was subsequently shown to interact with additional proteins. The forkhead proteins interact at the 59 end of the ETS motif (De Val et al. 2008), whereas SRF, PAX, and potentially CREB interact at the 39 end of the ETS motif (Hollenhorst et al. 2011b). Several of these interactions have been identified by examining tissue-specific enhancer sequences (Hollenhorst et al. 2011b). The cytokine, RANTES (regulated upon activation, normal T cell expressed) is induced by LPS through binding in promoters by ATF and Jun proteins to a composite site containing non-overlapping ETS and CRE motifs (Boehlk et al. 2000).
ETS and CRE motifs co-occur in proximal promoters (FitzGerald et al. 2004). Cooperative DNA binding by GABPa and CREB to adjacent ETS and CRE sites separated by various distances up to 15-bps has been reported (Sawada et al. 1999). The cooperative binding is mapped to the non-DNA binding region of GABPa, suggesting that cooperativity is via protein-protein interactions. These investigators did not observe that the two motifs needed to be precisely aligned relative to each other for cooperative binding. These results are in sharp contrast to what we observed; the precise overlap produces enhanced GABPa and CREB binding, suggesting that the cooperative binding we observed between the ETS and CREB DNA binding domains is distinct from the cooperative binding observed when full-length proteins are examined. The ETS and CRE motifs at different spacing than the observed ETS⇔CRE motif may be preferentially bound by different combinations of ETS and B-ZIP proteins and may Figure 5 (A) Preferential localization in housekeeping DHS compared with the genome for different length of ETS⇔CRE sequences. The ETS (CGGAAGTG) and CRE (AAGTGACG) 8-mers were lengthened toward the indicated arrows, and for each bp extension, preferential localization in housekeeping DHS are calculated. A jump in localization of ETS (CGGAAGTG) occurs when the second CG dinucleotide is included, which creates the 11-mer CGGAAGTGACG. The ETS 8-mer CGGAAGTG occurs 16,846 times in the genome and 1073 times in housekeeping DHS, a ratio of $8%. The ratio in housekeeping DHS of 8-mers (CGGAAGTN) with a different final nucleotide are shown as a colored dot (G = yellow, A = green, T = red, C = blue). The ETS 9-mer CGGAAGTGA occurs 343 times in housekeeping DHS with a similar enrichment in housekeeping DHS as the 8-mer. When the sequence is extended to the 11-mer CGGAAGTGACG, enrichment in housekeeping DHS jumps to 60%. If the final G in the 11-mer is changed to the three other nucleotides, enrichment in housekeeping DHS is only 10%. When the ETS⇔CRE motif is extended to a 12-mer and beyond, enrichment in housekeeping DHS remains constant. When the ETS⇔CRE motif is extended from the CRE side toward the ETS side, a jump in localization in housekeeping DHS occurs when the AAGTGACG 8-mer is extended to the CGGAAGTGACG 11-mer. (B) Conservation or phyloP score in 30 mammals for the CRE 8-mer. (C) phyloP score for the ETS 8-mer. (D) phyloP score for the ETS⇔CRE 11-mer.
have specific functions in regulating gene expression. Oncogenic ETS family members in prostate cancer localize at ETS⇔AP1 motifs that have the same overlap (Hollenhorst et al. 2011a) observed in the ETS⇔CRE motif. The AP1 or TRE 7-mer (TGA C / G TCA) is a 1-bp deletion at the center of the CRE, disrupting the CG dinucleotide. Recently, the ETS and CRE motifs were observed to co-occur in ChIP-seq data sets with a spacing of 1-bp to 2-bp (Whitington et al. 2011), whereas we highlight the ETS⇔CRE motif at a precise spacing with unique biochemical properties.
Overlapping protein binding is observed in the enhanceosome where the ATF-2/c-Jun heterodimer binds to the same DNA base pairs as the IRF-3 protein. Again, there are no protein-protein interactions (Panne et al. 2004(Panne et al. , 2007Panne 2008); instead, it appears that the cooperative binding of these three polypeptides is via allosteric changes to the DNA. This is similar to what may occur when GABPa and CREB preferentially bind the ETS⇔CRE motif.
Recently, it was suggested that a fundamental difference between prokaryotic and eukaryotic systems is that eukaryotic systems have short TFBS that proteins do not recognize with sufficient specificity to bind to cognate sites exclusively (Wunderlich & Mirny 2009) and need to cooperate with other TF to displace a nucleosome and become functional (Polach & Widom 1996;Mirny 2010). The overlap of two TFBS as observed in the ETS⇔CRE motif creates a long DNA sequences that are generally rare in mammalian genomes and could thus function like a prokaryotic system in which each occurrence is functional.
An alternative method to create specificity in vertebrate genomes is to have two TFBS that only need to be within 150-bps of each other and function together because they compete with nucleosomes for binding (Polach & Widom 1996;Mirny 2010;Biddie et al. 2011). It appears that both mechanisms operate in mammalian genomes. An advantage of the overlapping TFBS is that it allows for cooperative binding between specific members of each TF family, thus increasing specificity. This is absent in the model of two TF independently binding to DNA to displace a nucleosome. The nucleosome displacement mechanism allows different TF to act cooperatively, and it allows selection of which family member is functioning.
We have taken a DNA-centric perspective to evaluate which DNA sequences are important, eschewing the common practice embodied in the use of position weight matrices (PWM), of averaging two or more DNA sequences to create a logo or hybrid sequence. An inherent issue with the DNA-centric perspective is to know the length of the DNA sequence. An upper bound to the length of a DNA sequence is when it becomes unique in the genome, instead of having thousands of occurrences in which only a subset is functional. Vertebrate genomes are not big enough to accommodate all possible Figure 7 (A) Abundance of 4 ETS⇔CRE 13-mers ( C / G CGGAAGTGACG T / C ) and 1-bp variants in housekeeping DHS vs. percentage of occurrences in housekeeping DHS compared with the genome. All N-CG-N 7 -CG-13 N-mers are shown. The four abundant ETS⇔CRE 13-mers ( C / G CGGAAGTGACG T / C ) are shown in red. (B) Histogram of occurrences of the ETS⇔CRE 13-mers C / G CGGAAGTGACG T / C and all 1-bp variances in housekeeping DHS. (C) Pie chart representation of the occurrence of the dinucleotides at the end of the CRE motif TGACGTNN that occurs 2046 times in proximal promoters (2200-bps to +60-bps). (D) Pie chart representation of the occurrence of the dinucleotides at the end of the ETS⇔CRE motif CGGAAGT GACGTNN. (E) Preferential occurrence in promoters compared to the genome for all pairs of CG separated by 0-bps to 9-bps [CG-(0-9) -CG]. (F) Same as (E), but sequences with an internal CG are excluded. The one sequence that is abundant primarily in promoters is the ETS⇔CRE motif.
16-mers. The ETS⇔CRE 16-mer is long enough so that random occurrences are not expected. Here, we have taken the approach that different sequences should not be averaged because this could obscure details concerning longer sequences having a distinct function. For example, the ETS⇔CRE 13-mers GCGGAAGTGACGT and CCGGAAGTGACGT enrich for distinct GO terms in addition to having distinct methylation properties. Closer examination of proximal promoters may identify additional examples of pairs of DNA sequences that are constrained relative to each other as we observed for the ETS⇔CRE motif. The identification of these sequences will be essential as we deconvolute the genome into functional units.