Genome-Wide Patterns of Codon Bias Are Shaped by Natural Selection in the Purple Sea Urchin, Strongylocentrotus purpuratus

Codon usage bias has been documented in a wide diversity of species, but the relative contributions of mutational bias and various forms of natural selection remain unclear. Here, we describe for the first time genome-wide patterns of codon bias at 4623 genes in the purple sea urchin, Strongylocentrotus purpuratus. Preferred codons were identified at 18 amino acids that exclusively used G or C at third positions, which contrasted with the strong AT bias of the genome (overall GC content is 36.9%). The GC content of third positions and coding regions exhibited significant correlations with the magnitude of codon bias. In contrast, the GC content of introns and flanking regions was indistinguishable from the genome-wide background, which suggested a limited contribution of mutational bias to synonymous codon usage. Five distinct clusters of genes were identified that had significantly different synonymous codon usage patterns. A significant correlation was observed between codon bias and mRNA expression supporting translational selection, but this relationship was driven by only one highly biased cluster that represented only 8.6% of all genes. In all five clusters preferred codons were evolutionarily conserved to a similar degree despite differences in their synonymous codon usage distributions and magnitude of codon bias. The third positions of preferred codons in two codon usage groups also paired significantly more often in stems than in loops of mRNA secondary structure predictions, which suggested that codon bias might also affect mRNA stability. Our results suggest that mutational bias has played a minor role in determining codon bias in S. purpuratus and that preferred codon usage may be heterogeneous across different genes and subject to different forms of natural selection.

codon bias translational selection sea urchin antagonistic pleiotropy mRNA stability mutational bias The nonrandom usage of synonymous codons (codon bias) has been observed across all domains of life and varies in magnitude among closely related species and among genes within a genome (Grantham et al. 1980). The identity of the most frequently used synonymous codons within a genome may be similar or variable among loci. This cause of this variation has been attributed to mutational bias, random genetic drift, and natural selection, but the relative contributions of each remain unclear (Sharp and Li 1986;Bulmer 1991). Because synonymous codon substitutions do not change protein sequences, it has long been suggested that such changes are neutral and thus reflect of nonadaptive processes (King and Jukes 1969). Indeed, in broad taxonomic studies examining genome-wide patterns of codon bias researchers generally have found mutation processes may play a major role across broad domains of life (Chen et al. 2004). However, the variation available in the degeneracy of the genetic code and the multistep nature of transcription through translation may provide subtle opportunities on which natural selection may act (Clarke 1970;Sharp et al. 1995), and many researchers have detected subtle forces of selection at play (reviewed by Duret 2002;Hershberg and Petrov 2008;Plotkin and Kudla 2011). The possibility that natural selection could affect the rate of polypeptide synthesis through the matching of transfer RNA abundance and codon usage is not new (Zuckerkandl and Pauling 1965). The in vivo rate and accuracy of translation are affected by the use or disuse of "major" codons in Escherichia coli (Tuller et al. 2010;Navon and Pilpel 2011), and early observations of genome-wide synonymous codon usage in D. melanogaster and C. elegans showed a bias toward a subset of codons (Shields et al. 1988;Stenico et al. 1994). This set of preferred codons corresponds to the most abundant tRNAs in the cell and, by proxy, the number of tRNA gene copies in the genome (Moriyama Duret 2000). Subtle effects on fitness may be realized by a decrease in the cost of protein synthesis (Sharp et al. 1995) or by an increase in growth rate (Andersson and Kurland 1990;Kudla et al. 2009). Accordingly, highly transcribed genes have been found to be the most biased loci in both Drosophila and yeast (Duret and Mouchiroud 1999). Furthermore, in Drosophila the magnitude of codon bias is accentuated at genes expressed during the high growth rate larval stage (Powell et al. 1993;Vicario et al. 2008).
Early studies on the nature of the synonymous codon usage in yeast identified two relatively homogeneous groups of genes with different patterns of synonymous codon preference ). These different patterns proved to be biologically significant because they differentiated highly and lowly expressed genes. The selective advantages of translational efficiency and accuracy are clear and might explain the greater codon bias of highly expressed genes sharing one pattern of codon usage. However, it is unclear what selective processes, if any, might influence the persistence of the translationally suboptimal synonymous codons of the other group. It is possible that selection might influence the frequency of both major and minor codons in some taxa (Kreitman and Antezana 2000). This antagonistic pleiotropy hypothesis posits that minor codon frequencies may be driven by conflicting adaptive forces acting in competition on synonymous codons (Akashi and Eyre-Walker 1998). According to this hypothesis, major codons reflect translationally optimal codons, whereas minor codons represent translationally suboptimal codons. Translationally suboptimal codons, however, are still subject to other forms of selection. Different forms of natural selection on synonymous codons may thus compete for the synonymous state of the same codon.  a Bold identifies the synonymous codon in a codon family with the strongest significant negative correlation.
b Spearman's correlation coefficient between frequency of a synonymous codon usage to the overall codon bias of the gene.
Natural selection and mutation affect base composition and DNA sequence variation at the level of the genome, the gene, and at individual sites within genes. Genome-wide studies are capable of identifying major effectors of codon bias but the focus on major and minor codons may miss more subtle forms of natural selection. Identifying patterns of synonymous codon usage has proven useful in identifying group of genes likely under translational selection ). However, more than two clusters of synonymous codon usage patterns have been recently identified in both E. coli and Bacillus subtilus, and only one cluster in each species was associated with translational selection (Bailly-Bechet et al. 2006). These results are intriguing because they suggest that different patterns of synonymous codon usage might exist among different groups of genes and that multiple forms of selection (or mutation) may be acting on the different groups. Before this study, it is unclear whether multiple clusters of codon usage groups occur outside of bacteria and yeast.
Previous work on codon bias in metazoans has been performed primarily in the protostomes such as Drosophila and C. elegans. In deuterostomes, codon bias has been studied in vertebrates such as Xenopus (Musto et al. 2001) and Gallus (Rao et al. 2011), as well as in mammals (Chamary et al. 2006). However, there are a large number of nonchordate deuterostome species with large effective population sizes that could enable natural selection to act effectively on the subtle selection coefficients associated with synonymous codon usage (i.e., N e s % 1). Here, we investigate the patterns of synonymous codon usage in the purple sea urchin, Strongylocentrotus purpuratus. The purple sea urchin is an excellent candidate nonvertebrate deuterostome species in which to study codon bias. It is a widely distributed, broadcast spawner with a free swimming larval period of 2-3 months (Strathmann 1978) that results in a large effective population size, extensive levels of nucleotide polymorphism, minimal population structuring, and a well-supported phylogeny (Britten et al. 1978;Palumbi and Wilson 1990;Addison and Pogson 2009;Kober and Bernardi 2013). S. purpuratus has been studied as a model organism for early development for more than a century and a half (Pederson 2006), has a well-annotated genome assembly and transcriptome (Sodergren et al. 2006;Cameron et al. 2009), and is phylogenetically positioned to provide insight into vertebrate model systems (Cameron and Davidson 2007). Furthermore, the majority of its predicted~23,300 genes reside in regions with a narrow range of GC between 35 and 39% (Sodergren et al. 2006), thus minimizing the contribution of variable mutational bias across the genome to patterns of codon bias.
The objectives of the present study were to (1) investigate the presence of codon bias and the identity of preferred codons in the S. purpuratus genome, (2) assess the presence of distinct clusters of codon usage, and (3) examine the relative contributions of translational selection and selection on mRNA stability on the observed patterns of codon bias. We describe for the first time the existence of codon bias a marine invertebrate and show how mutational bias is an unlikely explanation for the nonrandom usage of synonymous codons. We document the existence of five distinct clusters of genes with different codon usage patterns and show how translational selection and selection for mRNA secondary structure differentially affect subsets of the codon usage groups. Finally, we identify the presence of five codon usage clusters in Drosophila melanogaster and show how translational selection is acting upon one group (as in S. purpuratus), suggesting the wide generality of these patterns.

MATERIALS AND METHODS
S. purpuratus genomics S. purpuratus genomic sequences were obtained from the Baylor College of Medicine (http://www.hgsc.bcm.tmc.edu/ftp-archive/Spurpuratus/ fasta/Spur_v2.1/). The gene models (i.e., genomic coordinates) of the Official Gene Set (OGS), GLEAN3_b-2.1-revised.gff and coding sequence annotations were obtained from the Sea Urchin Genome Database SpBase.org (Cameron et al. 2009). Coding sequence was obtained from the genomic sequence using the OGS gene models. Estimates of mRNA expression levels for early developmental stages (2, 15, 30, 48, and 72 hr) for all S. purpuratus genes were downloaded from the National Institute of Dental and Craniofacial Research (http://urchin.nidcr.nih.gov/blast/exp.html). Following Wei et al. (2006), we defined high expression as having an AU .25,000 (Arbitrary Units) and used the maximum expression observed across all five stages for all analyses.
Although the genome assembly is a high-quality draft and the OGS a well-annotated set of gene models, we applied a series of filters to minimize errors in our gene models. Each manually annotated gene prediction in the OGS underwent basic validation using CodonW (http://codonw.sourceforge.net/). Sequences identified by CodonW to have nonrecognized start codons, nontranslatable codons, partial last codons, and internal stop codons were discarded. Gene predictions without annotation data in SpBase (or annotated as "duplicates"), or as having excessive repeats or lacking sufficient information (i.e., having insufficient degrees of freedom) as identified by SCUMBLE (Kloster and Tang 2008) also were excluded. Of the remaining gene models, those with maximum expression scores less than 200 AU were removed. Finally, genes with identical expression scores for all time periods were treated as unhandled duplicates and those with the smallest codon bias scores (Wright's Nc; see the section Genome-wide patterns of codon bias) were retained. These filters left 4623 unique gene models for the study. Gene ontology (GO) terms (Harris et al. 2004) for 4232 of the 4623 genes were annotated using blastx interrogation of the SWISSPROT database (v. b2g.may10) with Blast2GO v. 2.4.9 (Conesa et al. 2005). Fisher's exact test with false discovery rate multiple testing corrections were used to assess enrichment of GO terms associated with genes in each group relative to entire set.
Secondarily filtered counts of tRNA genes predicted by tRNAscan-SE (Lowe and Eddy 1997) were obtained from the UCSC tRNA lab as a conservative set (http://gtrnadb.ucsc.edu/Spurp/Spurp-stats.html). Amino acids identified by a single tRNA gene were excluded from the analysis. tRNA gene species counts were used as proxies for tRNA gene concentrations. General wobble rules were used to identify the synonymous codons recognized by isoaccepting tRNA genes. Following Ikemura (1985), the frequency of use for tRNA genes with anticodons recognizing a single codon was estimated from the observed frequency of synonymous codon usage. For tRNA genes with anticodons n identifying multiple codons, their frequency of use was the sum of the observed frequencies of each synonymous codon (Ikemura 1985). The frequencies of tRNA gene copy numbers and synonymous codons for each tRNA species were normalized to 1.0 using the greatest frequency observed. We tested for relationships between frequencies of the isoaccepting tRNA gene copy numbers and the usage of their corresponding synonymous codons. Correlations were determined using the cor() function in the R statistical package (Ihaka and Gentleman 1996) for the entire dataset and separately for each codon usage group (see the section Clustering genes by codon usage distributions below). Data manipulations were performed using the Jim Kent's source tools (http://hgwdev.cse.ucsc.edu/~kent/src), the UCSC Sea Urchin Genome Browser strPur2 (Karolchik et al. 2007), BioPerl (Stajich et al. 2002), and BioPython (Cock et al. 2009). Comparative alignments and dS estimates for 2954 genes Allocentrotus fragilis and S. franciscanus genes were kindly provided by David Garfield (Oliver et al. 2010).

Codon bias measures
We used the effective number of codons, Nc (Wright 1990), to estimate codon bias. Nc (or ENC) is analogous to the effective number of alleles at a locus and ranges from 61 (random usage) to 20 (one codon per amino acid). To account for background nucleotide composition, Nc9 scores also were determined using ENCprime (Novembre 2002). Background composition was determined from the genomic nucleotide sequences of all nucleotides 1000 bp upstream and downstream of transcript initiation start and stop sites, respectively. Coding sequence GC levels (GC cds ) and third position codon GC levels (GC3) were calculated using CodonW. The GC content of flanking regions (GC f ) was estimated from nucleotide sequences 1000 bp upstream and downstream of the predicted start and stop codons, respectively. GC levels for introns (GC i ) were calculated for all genes possessing introns (n = 4389). These data for S. purpuratus are provided in File S1.
To examine the relative contributions of individual amino acids to the overall codon bias of a gene, sENC-X scores were calculated for each amino acid (where X is a particular amino acid) . Because unscaled ENC-X scores depend on the degeneracy of the amino acid (e.g., fourfold degenerate amino acids have ENC-X scores between 0 and 4), the sENC-X values are reset to fall between 0 (no bias) and 1 (maximum bias) for all amino acids. ENC-X scores were estimated using the 'cusage' program kindly provided by Etsuko Moriyama.
To identify genome-wide preferred codons, we determined frequency of each codon's usage within its codon family for each gene and tested the Spearman correlation between these frequencies with the overall codon bias of that gene (Nc9). The preferred codon for each amino acid was identified as that with the most significant negative correlation with Nc9 (Hershberg and Petrov 2008). In other words, the genome-wide preferred codons are defined as the codons that are used more frequently as the bias of the gene increases (Supporting Information, Figure S1). The significance threshold was defined by a P-value smaller than 0.05/n, where n is the number of codons in the codon family. This adjustment is to correct for performing more comparisons on more degenerate codon families (Hershberg and Petrov 2008).

Clustering genes by codon usage distributions
We used the clustering method of Bailly-Bechet et al. (2006) to identify groups on the basis of their codon usage patterns. Each group was defined by a distribution of synonymous codon usage that differed from group to group. The number of clusters, S, was optimized in terms of the information content. S was determined by maximizing the difference in self-consistency, b ðsÞ g , between the real data and a null model. b ðsÞ g provides a value reflecting the quality of the assignment of the gene g to cluster s, where the quality of the assignment relates to how close the value is to unity (Bailly-Bechet et al. 2006). The geometric average, B(S), of all clusters provides the quality of corresponding assignments. A null distribution was generated using an artificial dataset generated from the real data and the quality of this set, B random (S), also was determined. The maximum difference D(S) = B(S) 2 B random (S) gives the number of clusters S retained. Clustering and assessments were performed using the software implementation of Bailly-Bechet et al. (2006). For visualization in heatmaps, codon usage frequencies were centered on a random expectation using the Codon usage frequencies were centered on the equal usage expectation (e.g., for fourfold degenerate codons, frequencies were centered on 1/4). Centered frequencies clustering was performed with hierarchical cluster analysis using pairwise complete-linkage by Euclidean distance for both columns (CUG) and rows (synonymous codons) using Cluster 3.0 (Eisen et al. 1998). Heatmap plots were generated with Java TreeView (Saldanha 2004). Synonymous codon labels are colored by the base composition of the third position (N3) with N3 of 'A' and 'T' nucleotides, colored pink, and 'G' and 'C,' colored green. The asterisk ( Ã ) denotes genome-wide preferred codons in S. purpuratus (see Materials and Methods).
n Table 3 Summary of codon bias, GC content, gene size, and mRNA expression levels in the five codon usage groups  (1) We also applied the probabilistic model of codon bias implemented by SCUMBLE v 1.0 (Kloster and Tang 2008) to identify clusters of genes with similar patterns of codon usage. The algorithm fits a codon usage model in which each gene g is assigned a given number of "offsets" b i (g). An offset indicates to what extent gene g is affected by estimated bias ('trend') number i. A "preference function" E i (c) for each trend indicates how much trend i favors/disfavors codon c (Kloster and Tang 2008). We explored models with 0 to 10 trends using SCUMBLE. The number of trends used for the analyses was based on the normalized variance NV(g) of the genes.

Conservation of preferred synonymous codons
To test for the conservation of preferred codons, we used Akashi's test (Akashi 1994). For all genes, each codon was identified as preferred or nonpreferred and as conserved (synonymous) or nonconserved (nonsynonymous) between S. purpuratus and either A. fragilis or S. franciscanus (Stoletzki and Eyre-Walker 2007) using alignments provided by David Garfield (Oliver et al. 2010). Preferred codons within groups delineated by the clustering methods described above were identified as having the highest frequency of usage within each separate group. Only genes with sufficient data for the contingency tables were used (n = 2349). The Woolf test for homogeneity across genes in each cluster (Woolf 1955) was implemented using woolf_test from the vcd library in the R statistical package (Ihaka and Gentleman 1996). All tables for genes clustering into different groups were combined for analysis using the Cochran-Mantel-Haenszel procedure (Cochran 1954;Mantel and Haenszel 1959) and tested using mantelhaen.test from the stats library in R. This allowed us to determine if site conservation (i.e., synonymous substitutions) was independent of codon preference in any given codon usage cluster. To assess the strength of the association between conservation and codon preference, we followed Drummond and Wilke (2008) and generated a reference distribution of the odds ratio for each codon usage group by performing Akashi's test and computing the odds ratio for 1000 randomly generated sets of synonymous codons from the genes of that group.

Preference for mRNA folding stabilizing codons
To test if the third nucleotide positions of preferred codons were more likely to be stem-paired in RNA secondary structure, the data were stratified by gene. RNA secondary structure predictions were performed using RNAfold from the Vienna RNA Package v1.8.5 (Hofacker 2004). Nucleotides in the third position of each codon (N3) were identified as either preferred or non-preferred and as falling in a stem or loop in the folding prediction. Preferred codons for each grouping were determined as described above for the Akashi (1994) test. Then, 2 · 2 contingency tables were constructed for each gene (e.g., Table S1) and Woolf tests for homogeneity and association tests using the Cochran-Mantel-Haenszel procedure were performed as described previously. This allowed us to test if folding role (i.e., paired in a stem or unpaired in a loop) is independent of codon preference.

D. melanogaster genomics
The genomic assembly for D. melanogaster was obtained from the UCSC Genome Browser (dm3, Apr. 2006, BDGP R5). A total of 11,776 gene models were obtained for the Release 5.4 Annotation Update from flyBase.org (Flybase 2007_03 Update). Each gene prediction underwent the same validation procedure described earlier for S. purpuratus, leaving 11,223 gene models for analyses. The coding sequence for Notch and 1000 bp of flanking sequence was obtained from flyBase.org. Estimates of mRNA expression were obtained from the LS mean of overall expression of D. melanogaster (y;cnbwsp) from the Affymetrix Gene Chip Drosophila Tiling 1.0R Arrays (Graze et al. 2009). Expression values for Notch were not available in these data. These gene data for D. melanogaster are provided in File S2.

Genome-wide patterns of codon bias
The mean GC content of the S. purpuratus genome is 36.9% and the vast majority of the~23,300 genes fall in regions of GC between 35 and 39% Sodergren et al. (2006). For the 4623 genes analyzed in the present study, the GC content of coding regions was greater than the genome-wide mean (48.4%) but the GC content of introns and flanking regions were not (35.1% and 36.3%, respectively). Figure 1 presents the overall extent of codon bias in our data. Wright's (1990) effective number of codons (Nc) ranged from 32 to 61, exhibiting a mean of 54.5 (Figure 1). Correcting for background GC yielded Nc9 scores ranging from 25.4 to 61 with mean of 51.0.
Despite the weak overall signal of codon bias, we successfully identified preferred codons at all 18 amino acids having two or more synonymous codons (Table 1). Figure S1 shows for isoleucine (i.e., Ile) how the usage of the preferred codon (ATC) increases as the magnitude of codon bias increases. In contrast to the AT bias in the S. purpuratus genome, all preferred codons used a C or a G at the third position and the former were used far more frequently than the latter (15 vs. 3; Table S2). We also observed significant genome-wide correlations between isoaccepting tRNA gene copy numbers and the usage of both synonymous codon and amino acids (Table 2). Across all groupings, synonymous codons identified by tRNA genes with a greater species counts were used more commonly. Many of the isoaccepting tRNA gene predictions were redundant for the synonymous codons they identified (not shown).

Identification of different codon usage groups
The clustering method of Bailly-Bechet et al. (2006) identified five significantly distinct clusters of genes having different codon usage patterns (Table S3 and Figure 2A). The support for five groups was significantly better than six groups for both assignment probability [B(S) = 0.900 6 0.002 vs. 20.880 6 0.002, respectively] and maximum cluster stability [D(S) = 0.312 6 0.003vs. 0.306 6 0.002, respectively; see Figure S2A]. Table 3 shows that although codon bias was evident across all five groups (i.e., GC3 was elevated above both GC i and GC f ), it was most highly skewed in groups 0 and 1. Genes in groups 0 and 1 showed the highest levels of GC3 and the most extreme degrees of codon bias. Although groups 0 and 1 were similar in their levels of codon bias and patterns of synonymous codon usage (Table S3 and Figure 2A), group 0 genes exhibited dramatically greater levels of mRNA expression than the others (Table 3).
The SCUMBLE model of Kloster and Tang (2008) produced results that corroborated the existence of five distinct codon usage n groups ( Figure 3). Histograms of normalized variance (NV) suggested that at least four trends were required to explain the data ( Figure S3A) and models with more than four trends failed to show improvement ( Figure S3B). Offsets b 1 and b 3 correlated most strongly with GC3 and CT3 in all groups and in the pooled data, respectively (Table S4). Groups were differentiated, however, only by offset b 2 that was most strongly associated with mRNA expression levels at 72 hr for group 0, GT3 in group 2, and CT3 in groups 1, 3, and 4 (Table S4).

Patterns of codon usage bias
Codon bias in S. purpuratus was driven by synonymous codon usage in a subset of amino acids (Table 4). Four amino acids contributed significantly to codon bias in all five groups and in the pooled data [leucine (Leu), isoleucine (Ile), serine (Ser), and threonine (Thr)], whereas two amino acids did not contribute to codon bias in any group [histidine (His) and cysteine (Cys)]. Groups 0 and 1 exhibited the most strongly negative correlations (i.e., P , 10 210 ) between sENC-X and Nc9 (9 each) and group 3 had the fewest (3). Although there was considerable variability in the contributions of individual amino acids across groups, the strongest correlations in the entire set of 4623 genes tended to show significant patterns in four or five of the groups.
The GC content of third positions (GC3) and coding regions (GC cds ) exhibited significant correlations with the degree of codon bias (Figure 4, A and B and Table 5). By contrast, there was a limited association between the GC content of introns and flanking regions and the degree of codon bias (Figure 4, C and D and Table 5). Our results show that GC content of introns or flanking regions for all genes displayed weak associations with Nc (Table S5) and no significant correlations with Nc9 (Table 4). GC3 was strongly correlated with GC cds in all groups (Table 5). Relationships between the GC content of introns and flanking regions with both GC3 and GC cds were much weaker and inconsistent among groups ( Table 5).
The magnitude of codon bias was significantly associated with gene expression levels, but only in group 0 (Nc' r = 20.2578, P , 0.0001; Nc r = 20.2695, P , 0.0001). Highly biased, highly transcribed genes were primarily found in group 0 ( Figure 5A). For example, 51 of the 54 annotated ribosomal proteins were assigned to group 0 (of the remaining three, one was in group 1 and two in group 4). However, as with Notch in Drosophila, we observed highly expressed, highly biased genes that did not share a synonymous codon usage pattern with ribosomal proteins and the other members of group 0. For example, Sp-Ets1/2 (SPU_002874) is a member of the ETS family of transcription factors. It acts as a transcriptional activator or repressor and is involved in cell senescence and death, stem cell development and tumorigenesis (RefSeq Release 49, July 2011). Sp-Ets1/2 is highly expressed (41,923 AU) and highly biased (Nc9 = 42.443) but is a member of group 1 rather than group 0.
Gene length was also negatively correlated with mRNA expression and Nc9 (Table 6, Figure S4). We did not observe relationships between codon bias and position along the protein using intragenic spatial codon usage estimates (Qin et al. 2004). The magnitude of codon bias did not differ between the middle sections of genes relative to amino-or carboxy-terminal regions, did not change over the length of the gene for different size classes and was not correlated with intron number or length (not shown). Thus, we were unable to find support for interference selection acting on synonymous codon usage in S. purpuratus (Comeron and Kreitman 2002).

Conservation of preferred codons
We tested whether preferred codons were more evolutionarily constrained by applying Akashi's (1994) method. The conservation of codons was assessed using alignments containing S. purpuratus, A. fragilis, and S. franciscanus. We observed that preferred codons were more likely to be conserved than unpreferred codons in all five groups and for all genes combined (Table 7). This is remarkable considering that each group had a different set of preferred codons defined for the test. The Woolf test on homogeneity of odds ratios (Woolf 1955) showed significant three-way associations for all groups. To further test this result, we randomly generated sets of preferred codons and found that our previously identified list exhibited a significantly stronger association between codon preference and conservation across each (Table 7). In all five codon usage groups, stronger correlations were detected between our identified preferred codons and the evolutionarily conserved codons than that for the all genes combined.

Role of preferred codons in stabilizing RNA structure
We found that the third positions of preferred codons were significantly more likely to be paired in stems within folded RNA secondary structures in groups 2 and 3 (Table 8). Although all five groups had different distributions of preferred third positions (Table S2), each had two N3 preferred codons that were significantly overrepresented in stem pairs. In all five groups, G3 was significantly associated to be paired in stems in preferred codons. C3 of preferred codons were more likely to be stem-paired in all groups except group 2, where it was replaced by U3. The identity of the majority codon N3 base did not match the preferred stems for any group. The Woolf test on homogeneity of odds ratios (Woolf 1955) produced significant 3-way associations for all groups.

Comparison to codon usage patterns in Drosophila
To determine whether similar clusters of codon usage patterns occur in Drosophila melanogaster, we applied the Bailly-Bechet et al. (2006) clustering method to 11,223 gene models obtained from flybase.org. Similar to our results for S. purpuratus, we detected five significantly distinct groups ( Figure 2B, Figure S2B). Although similar in overall number, the patterns of codon usage differed between the species. Overall, 2675 (23.8%) genes were assigned to group 0, 2919 (26.0%) to group 1, 607 (5.4%) to group 2, 4156 (37.0%) to group 3, and 867 (7.7%) to group 4. Highly expressed, highly biased genes were almost entirely represented by Drosophila group 4 ( Figure 5B). As found in S. purpuratus, a majority of annotated small and large subunit ribosomal proteins were observed in this highly expressed group (54 of 60). Third positions in Drosophila groups 1, 2, 3, and 4 were more likely to be G or C ( Figure 2B). Conversely, N3 for genes in Drosophila Group 2 were more likely to be A or T. Despite being highly biased (Nc9 = 43.31) and typically highly expressed, the synonymous codon usage in Notch was more similar to other members of Drosophila Group 3. Genome-wide, S. purpuratus genes were not as biased as D. melanogaster (Nc9 = 46.04 6 6.83 and 51.08 6 5.08, respectively). S. purpuratus group 0 genes were more biased than D. melanogaster genome-wide average (Nc9 = 42.50 6 15.5) but less biased than D. melanogaster group 4 (Nc9 = 39.08 6 7.51).

DISCUSSION
Synonymous codon usage is influenced by the competing actions of random genetic drift, natural selection, and mutation (Sharp et al. 1995;Duret 2002;Hershberg and Petrov 2008). Despite decades of study on codon bias the relative contributions of mutational bias vs. selection favoring the speed and accuracy of protein translation, n  (4623)  The black arrow labeled "U" points to "ubiquitin-like/S30 ribosomal fusion protein" (SPU_005280), an example of a highly expressed gene in S. purpuratus. The black arrow labeled "E" points to Sp-Ets1/2 (SPU_002874), an example of a highly expressed, highly biased gene that does not belong to codon usage group 0.
translational robustness, or mRNA stability remains unclear (Duret 2002;Hershberg and Petrov 2008;Plotkin and Kudla 2011). Selection for translational efficiency has long been recognized as an important factor affecting synonymous codon usage bias in unicellular and multicellular organisms (e.g., Powell and Moriyama 1997;Akashi 1994). However, other forms of selection can favor translationally suboptimal codons (Hershberg and Petrov 2008) that in competition with translational selection can result in heterogeneous patterns of codon usage across the genome (Akashi and Eyre-Walker 1998). The present study has confirmed that multiple distinct patterns of codon usage bias exist in S. purpuratus that appear to result from different modes of selection (i.e., translational efficiency and mRNA stability). Our results also suggest that mutational bias has played a negligible role in generating codon bias in S. purpuratus.
Distinct codon usage groups occur in S. purpuratus Synonymous codon usage varies between genes varies in both composition and magnitude Shields and Sharp 1987). Genome-wide patterns of codon bias are driven primarily by the most highly biased genes, which, if they share preferred codons, will dominate the synonymous codon usage patterns. It is thus easy to overlook more subtle patterns of codon bias. This is evident in yeast where codon usage differs between highly and lowly expressed genes in both the level and nature of the bias . A similar situation is seen in Bacillus subtilis where correspondence analysis identified an axis with ribosomal proteins at one end and genes with different codon usage patterns at the other end (Shields and Sharp 1987). Notch is a highly expressed and highly biased gene in Drosophila but exhibits a codon usage distribution different from other similar loci (Dumont et al. 2004;Singh et al. 2007). These subtle patterns of codon usage appear to be biologically significant and need to be considered in studies examining codon bias. We identified five distinct codon usage groups in S. purpuratus using two independent methods (Figures 2A and Figure 3). The magnitude of codon bias and the usage of genome-wide preferred codons varied dramatically between groups (Table 1, Table 3 and Table S3), as did the mean levels of maximum mRNA expression during early development ( Figure 5A). Previously, the detection of distinct codon usage clusters has been restricted to bacteria and yeast (e.g., Bailly-Bechet et al. 2006;Kloster and Tang 2008). In E. coli K12 and B. subtilis, Bailly-Bechet et al. (2006) identified four and five distinct codon usage clusters, respectively. In both species, genes belonging to different clusters were associated together in different chromosomal regions implicating some role for translational selection. The present study represents the first documentation of similar codon usage clusters existing in a metazoan. To assess the generality of our results in S. purpuratus we also identified five distinct codon usage groups in D. melanogaster ( Figure 2B), a well studied species where translational selection on rate and accuracy has been previously documented . Notch, being an interesting anomaly for synonymous codon usage in Drosophila, was expected to have a different codon usage pattern from other highly expressed and highly biased genes. As expected, Notch had a different codon usage distribution from the majority of ribosomal proteins in Drosophila. Our identification of distinct codon usage groups in D. melanogaster supports the reliability of the patterns we observed in S. purpuratus and suggests that similar clusters may be present in other species.
Mutational processes cannot explain codon bias in S. purpuratus If mutational bias is responsible for generating codon usage bias, synonymous sites are expected to have nucleotide compositions similar to introns and flanking regions. S. purpuratus is an excellent species to test for mutational bias because it possesses an AT-biased genome lacking isochores and the majority of genes reside in regions of GC content between 35 and 39% (Sodergren et al. 2006). Despite this overall AT bias, all 18 preferred codons used a C or a G at their third positions. In our data, the mean GC content of introns and flanking regions were similar to the genome-wide average and did not vary among the different codon usage clusters (Table 3). The magnitude of codon bias exhibited strong correlations with mean GC3 and GC cds in the most highly biased groups (0 and 1). However, the degree of codon bias in groups 0 and 1 showed no associations with either GC i or GC f (Table 5, Table S5, and Figure 4). These observations suggest that mutational bias has played a negligible role in influencing patterns of codon bias in S. purpuratus.

Selection on translation drives major codon usage patterns
Codon usage and tRNA abundances are thought to have coevolved to maximize the speed (Bulmer 1987) or the accuracy of protein translation (Akashi 1994). Speed of translation is an adaptation for translation rate whereas translational accuracy acts to minimize the incorporation of incorrect amino acids into proteins. These two interrelated drivers of selection on translational efficiency (speed and accuracy) are difficult to disentangle and share similar predictions. For example, both hypotheses predict that the abundances of tRNA species should correlate with preferred codons in highly expressed genes (Ikemura 1985) and that high levels of gene expression should correlate with high levels of codon bias (Duret and Mouchiroud 1999). In S. purpuratus we obtained evidence for both of these predictions. We found that tRNA gene frequencies were significantly correlated with n  (4623), cluster 0 (396), cluster 1 (861), cluster 2 (1154), cluster 3 (912), cluster 4 (1300). b log 10 (maximum observed expression in AU).
genome-wide preferred codons and tRNA gene copy number exhibited a significant positive relationship with amino acid counts. Furthermore, the magnitude of codon bias was strongly related to mRNA expression levels during early development (Table 3 and Figure 5A). Although we detected genome-wide evidence for translational selection, it was driven exclusively by genes in group 0 that represented only 8.6% of all loci. The ability of such a small number of genes to drive this overall pattern is attributable to their dramatically greater levels of mRNA expression (roughly four times the other groups; see Table 3). Strikingly similar results were obtained for D. melanogaster, in which highly expressed and highly biased genes were represented in only one of the five clusters identified (group 4; Figure 5B). In both species, the vast majority of ribosomal protein genes were present in the highly expressed cluster and S. purpuratus group 0 was overrepresented with GO terms related to transcription and cell growth (Table S6). Translational selection thus apparently acts on only one of the five codon usage groups identified in both species and only a small proportion of genes (,10%) are driving the genome-wide correlations between codon bias and mRNA expression.
In S. purpuratus, codon usage group 1 shares similar patterns of codon preference to group 0 but fails to exhibit a correlation with gene expression levels. There are three amino acids [serine (Ser), proline [Pro], and glycine [Gly]) with different preferred codons between groups 0 and 1 (Table 4). Only Ser contributes to codon bias in both groups (Table S3), whereas Pro contributes only to group 0. If group 1 genes have also been selected for translational efficiency then one explanation for this anomalous result is they are not expressed during early development but are highly expressed at later developmental stages. Recent studies in Drosophila have found that larval stages exhibit greater overall codon bias than adult stages, supporting the hypothesis that codon usage bias is influenced by selection for efficient protein production (Vicario et al. 2008). In contrast, genes expressed after embryogenesis in C. elegans have greater synonymous substitution rates than those expressed during embryogenesis (Castillo-Davis and Hartl 2002). The possibility of translational selection acting on group 1 genes in S. purpuratus at later developmental stages needs to be investigated further.
Preferred codons are evolutionarily conserved Selection for the accuracy of translation is an additional explanation for the relationship between codon bias and rates of translation in Drosophila (Akashi 1994). The model assumes that the misincorporation of amino acids due to errors in translation adversely affects the functionality of proteins and reduces cell fitness. There are two unique predictions of the translational accuracy model: (1) synonymous codon usage at evolutionary conserved sites will be biased toward preferred codons, and (2) the degree of conservation will be stronger in highly expressed genes. As found for Drosophila, we observed that preferred codons were evolutionarily conserved across all genes in S. purpuratus (Table 7). In addition, we observed that preferred codons were conserved to a similar degree in all five codon usage groups with their respective sets of preferred codons despite differences in their synonymous codon usage patterns, levels of mRNA expression, and their magnitude of codon bias (Table 7). This is particularly noteworthy for group 2, whose preferred codons end primarily in A and T. Selection for translational accuracy should also be strongest in the most highly expressed group 0 but instead appeared to be independent of mRNA expression. Therefore, selection for translational accuracy cannot account for the different patterns of codon usage observed between the clusters.
Preferred third codon positions are more likely to pair in mRNA stems Selection to optimize heteroduplex base paring and maximize the stability of mRNA secondary structure has also been considered as a potential driver of codon usage bias for many years (Fitch 1974;Klämbt 1975;Grantham et al. 1980). A relationship between folding stability and codon preference has been established through the examination of mRNA secondary structure in mammals and other groups (Fitch 1974;Hasegawa et al. 1979;Chamary and Hurst 2005;Meyer and Miklós 2005;Chamary et al. 2006;Shabalina et al. 2006). mRNA stability is expected to be influenced by base composition at third positions (N3) (Carlini et al. 2001). A preference for G or C in N3 is seen in the most highly biased genes of Drosophila, which is almost entirely explained by an increased use of C over G (Shields et al. 1988). This preference for C in N3 of highly biased loci is not easily explained by local mutation effects but may result from its effect on mRNA stability (Gillespie 1991).The stability of mRNA secondary structure can be increased through GC heteroduplex pairing in N3 or by an increase in the number of heteroduplex pairs. Recognizing there are other competing processes that determine preferred N3 composition (Chamary and Hurst 2005), and different definitions of preference not limited to GC (this study), we restricted our analysis to stems and loops and tested if the third positions of preferred codons were more likely to reside as paired members of stems than unpaired members of loops. More specifically, if natural selection acts on synonymous codons to increase mRNA stability then we expect N3 of n  (Woolf 1955) shows significant three-way association for all groups. b P(Better Codon Set) is the fraction of 1000 randomly generated alternate preferred synonymous codon sets having a stronger association with conserved codons than the observed preferred set (see Materials and Methods).
n preferred codons to more likely to reside as paired members of stems than unpaired members of loops. In S. purpuratus, we observed that any nucleotide in the third position of preferred codons (N3) were more likely to be in paired in stems than unpaired in loops only for genes in codon usage groups 2 and 3 (Table 8). Ignoring codon usage groups, the Mantel-Haenszel tests were significant for C3 and G3, which is consistent with the observation that all genome-wide preferred codons had a G or C in their 3 rd positions. Considering the groups separately, group 2 differed sharply from the others in having G3 or U3 of preferred codons pairing in stems over loops. Group 0 exhibited the weakest relationships, which is not surprising because selection for mRNA stability may conflict with translational selection and other processes. No significant relationships were observed between mRNA stability and expression levels. Here again, mRNA stability can impact rates of translation and highly stable mRNA structures may be specifically avoided in rapidly transcribed regions of genes (Mita et al. 1988;Zama 1990). Overall these results implicate a role for mRNA folding, perhaps acting in a compensatory fashion, in determining the patterns of synonymous codon usage in S. purpuratus.

Codon bias decreases with gene length
Codon bias decreases with increasing gene length in S. purpuratus similar to that observed in other species (e.g., Comeron et al. 1999;Duret and Mouchiroud 1999;Ingvarsson 2007;Qiu et al. 2011). This decrease in codon bias with increasing gene length may be a sideeffect of the economy of translation where highly expressed, highly biased genes tend to be short (Akashi 2003). Indeed, in group 0 gene length was correlated with both codon bias and mRNA expression levels, further supporting the action of translational selection on this group. However, in the other codon usage groups gene length was correlated with either codon bias (groups 1, 3, and 4) or expression (group 2), but not both. The negative correlation between gene length and codon usage may also be due to GC-biased gene conversion (Stoletzki 2011). Our results do not support an alternative prediction for translational accuracy where the cost of incorporation of erroneous amino acids during elongation will drive large genes to have higher codon bias than small genes (Eyre-Walker 1996) as observed in yeast (Eyre-Walker 1996;Moriyama and Powell 1998;Stoletzki and Eyre-Walker 2007).
In summary, we detected significant nonrandom patterns of synonymous codon usage in the purple sea urchin that were incompatible with the predicted effects of mutational bias. Five distinct clusters of genes were identified that used different sets of preferred codons. However, only one cluster exhibited a correlation between codon bias and mRNA expression consistent with translational selection. The third positions of preferred codons were more likely to be paired in stems in folded mRNA secondary structures in two additional codon usage groups, implicating selection for mRNA stability on synonymous codon usage. The discovery of five distinct codon usage groups in D. melanogaster was similar to that found for S. purpuratus, but the patterns of codon usage differed between species. However in both species only one cluster representing ,10% of all genes was associated with translational selection. Our results suggest that other forms of selection might be more important in determining the overall patterns of synonymous codon usage and that similar codon usage clusters might be present in other groups.