Genomic Sequence Diversity and Population Structure of Saccharomyces cerevisiae Assessed by RAD-seq

The budding yeast Saccharomyces cerevisiae is important for human food production and as a model organism for biological research. The genetic diversity contained in the global population of yeast strains represents a valuable resource for a number of fields, including genetics, bioengineering, and studies of evolution and population structure. Here, we apply a multiplexed, reduced genome sequencing strategy (restriction site−associated sequencing or RAD-seq) to genotype a large collection of S. cerevisiae strains isolated from a wide range of geographical locations and environmental niches. The method permits the sequencing of the same 1% of all genomes, producing a multiple sequence alignment of 116,880 bases across 262 strains. We find diversity among these strains is principally organized by geography, with European, North American, Asian, and African/S. E. Asian populations defining the major axes of genetic variation. At a finer scale, small groups of strains from cacao, olives, and sake are defined by unique variants not present in other strains. One population, containing strains from a variety of fermentations, exhibits high levels of heterozygosity and a mixture of alleles from European and Asian populations, indicating an admixed origin for this group. We propose a model of geographic differentiation followed by human-associated admixture, primarily between European and Asian populations and more recently between European and North American populations. The large collection of genotyped yeast strains characterized here will provide a useful resource for the broad community of yeast researchers.

interest in natural isolates has increased as it has become clear that many nonlaboratory strains (including those adapted to various food/ industrial processes) have properties absent from the laboratory strains, such as the ability of several wine strains to ferment xylose (Wenger et al. 2010). The wider population of yeast strains represents a deep pool of naturally occurring sequence variation that has been leveraged to investigate the genetic architecture of polygenic traits (Swinnen et al. 2012). In addition, the polymorphisms that are observed in the global yeast population have been acted upon by evolution, making this set of sequences a powerful tool for investigating protein and regulatory sequence function as well as evolution (Nieduszynski and Liti 2011). Understanding the genetic diversity of yeast is therefore relevant to both the food/industrial roles of yeast and its role as a model organism in scientific research.
The question of the global population structure of S. cerevisiae is itself an ongoing topic of research. In several publications in the past few years, investigators have explored the genetic diversity and population structure of yeast by using techniques such as multigene sequencing (Fay and Benavides 2005;Aa et al. 2006;Ramazzotti et al. 2012;Stefanini et al. 2012;Wang et al. 2012), whole-genome sequencing (WGS; Liti et al. 2009), tiling array hybridization (Schacherer et al. 2009), and microsatellite comparisons (Legras et al. 2005(Legras et al. , 2007Ezov et al. 2006;Goddard et al. 2010;Schuller et al. 2012). These studies demonstrated that S. cerevisiae is not a purely domesticated organism but can be isolated from a variety of natural environments around the globe. Although there appears to be some clustering of yeast genotypes by geography (Liti et al. 2009), it also appears that yeast involved in particular human food-industrial processes often are genetically similar to one another. For example, wine strains isolated from around the world display a very high degree of sequence similarity (Fay and Benavides 2005;Legras et al. 2007;Liti et al. 2009;Schacherer et al. 2009). Unfortunately, several of the most diverged groups identified in these studies were represented by relatively small numbers of strains, suggesting that analysis of additional strains might help clarify the structure of global yeast diversity.
WGS of a large, diverse set of individuals is the most comprehensive approach to exploring the population structure and genetic diversity of an organism. However, despite the decreasing costs of DNA sequencing, complete genome sequencing of several hundred yeast strains is still a significant expense. In contrast, methods that compare strains by genotyping relatively small numbers of loci, such as microsatellites or a small number of genes (Fay and Benavides 2005), are less expensive, but the results may not reflect the relationships between strains genome-wide. A genome reduction strategy referred to as restriction site2associated sequencing (RAD-seq; Miller et al. 2007;Baird et al. 2008) directs sequence reads to genomic locations adjacent to particular restriction sites. However, because most restriction sites are common across strains of the same species, nearly the same subset of every genome is sequenced. Thus, RAD-seq permits the genotyping of a set of strains across a large number of positions scattered across the genome at modest cost.
In this work, we apply a multiplexed RAD-seq reduced genome sequencing strategy to explore genetic diversity and population structure in S. cerevisiae. Using this approach we sequenced more than 200 strains over~1% of the yeast genome. The strains include multiple representatives from six continents, 38 different countries, and were isolated from disparate sources, including fruits, insects, plants, soil, and a variety of human fermentations, such as ragi, togwa, cacao, and olives. From analysis of the resulting multiple alignment, we observed a clear geographical stratifi-cation of strains along with evidence of admixture between populations and human-associated strain dispersal.

MATERIALS AND METHODS
The S. cerevisiae strains analyzed in this study were obtained from a variety of sources, including the Phaff Yeast Culture Collection (http://phaffcollection.ucdavis.edu), the Agricultural Research Service (NRRL) Culture Collection (http://nrrl.ncaur.usda.gov/), published strains from individual laboratories or our own isolates from wild or domesticated sources. Details, including references and information about strain requests, are included in Supporting Information, Table S1. While analyzing the data, we came across a small number of anomalies, such as two dissimilar genome sequences for strain 322134S. These are likely to represent errors in strain labeling.

Yeast isolation
Soil, bark and leaves, or food samples were bathed in medium consisting of 2 g/L Yeast Nitrogen Base without Amino Acids (Difco, BD), 5 g/L ammonium sulfate, and 80 g/L glucose. Chloramphenicol (30 mg/mL) and carbenicillin (50 mg/mL) were added to the medium to suppress bacterial growth, and cultures were incubated at 30°. When necessary to suppress mold overgrowth, cultures were subcultured to liquid medium containing 1-5% ethanol. Cultures were examined by microscopy at 3 and 10 d, and those harboring budding yeast were plated onto CHROMagar Candida (DRG International, Inc.) and incubated at 30°for 325 d. CHROMagar Candida is a culture medium containing proprietary chromogenic substrates that can aid the identification of clinically important yeast (Odds and Bernaerts 1994). On CHROMagar Candida, S. cerevisiae colonies are known to range in hue from white to lavender to deep purple with most exhibiting the "purple" phenotype (C. L. Ludlow and A. M. Dudley, unpublished data;Boekhout and Robert 2003). Colonies exhibiting these color phenotypes were picked and saved for further study.

RAD-sequencing and alignment
A subset of strains were RAD-sequenced previously (Hyma and Fay 2013). For the rest, RAD-sequencing was performed as previously described (Lorenz and Cohen 2012;Hyma and Fay 2013). In summary, yeast genomic DNA was extracted in 96-well format and fragmented by restriction enzyme digestion with MfeI and MboI. P1 and P2 Adaptors were then ligated onto the fragments. The P1 adaptor contains the Illumina PCR Forward sequencing primer sequence followed by one of 48 unique 4-nucleotide barcodes and finally the MfeI overhang sequence. The P2 adaptor contains the Illumina PCR Reverse primer sequence followed by the MboI overhang sequence. After ligation, the barcoded ligation products were pooled, concentrated, and size selected on agarose gels, with fragments from 150 to 500 bp extracted from the gel. Gel-extracted DNA was further pooled to multiplex 48 uniquely barcoded samples in one sequencing library. The multiplexed DNA library was then enriched with a polymerase chain reaction using Illumina PCR Forward and Reverse primers. Sequencing runs were performed on the Genome Analyzer IIx (Illumina) for 40 bp single-end reads, with one library of 48 multiplexed samples per flow cell lane, yielding 20240 million reads. The read sequences generated for this study are available at the Sequence Read Archive under accession ERP003504, and for the subset of strains that were RADsequenced previously, DRYAD entry doi:10.5061/dryad.g5jj6.
Multiple sequence alignments were generated by mapping reads to the S288c reference genome (chromosome accessions: NC_001133.8, NC_001134.7, NC_001135.4, NC_001136.8, NC_001137.2, NC_001138.4, NC_001139.8, NC_001140.5, NC_001141.1, NC_001142.7, NC_001143.7, NC_001144.4, NC_001145.2, NC_001146.6, NC_001147.5, NC_001148.3) and generating consensus reduced-genome sequences for each strain. The tagged reads were split into strain-pools by their 4 base prefix barcodes. Reads with Ns or with Phred quality scores less than 20 in the barcode sequence were removed. Any reads with more than 2 Ns outside the barcode also were removed. Reads were aligned to the S288c reference using BWA (version 0.5.8; Li and Durbin 2009), with six or fewer mismatches tolerated. Samtools (version 0.1.8; ) was then used to generate a pileup from the aligned reads using the "pileup" command and the "-c" parameter. Base calls were retained if they had a consensus quality greater than 20. Positions with root mean squared mapping qualities less than 15 and insertion/deletion polymorphisms were ignored. After filtering there was an average of 209,765 bp for each strain. Sequences from each strain were combined into a multiple sequence alignment via their common alignment to the S288c genome. Sites with more than 10% missing data were removed, resulting in a multiple sequence alignment of 116,880 bp.

WGS alignment
Previously generated WGS were incorporated into the RAD-seq dataset for population genetic analysis. For genomes with an S288c NCBI coordinate system, sequences were extracted directly based on S288c reference coordinates. For genomes using an alternative coordinate system (Saccharomyces Genome Resequencing Project, SGRP), blat was used to convert from the S288c NCBI reference coordinates to the alternative coordinate system prior to extracting sequences. For assembled genomes without S288c alignments, coordinates were obtained by blast. A fasta file of the S288c reference sequence was generated for each contiguous segment in the multiple sequence alignment. The resulting files were used to query each genome assembly using blast. When quality scores were available, sites with sequence quality less than 20 were converted to "N" before blasting or after sequence retrieval.

Duplicated strains
Some strains were sequenced by both WGS and RAD-seq. For duplicate strains with pairwise divergence less than 0.0005 substitutions per site, excluding singleton alleles (i.e., found in only 1 strain), only the RADseq data were retained for analysis. For duplicate strains that exceeded the threshold, both RAD-seq and WGS data were retained and strain names were labeled with an "r" and "g," respectively. Differences between the WGS and RAD-seq data could be a result of: (1) sequencing/ alignment errors, (2) different monosporic clones from an originally heterozygous isolate, or (3) mislabeled strains. However, we were not able to confidently distinguish between these possibilities.

Population analysis
Neighbor-joining phylogenetic tree construction was carried out using MEGA (version 5.0; Tamura et al. 2011), based on P-distance with pairwise deletion. Population structure was inferred using InStruct (Gao et al. 2007). Because InStruct failed to converge using all sites, it was instead run on 759 sites with allele frequency greater than or equal to 10%. Polymorphic sites were made biallelic by treating third alleles as missing data. InStruct was run with the parameters "-u 40000 -b 20000 -t 10 -c 10 -sl 0.95 -a 0 -g 1 -r 1000 -p 2 -v 2" with K (number of populations) ranging from 3 to 15. Although the lowest deviance information criterion (DIC) was obtained from a chain with K = 15, there was substantial variation among independent chains. We chose K = 9 as the optimal model to work with based on the average DIC for K = 10 being nearly identical to that of K = 9 and subsequent decreases in DIC for larger values of K being small compared with the standard deviation in DIC among chains (Table S3). Consensus population assignments for K = 8, 9, and 10 were obtained for the five chains with the highest likelihood using CLUMPP (version 1.1.2; Jakobsson and Rosenberg 2007) with parameters "-m 3 -w 0 -s 2" and with greedy option = 2 and repeats = 10,000. The similarity among the five chains (H') was 0.995 for K = 9, very close to the maximum similarity of 1.0. Compared with K = 9, populations 6 (African, S. E. Asia/Palm, Cocoa, Fruit) and 7 (Israel/Soil) were merged for K = 8, and a new population was inferred within populations 3 (Asian/Food, Drink) and 6 (African, S. E. Asia/Palm, Cacao, Fruit) for K = 10 ( Figure S1). A second InStruct analysis was performed by the use of a pruned dataset to better conform to InStruct's assumption of independence among markers. We eliminated SNPs within 5 kb of another SNP based on the decline in r 2 as a function of distance between SNPs ( Figure S2). The pruned dataset contains 495 SNPs and an average distance between SNPs of 22.4 kb compared with 14.9 kb in the full set of 759 SNPs. In comparison with the full dataset, the pruned data also had an optimum of 9 populations but with more variance among runs as indicated by H' (Table S3). The similarity (H') between the CLUMPP consensus of the full and pruned dataset was 0.90. Seven strains showed population admixture proportions that changed by more than 0.25 for any population. The seven strains are all Israeli strains in the Israeli population (#7) and showed an increase in admixture with European (#8) and Human (#4) populations in the pruned analysis. Most of the strains (211) showed no changes in admixture proportions greater than 0.125.
Multidimensional scaling was performed on all 5868 sites and 262 strains using the identity by state distance between each pair of strains and the "cmdscale" function in R with three dimensions. Hierarchical clustering of either sites or strains was performed using the "hclust" function in R with complete linkage and the Euclidean distance of identity by state.

RESULTS
In an effort to expand the number and diversity of characterized S. cerevisiae strains available to the yeast community, we assembled and characterized a collection of .200 strains (Materials and Methods and Table S1). This strain set covers a diverse range of ecological niches and geographical locations, including strains used in previous studies of yeast global and local population structure (Fay and Benavides 2005;Ezov et al. 2006;Liti et al. 2009;Schacherer et al. 2009;Goddard et al. 2010) and strains with published WGS data. We sequenced each of these strains using a RAD-seq strategy to produce an initial multiple alignment (Materials and Methods). Strains with published WGS data were then added to the alignment to facilitate comparison between the results generated using WGS and RAD-seq data (Materials and Methods). The final dataset contained 262 strains genotyped across 116,880 base positions, of which 5868 sites were polymorphic (File S1).

Genetic relationship among strains
To visualize the phylogenetic relationships between the strains, we generated a neighbor-joining tree from the reduced genome multiple alignment (Figure 1 and File S2). The tree agrees well with the geographic origins of the strains and, for the subset of strains in common, is also consistent with a previous study that used WGS (Liti et al. 2009). To more directly compare our results to those obtained using WGS, we constructed a phylogenetic tree for only the subset of strains (38) analyzed in both our study and the previous whole-genome analysis ( Figure S3). The structure of the resulting tree is very similar to that produced in the previous study ( Figure 1C in Liti et al. 2009) and shows the same clustering of "Wine," "West African," "Malaysian," "Sake," and "North American" strains. Similarly, using our full dataset, these groups are found in clear and well-separated clusters on our tree ( Figure 1). We also identified a small isolated cluster of strains from Ghana involved in cacao fermentation and another discrete cluster of strains from the Philippines.
A clear exception to this geographical stratification is the dispersal of European/wine strains around the globe, a result that is also consistent with the previous study (Liti et al. 2009). We identified two clusters of strains that appear closely related to the European/wine cluster, one isolated from European olives and another consisting primarily of a collection of environmental isolates from New Zealand (Goddard et al. 2010). Results for this second group are consistent with the hypothesis that the strains largely reflect a population brought to New Zealand as a consequence of European settlement. Together with the main "European/Wine" cluster, these two groups of strains appear to identify a "greater-European" region of the tree.
Strains isolated from North America fell into two highly diverged regions of the tree. One set of strains ( Figure 1, "North America Wild") defines a cluster of strains almost universally isolated from North America (largely environmental samples from soil and vegetation). The second set is genetically similar to the European/wine strains, with strains scattered within the main European/wine cluster and related groups ( Figure 1 and Table S1). There are also a small number of strains isolated from North American environments in the "New Zealand" cluster. As previously observed (Hyma and Fay 2013), North American strains isolated from even the same locale (e.g., a single vineyard) split into subsets from both the North American Wild cluster and greater-European regions of the tree. These results are consistent with the assertion that in many locations across North America (particularly vineyards), a native population of yeast strains coexists sympatrically with a population introduced by European settlement (Hyma and Fay 2013).
Another instance in which highly diverged strains were isolated from a single small geographical location is provided by the set of strains isolated from "Evolution Canyon", a well-studied location in Mount Carmel National Park of Israel (Ezov et al. 2006). These strains fell into one large and two smaller clusters on the tree (Figure 1; Israel 1, Israel 2, and a third cluster within a diverse set of strains labeled "Mixed"). The genomic diversity of these strains is remarkable, given that they were collected within a few hundred meters of each other.
Strains widely used in the laboratory Included in the multiple alignment and phylogenetic tree is a group of seven strains widely used in the laboratory (S288c, W303, RM11-1a, FL100, Sigma 1278b, SK1, Y55), several of which are known to be closely related (Winzeler et al. 2003). The strains SK1 and Y55 are closely related to the West African cluster while S288c, FL100 and W303 are related and close to the European/Wine cluster. The position of these strains on the tree agrees with two previous studies (Liti et al. 2009;Schacherer et al. 2009), both of which described the limited sequence diversity of the lab strains. For example, none of the commonly used lab strains are derived from certain major populations, including the Asian group and the North America Wild group ( Figure  1). Together, these results suggest that the total sequence diversity of the yeast global population is poorly sampled by this set of strains in common laboratory use.
To compare the total sequence diversity captured by the full set of 262 strains relative to that present in the subset of laboratory strains, we analyzed all alleles (defined as single base pair polymorphisms) Figure 1 Neighbor-joining tree of the 262 S. cerevisiae strains based on multiple alignments of 116,880 bases. Branch lengths are proportional to sequence divergence measured as P-distance. Scale bar indicates 5 polymorphisms/10 kb of sequence. Geographical and environmental clusters of strains are named and are indicated by black-outlined/gray-filled ovals. Colored ovals with numbering refer to strain populations identified in Figure 2. Seven strains widely used in the laboratory are labeled.
that occurred in more than one strain. Alleles found in only one strain (singletons) were ignored to reduce the effect of sequencing errors, as were heterozygous calls. The results show a total of 3321 polymorphic loci with 6680 total alleles (3283 biallelic, 38 triallelic, and 0 tetraallelic positions). Only 1703 of these 6680 alleles were observed in the set of lab strains, and thus the set of strains assembled in our panel represents a significant increase (~4-fold) in sequence diversity over the set of laboratory strains.

Population structure
The infrequent sexual cycle of S. cerevisiae, combined with its high rate of self-mating, promotes the establishment of strong population structure and enables clonal expansion of admixed populations. To infer population structure and admixture between populations while accounting for selfing, we applied a Monte Carlo Markov chain algorithm, InStruct (Gao et al. 2007), to the 759 sites with an allele frequency of 10% or more. On the basis of the deviance information criterion, we inferred the most likely number of populations to be nine (Materials and Methods) and labeled each population by the most common geographic location and/or substrate from which the strains were originally isolated (Table S2). The relevant genotypes of each strain along with their inferred population ancestry are shown in Figure 2 and Table S1. The nine populations consist of two North American oak populations, an Asian food and drink population, a European wine and olive population, an African/S. E. Asian population, a New Zealand population, an Israeli population, and two populations associated with industrial/food processes. These populations match well with the major groupings seen on the phylogenetic tree, with the two North American populations identified by InStruct corresponding to the "North America Wild" grouping ( Figure 1 and Figure S1). It is notable that these two subdivisions do not reflect a clear geographic pattern within North America (Figure 2 and Table  S1). The New Zealand population clearly shares many alleles with the European strains, but harbors a small number of sites that make it unique. One of the two human-associated groups contains the majority of laboratory strains, emphasizing the uneven sampling of yeast populations represented by the set of laboratory strains.

Admixture
For each population, strains were observed with high levels of ancestry to that population. However, 38% of strains showed appreciable levels of admixture, defined as less than 80% ancestry from a single population. To assess the overall coincidence of mixture between pairs of populations we tabulated the number of strains with at least 20% ancestry from each pair of populations (Figure 3). Most admixed strains involved the European, Asian or African populations. Figure 2 Clustered genotypes with inferred population structure and membership. Sites were clustered by complete hierarchical clustering by use of the Euclidean distance of allele sharing (identity by state). Strains were grouped by population structure and memberships inferred using InStruct. Minor alleles are shown in red, heterozygous sites in yellow, common alleles in black, and missing data are gray. Populations are labeled by the most common source and/or geographic location from which they were originally isolated.
However, not all pairs of populations were equally likely to admix. Admixture was detected between the European population and the first North American (InStruct #1), but not the second North American (InStruct #2) populations. More generally, admixture with the two North American populations was largely restricted to the African and European populations or to admixture between the two populations themselves. Like the European population, the Asian population showed admixture with most other groups. The two human-associated populations were largely admixed with either the Asian or European populations. Finally, the New Zealand population only admixed with the European population, and the Israeli population was largely admixed with the Asian and one of the human-associated populations.

Heterozygosity
Matings within or between populations can result in strains with a large proportion of heterozygous sites. Most strains in this study had zero or a relatively small number of such sites. These strains could be naturally occurring homozygotes, haploids, or converted to homozygous diploids, a standard practice in some laboratories. However, we did identify 65 strains with more than 20 heterozygous sites (Table  S1). The two strains with the greatest number of heterozygous sites DCM6 (n = 305) and DCM21 (n = 288) were isolated from cherry trees in North America and appear to be hybrids between the European and North American populations (Hyma and Fay 2013). Other strains with a large number of heterozygous sites (Table S1) also were isolated from fruit-related sources, including three from cacao fermentations, one from banana fruit, one from fruit juice, and one from a spontaneous grape juice fermentation. Across these 65 strains, 42 also exhibit notable admixture, defined by less than 80% ancestry from a single population. The proportion of heterozygous strains exhibiting appreciable admixture (65%) is significantly greater (Fisher's exact test, P = 1.5 · 10 24 ) than strains with little or no heterozygosity (38%), suggesting that heterozygosity was derived in part by admixture between populations. The proportion of admixture in heterozygous strains (71%) compared with strains with little or no heterozygosity (27%) is even more significant if strains with WGS are removed. Among the heterozygous strains, the greatest proportion of ancestry comes from one of the human-associated populations (#4, 31%), followed by the European (20%), Asian (17%) and African (14%) populations. To examine rates of heterozygosity across populations, we compared expected to observed heterozygosity within each population (Table S1). Although most populations exhibit a deficit of observed, compared to expected heterozygosity, the two humanassociated populations show noticeably more heterozygosity than the other populations.

Relatedness between populations
Whereas heterozygosity and admixture can provide information about strain ancestry, relatedness between populations can provide information about the history of entire populations, some of which may themselves be derived from historical admixture events. To examine relatedness between populations, we applied multidimensional scaling (Materials and Methods) to the entire dataset ( Figure 4). The first principal coordinate differentiates the European population from the other populations; the second principal coordinate distinguishes the two North American populations from the Asian population; and the third principal coordinate differentiates the African/S. E. Asian population from the others. The remaining populations and most of the admixed strains lie between these four major groups (Figure 4). Consistent with their positions on the neighbor-joining tree (Figure 1) and their genotypes (Figure 2), the New Zealand and Israeli populations are most closely related to the European population, and the two human-associated populations lie between the European and Asian populations. The results, combined with its high rates of heterozygosity, also suggest that the first human-associated population (population #4) appears to be a recently derived population originating from hybrids between the European and Asian populations.

Subpopulations
Low frequency alleles (,10%) can sometimes define subpopulations not captured by inference of population structure based on common alleles. Two-dimensional hierarchical clustering of the low frequency alleles identified a number of such subgroups ( Figure 5 and Table S1). Figure 3 Coincidence of admixture between pairs of populations. Each bar shows the number of strains with at least 20% ancestry from a reference population (bar labels) and 20% ancestry with another population (indicated by color in the legend). For comparison, gray filled circles show the number of strains with more than 80% ancestry from each population.
Whereas the average number of derived low frequency alleles shared between any two strains is 3.5, there are 86 strains that share at least 100 derived low frequency alleles with another strain. These subgroups include a previously described Malaysian/Bertram Palm population (Liti et al. 2009), but also groups of strains from Philippines/ Nipa palm, togwa, olives, sake, and cacao. Although the number of strains in each group is small, the number of sites defining the groups is not. In support of these groups representing populations that have been isolated from other populations for some time, many of the rare variants that define these groups are not present in other strains but, in at least some cases, are variable within the subgroup. Interestingly, the subpopulations defined by the largest numbers of alleles are strains with primary membership to the African/S. E. Asian population, suggesting that there may be undiscovered subpopulations and diversity among strains of African or S.E. Asian origin.

DISCUSSION
Genetic variation within S. cerevisiae has been shaped by a complex history, influenced by human-associated dispersal and admixture. Understanding this history and the resulting patterns of diversity is important for capturing and harnessing its fermentative capabilities as well as for quantitative and population genetics research. In this study, we used a reduced genome sequencing strategy to characterize the genetic diversity among a global sample of 262 strains isolated from a wide range of ecological habitats and environmental substrates. Our findings indicate that the major axes of differentiation correspond to broad geographic regions. In addition to previously described populations and patterns of differentiation (Fay and Benavides 2005;Legras et al. 2007;Liti et al. 2009;Schacherer et al. 2009;Goddard et al. 2010;Warringer et al. 2011), two new patterns indicative of human influence also emerge. First, we find a population represented by multiple human-associated strains that contains a mixture of European and Asian alleles. Second, we find human-associated subpopulations from togwa, olive, cacao, and sake fermentations that are defined by a unique set of variants not present elsewhere in the global population. While inferences of population structure can depend on sampling, indeed our analysis points to areas of uncertainty, the structure of S. cerevisiae described here is based on the largest collection of strains typed across the genome. This work also provides a foundation for studying the genetic underpinnings of complex traits, the origin and evolution of strains used by humans, and the relationships between such traits and population history.

Geographic differentiation
A major unanswered question in the study of yeast population structure has been the relative importance of geography vs. ecological niche. Although the strains in this study were isolated from many different ecological habitats, a number of lines of evidence suggest that the groups they form are defined better by geographic differentiation than by ecological niche. The two North American populations contain predominantly oak-associated strains, but they also contain strains from plants and insects. Similarly, the European population contains primarily vineyard-associated strains, but also contains a number of European soil and clinical isolates (Liti et al. 2009). The Asian population also includes strains isolated from multiple countries and several different habitats, including strains used in Sake fermentation and several strains isolated from food. The Asian population shares many alleles with the North American populations, but is genetically distinct and includes only a handful of strains from outside of Asia. What is less clear is how this Asian population is related to a number of diverged lineages represented by strains from primeval and secondary forests in China (Wang et al. 2012).
In comparison with the European, North American and Asian populations, the African/S. E. Asian population is not as well defined.

Figure 4
Relatedness among strains and the inferred populations to which they belong. The first and second principal coordinates (A) and the first and third principal coordinates (B) obtained from multidimensional scaling. Each circle shows a strain with color indicating the population contributing the largest proportion of ancestry and size indicating the proportion of ancestry from that population (see key). Circles ringed in black show strains with more than 20 heterozygous sites. The first, second, and third coordinates explain 29%, 9.3%, and 3.9% of variation among strains, respectively.
Most of the strains are inferred to have mixed ancestry, and the strains that are most representative of the population (.80% ancestry) combine previously separated populations (Liti et al. 2009) of West Africa and Malaysia, two populations that are also separated on our tree ( Figure 1). Because the trees are consistent, the different results of the two population analyses could be a result of differences in the methods of analysis (e.g., Structure vs. InStruct), the larger number of strains used in this study, or the larger number of sites used by Liti et al. (Liti et al. 2009).

Admixture
Evidence of admixture was seen in a large fraction of strains and in every population. Although admixture is most common among the European, African, and Asian populations (Figure 3), the smaller number of admixed strains from the North America and New Zealand populations may represent the more recent establishment of European strains in these locations or may be related to the frequency of mating in the oak tree or soil environment. Some of the admixed strains also exhibit high rates of heterozygosity, indicating a relatively recent mating between strains with different ancestries. Interestingly, many of the heterozygous strains were isolated from fruits or orchards, an observation that is consistent with the isolation of admixed (mosaic) strains from fruits and orchards in China (Wang et al. 2012).
Because yeast can grow asexually, entire populations can arise as a consequence of even rare admixture events. The two humanassociated populations bear a strong signature of an admixed origin as they carry alleles from both European and Asian populations and lie between these two groups in the principal coordinate analysis ( Figure  2 and Figure 4). Human-associated population #4 bears the additional signature of high rates of heterozygosity, implying relatively recent mating events in the origin of this group. In contrast, human-associated population #5 harbors fewer heterozygous strains, but also contains multiple laboratory strains (Sigma 1278b, FL100, W303, S288C, and FY4), some of which show mosaic patterns across their genome indicative of an admixed origin (Winzeler et al. 2003;Doniger et al. 2008;Liti et al. 2009).
The New Zealand and Israeli populations may also have an admixed origin. These two populations carry a large subset of the European alleles, similar to many of the admixed European strains, but also carry a small number of alleles present at high frequency in the North American or Asian populations. This pattern is consistent with New Zealand and Israeli populations being derived from an admixture event between the European and these other populations followed by clonal (or nearly clonal) expansion. However, the New Zealand and Israeli populations also carry a small number of alleles that are not present in either the North American or Asian populations ( Figure 2). This raises the possibility that the New Zealand and Israeli populations were derived from admixture between the European and as yet undiscovered populations, or instead, rather than derived from an admixture event, that they represent lineages with roots in an ancestral European population (similar to the "Olive" grouping). The diversity of strains sampled from Evolution Canyon in Israeli is particularly notable. Of the 15 Israeli strains, seven define the nearly clonal Israeli population, three are assigned with 100% ancestry to the human-associated population #4, and four show comparable percentages of ancestry from the Asian, Israeli and human-associated (#5) populations.

Derived subpopulations
The use of common sites to infer population structure eliminates the detection of small populations defined by rare variants. With clustering based solely on rare variants, we identified a number of such subpopulations ( Figure 5). Although many of these groups were isolated from human-associated fermentations, the number of strains is too small to clearly indicate whether they are related by geographic or environmental origin. For example, the olive strain group contains isolates from Spanish olives imported to Seattle and one from olives in Spain. Yet, this group does not contain two strains isolated from the brine of olives from Mexico and one from an olive tree in California. The two North American groups contain strains from different states, and the togwa and cacao strains were each sampled from the same country. Although some of these subpopulations may be the result of recently expanded clones, several of them are defined by sites that are variable within the subpopulation. This latter observation points to the establishment of small groups that have remained isolated due to either geographic or ecological barriers to gene flow.
Prospects for future studies As our understanding of S. cerevisiae population history increases, so does the need to incorporate such information into quantitative and population genetic studies. Our results highlight the complex relationships between strains and populations, but also characterize a set of strains and sequences that can be used by the community. Using WGS or a reduced genome sequencing strategy, such as the RAD-seq Subpopulations defined by clustering of low frequency alleles. Two-dimensional hierarchical clustering of low frequency sites and strains. InStruct assignments, from Figure 2, are shown on the left, clustered genotypes are shown in the middle, with minor alleles in red, heterozygous sites in yellow, common alleles in black, and missing data in gray. Selected subpopulations are labeled on the right. method used here, new strains can be readily placed in the context of global population structure. We anticipate that new genetic diversity will be discovered, particularly in Africa for which we found less certain relationships and a number of derived subpopulations. Our results may also prove useful to studies of existing strains, either by controlling for population history in genome-wide association studies or by aiding the selection of strains for linkage analysis. In both cases, strain choice is an important consideration as the results can depend on what variation is captured and the structure of this variation across strains. Although many quantitative genetic studies have been based on crosses with laboratory strains, our results underscore the presence of additional variation that is available beyond those strains. Finally, the global diversity and increased variation uncovered by our study highlight the potential for identifying novel properties which could prove valuable to the improvement of existing strains or the engineering of new strains for use in industrial fermentations.