Population Structure in a Comprehensive Genomic Data Set on Human Microsatellite Variation

Over the past two decades, microsatellite genotypes have provided the data for landmark studies of human population-genetic variation. However, the various microsatellite data sets have been prepared with different procedures and sets of markers, so that it has been difficult to synthesize available data for a comprehensive analysis. Here, we combine eight human population-genetic data sets at the 645 microsatellite loci they share in common, accounting for procedural differences in the production of the different data sets, to assemble a single data set containing 5795 individuals from 267 worldwide populations. We perform a systematic analysis of genetic relatedness, detecting 240 intra-population and 92 inter-population pairs of previously unidentified close relatives and proposing standardized subsets of unrelated individuals for use in future studies. We then augment the human data with a data set of 84 chimpanzees at the 246 loci they share in common with the human samples. Multidimensional scaling and neighbor-joining analyses of these data sets offer new insights into the structure of human populations and enable a comparison of genetic variation patterns in chimpanzees with those in humans. Our combined data sets are the largest of their kind reported to date and provide a resource for use in human population-genetic studies.


Figure S1
Intra-population allele-sharing for pairs of individuals in those populations in the African data set for which we inferred at least one relative pair. First-and second-degree intra-population relative pairs are reported in Tables S5 and S6 Figure S3 Intra-population allele-sharing for pairs of individuals in the Pacific Islander data set (part 1). Symbols are as defined in Figure S1. First-and second-degree relative pairs are reported in Tables S10 and S11, respectively.  Figure S4 Intra-population allele-sharing for pairs of individuals in the Pacific Islander data set (part 2). Symbols colored black in the Nasioi population are relative pairs identified in Rosenberg [1]. Monozygotic, first-degree, and second-degree relative pairs are reported in Tables S9, S10, and S11, respectively. . First-and seconddegree relative pairs are reported in Tables S10 and S11, respectively. Inter-population allele-sharing for pairs of individuals in each of eight subsets that group populations by their geographic affiliation (Africa, the Middle East, Europe, Central/South Asia, East Asia, Oceania, and the Americas) or admixture status (Afro-European). Latino individuals were included in the Americas analysis, as they were genotyped concurrently with the Native American data set. First-and second-degree relative pairs in the Africa analysis are reported in Tables S13 and S14, respectively. Monozygotic, first-degree, and second-degree relative pairs in the Oceania analysis are reported in Tables S15, S16, and S17, respectively. Second-degree relative pairs in the Americas analysis are reported in Table S19.  Figure 2.  I   I   I  I   I   I  I   I   I   I   I   I   I I

Genotype data sets
File S1 is a Zip archive containing the genotype data of the combined human data set of 5795 individuals and the combined human-chimpanzee data set of 5879 individuals in the format used by the Structure program, along with a list of individual memberships in the subsets described in the paper File S1 is available for download at http://www.g3journal.org/lookup/suppl/doi:10.1534/g3.113.005728/-/DC1

Table S1
Allele size adjustments used to make the Pacific Islander data set comparable to the combined HGDP-CEPH, Native American, Latino, Jewish, Asian Indian, and CGP data set

ID in combined data set ID in Pacific Islander data set Amount added to genotypes in the Pacific Islander data set (c*)
a Friedlaender et al. [2] used an adjustment of -45 nt.
b This locus was not present in the list of adjusted loci reported by Friedlaender et al. [2]. c Friedlaender et al. [2] used an adjustment of -25 nt. d Friedlaender et al. [2] used an adjustment of 27 nt. e Friedlaender et al. [2] used an adjustment of -103 nt.

Table S2
Allele size adjustments used to make the African data set comparable to the combined HGDP-CEPH, Native American, Latino, Jewish, Asian Indian, CGP, and Pacific Islander data set ID in combined data set ID in African data set Amount added to genotypes in the African data set (c*) a This locus was not present in the list of adjusted loci reported by Tishkoff et al. [3]. b Tishkoff et al. [3] used an adjustment of 27 nt. Bedzan 71580 71584 FS R † † Allele-sharing suggests this pair is a second-degree relative pair ( Figure S1). However, in the RELPAIR analysis, the likelihood ratio statistic for all other relative types was <0.0001 for this pair. To be conservative, this pair was treated as a first-degree relative pair when creating the standardized subsets MS5547 and MS5435.    Figure S3). In the RELPAIR analysis, the likelihood ratio statistic for HS was the next highest after FS for this pair. To be conservative, this pair was treated as a first-degree relative pair when creating the standardized subsets MS5547 and MS5435.         Population IDs are the same as those used in Becquet et al. [12].