Imputation-Based Genomic Coverage Assessments of Current Human Genotyping Arrays

Microarray single-nucleotide polymorphism genotyping, combined with imputation of untyped variants, has been widely adopted as an efficient means to interrogate variation across the human genome. “Genomic coverage” is the total proportion of genomic variation captured by an array, either by direct observation or through an indirect means such as linkage disequilibrium or imputation. We have performed imputation-based genomic coverage assessments of eight current genotyping arrays that assay from ~0.3 to ~5 million variants. Coverage was determined separately in each of the four continental ancestry groups in the 1000 Genomes Project phase 1 release. We used the subset of 1000 Genomes variants present on each array to impute the remaining variants and assessed coverage based on correlation between imputed and observed allelic dosages. More than 75% of common variants (minor allele frequency > 0.05) are covered by all arrays in all groups except for African ancestry, and up to ~90% in all ancestries for the highest density arrays. In contrast, less than 40% of less common variants (0.01 < minor allele frequency < 0.05) are covered by low density arrays in all ancestries and 50–80% in high density arrays, depending on ancestry. We also calculated genome-wide power to detect variant-trait association in a case-control design, across varying sample sizes, effect sizes, and minor allele frequency ranges, and compare these array-based power estimates with a hypothetical array that would type all variants in 1000 Genomes. These imputation-based genomic coverage and power analyses are intended as a practical guide to researchers planning genetic studies.

genome-wide association study genomic coverage power SNP microarrays Microarray genotyping has been widely adopted as an efficient means to interrogate variation across the human genome. Such arrays are the technological foundation of genome-wide association studies (GWAS), which have uncovered numerous genetic associations for a wide variety of traits and diseases (Manolio 2010;Visscher et al. 2012). The arrays used in GWAS often are designed to genotype a maximally informative set of variants that capture, or "tag," a substantial proportion of genome-wide variation (Carlson et al. 2004). "Genomic coverage" refers to the total proportion of genomic variation captured by an array, either directly, via genotyping, or indirectly, through linkage disequilibrium (LD) with one or more genotyped variants (Barrett and Cardon 2006).
The determination of whether an untyped variant is "captured" by a genotyped variant can be based on one of two measures. The first measure is the maximum pairwise squared correlation (r 2 ) between discrete allelic dosages at the untyped variant and any of the genotyped variants. Thus one way to describe genomic coverage is as the fraction of variants in a reference set having a maximum r 2 with array variants above a given threshold (e.g., r 2 greater than 0.8). Genomic coverage for the first generation of SNP microarrays typically was estimated using this pairwise LD paradigm and with the HapMap Project (Frazer et al. 2007) as the reference set (Barrett and Cardon 2006;. A second measure for determining whether an untyped variant is captured by an array involves using array variants to impute, or predict, the genotype at the untyped variant. Imputation yields a different r 2 from the pairwise LD r 2 metric described previously. "Imputation r 2 " is the squared correlation between the actual (discrete) allelic dosage at a genomic variant and the imputed (continuous) allelic dosage, over a defined set of samples. For a variant with two alleles A and a, the imputed dosage is calculated as 2P(AA) + P(Aa), where P(AA) and P (Aa) are posterior genotype probabilities from imputation. Under this imputation paradigm, genomic coverage can be reported as the fraction of variants in a reference set with imputation r 2 above a given threshold (again, usually greater than 0.8) (Hoffmann et al. 2011b). The reference set of variants used to determine imputation-based genomic coverage then becomes the variants present in an imputation reference panel such as the 1000 Genomes Project . Imputationbased genomic coverage is especially relevant to current practice in association studies, in which untyped variants in the 1000 Genomes Project are routinely imputed from a scaffold of array genotypes and tested for association with traits of interest (Marchini and Howie 2010).
An important application of r 2 coverage metrics is to estimate the power to detect an association when the causal locus is not observed directly. Pritchard and Przeworski (2001) found that a sample size of N for direct observation of genotypes at a causal locus gives approximately the same power as a sample size of N/r 2 for observed genotypes at a marker locus, where r 2 is the squared correlation between the discrete allele dosages at the causal and marker loci. This relationship has been used to estimate the genome-wide power to detect an association with various SNP arrays, using the maximum pairwise r 2 for each variant in a genome-wide reference set such as the HapMap  or 1000 Genomes Projects (Lindquist et al. 2013). For genome-wide power, an average is taken over all variants in the reference set, using variant-specific allelic frequency and r 2 values (Jorgenson and Witte 2006). Here we show that the relationship between power and r 2 also holds when using imputation r 2 , the squared correlation between the discrete allelic dosage at the causal locus and the imputed allelic dosage (a continuous variable), assuming a linear relationship. We also provide genome-wide power estimates with imputation r 2 values, estimated for each ancestry group in a 1000 Genomes Project reference set.
Genomic coverage and power have been previously described for a number of arrays, using either HapMap  or 1000 Genomes Project pilot data (Lindquist et al. 2013) as reference sets and using maximum pairwise r 2 . Updated coverage assessments are provided here using new arrays, a denser and more robust data release from the 1000 Genomes Project, and imputation-based r 2 . New arrays have been developed in response to both the changing needs of the genetic research community and technological advances in genotyping. While the design of the first generation of arrays was informed primarily by the HapMap Project (Frazer et al. 2007), array content has since expanded to include variants cataloged by the 1000 Genomes Project ) and the Exome Sequencing Project (Tennessen et al. 2012), in addition to pharmacogenetic and expression quantitative trait loci markers.
Despite a growing focus on rare variants, the tagging, LD-based variant selection scheme is still of prime importance in array design. Many arrays still include a core, or "backbone," of GWAS markers selected using the same principles as the first generation of arrays. Each of the major commercial vendors has recently released (Affymetrix 2012; Illumina 2012) products with a foundation of approximately 240,000 GWAS tagging markers: the Illumina (www.illumina.com) Infinium HumanCore and the Affymetrix (www.affymetrix.com) Axiom Biobank. Although vendors frequently provide genomic coverage estimates in product documentation, the methods used are often different and/or not well-described, making it difficult to objectively compare across arrays. Here we have used a consistent, imputationbased approach to evaluate genomic coverage and power across eight different genotyping arrays. These analyses are intended to function as a practical guide to researchers who are planning genetic studies. The results may be used to compare the cost efficiency of commercially available arrays, to assess the power to detect trait associations using imputed allelic dosages, and to design custom arrays. They also provide an evaluation of the accuracy of imputation for different categories of minor allele frequency (MAF).

Genomic coverage
We assessed genomic coverage with the design outlined in Figure 1 by using the publicly available 1000 Genomes Project ) phase 1 integrated variant set, in variant call format (VCF, available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/ 20110521/), comprising 1,092 samples from 14 populations (Table  1). For each array in turn, we selected 1000 Genomes Project variants from the VCF file having the same position as variants on the array. The array variants were first pre-phased with SHAPEIT2 software (Delaneau et al. 2013) and then used to impute the remaining 1000 Genomes variants, with IMPUTE2 software (Howie et al. 2011). The 1092 samples were randomly divided into 10 batches (balancing across populations) for these pre-phasing and imputation steps, with the remaining samples serving as a worldwide imputation reference panel. After imputation, the ten batches were combined to calculate accuracy metrics within each ancestry group. This approach is similar to the leave-one-out strategy used by Lindquist et al. (2013) to estimate imputation-based genomic coverage.
To assess imputation accuracy and by extension genomic coverage, we compared imputed results at all the nonarray variants to observed genotypes from the initial VCF files. These comparisons were performed separately in the four different ancestry groups (African, "AFR;" American, "AMR;" Asian, "ASN;" and European, "EUR") and restricted to variants with at least two copies of the minor allele. For each imputed variant we calculated three metrics: (1) the squared correlation between observed and imputed allelic dosage, which we call "imputation r 2 "; (2) the concordance between observed and most likely imputed genotype, the "genotype concordance"; and (3) the concordance between observed and most likely imputed genotype, when at least one of those two genotypes contains one or two copies of the minor allele, which we call "minor allele (MA) concordance." Array (observed) variants are included in these metrics summaries and are given imputation r 2 , genotype concordance, and MA concordance values of 1. A more detailed account of data preparation, pre-phasing, imputation, and metrics calculation is available in Supporting Information, File S1.
We focus on imputation r 2 as the primary coverage metric mainly due to its simple relationship to power, in addition to the following advantages: (1) precedent in the literature for evaluating both imputation accuracy (Howie et al. 2011(Howie et al. , 2012Delaneau et al. 2013) and array coverage (Hoffmann et al. 2011a,b;Lindquist et al. 2013); (2) less sensitivity to allele frequency than concordance; (3) similarity to information metrics commonly reported by imputation software (for a review, see Marchini and Howie 2010); and (4) incorporation of imputation uncertainty by using expected allelic dosage rather than most likely genotype. However, one important caveat is that r 2 has high variance at low MAF (Evangelou and Ioannidis 2013). Overall genotype concordance is also a widely used metric that is easily interpretable, although it ignores imputation uncertainty and is very sensitive to allele frequency, as low MAF variants may yield high concordance purely by chance (Lin et al. 2010). MA concordance helps correct for sensitivity to MAF by not counting correctly imputed major homozygous genotypes. Thus, although we have used imputation r 2 as our primary coverage metric and the basis for the power analyses, we also provide coverage in terms of MA concordance and concordance to enable downstream users the flexibility to focus on one of these alternative metrics.

Power analyses
The power to detect a variant-trait association was calculated for a case-control study under the assumption of an additive genetic model, using the following parameters, where A is the risk allele and a is the alternate allele: genotype relative risk (GRR, of Aa relative to aa) from 1.1 to 1.4 in increments of 0.1; disease prevalence (K) of 0.05; frequency of the risk allele (p) assumed in each case to be the minor allele; the sample size (N cases and N controls) from N = 1000 to 10,000 in increments of 50; and a significance level of 5 · 10 28 (appropriate for genome-wide testing). The assumption of an additive model is based on how imputed dosages are used in association tests, and the range of GRR is based on a summary of effect estimates from genome-wide association studies (Lindquist et al. 2013). Power was calculated for an allelic association test (x 2 with 1 degree of freedom) using the non-centrality parameter in the Appendix. Sample size was N for observed (array) variants and N/(imputation r 2 ) for imputed variants. A genome-wide power estimate was obtained for each array and parameter set by averaging over all variants in a reference set, as indicated by Jorgenson and Witte (2006). Here, the reference set consists of 1000 Genomes variants with at least two copies of the minor allele (or a subset thereof). In principle, power would be calculated for each individual variant (using its individual MAF and r 2 values) and averaged over all variants in the set. In practice, to reduce computation, we binned all autosomal variants by a combination of MAF and r 2 values, calculated power for the mean values of MAF and r 2 within each bin, and took the mean over bins weighted by the fraction of variants in each bin, as in . Binning for MAF was from 0 to 0.05 in increments of 0.005, from 0.05 to 0.10 in increments of 0.01, and from 0.1 to 0.5 in increments of 0.05. Binning for r 2 was from 0 to 1 in increments of 0.05.

RESULTS AND DISCUSSION
We have assessed imputation-based genomic coverage for eight commercially available genotyping arrays using the four ancestry groups in the 1000 Genomes Project phase 1 release. The number of assays and unique genomic positions assayed by each array are summarized in Table 2. Only those array variants also found in the 1000 Genomes phase 1 integrated variant set were able to inform the imputation and thus contribute to these genomic coverage estimates. To see how this fact may differentially impact arrays, in Table 2 we also report (1) the percent overlap with 1000 Genomes at any MAF and (2) the percent overlap with 1000 Genomes at the requisite MAF to be included in the imputation (i.e., at least two copies of the minor allele observed in any one of the four ancestry groups). As may be expected, the overlap with 1000 Genomes for arrays with a high proportion of rare and exome variants is less than for arrays with less content devoted to rare and exome variants. The possible impact of the differential 1000 Genomes overlap across arrays on the estimated genomic coverage appears to be small (see File S1). Figure 2 shows the fraction of variants passing an imputation r 2 threshold of 0.8, by MAF bin and ancestry group, whereas Figure 3 shows the mean MA concordance. Genomic coverage assessments commonly use the MAF groupings of . 0.01 and . 0.05, which we have shown in Figure 2 and Figure 3, C and D, respectively. All the data plotted in Figure 2 and Figure 3 are presented numerically in Table 3 and Table 4, which also include tabular summaries of mean imputation r 2 and genotype concordance and the counts of variants in each MAF bin, as this count differs by ancestry group. Line plots of mean imputation r 2 and mean genotype concordance are available in Figure S1 and Figure S2. Plots organized by ancestry panel rather than by metric are available for AFR in Figure S3, AMR in Figure S4, ASN in Figure S5, and EUR in Figure S6.
As one might expect, coverage generally improves with increasing array density, regardless of either ancestry group or MAF bin. Common variants (MAF . 0.05) are well-covered by all the arrays in the AMR, ASN, and EUR ancestry groups. The percentage of common variants with imputation r 2 $ 0.8 is greater than 75% in these three ancestry groups, and up to~90% in all four ancestry groups for the highest density arrays. The most dramatic increase in coverage occurs for common variants in the AFR group: the fraction of common variants passing the imputation r 2 threshold of 0.8 increases by almost 40% from the least to most dense array. This finding is likely explained by the genetic diversity and lower LD characteristic of present day African ancestry populations compared to the other ancestry groups (Wall et al. 2008).
However, coverage at common variants eventually levels off with increasing array density, a phenomenon of diminishing returns previously observed by Barrett and Cardon (2006). The array density at n n Summaries of each array included in these genomic coverage analyses, restricted to assays mapped to chromosomes 1 through 22 and the non-pseudoautosomal portion of the X chromosome. For each array, the columns give (1) the company manufacturing the array; (2) the array name; (3) additional product information; (4) the total number of assays; (5) the total number of unique map locations represented by those assays; (6) the percent overlap with 1000 Genomes phase 1 integrated variant set, at any frequency; and (7) the percent overlap with 1000 Genomes phase 1 integrated variant set, with a minimum of two copies of the minor allele seen in at least one of the four ancestry groups. The counts in (5) are less than (4) when there is more than one assay/feature for a given genomic position. The percentages given in (6) and (7) are with a denominator of unique positions, rather than unique assays. MAF, minor allele frequency. a Product information. For Illumina arrays, number of samples (e.g., 4, 8, or 12), version of the array (e.g., "v1"), and version of the array manifest file (.bpm, e.g., "A" or "B"). For Affymetrix arrays, the NetAffx release number.
which this occurs differs by ancestry group. For AMR, ASN, and EUR, leveling off begins at the OmniExpress, whereas for AFR, coverage continues to improve up until the Omni2.5M. This plateau in coverage may be in part due to the limited size of the reference sample, and in part due to the underlying structure of the genome being such that some regions have insufficient LD to accurately impute variants, regardless of MAF. Furthermore, even with high density arrays, there may still be some regions that are sparsely covered and therefore difficult to impute. Also as expected, less common variation is not as well covered as common variation. The percentage of variants with MAF between Figure 2 Fraction of variants passing an imputation r 2 threshold of 0.8, by MAF bin and ancestry group. The imputation r 2 metric plotted here is the squared correlation between imputed and observed allelic dosage in the samples comprising the ancestry group. The y-axis is the proportion of variants (imputed and observed) with imputation r 2 $ 0.8, restricted to variants with at least two copies of the minor allele in the given ancestry group. The x-axis position of each array corresponds to the number of unique positions assayed by that array (see Table 2, column 5). Thus. the order of the arrays on each axis is as follows: HumanCore, HumanCore+Exome, Axiom Biobank, OmniExpress, Axiom World Array 4, Omni2.5M, Omni2.5M+Exome, and Omni5M. 0.01 and 0.05 and an imputation r 2 $ 0.8 is substantially lower than for MAF greater than 0.05: less than 40% for low density arrays in all ancestries and 50-80% in high density arrays, depending on ancestry. In all ancestry groups there is some improvement in coverage of less common variants when moving from the Omni2.5M to the Omni5M array, most notably in EUR (61% vs. 77% with r 2 $ 0.8 with Omni2.5M and Omni5M, respectively). Rare (MAF , 0.01) variants are generally not well imputed by either low or high density arrays. Imputation accuracy for rare variants is greatest in the AMR ancestry group, which might be a consequence of the lower genetic diversity Figure 3 Mean MA concordance, by MAF bin and ancestry group. The y-axis values are mean MA concordance in samples comprising the given ancestry group. MA concordance is defined as the concordance (percent agreement) between observed and most likely imputed genotype, when at least one of those two genotypes contains one or two copies of the minor allele. Variants were restricted to those with at least two copies of the minor allele in the given ancestry group. The x-axis position of each array corresponds to the number of unique positions assayed by that array (see Table 2, column 5). Thus the order of the arrays on each axis is as follows: HumanCore, HumanCore+Exome, Axiom Biobank, OmniExpress, Axiom World Array 4, Omni2.5M, Omni2.5M+Exome, and Omni5M. (A) Variants with at least two copies of the minor allele and MAF # 0.01, (B) for 0.01 , MAF # 0.05, (C) for MAF . 0.01, and (D) for MAF . 0.05. and greater LD in Native Americans than in other ancestry groups, due to a recent population bottleneck associated with human migration into the Americas (Wall et al. 2011). Imputation quality at rare variants may be improved by replacing pre-phased imputation with the traditional haplotype sampling approach, although the latter is more computationally intensive (Howie et al. 2012).
All coverage plots are essentially monotonic, with the exception of the transition from the OmniExpress to the Axiom World Array 4 in the AFR ancestry group. This array was designed for use in Latinos and involved preferential selection of variants in European and Native American ancestries (Hoffmann et al. 2011b), which might explain the lower coverage in the AFR group. We also note that the addition of the exome content has little effect on coverage and accuracy, particularly for the Omni2.5M. The HumanCore+Exome, however, shows slight improvement over the HumanCore alone for lower MAF variants, most likely because the addition of the exome content affords a greater relative increase in density for the HumanCore than for the Omni2.5M. Although exome content contributes little to overall genomic coverage, it clearly has additional value when directly observed, in that exomic variants are more likely to be functional than non-exomic variants. Furthermore, this value comes at relatively low additional cost.

Genome-wide power
Previous studies of genome-wide power using arrays have utilized a finding that the sample size required to achieve a given power is N/r 2 when using a tag marker compared with N when using the causal locus genotypes, where r 2 is the squared correlation between the discrete allelic dosages at each locus (Pritchard and Przeworski 2001). The Appendix shows that this relationship also holds under an additive genetic model when using a continuous imputed dosage rather than the discrete dosage of the causal locus. Therefore, we used imputation r 2 for estimating the genome-wide power to detect an association.
Genome-wide power estimates for GRR values of 1.2, 1.3 and 1.4 are displayed in Figure 4 for MAF . 0.05 (common variants), Figure 5 for 0.01 , MAF # 0.05 (less common), Figure S7 for MAF . 0.01 and Figure S8 for all variants with at least two copies of the minor allele within an ancestry group. Figure S9 displays results for GRR = 1.1 for each of the four MAF categories. The power curves show a sigmoidal relationship between power and sample size for all arrays, including the hypothetical 1000 Genomes array ("1000 Genomes"), which is equivalent to direct observation of all 1000 Genomes Project variants passing the ancestry-specific MAF filters (i.e., at least two copies of the minor allele observed in the given ancestry group) and thus does not involve uncertainty introduced by imputation. Power is generally low for the range of parameters considered here (even for "1000 Genomes" array), except at the upper end of genetic effect and sample sizes. This is particularly true for the less common variants, where power is generally , 50% even at the high end of the parameter space for the actual arrays and up to~60% for the "1000 Genomes" array.
n Four metrics are presented for each array: fraction of variants with imputation r 2 $ 0.8; mean imputation r 2 ; mean minor allele concordance ("MA conc"); and mean genotype concordance ("geno conc"); separately by ancestry group (AFR, AMR) and MAF bin. Note the lower bound of MAF in the first MAF bin differs across ancestry groups. This is because we required at least two copies of the minor allele in each panel in order to contribute to these metrics summaries, and there are differing numbers of samples in each group. AFR, African, AMR, American, MAF, minor allele frequency.
The cost efficiency of achieving equal power can be compared among array choices using these power estimates. For example, for the variant set with MAF . 0.05 and GRR = 1.3, achieving 50% power requires N = 3000 for the Omni2.5M array and N = 4000 for the HumanCore array in the African ancestry group. A power of 80% requires a larger sample size difference (5300 vs. 8900). Generally, the array differences required to achieve a set power are notably larger in regions where the power curves plateau. One array is more cost-efficient than another when the ratio of cost per sample is less than the ratio of sample sizes required to achieve equal power. Of course, this evaluation assumes that the extra samples required are available and that the main interest of the study is to detect effects characterized by the parameters used in the power calculation. All of the data used to generate the power plots are available for download at http://jhir.library.jhu.edu/handle/ 1774.2/36508, so that users can select parameter sets appropriate to their project goals. We also provide the number of variants per MAF and r 2 combinations within each ancestry group so that users can readily calculate power using parameters beyond the set provided here.
As noted previously in a similar analysis by Lindquist et al. (2013), increasing sample size can benefit genome-wide power more than increasing array density even up to full genotyping of the complete 1000 Genomes variant set. For example, with GRR = 1.3 and MAF . 0.05 in African ancestry, power is 26.7% for the "1000 Genomes" array with N = 2000, whereas doubling the sample size and using the HumanCore array brings power up to 50.2%. Lindquist et al. (2013) focused their analysis on variants with characteristics similar to those found in previous GWAS (odd ratios mainly in the range of 1.221.3 and MAF . 0.10). They concluded that only about one fifth of such variants have been detected with existing GWAS and that the potential for increasing this fraction is greater by increasing sample size than by increasing genomic coverage. Our results support this conclusion (using somewhat different methods and arrays), given the current situation where low density arrays such as the Illumina HumanCore and Affymetrix BioBank are much lower in cost than genomic sequencing or even high density arrays such as the Omni2.5M. However, we also note their caveat that increasing sample size significantly is not an option for many diseases.
For rare variants, power is low, even for complete coverage and very large sample sizes. Therefore, analysis of rare variants generally involves aggregation strategies to achieve reasonable power to detect an effect (Li and Leal 2008). The value of low density arrays in this context appears to be low because rare variants are not well imputed and aggregate tests are sensitive to genotype errors (Powers et al. 2011). Therefore, direct detection of rare variants using genomic sequencing, high density arrays and/or arrays supplemented with selected variants of interest (such as exome content) are likely to be required for making significant progress in detecting trait variation caused by rare variants.
n Four metrics are presented for each array: fraction of variants with imputation r 2 $ 0.8; mean imputation r 2 ; mean minor allele concordance ("MA conc"); and mean genotype concordance ("geno conc"); separately by ancestry group (ASN, EUR) and MAF bin. Note the lower bound of MAF in the first MAF bin differs across ancestry groups. This is because we required at least two copies of the minor allele in each panel to contribute to these metrics summaries, and there are differing numbers of samples in each group. ASN, Asian, EUR, European, MAF, minimum allele frequency.
In addition to planning experiments with existing arrays, our genomic coverage results can be used to design new genotyping arrays. For example, one might wish to design an array focused on specific candidate variants or regions, while also providing a backbone for genome-wide imputation, such as the existing HumanCore or Biobank variant set. In that case, one might begin with the backbone and then add candidate variants that are not already well imputed by the backbone variants in the target ancestry group(s). For this purpose, we also provide the individual imputation r 2 values for all 1000 Genomes variants from imputation using each of the arrays considered here (available at http://jhir.library.jhu.edu/handle/ 1774.2/36508).
In conclusion, these imputation-based genomic coverage and power analyses are intended as a practical guide to researchers planning genetic studies. Deciding among available arrays is often a delicate balance between scientific aims, monetary resources, and genotyping timeframe (i.e., which arrays are currently in use). Our analysis has the advantage of consistently applying the same approach to evaluate several different arrays from different manufacturers. There are, however, some limitations to this study design that may Figure 4 Genome-wide power estimates for GRR values of 1.2, 1.3, and 1.4, for common autosomal variants (MAF . 0.05). The array Omni2.5+ Exome is not shown in these plots because it is indistinguishable at this resolution from the Omni2.5M array. In the legend, "1000 Genomes" refers to a hypothetical array in which all variants in the 1000 Genomes dataset would be typed. result in either over-or underestimated genomic coverage. First, the genomic coverage reported here might be over estimated by imputing subsets of 1000 Genomes samples using the remaining samples as reference, as the haplotypes are likely to match better than in real world applications where the samples to be imputed are likely from different source populations than the reference. This advantage may be offset to some degree by using a smaller reference panel than is available for independent study populations, although this effect is likely to be small because the reference panel used here is approximately 90% of the full 1000 Genomes phase 1 reference panel. Second, coverage may be underestimated by having pre-phased each batch separately, as small sample sizes may negatively impact phasing accuracy (Browning and Browning 2011). An increased sample size during pre-phasing may disproportionally improve coverage in situations where phasing accuracy is already a challenge, such as rare variants and in samples with high haplotype diversity (e.g., African ancestry). Finally, some of the arrays assay a substantial number of variants that are not in the 1000 Genomes phase 1 release (see Table 2). These omissions would tend to underestimate genomic coverage, although probably not significantly, since they are expected to consist mainly of Figure 5 Genome-wide power estimates for GRR values of 1.2, 1.3, and 1.4, for less common autosomal variants (0.01 , MAF # 0.05). The array Omni2.5+Exome is not shown in these plots because it is indistinguishable at this resolution from the Omni2.5M array. In the legend, "1000 Genomes" refers to a hypothetical array in which all variants in the 1000 Genomes dataset would be typed. rare variants due to the 1000 Genomes Project's identification of 98% of SNPs with frequency . 0.01 in related populations . Despite these caveats (discussed further in File S1), the 1000 Genomes data set is clearly the best resource available for evaluating complete genomic coverage in multiple ancestry groups. We expect the results presented here to be useful in planning studies with the current generation of human genotyping arrays.