## Abstract

Although ascertainment bias in single nucleotide polymorphisms is a well-known problem, it is generally accepted that microsatellites have mutation rates too high for bias to be a concern. Here, we analyze in detail the large set of microsatellites typed for the Human Genetic Diversity Panel (HGDP)-CEPH panel. We develop a novel framework based on rarefaction to compare heterozygosity across markers with different mutation rates. We find that, whereas di- and tri-nucleotides show similar patterns of within- and between-population heterozygosity, tetra-nucleotides are inconsistent with the other two motifs. In addition, di- and tri-nucleotides are consistent with 16 unbiased tetra-nucleotide markers, whereas the HPGP-CEPH tetra-nucleotides are significantly different. This discrepancy is due to the HGDP-CEPH tetra-nucleotides being too homogeneous across Eurasia, even after their slower mutation rate is taken into account by rarefying the other markers. The most likely explanation for this pattern is ascertainment bias. We strongly advocate the exclusion of tetra-nucleotides from future population genetics analysis of this dataset, and we argue that other microsatellite datasets should be investigated for the presence of bias using the approach outlined in this article.

The rapidly decreasing cost of high-throughput genotyping means that large datasets with both detailed genetic coverage and wide geographic scope are finally a reality (*e.g.*, Li *et al.* 2008; Novembre *et al.* 2008; Jakobsson *et al.* 2008). These datasets provide fantastic opportunities to investigate human historical demography, and they promise to allow us to unravel the relative role of mutation, drift, and selection in shaping human diversity. However, the sets of single nucleotide polymorphisms (SNP) typed by current technologies were originally chosen to represent diversity within panels of narrow geographic scope (Tishkoff and Verrelli 2003; Carlson *et al.* 2003) and thus suffer strongly of ascertainment bias (Rogers and Jorde 1996; Kuhner *et al.* 2000; Wakeley *et al.* 2001; Akey *et al.* 2003; Bustamante *et al.* 2005), making their use problematic when reconstructing the past of human populations at a worldwide level.

A key step in assessing the importance of ascertainment bias in affecting demographic reconstructions has been the comparison with similar analyses performed on microsatellites (also known as single-tandem repeats, STR). Ascertainment schemes have been argued to have little effect on these highly polymorphic markers (Bowcock *et al.* 1994; Rogers and Jorde 1996; Harpending and Rogers 2000), which are generally thought to provide an unbiased estimate of neutral variation; thus, they are used as a yardstick against which attempts to remove ascertainment bias from SNP are assessed. For example, when analyzing a dataset of 3024 SNPs from the Human Genetic Diversity Panel [HGDP-CEPH (Cann *et al.* 2002), arguably the most comprehensive dataset on human genetic diversity covering over 1000 individuals from 52 populations], Conrad *et al.* (2006) investigated how using differently sized windows to define haplotypes affected estimates of within-population heterozygosity and concluded that a window of 20 Kb provided the best estimates as it gave the tightest fit with similar estimates obtained from 783 STRs typed in the same populations. Later analyses of a larger number of SNPs from the same panel have kept comparing the STR data, validating a variety of approaches, such as investigations of isolation by distance using patterns of between-population diversity (*F*_{ST}) (Jakobsson *et al.* 2008), of decrease in within-population variability with increasing distance from Africa (Li *et al.* 2008), and of population clustering using STRUCTURE (Li *et al.* 2008; Jakobsson *et al.* 2008).

The assumption that microsatellites are free of bias has been questioned by two studies looking at the HGDP-CEPH data (Ray *et al.* 2005; Foll and Gaggiotti 2006). Romero *et al.* (2009) failed to find a bias when comparing the full set of 783 STRs used in the HGDP-CEPH dataset with a set of 16 unbiased STRs discovered in a multiethnic panel; however, the small number of markers in the unbiased set limits the power of their analysis. Here, we look in detail for signs of ascertainment bias in the 783 STRs used in the HGDP-CEPH panel (Rosenberg 2006). Specifically, we compare results obtained using different statistical approaches applied to di-, tri-, and tetra-nucleotides separately, as we would expect the effect of bias to depend on the different mutation rates found for markers with different motif lengths. As some inconsistencies in results using different motifs might be ascribed to their different sensitivity to demographic processes (such as bottlenecks) rather than ascertainment, we develop a metric that reflects the underlying gene genealogy, effectively providing a description of diversity that is unaffected by mutation rates. More specifically, we use a rarefaction framework to generate estimates of expected heterozygosity (both within and between populations) that are rescaled to a single reference mutation rate and, thus, are directly comparable from a statistical point of view. Using this novel framework, which removes the differential effect of demographic processes on markers with different mutation rates, we demonstrate that tetra-nucleotides are inconsistent with di- and tri-nucleotides, a likely sign of ascertainment bias.

## Materials and Methods

### Identifying and removing inconsistencies in STR genotype data

We first developed a simple approach to determine the motif length of microsatellite markers from their fragment lengths, providing an objective way of removing inconsistencies from the data prior to our analysis. All fragment lengths *L* at a given locus can be written in the form *L* = *L _{m} n* + γ, where

*L*is the length of the locus’s repeat unit,

_{m}*n*is the number of repeat units (plus an unknown offset from the flanking regions), and γ is an integer between zero and

*L*− 1 (the remainder of dividing

_{m}*L*by

*L*). For each locus, we are interested in obtaining a reliable estimate of

_{m}*L*and γ.

_{m}The analysis, which is repeated for each locus, consists of three steps. First, *L*_{m} is determined by finding the shift (by 2, 3, 4, or 5 nucleotides) that maximizes the relative overlap , where *f _{i}* is the frequency of allele

*i*in the sample. Second, we calculate the remainder γ from dividing each allele length

*L*by

*L*. Ideally, there should be a unique value of γ obtained from all lengths

_{m}*L*for a given locus. In reality, several loci have multiple values. Loci for which the most common value of γ exceeded 95% of the total count were cleaned by recoding individuals with at least one allele with an unusual offset as missing data for that locus (supporting information, Figure S1, A and B); loci with less than 95% estimates of γ with same value were excluded from future analysis as they deviate too strongly from the assumed model (Figure S1, C and D). For all clean loci, allele lengths were converted into repeat numbers as

*n*= (

*L*− γ)/

*L*. A Matlab implementation of this procedure is available in File S1.

_{m}### Statistical analysis of filtered genotype data

We investigated the results obtained for different motif lengths from two types of statistical approaches: hierarchical clustering with STRUCTURE v. 2.2.3, (Pritchard *et al.* 2000) and pairwise population differentiation estimated as *F*_{ST} [calculated in Matlab v7.11 from the expected within- and between-population heterozygosities weighted according to sample size (Nei 1978)]. We would expect both clustering and *F*_{ST} patterns to be relatively robust to differences in mutation rates between the markers with different motif lengths (Haasl and Payseur 2011).

### Mutation rarefaction

Demographic processes such as bottlenecks might be sampled differently by markers with different mutation rates. Even the normalization of between- by within-population variability used when computing pairwise *F*_{ST} will not fully account for this effect. To be able to statistically compare markers with different mutation rates, we would ideally need a metric that reflects the underlying gene genealogy, effectively providing a description of diversity that is unaffected by mutation rates. To this end, we have developed a rarefaction framework to generate estimates of expected heterozygosity (both within and between populations) that are rescaled to a single reference mutation rate and, thus, are directly comparable from a statistical point of view.

Heterozygosity, defined as the probability that two alleles sampled at random are identical, is a frequently used measure of genetic diversity. It is most commonly computed to estimate within-population diversity, but it can be equally applied to describe between-population differentiation. However, heterozygosity estimates from markers with different mutation rates cannot be compared directly and do not obey a simple scaling rule; how heterozygosity scales with mutation rate depends on the detailed shape of the underlying gene genealogy (and thus on the distribution of time to most recent common ancestor, TMRCA). To solve this problem, we analyze the more general question of how the distribution *p*(Δ) of pairwise differences in repeat count, Δ, depends on the mutation rate.

The principle of our method is best understood by considering a single microsatellite locus with mutation rate μ. What would *p*(Δ) be if we had a lower mutation rate, μ′? If we knew the underlying gene genealogy of locus in a sample, and the location of mutations on the gene genealogy, we could rarefy the mutations to the new mutation rate by randomly removing mutations such that on average a fraction μ′/μ of the mutations remains, and calculate the new *p*(Δ) from the resulting genetic variation. This principle is illustrated in Figure 1, A and B, and the effect of mutation rarefaction on the distribution *p*(Δ) is illustrated in Figure 1C.

In reality, we can observe p(Δ), but we do not know the underlying gene genealogy, how many mutations occurred, or where they are placed on the gene genealogy. Nevertheless, for large samples, it is possible to estimate the rarefied *p*(Δ) from the observed distribution by implicitly removing a given fraction of the underlying mutations. Consider the characteristic function of the distribution *p*(Δ), defined as

Using the standard Stepwise Mutation Model [SMM, Kimura and Ohta (1975, 1978)], we can express this function in terms of the (unknown) distribution of TMRCA for pairs of individuals. In the SMM, each mutation leads to the addition or removal of single repeat units, with equal probability. Assuming a mutation occurs independently with rate μ per generation, the difference Δ in repeat count for a given pair of individuals with *t* generations to the MRCA can be written as the sum of 2*t* independent random variables, which are −1, 0, or 1 with probability μ/2, 1 − μ, or μ/2, respectively. As the characteristic function of the sum of two independent variables is the product of their characteristic functions, and each variable here has the characteristic function , the characteristic function of Δ is . For a sample of more than two individuals, we average the characteristic function over the distribution *g _{t}* of pairwise time to MRCA in the sample:where is a generating function for

*g*. Solving for ω as a function of

_{t}*z*, one obtains

As the underlying distribution of time to the MRCA is assumed to be independent of the mutation rate (because of neutrality), this expression holds for any mutation rate. Hence, for mutation rate μ′, we have

Writing in terms of Δ gives an explicit expression for the rarefied distribution of Δ in terms of the original distribution: (1)

From the characteristic function, we can obtain the probability of observing difference Δ in repeat counts between individuals, *p*_{rarefied}(Δ), in the rarefied sample by taking an inverse Fourier transform: (2)

Figure 1C illustrates the effect of rarefaction on *p*(Δ). Finally, using this relation, we obtain the following scaling rule for heterozygosity:

Mathematically, this method only works when going from larger to smaller mutation rates. This makes intuitive sense, because for μ′ < μ, the method is removing information, but going in the opposite direction would correspond to increasing the amount of information in the sample. This type of extrapolation is so unstable that it is useless for all practical purposes.

Finally, as a sanity check, we verify that our scaling method obey the scaling relation , where *V* is the variance of the counts of repeat units, which is the standard method in SMM theory for calibrating the relative mutation rate of two markers [*e.g.*, see Zhivotovsky *et al.* (2003) and references therein]. It is straightforward to show that this relation holds for the rarefaction method, using the Taylor expansion of Equation 1 around ω = 0.

### Estimating mutation rates from mutation rarefaction

For the markers used in the HGDP-CEPH panel, only the mutation rate for di-nucleotides has been estimated directly from pedigree data [μ_{2} = 1.52 × 10^{−3}, Zhivotovsky *et al.* (2000)], whereas mutation rates for tri- and tetra-nucleotides have to be estimated indirectly (we do not consider penta-nucleotides, as there are too few of them). As the scaling rule for heterozygosity depends only on the ratio of mutation rates (*i.e.*, on the fraction of mutations kept in the rarefaction process), we can use it to find the best estimates of mutation rates in tri- and tetra- nucleotides.

For example, if we want to estimate the mutation rate for tri-nucleotides, we can find the optimal rarefaction factor, *k*_{23} (*i.e.*, μ_{3}/μ_{2}), that brings the within- and between-population heterozygosities of di-nucleotides closest to those of tri-nucleotides. The mutation rate estimate of tri-nucleotides is then μ_{3} = *k*_{23}μ_{2}. An additional estimate for tri-nucleotides can be obtained using tetra-nucleotides as a reference for scaling the mutation rate of di- and tri-nucleotides. Given the optimal rarefaction factors *k*_{24} (*i.e.*, μ_{4}/μ_{2}) and *k*_{34} (*i.e.*, μ_{4}/μ_{3}), we estimate the mutation rate of tri-nucleotides as μ_{3} = μ_{2}*k*_{24}/*k*_{34}.

### Trends in within- and between-population heterozygosity

To test whether the matrix of within- and between population heterozygosities of motifs with different lengths are consistent after scaling, we separately scaled di- and tri-nucleotide markers to match the HGDP-CEPH tetra-nucleotide markers (by minimizing mean-square difference of the heterozygosity matrices). To test whether the one-to-one relation can be significantly rejected, we estimated 95% confidence intervals of the regression lines (di- and tri-nucleotide heterozygosities *versus* tetra-nucleotide heterozygosities) from 10,000 bootstrap samples of the data using a Matlab script. We then repeated the same analysis comparing di-, tri- and tetra-nucleotide heterozygosities to the heterozygosities of the unbiased tetra-nucleotide markers, after scaling to this set of markers.

Finally, we analyzed the dependence of within-population expected heterozygosity and allelic richness [calculated using ADZE v.1.0 (Szpiech *et al.* 2008)] on the distance from sub-Saharan Africa [calculated as in (Manica *et al.* 2005)] for di-, tri-, and tetra-nucleotide markers. Especially, we analyzed the effect of subsetting the markers and scaling of heterozygosities on the cline in heterozygosity away from sub-Saharan Africa.

## Results

### Identifying and removing inconsistencies in STR genotype data

Despite our strict demands for accepting a locus, our approach gave 749 markers (out of 783) that could be used for analysis, subdivided into 54 di-, 166 tri-, 519 tetra-, and 10 penta-nucleotides (Table S1). The 10 penta-nucleotide markers were dropped from further analysis because they were deemed too few. The cleaned data from the 749 markers [as well a cleaned version of the 16 unbiased markers from Romero *et al.* (2009)] are available (File S1 and File S2). For those markers that had been previously classified based on the reference Human Genome sequence (Pemberton *et al.* 2009), Pemberton *et al.* and our classifications were in agreement, with the exception of five markers (Table S2, File S3, and File S4).

### The effect of motif length on STRUCTURE clusters and *F*_{ST} patterns

Analysis of di-, tri- and tetra-nucleotides with STRUCTURE yielded the same clusters irrespective of which motif length was used (Figure 2). However, when comparing *F*_{ST} patterns for these three categories of markers, we found a strong effect of marker length for *F*_{ST} patterns: di- and tri-nucleotides were in very good correspondence (Figure 3, A and B), but tetra-nucleotides showed much lower population differentiation than the shorter motifs (Figure 3C). To determine whether this discrepancy can be attributed to demographic processes being sampled differently by markers with dissimilar mutation rates, we apply our rarefaction approach to compute statistically comparable, rarefied heterozygosity estimates.

### Mutation rates from mutation rarefaction

To rarefy estimates of within- and between-population expected heterozygosity, we first use our scaling procedure to find the relative mutation rates of the different groups of markers (see *Materials and Methods*). Table 1 shows the result of estimating the mutation rates of di-, tri-, and tetra-nucleotides, as well as a set of 16 unbiased tetra-nucleotides from Romero *et al.* (2009). Each row corresponds to using a given reference set of markers, and the columns contain the mutation estimates for different sets. In addition to fitting the within- and between-population heterozygosities, *p*(0), we show mutation rates calculated by fitting the distribution of count differences *p*(Δ) and by scaling the variances (the approach used in previous articles). We find that the mutation rates obtained by rarefaction analysis are insensitive to which markers are used as reference and whether the fitting was based on the full distribution of differences, *p*(Δ), or on only the heterozygosities, *p*(0). The scaling of variance, however, yields a slightly lower mutation rate for the tri-nucleotide loci (0.65 × 10^{−3} *versus* 0.71 × 10^{−3} – 0.77 × 10^{−3}). There is also a mismatch for the mutation rate for the unbiased tetra-nucleotides (0.46 × 10^{−3} *versus* 0.23 × 10^{−3}), but the small number of markers makes these estimates somewhat unreliable. For comparison, we also show the mutation rates found by Zhivotovsky *et al.* (2003), which are based on a subset of the CEPH-HGDP markers obtained by removing markers with extreme variances.

### Trends in within- and between-population heterozygosity

Figure 4 shows pairwise comparisons of the rarefied within- and between-population average heterozygosities of HGDP-CEPH markers after fitting to the tetra-nucleotide markers (left panels) or unbiased tetra-nucleotide markers (right panels, together with lines indicating the one-to-one relationships and 95% confidence intervals of the regression lines. The di-, tri- and unbiased tetra-nucleotides are in good agreement (Figure 4, A, B, and D), whereas the HGDP-CEPH tetra-nucleotides are significantly different from all other groups (Figure 4, C, E, and F), consistent with patterns in *F*_{ST} (Figure 3).

This discrepancy seems to be due to populations in Eurasia being relatively homogeneous for tetra-nucleotides. This effect can be visualized by plotting the decline in within-population genetic diversity with increasing distance from sub-Saharan Africa [a pattern attributed to the effect of sequential founder events during the spread out of Africa by anatomically modern humans (Prugnolle *et al.* 2005; Ramachandran *et al.* 2005; Manica *et al.* 2007)]. Irrespective of whether we look at raw estimates of expected within-population heterozygosity (Figure 5A) or at rarefied estimates using the approach outlined in this article (Figure 5B), it is clear that, whereas di- and tri-nucleotides show a smooth decline of heterozygosity with increasing distance from Africa, the heterozygosity of tetra-nucleotides is much flatter across Eurasia (corresponding to approximately 5,000–15,000 km in Figure 5), with a steep decline evident only once we reach the Americas. A similar pattern holds for allelic richness (Figure S2).

To what extend do tetra-nucleotides affect previous estimates of relationship between within-population heterozygosity and distance from sub-Saharan Africa? To answer this question, we estimated the relationship for three different averages: (*i*) of the complete set of markers for the HGDP-CEPH panel (*i.e.*, pooling all di-, tri- and tetra-nucleotides and ignoring differences in mutation rates, as is commonly done when estimating within-population heterozygosity); (*ii*) of only the subset of di- and tri-nucleotides (*i.e.*, pooling only the markers that were consistent with each other, again ignoring mutation rates); and (*iii*) of an estimate of the latter after rarefying the di-nucleotide to the mutation rate of tri-nucleotides (*i.e.*, the correct way of combining markers with different mutation rates). As can be seen in Figure 6, the slope estimated for the full set of markers (−6.792·10^{−6} ± 4.277·10^{−7}) is significantly shallower than that estimated for the subset of di- and tri-nucleotides (−8.866·10^{−6} ± 4.584·10^{−7}; F_{1,98} = 10.94, *P* = 0.001) and for the rarefied dataset (−8.656·10^{−6} ± 4.498·10^{−7}; F_{1,98} = 9.02, *P* = 0.003). The is no difference in the slope estimates for subsets of di- and tri-nucleotides and for the rarefied dataset (F_{1,98} = 0.11, *P* = 0.744), but as one would expect, the intercept is significantly lower for the latter (7.820·10^{−01} ± 5.929·10^{−3} *versus* 7.958·10^{−1} ± 6.043·10^{−3}; F_{1,99} = 6.92, *P* = 0.010).

## Discussion

Our analyses clearly illustrates that, whereas di- and tri-nucleotides show patterns that are highly consistent with each other, the tetra-nucleotides originally typed for the HGDP-CEPH panel are inconsistent as they are too homogeneous in their individual distributions across Eurasia. This discrepancy cannot have arisen from demographic processes affecting markers with different mutation rates to different extents: rarefying diversity to the slowest mutation rate should always give equivalent estimates, as the underlying distribution of gene genealogies is the same irrespective of the marker of choice. Although this is true for di- and tri-nucleotides, for which both within and between-population heterozygosity estimates are equivalent after rarefaction, tetra-nucleotides remain inconsistent even when we consider rarefied estimates of heterozygosity.

The most likely explanation for this discrepancy seems to be some form of bias in the choice of markers. We know that tetra-nucleotide markers included in the HGDP-CEPH panel show significantly higher diversity than the unbiased markers developed by Romero *et al.* (2009). The patterns observed in the data would be consistent with a scenario where the HGDP-CEPH tetra-nucleotides were selected for high diversity from a panel mostly composed by Eurasians. It is difficult to say whether di- and tri-nucleotides were selected from a broader panel (and thus there was little or no bias in their selection process) or whether they escaped the effect of bias because of their higher mutation rates (*i.e.*, most markers have a diversity well above the threshold for selection). In any case, the discrepancy between different microsatellites of different motif lengths indicates that microsatellites are probably not as immune to ascertainment bias as suggested in the past (Rogers and Jorde 1996; Bowcock *et al.* 1994; Harpending and Rogers 2000).

It is clear that the choice of metrics extracted from the markers is important, as some are more affected by bias than others are. In contrast to the heterozygosity-based measures, we found STRUCTURE clusters to be relatively robust to the bias affecting tetra-nucleotides. This result is consistent with simulation studies of microsatellites and STRs, showing that STRUCTURE is relatively robust against ascertainment bias (Haasl and Payseur 2011).

Average heterozygosity is a commonly computed quantity, often used to parameterize population genetic models (Zhivotovsky *et al.* 2003; Liu *et al.* 2006; Deshpande *et al.* 2009). However, when the markers have heterogeneous mutation rates, this quantity does not correspond to the heterozygosity of the average mutation rate (or to any other single mutation rate). An important contribution of our article is the development of a rarefaction approach to combine markers with different mutation rates in heterozygosity-based measures corresponding to a single mutation rate. This approach allows all markers to carry equal weight in determining the overall metric, and is a much more accurate approach than simply averaging estimates obtained for different types of markers. Our approach is model free with respect to the underlying demographic processes (*i.e.*, we do not explicitly reconstruct bottlenecks, migrations, etc.). It is important to note that to combine heterozygosities, we do not need to know the mutation rate of any of the types but only have an *a priori* classification scheme that allows us to group markers with relatively similar mutation rates. However, if the mutation rate of one of the marker types is known, rarefaction yields the mutation rate of the combined heterozygosities. This way of estimating mutation rates has the advantage to be relatively robust to the inclusion of markers with extreme values, a known weakness of the commonly used approach of scaling the variances (Zhivotovsky *et al.* 2003).

Although we have not performed a formal investigation of the number of makers needed for our approach to be stable, we estimate our rarefaction method to require at least 10, and ideally over 20 markers in each group to avoid artifacts while rescaling. In the case of microsatellites, we obviously need markers to be clean enough to comply with the SMM framework. We provide several tools to check such compliance.

The SMM model is the simplest and most widely used model of microsatellite evolution, but it may seem overly simplistic given the heterogeneous nature of human microsatellites (Chakraborty *et al.* 1997; Ellegren 2004). For example, the fact that the SMM model assumes symmetric mutation rates may seem to limit its validity for our method. However, our method uses only the pairwise difference in number of repeats between individuals and is, therefore, independent of any constant, directional bias in the mutations (toward higher or lower repeat counts). It is also possible to use our method with more complex models. In the appendix, we show how our method can be adapted to use mutation models with arbitrary length-dependent increment and decrement rates and for models in which multiple repeat units can sometimes be added or removed. The main conclusion from analyzing these models is that the scaling relation between rarefied and original distribution of repeat counts depends only on the ratio of the two mutation rates, and they can be used in the same way as the simple SMM model. Although in many cases, it is not possible to write the scaling relation in a closed form, it is straightforward to compute it numerically. For the case of constraining the SMM model to have repeat counts between one and *L*, we show in the appendix that the scaling relation is the same as in the standard SMM model.

What are the implications of the bias in HGDP-CEPH tetra-nucleotides for studies that have used these markers to study the effect of SNP ascertainment bias? Conrad *et al.* (2006) plotted within-population haplotype heterozygosity against microsatellite heterozygosity for different haplotype lengths. Although SNPs were found to have little relation to microsatellites, they found that points fall on a curve for haplotypes at least 20 kb long (see Figure 3 in Conrad *et al.* 2006). If we were to remove the tetra-nucleotide markers from this plot, the American data points would be shifted strongly left, the Asian data points slightly so, and the remaining populations would be essentially unchanged (Figure 6). Although the resulting pattern would be slightly less linear, the points would still fall on a curve (because heterozygosity of microsatellites and SNPs depend differently on the underlying gene genealogies, the relation between the two is not linear in general). The conclusion that sufficiently long haplotypes are effectively free of ascertainment bias is not strongly affected by the bias in the tetra-nucleotides. Sun *et al.* (2009) found a linear relation between microsatellite average-squared difference (ASD) and sequence divergence for individuals from different populations. The bias observed in the tetra-nucleotides has little effect on ASD (unpublished) and, therefore, does not change any of their conclusions in that article.

For models describing the expansion of anatomically modern humans out of Africa that have been parameterized using the average heterozygosity of all markers (*e.g.*, Ramachandran *et al.* 2005; Liu *et al.* 2006; Deshpande *et al.* 2009), we would not expect the biases discussed in this article to put their main conclusions into doubt. On the other hand, we would expect the actual estimates of most parameters to be affected. This is especially true of the timing of the expansion out of Africa, which is usually the most interesting quantity. In addition, the relatively flat *F*_{ST} pattern from tetra-nucleotide markers (Figure 3), as well as the weaker cline in within-population expected heterozygosity (Figure 6), would cause such models to overestimate gene flow (migration) and underestimate population bottlenecks during the expansion; the magnitude of these effects depends on the details of the models, but it is approximately proportional to the relative effect of the bias on *F*_{ST} and the trend in expected heterozygosity.

The HGDP-CEPH panel has been used as a blueprint for the development of large datasets based on microsatellites, and subsets of the markers used in the HGDP-CEPH have been adopted for investigating genetic diversity in Africa (Tishkoff *et al.* 2009), India (Rosenberg *et al.* 2006), Oceania (Friedlaender *et al.* 2008), and the Americas (Wang *et al.* 2007). A simple precaution for future users of these datasets would be to limit their analysis to di- and tri-nucleotides. Although tetra-nucleotides constitute the majority of the markers in the HGDP-CEPH panel, di- and tri-nucleotides together still provide 220 markers, a very large number in terms of population genetics. Furthermore, our analyses above show that these markers are enough to perform all the type of analytical approaches usually employed on this type of data and, indeed, drive the results for measures such as *H* and *F*_{ST} when the data are combined. However, as several of these datasets concentrate on geographic areas at the boundaries of the coverage from the HGDP-CEPH panel, we would recommend following the same steps adopted in this article to investigate the presence of biases and make sure that they are fully removed.

## Acknowledgments

This work was supported by a Philip Leverhume Award and Biotechnology and Biological Sciences Research Council Grant BB/H005854/1.

## Appendix

### The SMM Model with Arbitrary Jump Distributions

Consider an extension of the standard SMM model to arbitrary jump distributions, such that at each mutation the probability of a change of *k* repeats is *q _{k}* (with

*q*

_{0}= 0). The standard SMM model corresponds to taking

*q*

_{1}=

*q*

_{-1}= 1/2, and

*q*= 0 otherwise. Using the characteristic functionwe can write the characteristic function for the difference Δ in repeat count between two alleles aswhere

_{k}*g*(

*z*) is the generating function for the distribution of the TMRCA. Expanding to the first order in µ, we obtainwhere

Hence, for any small mutation rate µ,where denotes the inverse function of *G*. Thus, for mutation rate µ′, we express in terms of :

Finally, we can use the inverse Fourier transform to obtain the distribution of Δ at mutation rate µ′ from . As for the standard SMM model, the scaling relation depends only on ratio of the two mutation rates, which means that our scaling method works equally well for these models.

We also note that only the component of the jump distribution symmetric around *k* = 0 enters into the final expression. This is a consequence of the fact that we consider only differences in repeat count, and implies that the scaling relation is independent of any systematic increase or decrease in the repeat count.

If the jump distribution is fully symmetric, the scaling relation is exact. Otherwise, there is a small correction to the scaling relation, but it is still possible to express in terms of . However, when the mutation rates are small the correction is negligible.

### The *L*-allele Mutation Model

In this section, we consider the effect of limiting the SMM models to repeat counts in the range [1, *L*] (shifting the range with an arbitrary amount makes no difference to the analysis). At the ends of the interval, we introduce reflecting boundary conditions such that the stationary distribution is uniform over the range. For alleles with repeat count in the range [2, *L*-1], mutations occur independently each generation with probability µ, with equal probability of gaining or losing a single repeat unit. At repeat count one, only mutations leading to an increment are accepted, and the repeat count *L* only decrements are accepted. Starting from a distribution *x* at time zero, the distribution of *p* of alleles *t* generations later can be written on matrix form aswhere the matrix M has elements given by *M _{i,i}* = 1,

*M*

_{i}_{-1,}

*=*

_{i}*M*

_{i}_{+1,}

*= −1/2, and*

_{i}*M*

_{i}_{,}

*= 0 otherwise, except at the boundaries where*

_{j}*M*

_{1,1}=

*M*

_{L}_{,}

*= 1/2. Correspondingly, the joint distribution*

_{L}*P*for two individuals becomes (given that

_{ij}*M*is symmetric) (A1)where

*D*denotes the diagonal matrix with diagonal elements equal to

*x*.

Because *M* is symmetric, all eigenvectors and eigenvalues are real-valued, and the eigenvectors form an orthonormal basis. Expanding *M* in the eigenvectors *u*_{α} and eigenvalues *λ*_{α}, we havewhere the superscript *T* denotes transpose. Inserting this in the equation for *P* above, and multiplying from the left and right with and , respectively, gives

If we now assume that the distribution of alleles in the MRCA of the two individuals is the stationary distribution (*i.e.*, the locus is old enough), this equation simplifies towhere is Kroenecker’s delta, which is equal to one if and is zero otherwise, is a diagonal matrix. If we now consider an arbitrary distribution of TMRCA with generating function *g*(*z*), as in the previous section, we obtain

Again, the function *g*(*z*) is independent of the mutation rate. Thus, this equation suggests a scaling relation between the eigenvalues at mutation rate *µ* and *µ*’ for the same underlying distribution of TMRCA. The joint distribution of alleles for mutation rate µ′, *P*(µ′), can then written in terms of the diagonal elements :where denotes the value of α corresponding to α′ under the scaling relation. A problem with this view is that the scaled values of α may not be an integer. In this case we can use that the underlying function *g*(*z*) is smooth: hence, if we sort the eigenvalues in increasing order, we can estimate at non-integer α using interpolation.

For the *L*-SMM, the eigenvalues are

Solving the scaling relation gives the correspondence

Identifying the angular frequency , we see that this relation is identical to the scaling relation for the standard SMM model (see Equation 1). Indeed, since the homozygosity *F* is the sum of the diagonal of *P*, and this scalar is invariant under unitary matrix transformations, we can write *F* as which is the direct correspondence to Equation 2 with Δ = 0.

### SMM Models with Length-Dependent Mutation Rates

The analyses in the previous section can be extended to single-step mutation models (with finite number of states) where the probability of increment and decrement mutations can both have an arbitrary dependence on the number of repeats in the sequence. As in the preceding section, we will assume that the locus is old enough that the ancestral probability distribution is approximately equal to the stationary distribution (this is an assumption implicit in methods typically used in estimating length-dependent mutation rates). Our starting point is Equation A1 in the preceding section, where *M* is the (tri-diagonal) mutation matrix, and *D* the diagonal matrix corresponding to the stationary distribution. Here, *M* is generally not symmetric, but it is straightforward to verify that the matrix *MD* is (it follows directly from the condition for having a stationary distribution). Hence, we can use an approach similar to above using

Inserting Equation A1, we obtain

The matrix is symmetric, as

Hence, we can use the method from the previous section to relate to the generating function *g*(*z*) of the TMRCA of the locus,where are the eigenvalues of (which are identical to those of *M*), and the value of , for 0 < *µ*′ < *µ*, can again be found by interpolation from . Finally, the joint distribution of alleles at the lower mutation rate can be calculated usingwhere are the eigenvectors of .

## Footnotes

Supporting information is available online at http://www.g3journal.org/lookup/suppl/doi:10.1534/g3.111.001016/-/DC1

*Communicating editor: S. W. Scherer*

- Received August 29, 2011.
- Accepted September 20, 2011.

- Copyright © 2011 Eriksson, Manica

This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.