## Abstract

Gene diversity, or expected heterozygosity (*H*), is a common statistic for assessing genetic variation within populations. Estimation of this statistic decreases in accuracy and precision when individuals are related or inbred, due to increased dependence among allele copies in the sample. The original unbiased estimator of expected heterozygosity underestimates true population diversity in samples containing relatives, as it only accounts for sample size. More recently, a general unbiased estimator of expected heterozygosity was developed that explicitly accounts for related and inbred individuals in samples. Though unbiased, this estimator’s variance is greater than that of the original estimator. To address this issue, we introduce a general unbiased estimator of gene diversity for samples containing related or inbred individuals, which employs the best linear unbiased estimator of allele frequencies, rather than the commonly used sample proportion. We examine the properties of this estimator, relative to alternative estimators using simulations and theoretical predictions, and show that it predominantly has the smallest mean squared error relative to others. Further, we empirically assess the performance of on a global human microsatellite dataset of 5795 individuals, from 267 populations, genotyped at 645 loci. Additionally, we show that the improved variance of leads to improved estimates of the population differentiation statistic, which employs measures of gene diversity within its calculation. Finally, we provide an R script, *BestHet*, to compute this estimator from genomic and pedigree data.

The gene diversity of a locus, also known as its expected heterozygosity (*H*), is a fundamental measure of genetic variation in a population, and describes the proportion of heterozygous genotypes expected under Hardy-Weinberg equilibrium (Nei 1973). Formally, gene diversity is the probability that a pair of randomly sampled allele copies from a population are different, and is computed as(1)where *I* is the number of distinct alleles at a locus, and () is the frequency of allele *i* in the population.

For a sample without related or inbred individuals composed of *n* allele copies, an unbiased estimator of expected heterozygosity is (Nei and Roychoudhury 1974)(2)where is the sample proportion of allele *i*. is a biased estimator when inbred or related individuals are included in the sample (DeGiorgio and Rosenberg 2009). This result is based on the idea that, as the proportion of related individuals in the sample increases, the number of independent allele observations decreases.

When two alleles are drawn from a sample, one each from a pair of related individuals, there is a nonzero probability that they will be identical by descent (IBD), rather than just identical by state (Lange 2002). This IBD probability is known as the kinship coefficient, and is denoted by for a pair of individuals *j* and *k*. Thus, the observed diversity will be lower than the true value because a greater proportion of identical alleles are observed than for a sample in which there are no related individuals. DeGiorgio *et al.* (2010) developed an estimator of expected heterozygosity,(3)which is unbiased for samples containing related and inbred individuals of any ploidy, and employs a weighted mean kinship coefficient as a bias correction factor. is the average of all kinship coefficients for every pair of individuals within the sample (see *Methods*). Further, DeGiorgio *et al.* (2010) derived the theoretical variance of as well as its approximate value for samples wherein individuals are related to no more than one other sampled individual.

As an alternative to the sample proportion (), McPeek *et al.* (2004) introduced the best linear unbiased estimator (BLUE, denoted as ) of population allele frequency, which is an unbiased linear estimator with smaller variance than the unbiased linear estimator The BLUE incorporates the relatedness of individuals in the sample as a covariance matrix to define the weight of each observation. Simulations and analytical evaluation corroborating their result suggest that the mean squared error (MSE) of is always smaller than that of and this difference is especially evident for samples with complex pedigrees.

Because has the smallest variance of any unbiased linear estimator of allele frequencies, we expect its low variance to translate to smaller variance of gene diversity statistics that use We developed such a statistic, termed that is an unbiased estimator of expected heterozygosity in samples containing related and inbred individuals of arbitrary ploidy. Through simulations, analytical predictions, and empirical assessments, we compare the performance of to that of and for samples containing related individuals of various types across different ploidy and inbreeding status. Additionally, we derive the variance of any measure of expected heterozygosity that uses unbiased linear estimators of allele frequencies. We find that the increased precision of allele frequency estimates transfers to our unbiased estimator, yielding values for MSE invariably equal to or smaller than those of while occasionally exceeding the precision of The improved properties of translate to its applications as well, which we demonstrate in the calculation of the population differentiation statistic, (Wright 1951). can be written in terms of intrapopulation and interpopulation gene diversity as (Hudson *et al.* 1992)(4)where and are the values of expected heterozygosity within each of two compared populations, and is the expected heterozygosity between them.

## Methods

Consider a locus with *I* distinct alleles in a sample of *n* individuals. Let denote the fraction of alleles at the locus in individual *k* that are of type *i*, An unbiased linear estimator of population allele frequencies denoted by is defined as(5)where is the weight of individual *k*, and Formally, we have thatwhere is an indicator random variable whose value is 1 if allele *t* of individual *k* is of type *i*, and zero otherwise, and where is the ploidy of individual *k*. As an example, if individual *k* were diploid at the locus, then Taking the expectation of shows that it is an unbiased estimator of

### Unbiased estimation of gene diversity using unbiased linear estimators of allele frequencies

In this section, we construct an unbiased estimator, of expected heterozygosity that uses a general unbiased linear estimator, of allele frequency (Proposition 1). We then show that the unbiased estimator, of DeGiorgio *et al.* (2010) follows as a corollary, assuming that the sample proportion allele frequency estimator (Corollary 2). We then derive a new estimator, also as a corollary, assuming that the BLUE of allele frequency (Corollary 3).

#### Proposition 1:

Consider a locus with *I* distinct alleles and parametric allele frequencies and For a sample of size *n* individuals of any ploidy, inbreeding status, and relatedness,(6)is an unbiased estimator of expected heterozygosity, whereis a weighted mean kinship coefficient of the sample for all pairs of individuals in the sample, and where is the weight for individual *k*. The proof of Proposition 1 is found in the Appendix.

From the sample proportion estimator of allele frequency *i*, is recovered when for individual *k*, leading toHere, each individual is weighted by its contribution to the number of allele copies in the sample.

#### Corollary 2:

Consider a locus with *I* distinct alleles and parametric allele frequencies and For a sample of size *n* individuals of any ploidy, inbreeding status, and relatedness,(7)is an unbiased estimator of expected heterozygosity, whereis the sample proportion estimator of allele frequency *i*, whereis a weighted mean kinship coefficient of the sample for all pairs of individuals, and where is the ploidy for individual *k*. The proof of Corollary 2 is found in the Appendix.

It may be beneficial to apply an unbiased linear estimator of allele frequencies that has minimum variance. McPeek *et al.* (2004) introduced the BLUE of allele frequencies, which we formally define here. We will use the BLUE of allele frequencies to construct a new unbiased estimator of gene diversity that would ideally have improved variance over other estimators. Let be an symmetric matrix of kinship coefficients, with The BLUE () of allele frequency is obtained when yieldingwhere denotes the inverse matrix of **1** is a column vector of *n* elements with all entries equal to 1, and is the transpose of **1**.

#### Corollary 3:

Consider a locus with *I* distinct alleles, and parametric allele frequencies and For a sample of size *n* individuals of any ploidy, inbreeding status, and relatedness,(8)is an unbiased estimator of expected heterozygosity, whereis the BLUE of allele frequencies, and whereis a weighted mean kinship coefficient of the sample for all pairs of individuals. The proof of Corollary 3 is found in the Appendix.

### Variance of H estimators using unbiased linear estimators of allele frequencies

We now derive the equation (Proposition 4) describing the variance of the unbiased estimator which takes as the unbiased linear estimate of population allele frequency This value depends on the weighted mean kinship coefficients of the sample for all pairs, trios, quartets, and pairs of pairs of individuals in the sample, defined asHere, is the probability that three randomly sampled alleles, one each from individuals *j*, *k*, and are IBD. is the probability that four randomly sampled alleles, one each from individuals *j*, *k*, and are IBD. Finally, is the joint probability that two randomly sampled alleles, one each from individuals *j* and *k* are IBD, and two randomly sampled alleles, one each from individuals and are IBD. Note that individuals *j*, *k*, and are not necessarily distinct. The variances of and of follow as Corollaries 7 and 8, once again differing only in the weight of a sampled individual in the mean kinship coefficient calculation.

#### Proposition 4:

Consider a locus with *I* distinct alleles and parametric allele frequencies and For a sample of size *n* individuals of any ploidy, inbreeding status, and relatedness,(9)is the variance of the unbiased estimator of expected heterozygosity where is a weighted mean kinship coefficient of the sample, and where for is the weight of individual *k*. Further, we have(10)The proof of Equation 10 is presented for the specific case of in Appendix B of DeGiorgio *et al.* (2010), where is substituted for and and and coefficients are substituted for and coefficients, respectively. We provide an abbreviated version of this proof for the general case in the Appendix. Further, the approximate value of Equation 10 for samples wherein no individual is related to more than one other is(11)For this simplifying case, the terms and are negligible compared to

In the Appendix, we reintroduce the definition of from DeGiorgio *et al.* (2010) (Corollary 7), and then define (Corollary 8), both of which take the form illustrated in Proposition 4. As demonstrated by DeGiorgio *et al.* (2010), the mean kinship coefficients composing Equation 10 derive from the relationship between the 15 identity states available to four alleles (Gillois 1965; Cockerham 1971), and the coefficients of kinship between pairs, trios, quartets, and pairs of pairs of alleles within those four.

### Bias of for samples containing related or inbred individuals

Here, we briefly derive an equation (Equation 12) within Proposition 5 that describes the bias of which we display in the left panels of Supplemental Material, Figure S1A and Figure S2A. We include Corollaries 9 and 10 to Proposition 5 within the Appendix for specific cases of bias derived from -based and -based estimations, respectively. We also note that Equation A10 of Corollary 9 represents the form of the bias typically encountered in applications of as well as in all of our experimental scenarios.

#### Proposition 5:

Consider a locus with *I* distinct alleles and parametric allele frequencies and For a sample of *n* possibly related or inbred individuals, the bias of the estimator of expected heterozygosity changes with the true locus expected heterozygosity such that(12)where

#### Proof:

We begin by substituting Equation 6 into Equation 13 such thatandFrom the definition of bias,

□### Variance of estimators using unbiased linear estimators of allele frequencies

Because the population differentiation statistic (Wright 1951) can be defined in terms of expected heterozygosities, it is possible to theoretically evaluate its approximate variance. A general estimator of can be written as(14)where is an unbiased estimator for the expected heterozygosity between a pair of sampled populations, numbered 1 and 2, defined as (where is a linear unbiased estimator of the frequency of allele *i* in population 2, analogous to in population 1), while and are the within-population expected heterozygosities for populations 1 and 2, respectively. Referring to the numerator as *x*, and the denominator as *y*, we can write the expression for an approximation of the variance of a ratio as(15)following the definition for the approximate variance of a ratio (Wolter 2007).

#### Proposition 6:

Consider a locus with *I* distinct alleles across two populations and parametric allele frequencies and for population 1, and and for population 2. For samples of size and individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness, the variance of the population differentiation statistic calculated from their respective expected heterozygosities is approximated as(16)where(17)In the Appendix, we provide a derivation of the variance and covariance components of Equations 16 and 17. For each of these equations, the result and proof are fairly long, and do not simplify when arranged into Equation 16.

### Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

## Results

### Analytical validation of

We tested the performance of using both theory and simulations against that of the unbiased estimator (DeGiorgio *et al.* 2010), and of (Nei and Roychoudhury 1974). Here, we applied the estimators to samples of individuals wherein each individual was related to exactly one other. Thus, for samples of size *n* individuals, the number of relative pairs was *n*/2. When inbred or closely related individuals are included in a sample, is a biased estimator of gene diversity for which we use the symbol To construct an unbiased estimator with we also applied to a reduced sample in which one member of each relative pair was removed randomly for samples containing only diploid individuals, and the haploid member was removed for each haploid-diploid (*i.e.*, male-female) pair (reduced sample size of *n*/2), and we denote this estimator by To evaluate the performance of the four estimators ( and ), we modified the factors upon which their variance depends: true locus expected heterozygosity (*H*), sample size *n*, and relatedness of individuals within the sample ().

### Effect of true locus expected heterozygosity, H, on estimators

We first evaluated the theoretical bias, variance, and mean squared error (MSE) of each estimator across the 645 human microsatellite loci from across the genome in the composite dataset MS5795 of Pemberton *et al.* (2013), where MSE is the sum of the squared bias and variance. The data used in our analyses is freely available online within File S1 of Pemberton *et al.* (2013) (http://www.g3journal.org/content/early/2013/03/27/g3.113.005728/suppl/DC1). We took the sample allele frequencies calculated from all individuals in the MS5795 dataset as the true population allele frequencies for the variance calculations, and, from these, determined the true expected heterozygosity at each locus using Equation 1 (see File S1; incorporated into Equation A10). Here, each sample contained 60 diploid individuals composed of 10 inbred full-sibling, 10 outbred full-sibling, and 10 outbred avuncular pairs. Each point in Figure 1 and Figure S1 represents a single analytical computation for a sample of 60 (or 30 for ) individuals at a microsatellite locus. We report the approximate variance and MSE because each individual is related to exactly one other in the sample, satisfying the assumption of Equation 11. Further, under this scenario DeGiorgio *et al.* (2010) showed that this was a reasonable approximation of the exact variance.

We begin by demonstrating the relative performance of the unbiased estimators and measured in terms of MSE, against the biased estimator (Figure 1). While the variance of is invariably smaller than that of the other estimators, and the MSE and variance of each estimator decrease with increasing locus expected heterozygosity (), accumulates bias quadratically with increasing *H*, and thus yields an increasingly unreliable estimate with increasing site diversity (Figure S1A, left). However, the effect of this trend differs for each comparison. The MSE of always exceeds that of because the removal of relatives to create the reduced sample causes a substantial increase in estimator variance, though, for high diversity markers, the MSE values of and converge (Figure 1, left). In contrast, outperforms for most loci, demonstrating that the rate of decrease in MSE with increasing *H* is greater for than for (Figure 1, center). Interestingly, the comparison of with shows an opposite trend to the preceding two. Despite the impact of bias, the decrease in variance of over the analyzed range outpaces that of Even so, uniformly yields a smaller MSE for the analyzed diploid samples (which contain a proportion of inbred individuals) across all loci (Figure 1, right).

To validate these theoretical predictions, we simulated 30 independent genotypes for each locus, and, for each independent genotype, simulated a single relative’s genotype (inbred full-sibling, outbred full-sibling, or avuncular). Briefly, we generated the independent genotypes by sampling alleles uniformly at random from the distribution of allele frequencies at each microsatellite locus, and generated relatives by copying zero, one, or two alleles from the relative according to the probability the pair would share zero, one, or two alleles IBD [see Lange (2002), Chapter 5]. The patterns observed for the simulated data accord with those of the theoretical predictions (Figure S2, each point is based on simulations). It is clear from these results that locus expected heterozygosity is heavily influential on estimator MSE. However, we also find that the observed value of expected heterozygosity for a locus normalized to its range of expected heterozygosity values has an impact on estimator MSE. The maximum and minimum values of expected heterozygosity for a locus depend on the number of distinct alleles (*I*), and the frequency of the most frequent allele (*M*), at that locus [see Theorem 2 of Reddy and Rosenberg (2012)]. We quantify proximity of *H* for a locus to its maximum possible value as where *D* is the observed value of expected heterozygosity for a locus minus its minimum possible value given *I* and *M*, and *R* is the maximum minus the minimum value of expected heterozygosity, given *I* and *M*, such that Loci with a smaller value of *B* yield a smaller MSE for all estimators (Figure S3).

### Effect of sample size, n, on estimators

We next examined the properties of each estimator as a function of sample size. All estimators perform increasingly well for samples of increasing size. We demonstrate this property by measuring estimator MSE for samples containing 2–100 relative pairs of various type and ploidy at the D3S2427 locus, selected to highlight the improved performance of as the bias of increases ( Figure 2). For these tests, we considered only a single relative pair type at a time. The unbiased estimators and perform identically for diploid samples of first- and second-degree relative pairs regardless of inbreeding (Figure 2, A–D). Additionally, estimator MSE is uniformly smaller for samples containing only second-degree relative pairs than it is for samples containing only first-degree pairs (*cf*. Figure 2, A and B, and Figure 2, C and D; see also, Figure S4A). However, unambiguously outperforms the other estimators with relative pairs of varying ploidy (in this case, male-female full-sibling pairs at an X-linked locus). In this scenario, provides a more accurate estimate of expected heterozygosity than does when the reduced set is created by removing only males from the original while retaining females (Figure 2E). When all females are removed instead, and males retained (Figure 2F), the MSE of is markedly the largest of the four estimators because 2/3 of the alleles in the sample are discarded, rather than 1/3. For samples with inbred full-siblings whose parents are brother and sister (Figure 2, C and D), the trend of MSE with sample size mirrors that of outbred diploid samples (Figure 2, A and B), but with larger MSE. However, the relative performance of is notably worse for samples containing inbred diploid avuncular pairs (Figure 2D) than for samples containing outbred diploid avuncular pairs (Figure 2B). That is, its MSE remains greater than, or equal to, that of the other estimators over the range of sample sizes considered for the inbred diploid avuncular pair scenario (Figure 2D), but consistently has smaller MSE than for the outbred diploid avuncular pair scenario (Figure 2B). Generally, increasing the sample size is most effective for samples of <20 individuals, and it is over this range that the difference in performance of the estimators is most apparent.

### Effect of varying sample relative pair composition on estimators

Finally, we calculated the MSE of each estimator for all 1326 combinations of one to three relative pair types for samples of 100 individuals fixed at 50 relative pairs, which we represent as triangular heat maps, across samples containing outbred diploids, males and females at an X-linked locus, or inbred diploids (each individual related to exactly one other; Figure 3, Figure S4, Figure S5, Figure S6, Figure S7, and Figure S8). The kinship coefficients () for each relative pair type considered across our tests are defined in Lange (2002, Chapter 5) and DeGiorgio *et al.* (2010, see Table 2), and modeled on the D3S2427 locus ().

The outbred diploid samples included parent-offspring (), avuncular (), and full-sibling () relative pairs. Because parent-offspring and full-sibling pairs have the same kinship coefficient, the heat maps in Figure 3, Figure S4A, Figure S5A, Figure S6A, and Figure S7A are symmetrical with parent-offspring and full-sibling pairs on the bottom vertices, and avuncular pairs on the top vertex. yielded the largest MSE of the four estimators, and this value was constant throughout the space of the heat map (Figure S4A, second triangle), because all reduced sets are identical for outbred diploid samples. consistently yielded the smallest MSE across configurations (Figure S4A, fourth triangle). As was the case in Figure 2, the MSE of the estimators and was smallest for samples with only avuncular pairs, because these contain fewer dependent allele observations on average. We observed these features in simulated data as well (Figure S8A).

Although performed best overall for samples including outbred diploid relative pairs at D3S2427, the estimator with the smallest variance in all situations is the biased estimator (Figure S6A). However, because its squared bias increases with the number of first-degree pairs (Figure S5A), its relative performance declines compared to as more of these pairs are sampled (Figure 3A, left triangle). The relative performance of is highest when the number of first degree pairs is maximized, but this is due to the decreasing performance of as more dependent observations are included (Figure 3A, center triangle). While the difference in MSE between and is always slight for samples of noninbred diploids, these values diverge as the complexity of the sample increases (Figure 3A, right triangle). That is, as the numbers of first- and second-degree pairs approach each other, emerges decisively as the more accurate estimator, with the maximum value of this difference reached at 23 second-degree and 27 first-degree pairs. Thus, while the performance of the estimators for a sample containing relatives follows the same general trend, provides the greatest accuracy for heterogeneous samples of outbred diploid individuals.

We also considered the relative performance of each estimator when using either the BLUE () or the sample proportion () to estimate allele frequencies. Notably, all estimators perform best when the BLUE () of allele frequency rather than the sample proportion () is used to infer population allele frequencies. We calculated the theoretical MSE for each estimator once with and once with across all combinations of relative pairs for diploid individuals at the D3S2427 locus and mapped its value for the estimate with minus the estimate with (Figure S7A). Because both frequency estimations yield the same values in samples of unrelated individuals, performs identically for and and is not included. The MSE of an estimator calculated with is always smaller than that of the estimator calculated with and the pattern of divergence between their MSEs follows a similar trend across all estimators, resembling the rightmost panel in Figure 3A. This result suggests that the difference in MSE between and is driven primarily by the difference in performance between and Both the and estimators yield the same value at the vertices of the triangles, and the difference in their MSEs reaches a maximum at 22 second-degree pairs for and 24 second-degree pairs for and (Figure S7A, center and right triangles). The MSE of calculated with is, at most, on the order of greater than that of calculated with indicating its robustness to variance in allele frequency determination (Figure S7A, right triangle). In contrast, the other estimators return a maximum difference in MSE on the order of The estimation of expected heterozygosity with or will always yield a smaller MSE for samples of outbred, diploid individuals when rather than is taken as the estimator of population allele frequency.

We repeated these tests in samples of mixed ploidy (Figure 3B, Figure S4B, Figure S5B, Figure S6B, Figure S7B, and Figure S8B), and emerged similarly superior to the other estimators, once again yielding the smallest MSE. We analyzed the D3S2427 locus as X-linked for these tests, counting males as haploid and females as diploid, and observed full-sibling pairs [similarly to DeGiorgio *et al.* (2010), for male-male pairs, for male-female pairs, and for female-female pairs] for samples of 100 individuals and 50 relative pairs. All estimators reach their maximum MSE in samples containing only male-male pairs (Figure S4B). This is because the number of independent observations (indicated by a larger mean kinship coefficient) is smallest when there are no females in the sample. Correspondingly, the estimators yield smaller MSE values with increasing incorporation of male-female pairs. The minimum MSE of is reached at 50 male-female pairs, as with and because its squared bias (Figure S5B) decreases with increasing male-female pairs, though its variance is smallest at 50 female-female pairs, due to the greater number of alleles in the sample (Figure S6B). To create the reduced sets, males were removed from male-female pairs to minimize the subsequent increase in MSE. That is, the removal of males removes 1/3 of the allele copies from the sample, rather than 2/3 if females are removed, or 1/2 for a pair of same-ploidy individuals, and so has the same value across samples with the same number of male-male pairs (Figure S4B, second triangle).

The direct comparison of with the other estimators once again yielded different signatures for each subtraction for mixed-ploidy samples (Figure 3B). The point of greatest difference in MSE between and occurs when all relative pairs are male-male, while the point of least difference occurs for samples of only male-female pairs (Figure 3B, left triangle). This pattern broadly resembles the squared bias of (Figure S5B, first triangle), underscoring the effect of bias on estimator performance. The pattern of difference in performance between and differs markedly, and the two estimators perform most similarly as the number of male-male pairs decreases, reaching a minimum at 33 male-female pairs plus 17 female-female pairs (Figure 3B, middle triangle). yields the closest MSE to that of for all relative pair configurations, and their difference is, at most, on the order of (Figure 3B, right triangle). The pattern here mainly reflects the difference in performance between and estimates of population allele frequency, as in Figure S7B, where estimators yield increasingly smaller comparative MSE values as the numbers of relative pairs in the sample approach each other.

We repeated the preceding tests once more for a sample in which full-siblings resulting from a brother-sister mating were included alongside second-degree and outbred full-sibling pairs (Figure 3C, Figure S4C, Figure S5C, Figure S6C, Figure S7C, and Figure S8C). Here, the kinship of inbred individuals with each other was 3/8 rather than 1/4. For all estimators, the inclusion of inbred full-siblings increased the MSE of the estimator, with a maximum MSE at 50 inbred full-sibling pairs, and a minimum at 50 second-degree pairs. For this minimum was also reached for any sample in which there were no inbred individuals, because the reduced sample is identical for these (Figure S4C, second triangle). Again, was the least errant estimator across the space of sample configurations (Figure S4C, fourth triangle), and its advantage over the other estimators differs for each estimator (Figure 3C). Because the bias of is largest at 50 inbred full-sibling pairs, the greatest difference in performance between it and is at this point (Figure 3C, left triangle). Meanwhile, the largest differences in MSE between and are near the top vertex, where second-degree relative pairs predominate, while the smallest are toward the bottom vertices (Figure 3C, center triangle). The difference in MSE between and is at least an order of magnitude less than for the other comparisons, and increases for increasing sample complexity, but reaches its maximum for samples of 28 inbred full-sibling plus 22 second-degree pairs (Figure 3C, right triangle). This pattern reflects the decreased MSE for the estimators when calculated with compared to their calculation with (Figure S7C). Ultimately, the performance of the estimators of expected heterozygosity across varying sample compositions depends on the estimator of allele frequency incorporated into the expected heterozygosity calculation. No matter the sample type, estimators based on outperform estimators based on and outperforms and

### Tests of on single-nucleotide polymorphism (SNP) loci

Because SNP datasets are more common in recent studies, we performed analyses equivalent to our microsatellite analyses for 50 hypothetical SNP loci. These loci were biallelic with minor allele frequency (MAF) between 0.01 and 0.5, with increments of 0.01, corresponding to expected heterozygosity values ranging from 0.0198 to 0.5. We first measured the difference in MSE of with that of or as a function of true locus expected heterozygosity (*H*), as we did in Figure 1 (Figure S9). For each locus, the MSE of was smallest, while that of was generally second-smallest, following the trend for microsatellite loci visible in Figure 1, wherein less diverse loci yielded a smaller MSE for than for However, unlike for microsatellite loci, estimator MSE peaks midway through the range of evaluated SNP loci, such that the smallest MSE values lie at either extreme of the range and the largest MSE value, as well as the largest difference in MSE values for all comparisons, is at the locus with MAF (). Additionally, performs comparatively better than (Figure S9, left) and (Figure S9, center) as *H* approaches 0.255, but is outperformed by these unbiased estimators as *H* approaches 0.5. Once more, the trend is opposite for the comparison between and showing the greatest comparative performance by at the same locus (MAF ). Thus, considering the results presented in Figure 1 and Figure S9, the greatest relative performance of for inbred samples is achieved at loci for which estimator MSE is largest.

We next examined the effect of sample size on estimator performance for hypothetical samples of outbred diploid, inbred diploid, and outbred male-female relative pairs at the simulated locus with MAF (). As we varied the sample size from two relative pairs to 100 (each individual related to exactly one other, one relative pair type per sample), we found that yielded the smallest MSE of all estimators only for samples containing male-female full-sibling pairs modeled at an X-linked locus (Figure S10, E and F). This observation mirrors the trend seen in Figure 2, wherein outperformed the other estimators across all sample sizes. However, yielded the smallest MSE across all sample sizes for outbred and inbred diploid full-siblings and avuncular pairs (Figure S10, A–D). This result is because the samples modeled here are minimally complex, with only one relative pair type, and modeled for a highly homozygous marker—two conditions under which the low bias and variance of result in favorable performance.

Finally, we analyzed estimator performance once more for the locus with MAF (), for a sample of 50 individuals across changing outbred diploid, inbred diploid, and male-female full-sibling relative pair compositions (Figure S11, A–C). We display these results as heat maps, and find that our results here are broadly concordant with those for the D3S2427 human microsatellite locus (). As with the experiments displayed in Figure S10, the least complex samples yielded a smaller MSE for estimates than for estimates. Correspondingly, samples whose relative pair compositions resulted in fewer independent allele observations were more accurately and precisely evaluated with Thus, while sampling lower-diversity markers may occasionally favor the use of the inclusion of two or more relative pair types in the sample is likely to bias , and require the use of to yield accurate inferences.

### Empirical application of

To conclude our investigation into the performance of we applied it to empirical data from the MS5795 dataset. We retrieved human microsatellite data from 5795 individuals (11,590 allele copies) across 645 autosomal loci sampled genome wide. We assumed the mean value across loci for in each of 267 populations to be the true expected heterozygosity value for these populations, as it is an unbiased estimate. We additionally chose to compare the other estimators with because an important basis for their evaluation is their agreement with this unbiased estimator, irrespective of the data to which they are applied.

To emphasize this, we performed three Wilcoxon signed-rank tests to compare the ranking of populations by their mean expected heterozygosity across all loci calculated with and either or (Table 1). At the significance level, the comparisons showed that the inclusion of relatives for was highly significant on the rankings it yielded, indicating that not correcting for relatedness among samples can significantly alter the estimates of expected heterozygosity. However, both and, especially, yielded *P*-values greater than the threshold for the test against These results indicate that the estimates of expected heterozygosity are not significantly affected by the inclusion of related individuals in the sample when relatedness is taken into account. Furthermore, a test between and yielded a *P*-value of suggesting no significant difference in the ranking of populations by mean expected heterozygosity with these two estimators.

Although the unbiased estimators and have smaller MSE than for samples with related individuals, their variance tends to be larger than that of DeGiorgio *et al.* (2010) previously showed that the difference in SD of with was small, while the mean values of and were much more similar to each other than either of them was to the mean of We again show this to be the case, and find as well that not only repeats, or improves upon, the concordance of with but, in some cases, has a smaller SD than does (Figure 4, left and center panels). A direct comparison of the performance of against that of (Figure 4, right panel) shows that has a generally improved SD, and similarity to the estimate over For some samples (primarily those from the Americas), this is not the case, possibly because all close relatives were not identified in the original dataset, resulting in an incorrect kinship matrix for calculation of the statistic.

### Improving estimates of by application of

We predicted that the smaller MSE of would translate to improved accuracy for estimators that are summaries of expected heterozygosity when samples contain related individuals. To test this hypothesis, we calculated the population differentiation statistic, (Equation 4), for pairs of populations whose samples in the MS5795 dataset contained related individuals. Our intent was to compare the MSE and bias of the commonly used estimator of Reynolds *et al.* (1983), which is based on and which we label as to an estimate of calculated from which we label The formulas for these estimators follow the form of the general estimator of (Equation 14). We first measured the MSE of both methods (and an estimate using ) on simulated data, where the of pairs of populations with samples of size 60 diploids each (30 relative pairs, 10 inbred full-sibling, 10 outbred full-sibling, and 10 avuncular pairs; Figure 5) was averaged across simulated replicates. The calculations included here were performed for simulated Gujarati and Maya (left), Gujarati and Japanese (center), or Gujarati and Hadza (right) samples for the least diverse (TCTA015M_22), median diverse (D10S2327), and most diverse (D3S2427) loci of the MS5795 dataset, following their allele frequency distribution in MS5795. consistently has a smaller MSE than the others, and the MSE of all estimators of decreases with increasing locus diversity, as the MSE of the estimator of expected heterozygosity decreases.

We additionally find that has an upward bias compared with (calculated with ), as well as a larger SD in general than (Figure 6). Furthermore, all values of are smaller than the paired value of calculated for the same population. The difference in the mean of and of across all loci with the mean of an estimator which serves as a proxy for the true value of is displayed on the vertical axis, while the horizontal axis measures the SD of and of (Figure 6). Supporting our observations indicating the improved accuracy of over Wilcoxon signed-rank tests (Table 2) between and either or indicate that the inclusion of relatives significantly affects the estimate of population differentiation at the significance level. Meanwhile, and are not significantly different in their estimates. These results suggest that the improved properties of transfer to the summaries that include it in their calculations.

## Discussion

We have introduced an extension to the estimator () of expected heterozygosity developed by DeGiorgio *et al.* (2010) that yields a smaller mean squared error in samples containing related individuals, while maintaining unbiasedness. Conveniently, the derivations of and its variance, are parallel in form to those of and we were therefore able to analytically evaluate the performance of the new estimator simultaneously with that of its predecessor. Our updated estimator, is based on results from McPeek *et al.* (2004), who characterized the BLUE () of allele frequency. The BLUE improves the precision of allele frequency estimation in complex pedigrees, for which the sample proportion ( the estimator of allele frequency used in and ) is unbiased, but increases in variance with inclusion of related and inbred individuals. Because the properties of the estimator of allele frequency transfer to the estimator of expected heterozygosity, is likely to outperform in situations where has a smaller variance than This trend is true for genome-wide data as well (Figure 4 and Table 1).

Overall, yields identical results to in samples containing only one relative pair type, but the two diverge in performance as sample complexity increases (see heat maps in Figure 3, Figure S4, Figure S5, Figure S6, Figure S7, and Figure S8). While both estimators are unbiased, experiences a larger increase in variance for each additional relative pair type introduced into a sample after the first. This holds true for all sample types regardless of ploidy and inbreeding, suggesting that will outperform in practice, where datasets are often complex. Furthermore, the results of our empirical analysis provide an equally important complement to this observation. Of the 93 populations from the MS5795 dataset we considered that contained relative pairs in their samples, each contained sampled individuals that were not related to any other in the sample. Thus, these samples were more complex than those in which each individual was part of a relative pair of the same type. For most of these cases, except for some American populations (discussed below), outperformed This is corroborated by the Wilcoxon signed-rank test (Table 1). We expect therefore that any scenario in which there is heterogeneity in relative pair type among sampled individuals, as is observed in many human population-genetic datasets (Pemberton *et al.* 2010, 2013), should favor the application of over other estimators.

In addition, random sampling of small isolated populations yields an increased chance that related individuals will be included with large enough sample sizes. Further, inbreeding may confound estimates of diversity, and mislead to underreport true population expected heterozygosity. Populations of interest that may display these attributes include geographically isolated human settlements in remote alpine (Coia *et al.* 2012; Capocasa *et al.* 2013), South American rainforest (Wang *et al.* 2007), and Siberian taiga and steppe habitats (Dulik *et al.* 2012), and groups such as the Old Order Amish (Van Hout *et al.* 2010), Hutterites (Abney *et al.* 2002; Chong *et al.* 2011), and Mennonites (Payne *et al.* 2011). Further, though our analysis did not directly consider polyploid organisms, the applicability of to samples containing individuals of any, and varying, ploidy highlights its usefulness for such data. Prominently, analysis on polyploid organisms such as plants including tetraploid *Arabidopsis thaliana* (Hollister *et al.* 2012), and hexaploid bread wheat (Nielsen *et al.* 2014), both of which self-fertilize, and may therefore be inbred, as well as commercially and ecologically significant Hymenopteran insects, including honeybees (Solignac *et al.* 2003; Harpur *et al.* 2014), bumblebees (Lye *et al.* 2011), and ants (Butler *et al.* 2014), whose males are haploid at all loci, while females are diploid, is likely to benefit from the improved accuracy and precision of

We additionally believe that continued investigations into the diversity at single sites in organisms as diverse as dogs (Sutter *et al.* 2007), gray wolves (Zhang *et al.* 2014), humans living at high altitude (Simonson *et al.* 2010; Huerta-Sánchez *et al.* 2013), and rice (Huang *et al.* 2012), in addition to host-microbiome studies (Blekhman *et al.* 2015), will benefit from the advances provided by These studies, as well as many others, have performed scans for positive selection using genomic outliers of population differentiation-based statistics (*e.g.*, locus-specific branch length, and the population branch statistic), where the calculation is performed per-site, rather than averaged across a large number of sites. Such studies would benefit from estimators of genetic diversity, such as and with improved variance.

It is pertinent at this point to revisit a pair of potential limitations in our method and examine their implications. First, in Figure 4 (rightmost panel), the mean of is either closer to that of than to has smaller SD than or both for certain samples (predominantly from the Americas). These observations indicate that the accuracy and precision of may be impacted by the accuracy of the kinship information incorporated into the calculation. The pedigrees of smaller, more remotely located, populations may be more complex compared to those of larger groups. Further, with a greater proportion of relative pairs in each sample, the effect of relative pair type misidentification may be larger. For RELPAIR (Epstein *et al.* 2000), which was the software chosen to identify relative pairs in MS5795 samples, second-degree pairs cannot be identified as confidently as first-degree pairs (Pemberton *et al.* 2010). Even so, although may exhibit a somewhat greater robustness to relative pair misclassification, it is still generally outperformed by

The second point we address is the smaller MSE of at less diverse loci in the dataset, especially for samples with fewer relative pairs. While the variance of is always smaller than that of the other estimators, its bias increases with increasing locus allelic diversity. It is for this reason that the unbiasedness of is its most desirable property. In practice, the mean of expected heterozygosity is often taken across loci. Based on such an approach, (and as well) will return the mean expected heterozygosity, and the variance of this estimation (as with all estimators taking the mean across loci) approaches zero as more loci are sampled. An interesting property of all estimators is that their variance (and therefore MSE) is larger for loci whose value for *B* is closer to 1, where ( see *Results* and Figure S3). Because this effect is greatest for loci with lower true values of *H*, we expect to have the smallest MSE of all estimators at less diverse loci that are close to their maximum expected heterozygosity, and for which the sample mean kinship coefficient is insufficiently large to appreciably bias the estimator (Equation 12). It is thus important to note that no estimator is uniformly superior to the others. Accordingly, the unique limitation of is that the sample kinship matrix must be invertible for the calculation to proceed.

additionally confers its improved MSE over downstream to calculations that incorporate estimates of expected heterozygosity. To illustrate this point, we computed as a function of three estimators: and For simulated data, we found that yielded an estimate with smaller MSE for the three tested loci than did (Figure 5) or and a much smaller mean distance from the true value than For empirical data (Figure 6), we observed a consistent upward bias for compared to in samples containing relatives that followed much the same pattern as the downward bias of for such samples. This trend is clear when we consider the formula for which can be written as Taking and as and this expression yields a larger value than if and were used, because the ratio is smaller for downwardly biased estimators. Interestingly, the SD of is, in most cases, smaller than that of for the dataset, while the SD of was frequently (though not consistently) larger than that of (Figure 4, center panel).

It is thus noteworthy to consider that the performance of and may diverge further in their applications, where any improvement in MSE for may be magnified downstream. This is highlighted by the increased concordance between and compared to and (*cf*. *P*-values between Table 1 and Table 2). With this in mind, applications of can also be considered. Two such examples are the locus-specific branch length (LSBL; Shriver *et al.* 2004) and the similar population branch statistic (PBS; Yi *et al.* 2010). These statistics incorporate values between three populations as measures of branch length to detect positive selection at a locus. Loci for which the unrooted three-taxon tree indicates a significantly longer branch length in a particular lineage may represent regions possibly under selection. To allow for the easy application of we have written an R script, *BestHet*, that computes and given matrices of genotype and kinship data for a sample (download available at http://www.personal.psu.edu/mxd60/best_het.html).

## Acknowledgments

We thank two anonymous reviewers for their insightful comments. This work was supported by Pennsylvania State University startup funds from the Eberly College of Science.

## Appendix

### Derivations of unbiased estimators of expected heterozygosity

In this section, we derive the general unbiased estimator of expected heterozygosity for any unbiased linear estimator of population allele frequencies, defined in Proposition 1, and show how the formulas for (DeGiorgio *et al.* 2010), and (Corollaries 2 and 3), emerge from specific cases of

##### Proof of Proposition 1:

We need to show that Note thatTaking the expectation, we obtain(A1)Therefore

□##### Proof of Corollary 2:

We show that defining the weight of each individual in the calculation of in terms of an individual’s relative allele copy contribution yields from Letting we have thatandPlugging in yields

□##### Proof of Corollary 3:

We show that defining the weight each individual according to their relative contribution to the inverted kinship matrix of the sample yields from Letting we have thatandPlugging in yields

□### Derivations of variances of expected heterozygosity estimators

In this section, we summarize the procedure by which DeGiorgio *et al.* (2010) derived the equation for the variance of illustrating the variance of the general case, For the full derivation, see Appendix B of DeGiorgio *et al.* (2010). We then provide the specific formulation for the variance of (Corollary 7) and (Corollary 8).

##### Abbreviated proof of Proposition 4:

The variance of (Equation 9) is defined asBy definition of variance, we getwithandRecalling that for the th allele copy of individual *j*, whose ploidy is we have thatandfor the case We have previously shown in Equation A1 that and the value of similarly follows. Thus, we need to calculate and for the case. These are(A2)and(A3)Substituting Equation A2 into and solving for we obtainand, substituting Equation A3 into and solving for we obtainThus, substituting the values for variance and covariance into the definition of variance, we have

##### Corollary 7:

Consider a locus with *I* distinct alleles, and parametric allele frequencies and For a sample of size *n* individuals of any ploidy, inbreeding status, and relatedness,(A4)and(A5)where and are mean kinship coefficients, weighted by the contribution of individuals to the number of allele copies in the sample, with subscripts corresponding to the number of individuals considered for the calculation. Additionally,(A6)The proof of Corollary 7 follows from the proof of Proposition 4, where is substituted for and and are substituted for and respectively.

##### Corollary 8:

Consider a locus with *I* distinct alleles and parametric allele frequencies and For a sample of size *n* individuals of any ploidy, inbreeding status, and relatedness,(A7)and(A8)where and are mean kinship coefficients, weighted by the contribution of individuals to the inverted kinship matrix, with subscripts corresponding to the number of individuals considered for the calculation. Additionally,(A9)The proof of Corollary 8 follows from the proof of Proposition 4, where is substituted for and and are substituted for and respectively.

### Derivations of bias measurements in the application of

For samples containing related and inbred individuals, has a downward bias, which is defined in Equation 12 for the general estimator of population allele frequency Here, we present Corollaries 9 and 10 for the specific estimators of population allele frequency and respectively.

##### Corollary 9:

Consider a locus with *I* distinct alleles and parametric allele frequencies and For a sample of size *n* possibly related or inbred individuals, the bias of the estimator of expected heterozygosity changes with the true locus expected heterozygosity such that(A10)whereAs this is the standard application of (Equation 2), Equation A10 describes the bias of in the *Results*. However, is biased with any unbiased linear estimator of allele frequency for samples containing related or inbred individuals. The proof of Corollary 9 follows from the proof of Proposition 5, where is substituted for

##### Corollary 10:

Consider a locus with *I* distinct alleles and parametric allele frequencies and For a sample of size *n* possibly related or inbred individuals, the bias of the estimator of expected heterozygosity changes with the true locus expected heterozygosity such that(A11)where(A12)The proof of Corollary 10 follows from the proof of Proposition 5, where is substituted for

### Derivations of components for the variance of estimators

In this final section of the Appendix, we provide derivations for the components of Equations 16 and 17, which describe the variance of We derive the variance of as well as the covariances of with (and interchangeably, with ), and of with Because the complete expression for is unwieldy, we stop at the derivation of the final component.

##### Lemma 11:

Consider a locus with *I* distinct alleles across two independent populations and parametric allele frequencies and for population 1, and and for population 2. For two samples of size and individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness,(A13)where the superscript of the mean kinship coefficient corresponds to the population for which it is calculated. The equations for the variance of and are obtained by substituting and respectively, into Equation A13 as the mean kinship coefficients in place of

##### Proof:

By definition of variance,whereandBecause and are unbiased estimators of population allele frequency, and populations 1 and 2 are independent,Similarly, Next, we have(A14)where takes the same form as (Equation A1), except that the resulting weighted mean kinship coefficient is for population 2, indicated by the superscript. By substituting Equation A14 into we have(A15)We now derive an expression for Let be an indicator random variable in population 2 analogous to the indicator random variable which we have previously defined for population 1.whereandConsider a scenario in which we have two allele copies. Let be the identity state with probability in which two randomly drawn alleles are not IBD, and be the identity state occurring with probability in which the two alleles are IBD.Note that, because and (same with ), we have Thus,andSubstituting, we now have(A16)and substituting Equation A16 into yields(A17)Therefore, using Equations A15 and A17,

□##### Lemma 12:

Consider a locus with *I* distinct alleles across two independent populations and parametric allele frequencies and for population 1, or and for population 2. For two samples of size and individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness,(A18)and(A19)where the superscript of the mean kinship coefficients and corresponds to the population for which these are calculated. The formulas for and are obtained by substituting and (for ), or and (for ) into Equations A18 and A19, respectively.

##### Proof:

The covariance between and isThe value of the covariance calculated for the case where can be written asFrom the proof of Lemma 11, we have derived the value of and, from the proof of Proposition 1 we know the value for We therefore only need to computewhere Solving for we haveThe value of depends on the probabilities of distinct identity states in which three alleles are drawn from the sample (one each from individuals *j*, *v*, and ). We define state 1 as no IBD alleles drawn (probability ), state 2 as IBD alleles drawn from *j* and *v* (probability ), state 3 as IBD alleles drawn from *v* and IBD (probability ), state 4 as IBD alleles drawn from *j* and IBD (probability ), and state 5 as all three IBD (probability ), with Thus, the probabilities for the relevant kinship coefficients arewhich yieldsThus, is(A20)and from Equations A20 and A1, and the definition of (A21)Meanwhile, for the case of This is intuitively sensible because the products and are independent, describing different alleles, and should not covary.

Finally, we can see that, when the two populations considered are independent from one another, the value of (or equivalently of ) is driven entirely by the case in which such that□We now need to derive the final term required to compute

##### Lemma 13:

Consider a locus with *I* distinct alleles across two independent populations and parametric allele frequencies and for population 1, or and for population 2. For two samples of size and individuals from populations 1 and 2, respectively, each with individuals of any ploidy, inbreeding status, and relatedness,(A22)where the superscript of the mean kinship coefficients and corresponds to the population for which these quantities are calculated. The formulas for and are obtained by substituting and (for ), or and (for ) into Equation A22.

##### Proof:

We begin by breaking up the covariance into its components,This equation is composed of terms that we previously derived (Equations A13, A18, and A19). Therefore,

□## Footnotes

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.116.037168/-/DC1.

*Communicating editor: B. J. Andrews*

- Received September 20, 2016.
- Accepted December 18, 2016.

- Copyright © 2017 Harris and DeGiorgio

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.