## Abstract

Determination of degree of relationship traditionally has been undertaken using genotypic data on individual loci, typically assumed to be independent. With dense marker data as now available, it is possible to identify the regions of the genome shared identical by descent (ibd). This information can be used to determine pedigree relationship (**R**), *e.g.*, cousins *vs.* second cousins, and also to distinguish pedigrees that have the same Wright’s relationship (*R*) such as half-sibs and uncle–nephew. We use simulation to investigate the accuracy with which pedigree relationship can be inferred from genome sharing for uniparental relatives (a common ancestor on only one side of their pedigree), specifically the number, position (whether at chromosome ends), and length of shared regions ibd on each chromosome. Moments of the distribution of the likelihood ratio (including its expectation, the Kullback-Leibler distance) for alternative relationships are estimated for model human genomes, with the ratio of the mean to the SD of the likelihood ratio providing a useful reference point. Two relationships differing in *R* can be readily distinguished provided at least one has high *R*, *e.g.*, approximately 98.5% correct assignment of cousins and half-cousins, but only approximately 75% for second cousins once removed and third cousins. Two relationships with the same *R* can be distinguished only if *R* is high, *e.g.*, half-sibs and uncle–nephew, with probability of correct assignment being approximately 5/6.

Relatives carry individual genes and also genomic regions identical by descent (ibd). In many situations in human, natural, or agricultural populations, it is important to identify relatives and, if possible, degree of relationship using this information. Traditionally, methods of identifying relatives have used information regarding identity in state (ibs) of individual genes (Weir 1996), with increasingly dense markers enabling increasingly high precision, using methods such as CERVUS (Marshall *et al.* 1998), components of PLINK (Purcell *et al.* 2007), or COANCESTRY (Wang 2011).

Traditionally, establishing relationships does not use information regarding location in the genome, and statistical properties are often based on assuming unlinked markers. Linkage information can be incorporated, however, by using the linkage map and taking into account the Markovian nature of the ibd process underlying the genotypes of relatives at linked loci (Epstein *et al.* 2000; McPeek and Sun 2000; Kyriazopoulou-Panagiotopoulou *et al.* 2011) using methods such as RELPAIR (Epstein *et al.* 2000). Regions of the genome that are shared ibd can be established using identity in state (*e.g.*, Abecasis *et al.* 2002; Roberson and Pevsner 2009) using programs such as MERLIN (Abecasis *et al.* 2002).

Alternatively, distantly related individuals can be identified from multilocus sharing of even quite small regions of the genome (Browning and Browning 2011, 2012, 2013). If it is known that two individuals are related, then the allelic information adds little on regions already clearly shared ibd as determined by common sequence (except perhaps on ibs of two very-low-frequency genes). Further, the use of information on shared regions rather than just individual loci allows, at least in principle, discrimination between relationships with the same Wright’s relationship *R* but different pedigree **R**, *e.g.*, uncle–nephew and half-sib, both of which have *R* = 0.25. *R* used here is strictly Wright’s *numerator* relationship, which equals twice the kinship (coancestry), but it is the same as Wright’s relationship in the absence of inbreeding. **R** defines the pedigree (Table 1). Further, the actual proportion of the genome shared can, by chance, be higher by more distant (*e.g.*, second cousins: *R* = 1/32) than closer relatives (*e.g.*, first cousins once removed: *R* = 1/16). The proportion of overlap of the distribution of actual relationship increases as the relationship of each of the pairs becomes more distant (*e.g.*, *R* = 1/64 *vs.* 1/32) (Hill and Weir 2011), further increasing the problem of determining the pedigree **R**.

The pedigree relationship may be needed in a number of situations. The estate of an individual who dies intestate may by law have to be divided among his or her closest relatives. Courts would assume this to be defined by pedigree. Another situation would be in identification of individuals in forensic cases, for example, in identifying a body or a body part in a disaster zone, or in familial searching for relatives of an offender already in the database (Rohlfs *et al.* 2012). In studies of natural populations, pedigree construction is an important component in determining breeding structure and estimation of genetic parameters (Blouin 2003; Pemberton 2008).

Detection of genomic regions for which there is biparental sharing, *i.e.*, individuals with ibd genotype at each diploid locus due to relationship through both parents (*e.g.*, full-sibs or double first cousins), is quite straightforward because there is ibs at each locus in the region. We consider here only the much more common situation of uniparental sharing, in which case *R* is half the probability relatives share one allele ibd at a locus, or half the expected proportion of uniparental genome shared.

Therefore, a quantitative description of the number, position, and length of shared regions is all the information we can have about relationship of a pair of individuals in the absence of pedigree information, and this sets an upper limit to what we can detect. Our objective is to find what this limit is for different alternative pedigrees. Therefore, as a reference point, we work on the premise that we have precise estimates of these quantities but later consider this assumption. We also assume there are no confounding factors, such as inbreeding of the common ancestor or relationships among other ancestors of the pair. Data on gene frequencies and genotypes at individual loci then add no further information.

We focus on identifying specified pedigree relationship from actual or realized genomic sharing, for example, whether a pair of individuals are related either as second cousins or as cousins once removed, in each case assuming there is uniparental sharing. Such comparisons can be undertaken based on a likelihood ratio, although the appropriate test or discrimination depends on the questions to be answered, such as the following. Which of two or more alternative relationships is the most probable? How sure are we? What relationships can we exclude?

The variation in the total length can be computed (Hill and Weir 2011) and there are also various methods and approximations for computing the numbers and distribution of the lengths of shared regions (Fisher 1954; Donnelly 1983; Stam 1980). Recently, Huff *et al.* (2011) have proposed methods to identify whether pairs of individuals taken from the population are related more closely, *e.g.*, as second cousins, than background relationship among all population members from distant relationships in a finite closed population.

There is no theory available that enables prediction of the numbers and distribution of shared segments exactly for arbitrary relationships. Therefore, we use simulation to generate the required probability distributions. There are, however, approximations for some of these distributions available: for example, Huff *et al.* (2011) assumed a Poisson distribution of number and exponential distribution of shared regions (*i.e.*, independence), and we also investigated their accuracy. We conclude with a discussion on inference. The primary objective was to set the theoretical framework and compute what can be achieved rather than focus on applications *per se*.

## Materials and Methods

### Simulation

The simulation program was used previously to check theoretical results for the variance of the length of shared regions on a chromosome (Hill and Weir 2011), which in turn provided a check on the program itself. Simulations were undertaken for a single chromosome, for example, of length *l* Morgans, in each independent replicate. There was assumed to be a uniform recombination rate and no interference, *i.e.*, corresponding to a Haldane mapping function. The number of recombination events was sampled from a Poisson distribution and their positions were sampled as real valued numbers independently from the uniform distribution. All regions of ancestral chromosomes were labeled by the same integer value, *e.g.*, 1, 2, and so on. Hence, a chromosome of a descendant was defined by the position (*π*) of each of the *n* − 1 recombination events, *e.g.*, 0 = *π*_{0} < *π*_{1}, *π*_{2}, …, *π _{n}*

_{−1}<

*π*= 1, defining the

_{n}*n*chromosomal regions labeled

*h*

_{1}, …,

*h*. Then, for example,

_{n}*n*= 4,

*π*

_{1}= 0.1256,

*π*

_{2}= 0.5701,

*π*

_{3}= 0.9012, and

*h*

_{1}= 1,

*h*

_{2}= 2¸

*h*

_{3}= 1,

*h*

_{4}= 3 denote a chromosome for which the first region (from 0.0 to 0.1256) and third region (from 0.5701 to 0.9012) were derived from ancestor 1, the second was derived from ancestor 2, and the third was derived from ancestor 3, and thus ibd for that genomic region with these respective ancestors. This does not imply that the parent has a chromosome with exactly that haplotype, but that a gamete could be formed from it that does,

*i.e.*, the shared region may span grandpaternal and grandmaternal origins. Hence, for a second descendent of the same individuals with, for example,

*n*= 2 and

*π*

_{1}= 0.3659,

*h*

_{1}= 2,

*h*

_{2}=1, there is sharing between the two descendents in two regions, between 0.1256 and 0.3659 from ancestor 2 and between 0.5701 and 0.9012 from ancestor 1,

*i.e.*, internal regions of length 0.2403 and 0.3311, respectively, with a total proportion of 0.5714.

To obtain the results presented here, 100,000 or more independent replicates were performed. For each replicate the sharing among different kinds of relatives was computed, so for a founder full-sib family, the degree of sharing of, for example, uncle and nephew (or aunt and nephew, etc, because only autosomes were simulated), great-uncle and great-nephew, and cousins of degree up to third cousins were sampled successively. Although this induced sampling correlations, these were trivial because replicates were numerous and independent. Simulations were performed independently for chromosomes of different length and for three different founder relationships: linear descendants, full-sib–based, and half-sib–based (Table 1).

### Distribution of shared segments

#### Numbers of shared segments:

We provide examples to illustrate the kind of data available from the simulation for a map length of an “average” human chromosome of 1.632 M (based on Kong *et al.* 2004). Table 2 shows the distribution of numbers of shared segments (*n*_{s}) for a range of relationships from a full-sib base and for a more limited number of half-sib–derived and lineal relationships. For this length of chromosome there is a less than 1% chance that uncle and nephew share no genome and approximately 15%, 35%, and 31% probability that they share 1, 2, and 3 regions, respectively. For half-sibs, who have the same Wright’s relationship (1/4) as uncle–nephew, the probabilities are 2%, 25%, 41%, and 24%, respectively. Of course, more distant relatives share fewer and smaller regions. For longer chromosomes (in terms of map length or expected number of recombinations) than shown in Table 2, the expected number of shared regions increases and length of individual segments decreases.

#### Positions of shared segments:

Information also can be obtained from position of the shared regions, specifically whether they include the chromosome ends. Examples of the distribution of shared regions on the chromosome according to their position, specifically whether they include both, one, or no ends of the chromosome (*p*_{s} = 2, 1, 0, respectively), are shown in Table 3. A single region sharing both ends rarely occurs unless the relationship is close, and the proportion sharing at neither end increases as the relationship becomes more distant. Half-sibs are more likely to share regions including both chromosome ends than are uncle and nephew.

#### Lengths of shared segments:

The expected proportion of genome shared (*i.e.*, 2*R*) is of course the same as the overall length of shared regions expressed as a proportion of the genome length, but the distribution of the lengths of the total and of individual shared segments depends on the pedigree **R**. Examples are also given in Table 3 for the mean and SD of the total length actually shared, expressed as a proportion of the chromosome length *l* = 1.632 M, as a function of whether the shared regions include zero, one, or two chromosome ends. A special case is when *n*_{s} = 1 and *p*_{s} = 2, when the length is invariant because the whole chromosome is shared.

### Summary of simulated statistics

Because the numbers (*n*_{s}) and positions (*p*_{s}) are discrete valued variables, to facilitate subsequent analysis total length shared on the chromosome also was summarized in discrete values, namely as the number of tenths of the chromosome shared (*t*_{s}): if *x* (>0) is the length shared, then for *t*_{s} = 1: 0 < *x* ≤ 0.1, *t*_{s} *=* 2: 0.1 < *x* ≤ 0.2; …; *t*_{s} = 10: 0.9 < *x* ≤ 1.0. The distribution of the length of individual shared segments conditional on the numbers, positions, and total length shared on each chromosome was not included in subsequent analyses because it contains no additional information. For example, if there are two shared segments of total length *x*, then the relative lengths *y* and *x* – *y* tell us nothing about the numbers of generations apart. Although shown by simulation, on reflection it is obvious because the distribution is uniform.

To simulate the 22 human autosomes, map lengths were simplified into five classes, based on the data of Kong *et al.* (2004), and previously were used for illustration (Figure 5 of Hill and Weir, 2011), namely two chromosomes of 0.75 M, eight chromosomes of 1.25 M, six chromosomes of 1.75 M, four chromosomes of 2.1 M, and two chromosomes of 2.75 M, totaling 35.9 M. Simulation also was undertaken assuming 22 chromosomes each of 1.632 M, *i.e.*, with the same average length as in the model using five lengths. As shown later, there is little difference in predictions of discriminating ability between the five-length and one-length models, so further subdivision of chromosome lengths to more closely match those for humans for analysis would have little impact on calculations or conclusions. This does not, however, imply that individual lengths should be ignored in analyses of real data.

As inferred from Hill and Weir (2011), from variances of actual relationship and also from simulations, for half-sib–based relationships the distribution of shared regions (*n*_{S}, *p*_{S}, and *t*_{S}) depends only on Wright’s numerator relationship *R*. For example, it is the same for half-cousins and half-great-uncle–great-nephew relationships (both *R* = 1/16), and for half second cousins and half-cousins twice removed (*R* = 1/64). Similarly, for full-sib–based relationships, the distribution is the same for second cousins and first cousins twice removed (*R* = 1/32), but is not the same for great-uncle–great-nephew and cousins (*R* = 1/8).

### Likelihood ratios

#### Computation:

Let *k* denote a specific realization {*n*_{s}, *p*_{s}, *t*_{s}} of genome sharing on a specified chromosome, and let *P*** _{R}**(

*k*) denote the probability of this outcome dependent on the chromosome length and conditional on the relationship being

**R**(

*e.g.*half-sibs). If, for example, only information on

*n*

_{s}is used, then the realization is simply {

*n*

_{s}}. The contribution provided by the observation

*k*to the log likelihood ratio

*λ*(

**A : B)**for relationships

**A**and

**B**is then log

*P*

**(**

_{A}*k*)

**−**log

*P*

**(**

_{B}*k*) using the logarithm inter alia because it has better sampling properties. We use the simulation results to obtain these probabilities, computed simply as the proportion of replicates with the appropriate outcome. Thus, using only data on

*n*

_{s}, for example, and assuming three shared segments, then

*λ*(UN : HS) ∼ ln(0.310/0.243) = 0.245. and

*λ*(UN : GUGN) ∼ ln(1.91) = 0.645 (Table 2). Because segregation over chromosomes is independent, the total log likelihood ratio is obtained by summing contributions to the log likelihood ratio from different chromosomes, using probabilities appropriate to the map length and realization for each chromosome. If there is previous information regarding the relationships from nongenetic data and these can be quantified, then Bayes theorem can be used straightforwardly to compute posterior probabilities of alternative relationships. Otherwise, application is context-dependent, and we discuss that subsequently.

#### Moments:

Although any testing is situation-specific, we can investigate the properties of the log likelihood ratio as a function of the data used and possible relationships to be compared. Thus, we consider its moments, specifically its mean and variance. If the real relationship is **A**, then the contribution to the mean from a single chromosome is as follows in equation 1:and there is an equivalent formula for the variance. The overall mean and variance of *λ* are obtained by summing contributions over chromosomes. We also compute its skew and kurtosis.

The mean *λ* is the (directed) Kullback-Leibler distance between the two distributions *P*_{A} and *P*_{B} (Kullback and Leibler 1951; Burnham and Anderson 2001). This “distance” is not symmetric, *i.e.*, in general, E** _{A}**[

*λ*(

**A**:

**B**)] ≠ E

**[**

_{B}*λ*(

**B**:

**A**)]. Subsequently, we tabulate values over the correct distribution (

*i.e.*, real relationship) such that they are positive.

Because the numbers of shared segments and their positions are count data and because lengths shared were analyzed similarly as discrete variables, the numbers in each defined class *k* have a multinomial distribution with parameters estimated from the simulation results. In computing the moments of *λ*, the expected probabilities *P*** _{R}**(

*k*) were assumed to have been estimated by simulation with negligible error. If the estimate from simulation of

*P*

**(**

_{A}*k*) was not zero but that of

*P*

**(**

_{B}*k*) was zero, then in computing the term

*P*

**(**

_{A}*k*)[log

*P*

**(**

_{A}*k*)

**–**log

*P*

**(**

_{B}*k*)

**],**it was assumed that

*P*

**(**

_{B}*k*) = 1/(2

*N*), where

*N*was the number of replicates simulated. This term becomes important only when the distributions differ greatly [in which case E(

*λ*) is already large] and when expected numbers in cells become very small. To reduce errors such as this due to simulation, because data regarding numbers of segments itself included data regarding lengths, results given utilizing

*n*

_{s,}

*p*

_{s}, and

*t*

_{s}used all three for 1 ≤

*n*

_{s}≤ 4, but only

*n*

_{s}and

*p*

_{s}for

*n*

_{s}> 4.

The ability to discriminate between alternative relationships using the likelihood ratio depends on the distribution of *λ*, mainly on the relative sizes of its mean and SD, so we tabulate E(*λ*)/SD(*λ*). Because there is replication of observations across chromosomes, SDs were computed over the aggregate, and therefore might be regarded as standard errors, but we retain the SD notation. We also found that λ typically has close-to-normal form.

## Results

### Moments of log likelihood ratios

#### Expectation:

Information available for contrasting relationships expressed as expected log likelihood ratios [E(*λ*), Kullback-Leibler distances] are provided in the upper part of Table 4 for a subset of relationships using the full simulated data for numbers (*n*_{s}), positions (*p*_{s}), and lengths (*t*_{s}) of shared segments. In these and subsequent tables, rows denote the real relationship and columns denote the hypothesized relationship. Values of E(*λ*) for all 19 relationships analyzed and incorporating successively more information are given in Appendix Table A1 (using *n*_{s} only), Table A2 (using *n*_{s} and *p*_{s}), and Table A3 (using *n*_{s}, *p _{s}*, and

*t*

_{s},

*i.e.*, as in Table 4). In all these Tables, values were computed from simulation runs for each of the designated five map lengths (0.75, 1.25. 1.75, 2.10, and 2.75 M), each replicated 100,000 times,

*i.e.*, as weighted averages over a total of 500,000 replicates.

It was seen that E(*λ*) is small when relationships are distant and of similar magnitude (Table 4, upper part), *e.g.*, second cousins and half-cousins once removed (for both of which *R* = 1/32). Although Kullback-Leibler distances are not symmetric, the reciprocal cases here are usually close but not identical in value, so only half the pairs of assumed relationships are included in Table 4 (but all are in the Appendix Tables). E(*λ*) is typically higher when the likelihood ratio is conditional on the higher relationship of the two, presumably because there is a wider distribution of numbers and lengths of segments shared among close relatives and therefore there is more information in the data.

The increment in E(*λ*) by incorporating position and length can be substantial for comparisons involving quite closely related individuals (Appendix Table A1, Table A2, and Table A3). As they become distant, *e.g.*, half-cousins *vs.* third cousins, the absolute and proportional increase is small. First, few shared segments are at the ends of chromosomes and the coefficient of variation in length shared decreases as the number of segments shared increases.

### Expectation *vs.* sampling error

SDs of *λ* values using all information (*n*_{s}, *p*_{s}, *t*_{s}) are given in Table 4 (lower) for a number of relationships. Examples of E(*λ*)/SD(*λ*) for two subsets of relationships, one including pairs of high relationships (1/16 ≤ *R* ≤ 1/4 in Table 5) and the other including pairs of more distant relationship (1/128 ≤ *R* ≤ 1/16 in Table 6). Later, we discuss the interpretation of these values and show that the ratio is, at least approximately, a noncentrality parameter determining the probability of misassignment. Approximately, a value of 2.0 or more indicates a pair of relationships that can be distinguished with reasonable confidence. Full data fitting different amounts of information are given for SD(*λ*) in Appendix Table A4, Table A5, Table A6 and for E(*λ*)/SD(*λ*) in Appendix Table A7, Table A8, and Table A9. It is seen that SD(*λ*) tends to increase along with E(*λ*) as relationships become more different, *e.g.*, uncle–nephew *vs.* half-sib and *vs.* cousin (Table 5), and therefore E(*λ*)/SD(*λ*) diverges less rapidly than E(*λ*).

### Contributions from segment position and length

The contributions of different components of the data to E(*λ*)/SD(*λ*) are illustrated in Figure 1 for some of the relationships in Table 5 and Table 6. It shows the ratio fitting only numbers of shared segments and shows the increments in the ratio by fitting positions and then lengths. A high proportion [in some cases almost all the information as judged by E(*λ*)/SD(*λ*)] is contained in the number of shared segments. A little more is added by including position, but only for close relationships when chromosomes ends are likely to be shared (Table 3). More information is obtained by incorporating length of chromosome shared, although not with a clearly defined pattern over relationships.

### Approximating likelihoods

#### Equal chromosome lengths:

To facilitate analysis of the distribution of log likelihood ratios, we consider a computational simplification, namely assuming all chromosomes have the same length rather than ranging over five different lengths. Hence, data also were simulated using a larger number of replicates (300,000) for chromosomes of length 1.632 M, the mean of those simulated previously, and likelihood ratios computed for genomes with 22 such chromosomes. Very similar values of E(*λ*), SD(*λ*), and, consequently, E(*λ*)/SD(*λ*) as those in Appendix Tables A1 through A9 were obtained. Results in Appendix Table A10 for E(*λ*)/SD(*λ*) enable comparison directly with those in Appendix Table A9 computed using the five chromosome lengths model. In summary, of the 342 off-diagonal comparisons of E(*λ*)/SD(*λ*) for the 19 relationships, only 32 deviated by more than 2% and of these 32, E(*λ*)/SD(*λ*) exceeded 1.0 in only 11, *i.e.*, large proportional differences typically occurred when absolute differences were small.

#### Replication:

Because differences in moments of *λ* ascribed to different models can arise from differences in expectation and from sampling in the simulation, a further run of 300,000 replicates for chromosomes of length 1.632 M as in Appendix Table A10 was undertaken (results not shown). The differences in E(*λ*)/SD(*λ*) between the replicates were very small; of the 342 off-diagonal comparisons, only 11 differed by more than 2%, and of those E(*λ*)/SD(*λ*) exceeded 1.0 in only 5. The main results (*e.g.*, Table 4, Table 5, Table 6 and corresponding Appendix Tables) computed for five lengths of chromosome involved a total of 500,000 unequally weighted runs, rather than 300,000 equally weighted runs (as we performed), so we conclude that sufficient replication was used.

### Higher moments and distributions of log likelihood ratios

To simplify calculations, and in view of these results showing a good approximation of likelihood statistics computed for a model of chromosomes of equal length as that for chromosomes of different lengths, higher moments and distribution of *λ* were computed assuming all chromosomes had length 1.632 M (from simulations as in Appendix Table A10). Coefficients of skew and kurtosis are given in Appendix Table A11 and Table A12, respectively, for a subset of relationships. In general, both coefficients are small, indicating closeness to a normal distribution. The kurtosis coefficient is generally smaller than the skew, and kurtosis tends to be seen only in the presence of skew. The largest skew generally is found when the true relationship is weak and the assumed relationship is stronger, in which case there is negative skew. Positive skew is found less often, but typically when the assumed relationship is weaker than the true relationship. The apparent near-normality is not unexpected because each sample is of size 22 and the central limit theorem applies (as it would to results simulated for samples from chromosomes of five different lengths). Examples of the distribution of the log likelihood ratio, scaled as *λ/*SD(*λ*), are given in Figure 2, showing near-“normal” form as anticipated in these particular examples.

### Approximations to sampling distributions

The results we have used have been based entirely on simulation. We investigate, however, theoretical results available that could be used to obtain some more simply computed but potentially less informative tests of pedigree relationship.

Based on work by Thomas *et al.* (1994), Huff *et al.* (2011) give an expression for the expected number of shared segments in the genome that, for a single chromosome, becomes the following equation (equation 2):where *a* is the number of ancestors (1 for half-sib mating, 2 for full-sib mating), *d* is the total number of meioses separating ancestors and descendants (back to the grandparents), and *l* is the map length. For lineal descendents, numbers of shared segments are typically one-half those of half-sib descendents, and the expected number shared with the grandparent (or founder of a recurrent backcross line) is [(*d* − 1)*l* + 1](½)^{(}^{d}^{−1)}, where *d* is the number of meioses back to the grandparent (*i.e.*, founder, hence terms in *d* − 1 because recombination to the parent is irrelevant). Thus, for example, *R* = 1/16 and *d* = 4 for full-sib–based (cousins once removed), half-sib–based (half cousins), and lineal descendents (great-great grandparent–great-great grandoffspring). The formulae do not apply to the cases of uncle–nephew, for which (surmised from simulations as in Table 2) E(*n*_{s}) = (5*l* + 2)/4, or great-uncle–great-nephew, for which E(*n*_{s}) = (7*l* + 2)/8. The mean numbers of shared segments from simulation agree (within sampling error) with prediction (Table 7).

Huff *et al.* (2011) also state that given *d*, the expected length of a shared segment is 1/*d*, based on the calculations of length surrounding a specific marker (Fisher 1949). They assume independence of numbers and length, implying from equation 2 that the expected total length of a chromosome shared is (*dl* + 1)(½)^{(}^{d}^{−1)}/*d* for half-sib descendents. However, because the expected proportion of the genome shared is 2*R* = (½)^{(}^{d}^{−1)}_{,} the mean total length shared is actually *l*(½)^{(}^{d}^{−1)}. It is partitioned over the expected number (*dl* + 1)(½)^{(}^{d}^{−1)} of shared segments and, therefore, taking into account the finite length of the chromosome, the expected length of an individual segment is *l*/(*dl* + 1) = 1/(*d* + 1/*l*), not 1/*d*. These equations also hold for full-sib and lineal descendants. For example, for a chromosome of length 1.632 M, the expected lengths of a shared segment are 0.383 M, 0.277 M, and 0.217 M for half-sibs, half-uncle, and half-cousins, respectively, rather than 0.5 M, 0.333 M, and 0.25 M without the correction. The proportionate difference becomes smaller for more distant relatives, *e.g.*, 0.151 M rather than 0.167 M for half second cousins. For uncle–nephew and great-uncle–great-nephew, the expected lengths of individual segments on a chromosome are, from simulation, 2/(5*l* + 2) and 2/(7*l* + 2), respectively.

Huff *et al.* (2011) also made the simplifying assumption that the number of shared segments is Poisson-distributed, implying Var(*n*_{s}) = E(*n*_{s}) on individual chromosomes and the whole genome, but simulations show departures between mean and variance (Table 7). For a chromosome of length *l* = 1.632 M, the actual distribution is rather less dispersed than the Poisson for close relatives, but slightly more dispersed for more distant relatives. For cousins, for example, E(*n*_{s}) = 1.882, V(*n*_{s}) = 1.311, and the proportion sharing no segments is ∼10% (Table 2), but the Poisson expectation is ∼15%. Further, the distribution of shared segment lengths was assumed by Huff *et al.* to be independently exponentially distributed, in which case the coefficient of variation (CV) of the total length of shared segments on a chromosome would be proportional to 1/√*n*_{s}. For close relations who may share a high proportion of the chromosome, the actual distribution is substantially underdispersed compared with the Poisson and the CV of total length shared deviates from the 1/√*n*_{s} prediction. As relationships get more distant, these predictions hold better.

### Using approximate sampling distributions to distinguish relationships

Because the predicted numbers (*n*_{s}) of shared segments (Huff *et al.* 2011) have the correct mean, they provide a simple route to likelihood calculations without simulations. Further, as illustrated in Figure 1, most of the information can be obtained from the numbers of shared segments without using their positions and length. As the actual distribution departs from the Poisson (Table 7), however, there would be some reduction in discriminating power in computation of likelihoods, even from number of segments shared alone. To investigate this, we computed the log likelihood ratio for alternative types of relationships using data only regarding *n*_{S} assuming it is Poisson-distributed, and we computed its mean and SD using the actual frequency distribution obtained from simulation. For simplicity, we assumed 22 chromosomes each of length 1.632 M. Examples are given in Table 8 for the log likelihood ratio computed using both the Poisson and the actual distributions.

The log likelihood ratios remain zero when the real and assumed relationships are the same. In general, E(*λ*) is smaller when the Poisson approximation is used, but the proportional reduction is inconsistent. There are cases when it is larger, which seems illogical, but there is no guarantee *λ* decreases because the test is against a false hypothesis, with the actual distribution fitting closer to the Poisson with the wrong parameters. Because the SD is also substantially affected and typically is smaller, the ratio E(*λ*)/SD(*λ*) is often larger than that computed using the correct distribution obtained by simulation, but the pattern is not consistent. In view of this, such approximations should be used with care, and in any case we have provided an exact approach (strictly, more nearly exact, from replicate simulations).

### Extension to other species: impact of chromosome number and length

Results have been given for a model human genome of *c* = 22 autosomes with a total map length of *L* = 35.9 M; however, to assess how they need modifying for other species, we consider how *c* and *L* influence results. We have shown that a model of 22 chromosomes of equal average length (1.632 M) approximates that with lengths ranging from 0.75 M to 2.75 M, with most in mid range. Therefore, if chromosomes have similar mean length to those of humans and the distribution of lengths is no more dispersed, moments for different numbers of chromosomes can be predicted well by scaling as E(*λ*) ∝ *c* and E(*λ*)/SD(*λ*) ∝ √*c* because they are independent. To investigate the impact of wider variation in length we considered alternatives with total genome length 36 M, comprising 72 chromosomes each of 0.5 M or 12 chromosomes each of 3 M.

Ability to discriminate, expressed in terms of E(λ)/SD(*λ*), is given for some examples of relationship in Appendix Table A13 using either numbers of segments alone or all sources, *i.e.*, numbers, positions, and lengths. In summary, when there are many independent chromosomes, E(*λ*)/SD(*λ*) is generally higher than when there are few, particularly when Wright’s relationship *R* differs, because probabilities of ibd at individual loci are mostly uncorrelated with many small chromosomes. Independent loci do not provide evidence to distinguish relationships such as uncle–nephew and half-sibs having the same *R*. Information is contained in the distribution of number and length of shared segments, however, and the differences in E(*λ*)/SD(*λ*) between the 12 and 72 chromosome models for the same total map length are small, although generally higher for *c* = 72 when comparing relationships with different *R*. For relationships with the same *R* there is negligible difference, *e.g.*, real relationship half-sib, assumed uncle–nephew, E(*λ*)/SD(*λ*) = 0.97 for 72 chromosomes, 0.98 for 12 (Appendix Table A13) chromosomes, and 0.98 for 22 variable-length chromosomes (Table 5). Overall, therefore, the discriminating power clearly depends more on total amount of genome rather than on the individual chromosome lengths for the typical range of lengths in mammals.

## Discussion

### Inference

Although likelihood ratios are a natural way to describe the plausibility of alternative relationships, how to draw inferences from them is less clear-cut. Let **Ω** denote the set of all pedigree relationships **R** under consideration. Because this is a finite set of discrete elements, it removes some of the difficulties in assigning prior probabilities when, typically, these are neither specified nor easy to specify. Bayes theorem can then be used to combine likelihoods and prior probabilities to produce a posterior distribution over the elements of **Ω**. Unless some form of ordering, or measure of distance, is introduced in **Ω**, it is impossible to speak of means or variances of this distribution, but it will usually have a unique mode, and the corresponding relationship **R** will be our “best guess” at the true relationship. A confidence set could be obtained by ordering relationships by posterior probability and dropping relationships with the smallest probabilities until a desired probability level is achieved for the remainder.

Without prior probabilities, everything hangs on the likelihood. The likelihood function is defined on **Ω**, and the relationship **R** in **Ω** that produces the maximum value of the likelihood is the maximum likelihood estimate of the true relationship, corresponding to the posterior mode with a uniform prior. Without a distance measure, and with discrete relationship classes, standard asymptotic results for maximum likelihood estimates are not available. The distribution of the maximum likelihood estimate could be calculated by simulation, however, assuming any particular **R** to be true.

Any particular **R** can be tested as a null hypothesis against the general alternative that the true relationship is not **R** by using a maximum likelihood ratio test (McPeek and Sun, 2000). The set of those **R** in **Ω** for which this test is not significant at a given significance level constitutes a confidence set for the unknown relationship. McPeek and Sun (2000, p. 1079) point out that although the sampling distribution of log likelihood ratios for two fixed relationships is often close to a normal distribution (as we have shown previously; Figure 2, Appendix Tables A11 and A12), the sampling distribution of the maximized version tends to be skewed (the difference is between the estimate of **R** fixed or varying from sample to sample). Nevertheless, even in the normal case, simulation is required to obtain the mean and variance of the null distribution.

An issue that arises with both Bayesian and likelihood approaches is the completeness or otherwise of **Ω.** The true relationship might be one we neglected to consider; it might be bilinear, but not so detected (*e.g.*, paternal half-sibs and maternal second cousins), or an ancestor might be inbred so the probabilities of ibd sharing of descendents differ from those assumed here. Some relationships could be excluded based, for example, on ages of the individuals concerned, *e.g.*, some lineal or avuncular relationships.

If all that is required is to identify the “best guess” among all relationships under consideration, then we select the relationship with the largest likelihood, or the largest posterior probability. This can be regarded as a discrimination problem, with the relationships treated symmetrically. The two solutions correspond to the maximum likelihood or Bayes discriminate rules (Mardia *et al.* 1979), and the performance of such rules is judged by the set of misclassification probabilities.

Discriminating between two relationships amounts to choosing one if the log likelihood ratio *λ* > 0 and choosing the other if *λ* < 0. If the two relationships are **A** and **B**, and the distribution of *λ* is normal in each case, then the misclassification probabilities are Φ(−*m*_{B}/*s*_{B}) when we choose **A**, and Φ(−*m*_{A}/*s*_{A}) when we choose **B**, where *m*_{A} = E_{A}[*λ*(**A** : **B**)], *i.e.*, the mean of *λ* when **A** is the true relationship, and *m*_{B} = E_{B}[*λ*(**B** : **A**)] (= −E_{B}[*λ*(**A** : **B**)]) when **B** is the true relationship and *s*_{A} and *s*_{B} are the corresponding SDs. Ratios of *m*/*s* for various pairs of relationship are in Table 5 and Table 6, with more in Appendix Tables A7–A9.

As an example, let us assume X dies intestate and a search locates one living relative, indisputably a half-cousin. Subsequently, Y appears claiming to be a cousin of X (but otherwise unrelated to Y), and thus is more closely related. Given only DNA data, can the claim be substantiated or disproved? There are two competing hypotheses: for **A**, Y is a cousin of X; and for **B**, Y is a half-cousin of X. To keep this argument simple, we discount other possible relationships. Given a prior probability that X and Y are cousins, the Bayesian approach provides a posterior probability, but it is not clear what a reasonable prior probability would be in the absence of any background information for Y. With the likelihood approach, we can clearly discriminate with confidence in this situation because both misclassification probabilities are small, ∼1.5% using Table 5, Φ(−2.22) ∼ 0.013, *i.e.*, if we decide half-cousins, and Φ(−2.14) ∼ 0.016 if we decide cousins.

Taking as a simple criterion a difference of 2 SD in log likelihood ratio as an indicator of discriminating ability (corresponding to a misclassification probability of approximately 0.02), it is seen that although it is possible to distinguish between a distant and a close relationship with high power, it is more difficult between relationships of the same degree (*R*), increasingly so as *R* becomes smaller (Table 5 and Table 6). There is little power to discriminate between relationships for which *R* is 1/64 or less; for example, the probability of correct assignment (based simply on sign of the log likelihood ratio) is approximately 3/4 for second cousins once removed *vs.* third cousins as E(*λ*)/SD(*λ*) ∼ 0.6. It is easier to distinguish lineal relationships, *e.g.*, great-great-great-grandparent–offspring from second cousins, than it is to distinguish second cousins from half-cousins once removed (for all of which *R* = 1/32) because the lineal recombination and transmission process differs more than that between half and full-sib descendants.

Without use of information as shown here regarding shared genomic regions and merely considering resemblance locus by locus, relationships such as uncle–nephew and half-sib cannot be distinguished at all. It is seen that E(*λ*)/SD(*λ*) ∼ 1, whichever relationship is the real one. Hence, the likelihood ratio will be in the correct direction approximately 5/6 of the time—not certainty at a level looked for in significance tests, but not valueless. For more distant pairs with the same *R*, the probability of correct assignment will decline; for second cousins and half-cousins once removed (*R* = 1/32), the probability declines to approximately 2/3. This illustrates the limitations of making decisions about the relationship between a pair of individuals even if based on full genomic data.

### Assumptions

Many assumptions have been made in this analysis. The first is that the number of shared segments is accurately recorded, and the main risk is that short segments are missed. In population studies, Browning and Browning (2013) and S.R. Browning (personal communication) report good power to detect segments of 1.5 cM and higher using dense SNP data and 1 cM or higher with sequence data. For exponentially distributed segment lengths of expected length *a* (cM), this would imply a probability of missing an individual segment of approximately 1.5/*a* (1/*a*) from SNP (sequence) data. For half-cousins, for example, the expected segment length is 21.7 cM for a chromosome of 1.632 M (see *Results* regarding approximations to sampling distributions), implying an approximately 7% chance of missing a random segment using SNPs, slightly less for closer relatives or using sequence data. Thus, there would be bias towards underestimating both Wright’s and pedigree relationship, but little in comparing relatives with the same *R*. For known relatives, however, as considered here, the probability would be expected to be much lower because the individuals are already identified as relatives and not trawled from the population. Errors in estimating segment length would be comparatively unimportant (Figure 1).

Errors therefore will not necessarily lead to wrong assignment but to miscalculation of the likelihood ratios. As Table 2 and Table 3 show, however, the pattern of numbers shared is unlikely to change greatly if the error rate is no more than a few percent, and the relative parameters for different relationships will remain approximately the same. A detailed analysis of consequences of errors is beyond the scope of this article, however.

Further assumptions made when information on chromosome length is included are that a Haldane mapping function is appropriate and that map length can be accurately inferred from physical length of the chromosome. We consider the number of segments and the probability that shared segments reach chromosome ends would depend little or not at all on the mapping function. Problems might be encountered in measuring the segment length distribution, converted to map units, before using the data and methods presented here. If there are major experimental technical problems in measuring lengths or concern about the mapping functions or conversion from physical length, then that information could just be ignored with, for most pairings of **R**, little impact on discriminating ability (Figure 1).

We also have taken no account of distant background relationship, assuming all genome sharing was due to recent common ancestry, whereas Huff *et al.* (2011) did so. Such sharing will bias predictions towards higher relationship. As in the example here, an extra rather than lost shared segment on one or two chromosomes will have little effect on likelihood calculations for fairly close relationships. Proportional errors become larger as relationships become more distant, but as results such as in Table 6 show, the power to discriminate among quite distant relationships is low in any case.

### General conclusions

The results presented here show what can, in theory, be achieved in determining pedigree relationships from information on genome sharing. No further information is, in principle, available from analysis at the individual locus level (except perhaps from sequencing and tracing point mutations in the pedigree). The low levels of expected likelihood ratios compared with their sampling error for pairs of quite distant relationships illustrate both how much variability in actual relationship in terms of shared genome comes from random Mendelian segregation and linkage and the consequent difficulty in assigning relationship.

## Acknowledgements

We thank Bruce Weir for helpful comments and discussion and reviewers for their useful criticisms. This work was supported by grants from the Leverhulme Trust to William G. Hill and from National Institutes of Health (GM 099568) to Bruce Weir, University of Washington, and by USS.

## Appendices

Simulated data in supplementary files.

## Footnotes

Supporting information is available online at http://www.g3journal.org/lookup/suppl/doi:10.1534/g3.113.007500/-/DC1

*Communicating editor: D.-J. De Koning*

- Received April 9, 2013.
- Accepted July 5, 2013.

- Copyright © 2013 Hill and White

This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.