# Genomic Relatedness Strengthens Genetic Connectedness Across Management Units

- Haipeng Yu,
- Matthew L. Spangler,
- Ronald M. Lewis and
- Gota Morota
^{1}

- 1Corresponding author: Department of Animal Science, University of Nebraska-Lincoln, PO Box 830908, Lincoln, NE 68583-0908. E-mail: morota{at}unl.edu

## Abstract

Genetic connectedness refers to a measure of genetic relatedness across management units (*e.g.*, herds and flocks). With the presence of high genetic connectedness in management units, best linear unbiased prediction (BLUP) is known to provide reliable comparisons between estimated genetic values. Genetic connectedness has been studied for pedigree-based BLUP; however, relatively little attention has been paid to using genomic information to measure connectedness. In this study, we assessed genome-based connectedness across management units by applying prediction error variance of difference (PEVD), coefficient of determination (CD), and prediction error correlation r to a combination of computer simulation and real data (mice and cattle). We found that genomic information () increased the estimate of connectedness among individuals from different management units compared to that based on pedigree (). A disconnected design benefited the most. In both datasets, PEVD and CD statistics inferred increased connectedness across units when using - rather than -based relatedness, suggesting stronger connectedness. With r once using allele frequencies equal to one-half or scaling to values between 0 and 2, which is intrinsic to connectedness also increased with genomic information. However, PEVD occasionally increased, and r decreased when obtained using the alternative form of instead suggesting less connectedness. Such inconsistencies were not found with CD. We contend that genomic relatedness strengthens measures of genetic connectedness across units and has the potential to aid genomic evaluation of livestock species.

- coefficient of determination
- genomic connectedness
- prediction error correlation
- predication error variance of difference
- relatedness

The problem of connectedness or disconnectedness is particularly important in genetic evaluation of managed populations such as domesticated livestock. When selecting among animals from different management units (*e.g.*, herds and flocks), caution is needed; choosing one animal over others across management units may be associated with greater uncertainty than selection within management units. Such uncertainty is reduced if individuals from different management units are genetically linked or connected. In such a case, BLUP offers meaningful comparison of the breeding values across management units for genetic evaluation (*e.g.*, Kuehn *et al.* 2007).

Structures of breeding programs have a direct influence on levels of connectedness. Wide use of artificial insemination (AI) programs generally increases genetic connectedness across management units. For example, dairy cattle populations are considered highly connected due to dissemination of genetic material from a small number of highly selected sires. The situation may be different for species with less use of AI and more use of natural service mating such as for beef cattle or sheep populations. Under these scenarios, the magnitude of connectedness across management units is reduced and genetic links are largely confined within management units.

Pedigree-based genetic connectedness has been evaluated and applied in practice (*e.g.*, Kuehn *et al.* 2009; Eikje and Lewis 2015). However, there is a relative paucity of use of genomic information such as single nucleotide polymorphisms (SNPs) to ascertain connectedness. In what scenarios genomics can strengthen connectedness, and how much gain can be expected relative to the use of pedigree information alone, still remains unknown. Connectedness statistics have been used to optimize selective genotyping and phenotyping in simulated livestock (Pszczola *et al.* 2012) and plant populations (Maenhout *et al.* 2010), and in real maize (Rincent *et al.* 2012; Isidro *et al.* 2015) and rice data (Isidro *et al.* 2015). These studies concluded that the greater the connectedness between the reference and validation populations, the greater the predictive performance. However, (1) connectedness among different management units and (2) differences in connectedness measures between pedigree and genomic relatedness were not explored in those studies. For better understanding of genome-based connectedness, it is critical to examine how the presence of management units comes into play. For instance, genomic relatedness provides relationships between distant individuals that appear disconnected according to the available pedigree information. In addition, it captures Mendelian sampling that is not present in pedigree relationships (Hill and Weir 2011). Thus, genomic information is expected to strengthen measures of connectedness, which in turn refines comparisons of genetic values across different management units. The objective of this study was to assess measures of genetic connectedness across management units with use of genomic information. We leveraged the combination of real data and computer simulation to compare gains in measures of connectedness when moving from pedigree to genomic relationships. First, we studied a heterogenous mice dataset stratified by cage. Then, we investigated approaches to measure connectedness using real cattle data coupled with simulated management units to have greater control over the degree of confounding between fixed management groups and genetic relationships.

## Materials and Methods

### Mice data

We analyzed a heterogeneous stock mouse population established for quantitative trait mapping (Solberg *et al.* 2006; Valdar *et al.* 2006). It was originally derived from eight inbred strains (DBA/2J, C3H/HeJ, AKR/J, A/J, BALB/cJ, CBA/J, C57BL/6J, and LP/J), followed by 50 generations of pseudorandom mating. This process introduced recombinants that allow high-resolution mapping (Solberg *et al.* 2006; Valdar *et al.* 2006). This population was used for one of the first empirical applications of genomic selection in animals (Legarra *et al.* 2008) and later used for an array of quantitative genetic studies. The data consisted of 1884 individuals from 169 full-sib families with ∼11 siblings per family. Each individual was genotyped with 10,946 SNPs, yet none of the full-sib parents were genotyped. We removed SNPs with a minor allele frequency (MAF) < 0.05, resulting in 10,339 markers for analysis. The mice were reared in 523 cages or management units that created shared environments. The majority of full-sibs were housed in the same cages and distributed to three cages on average, *i.e.*, a full-sib family was typically reared together in three cages. Pedigree relationships within and across full-sib individuals were 0.5 and 0, respectively. This resulted in an extreme case of genetic disconnectedness across management units. Thus, the extent of connectedness was determined by the presence or absence of full-sibs in different management units.

### Cattle data

Pedigree information of dairy cattle was available on 1929 cattle collected over six generations starting from a base generation 0 to generation 5 (Wimmer *et al.* 2015). Among those, 500 individuals, mostly coming from generations 2 and 3 (90%), had both phenotypes and genotypes. Historic pedigree information in addition to the 500 individuals are a source of connectedness as the pedigree-based relationship matrix was constructed from the entire pedigree. The 500 individuals were genotyped for 7250 SNP markers. The average missing rate of genotypes across the entire SNP was 0.0002. We imputed missing genotypes by sampling alleles from a Bernoulli distribution with the marginal allele frequency used as a parameter. We retained 6714 SNP after removing markers with MAF < 0.05. We simulated management units in two steps: (1) individuals were clustered and (2) clusters were assigned to management units. The k-medoid clustering was performed to cluster individuals into distinctive groups. In particular, we used partitioning around medoids, which is considered a robust version of K-means (Kaufman and Rousseeuw 1990; Reynolds *et al.* 2006). We formed sets of clusters so that individuals in the same groups were more similar to each other than to those in other groups. We selected the number of clusters by optimum average silhouette width algorithm implemented in the cluster and fpc R packages. This algorithm minimizes dissimilarity measures among individuals within the same cluster using the Euclidean metric and finds the optimal number of clusters that returns the lowest average dissimilarity computed from each cluster. The clustering was based on the matrix, which was converted to a dissimilarity matrix by calculating the distance from the highest similarity to each similarity value in such a way that the relationship with the largest value becomes zero. We simulated the four following scenarios.

Scenario 1: Completely disconnected, all clusters allocated to their own management units.

Scenario 2: Disconnected, one-half of clusters allocated to management unit 1 and remaining half assigned to management unit 2.

Scenario 3: Partially connected, approximately one-third of clusters allocated to management unit 1, another one-third to management unit 2, and the remaining one-third of clusters assigned to both managements to act as a link to connect the two management units indirectly.

Scenario 4: Connected, all clusters equally allocated to the two management units.

Subsequently, appropriate incidence matrices were constructed and we computed connectedness statistics across management units employing pedigree and genomic relationships.

### Prediction error variance (PEV)

Genetic connectedness statistics are typically defined as a function of the inverse of the coefficient matrix. For instance, Kennedy and Trus (1993) proposed a genetic connectedness measure as the average PEVD in predicted genetic values between all pairs of individuals in different management units. The PEV can be obtained from Henderson’s mixed model equations (MME) (Henderson 1984). We constructed MME according to a standard linear mixed model where is a vector of phenotypes, is an incidence matrix of management units, is a vector of effects of management units, is an incidence matrix relating individuals to phenotypic records, is a vector of random additive genetic effects, and is a vector of residuals. The phenotypic vector was standardized to have mean of 0 and variance of 1 so that results could be compared across different scenarios. The variance–covariance structure for this model iswhere is the genetic variance, is the residual variance, and is a positive (semi)definite relationship matrix defined later.

The inverse of the MME coefficient matrix is represented aswhere λ is the ratio of variance components The PEV of genetic value for the *i*th individual () is given bywhere is the *i*th diagonal element of coefficient matrix. Note that PEV can be interpreted as the proportion of additive genetic variance not accounted for by the prediction. Equivalently, the matrix of PEV can be computed aswhere is the absorption (projection) matrix for fixed effects where which is orthogonal to the vector space defined by (*i.e.*, ). This avoids calculating the inverse of the entire coefficient matrix, which is useful when the number of columns of is large or analysis involves repeated computation of PEV.

### Genetic connectedness

We computed three genetic connectedness statistics: the PEVD between genetic values (Kennedy and Trus 1993), the CD of the difference between predicted genetic values (Laloë 1993), and the r between genetic values of individuals from different management units (Lewis *et al.* 1999). The first two statistics were originally used to evaluate the accuracy of individual estimated breeding values and later extended to assess inherent risk in comparing individuals across management units. First, genetic connectedness between two individuals, *i* and *j*, was measured as PEVD (Kennedy and Trus 1993)where PEC is the prediction error covariance (PEC) or covariance between errors of genetic values, which is the off-diagonal element of the PEV matrix. If PEVD is small, individuals are said to be connected. The idea behind using PEVD as a measure of connectedness is that the accurately estimated genetic values of individuals have smaller PEV and that the pairs of genetically related individuals in the different management units have a positive PEC. Throughout this study, we used a scaled PEVD following Kuehn *et al.* (2008) by scaling PEVD by the additive genetic variance to express connectedness without units of measurement.

Similarly, CD is closely related to PEVD and is defined by scaling the inverse of the coefficient matrix by corresponding coefficients from the relationship matrix. We can view CD as the squared correlation or reliability between the predicted and the true difference in the breeding values (Laloë *et al.* 1996). This is given byfor pairwise comparison. In contrast to PEVD, CD accounts for the reduction of connectedness due to relationship variability between individuals under comparison. This statistic is bounded between 0 and 1, with larger values indicating increased connectedness.

The r is obtained by transforming a PEV matrix into a predictive error correlation matrix. For individuals *i* and *j*, this statistic is given byThe rationale behind r is that there is no connectedness when PEC is zero (Lewis *et al.* 1999). Similar to CD, r is also bounded between 0 and 1. The larger the r, the greater the connectedness.

### Connectedness summary

We can generalize connectedness between any pair of management units and by setting up a corresponding contrast vector that sums to zero (*i.e.*, ) (Laloë 1993). The PEVD of contrast in genetic values is given bywhere is a column vector including and 0, for the elements corresponding to th unit, th unit, and the remaining units, respectively, where and were the numbers of individuals belonging to th and th units, respectively. In a contrast vector notation, pairwise CD between management units and is given byFor the r statistic, a similar summary statistic can be derived aswhere and were the sums of the elements of and respectively (Kuehn *et al.* 2008). However, in the *Appendix* we show that when this summary statistic is applied across units it provides a reasonable summary for a pedigree relationship matrix, but it is not suitable for a genomic relationship matrix when the total number of management units is two. Thus, we reported connectedness by averaging the r statistic for all pairs of individuals across management units.

### Relationship matrix

Connectedness is realized through a genetic relationship matrix under the BLUP framework. Three genetic connectedness statistics defined above require information about covariance structures among individuals or genetic values that evaluate relatedness. We considered five relationship kernel matrices () in this study, where *n* is the number of individuals. The numerator relationship matrix, is based on relatedness due to expected additive genetic inheritance. This can be computed directly from pedigree information, and reflects the probability that alleles are inherited from a common ancestor and thereby are identical by descent (IBD). The off-diagonal elements are twice the kinship coefficients and are equivalent to the numerators of Wright’s correlation coefficients (Wright 1921, 1922). The majority of genetic connectedness literature is based on the pedigree relationship matrix, *i.e.*, average relationships assuming conceptually, an infinite number of loci. On the other hand, the genomic relationship matrix, captures genomic similarity among individuals. The matrix is a function of the matrix of allelic counts where and denote the indices of individuals and of markers, respectively. Each element of the allele content matrix is the number of copies of the reference allele. Under Hardy–Weinberg equilibrium, and so that is a standardized incidence matrix of allelic counts, where is the allele frequency at the *j*th marker. The matrix is constructed from a cross-product of scaled marker genotype matrix divided by some constant, *i.e.*, the number of markers under assumption of unity marker varianceThe standardization of and the constant in the denominator make the matrix analogous to the matrix (VanRaden 2008). This genomic relationship matrix estimates the proportion of the genomes of two individuals that is identical by state (IBS).

One concern that arises when comparing the and matrices is that these two matrices are not on the same scale. The matrix represents the estimate of a covariance (correlation) structure among individuals marked by SNPs with the potential having some negative off-diagonal entries. Such negative values indicate that some individuals are molecularly less related than average pairs of individuals in the sense of IBS if the population were in Hardy–Weinberg equilibrium (*e.g.*, Toro *et al.* 2002). This mostly happens when the current population is defined as a base population, namely, computing the matrix by using the estimates of observed allele frequencies from the current population (Powell *et al.* 2010). While the negative coefficients arising from IBS can be interpreted as negative correlations of alleles (Toro *et al.* 2002), this is in contrast to the matrix, which is defined as an IBD. In the matrix, a founder population is assumed to be the unselected base population. This may impact some of the connectedness statistics used in this study. For this reason, we also considered two other genomic relationship matrices: a matrix and a scaled matrix, so that the genomic relationship matrix is on nearly the same scale as the matrix. The matrix was created by scaling the by instead of where is the estimate of allele frequency in the base population. Because allele frequencies in the base population are unknown, we set all equal to 0.5 under the assumption of no selection (VanRaden 2007; Toro *et al.* 2011; Vitezica *et al.* 2011). The matrix constructed in this way does not create any negative coefficients for either the mice or cattle datasets. The correlations between and (defined as correlation between elements of upper triangular matrix including diagonals) were 0.81 and 0.98 for mice and cattle, respectively.

Alternatively, a min–max scaler, one of the common scaling methods, was employed to scale the matrix. The min–max scaler transforms inputs into the given range of minimum and maximum values. The scaled genomic relationship between *i*th and *j*th individual was given bywhere and are the minimum and maximum elements of and is the *i*th, *j*th element of The and define the range of minimum and maximum values of elements of These values were set to 0 and 2, respectively, according to the minimum and maximum values of numerators of Wright’s correlation coefficients. This scaling sets negative off-diagonal entries in the matrix to 0 (Momen *et al.* 2017). Note that the correlation between and is equal to one because a correlation is invariant to changes in scale.

Lastly, the covariance between ungenotyped and genotyped individuals was jointly modeled through a hybrid matrix where The matrix can be viewed as a matrix that combines pedigree and genomic relationships. By considering the distribution of genetic values of ungenotyped individuals conditioned on genetic values of genotyped individuals, it can be shown (Legarra *et al.* 2009; Christensen and Lund 2010) thatwhere and are numerator relationship matrices among ungenotyped, ungenotyped and genotyped, and genotyped individuals, respectively. is the genomic relationship matrix for genotyped individuals. In addition to and the matrix was used for the cattle dataset that spans several generations. We treated individuals at generations three, four, and five as genotyped individuals, and earlier generations as ungenotyped individuals. This reflects a practical situation in typical breeding programs, where the majority of genotyped individuals are concentrated in more recent generations. This partitioning resulted in 65% ungenotyped and 35% genotyped individuals, simulating a realistic scenario where there are more ungenotyped than genotyped individuals (*e.g.*, Legarra *et al.* 2009).

### Principal component analysis (PCA) of measures of connectedness

PCA of PEVD, CD, and r pairwise individual-based matrices computed under the four different simulated scenarios in the cattle dataset was used to cluster individuals. The prcomp function in R was used to produce principal component (PC) scores and the PC plots were generated with the ggbiplot package based on the first two PC.

### Heritability

For simulation, we used two heritability values ( and ) by varying the ratio of variance components assuming an animal model, where and are residual and genetic variances, respectively.

### Data availability

The mouse dataset is available at http://wp.cs.ucl.ac.uk/outbredmice/heterogeneous-stock-mice/ and the cattle dataset is downloadable from the synbreedData R package at https://cran.r-project.org/web/packages/synbreedData/index.html.

## Results

### Mice data

#### Absence of full-sibs:

The average (SD) of pedigree relationships among individuals in the same management units was 0.491 (0.058) because of the aforementioned full-sib family assignments. The genomic counterpart () gave a similar estimate of 0.494 with a slightly increased SD of 0.087 due to Mendelian sampling variation (Hill and Weir 2011). The average across-management unit pedigree-based genetic connectedness was 1.299 when measured by PEVD and (Table 1). Measures of connectedness increased using genomic data () by reducing PEVD to 0.456. With while the overall genetic connectedness decreased, genomic information () lowered PEVD compared to that of pedigree. Use of the reduced PEVD more than that of the hence increased the measures of connectedness. Using the scaled genomic relationship matrix increased connectedness statistics compared to those of the pedigree-based, but they were less than those with Similarly, use of the matrix compared to the matrix strengthened measured connectedness in CD for both and The matrix also increased measures of connectedness compared to those of the and the matrix resulted in the greatest measures of connectedness among the four relatedness matrices. Both PEVD and CD statistics confirmed that genome-wide markers increased the degree of connectedness estimated between individuals across management units. However, the connectedness measures assessed by r were less when the was compared with the On the other hand, the and the scaled genomic relationship matrix estimated greater connectedness measures than those of the

#### Presence of full-sibs:

The increased estimates of disconnectedness were less when at least one full-sib was present in different management units for PEVD. For instance, comparisons between absence or presence of full-sibs across management units were 1.299 *vs.* 0.354 and 0.456 *vs.* 0.127 for pedigree-based *vs.* genome-based () PEVD, respectively. The presence of full-sibs in different management units decreased PEVD. However, corresponding statistics for CD were lower with the existence of full-sibs. This is explained by the fact that CD penalizes the estimates of connectedness when genetic variability is small. The CD statistic attempts to decrease the average PEV of the contrast while maintaining the variability of relatedness. Laloë (1993) stated that increased estimate of connectedness should not be achieved by simply using genetically similar individuals and that CD is the most relevant connectedness statistic in terms of genetic progress of agricultural species. This was confirmed in the mice data illustrating that the presence of full-sibs decreased the estimates of CD. Regardless of the absence or presence of full-sibs across units, genomic information elucidated additional relationships, thus increasing connectedness estimates relative to pedigree. This trend was also true for the and matrices. With r, when transitioning from to the values of the statistic reduced; however, the and yielded greater values of connectedness than those of pedigree in the existence of full-sibs. In all cases, using one of the or matrices increased the estimates of connectedness statistics as compared to using the As shown in Table 1, we found a similar overall pattern when was set to 0.2, although connectedness remained less than the alternative higher heritability. Replacing pedigree with genome-wide markers increased the degree of connectedness captured among individuals in disconnected management units.

#### Illustrative examples:

To illustrate how matrix impacted our measures of connectedness, we chose five management units including full-sib and nonfull-sib individuals. In this example, management units “19F,” “29A,” and “36F” share at least one pair of full-sib individuals, whereas management units “12A” and “13C” do not share any full-sib individuals across management units. Figure 1 shows PEVD-derived connectedness across management units when Comparison across-management units with full-sibs in common had smaller PEVD, hence greater connectedness. The molecular information captures more of the genetic connectedness relative to pedigree across-management units. We further investigated how the or increased connectedness measures across management units relative to the using PEVD and r. To do so, we examined the specific components in the PEV matrix derived from several management units including full-sib and nonfull-sib individuals. As shown in detail in Supplemental Material, Text S1 in File S1, we found that the rates of PEV (diagonals) and PEC (off-diagonals) reductions from to or explain the changes of connectedness measures.

### Cattle

#### Clustering:

The partitioning around medoids clustering method yielded eight clusters. Table 2 contains descriptive statistics for those clusters. The number of individuals per cluster varied from 36 to 127. The average of within-cluster pedigree-based relationships was ∼0.05, except for cluster 6 in which distant relatives were grouped together. Between clusters, all pedigree-based relationships were close to zero. Each cluster was assigned to management units in four simulated scenarios, as summarized in Figure 2.

Scenario 1: Each cluster was assigned to its own management unit.

Scenario 2: Clusters 1, 2, 3, 4, and 5 were assigned to management unit 1 and clusters 6, 7, and 8 were assigned to management unit 2.

Scenario 3: Clusters 1, 2, and 3 were assigned to management unit 1; clusters 7 and 8 were assigned to management unit 2; and individuals in clusters 4, 5, and 6 were assigned to both management units 1 and 2 to act as link among clusters or individuals that partially connect the two management units.

Scenario 4: Individuals in clusters 1 to 8 were equally assigned to management units 1 and 2.

The number of individuals in management units 1 and 2 were approximately equal in scenarios 2, 3, and 4. We computed PEVD, CD, and r for each of the four scenarios and compared genetic connectedness when using the and kernel matrices.

#### PEVD:

Across-management unit PEVDs for each of the four scenarios are presented in Table 3. Connectedness estimates increased across management units when transitioning from scenario 1 to scenario 4 for both heritability levels. Figure 3 shows the relative increase of genetic connectedness as measured with PEVD, as a percentage, across management units in comparison to scenario 1. Genetic connectedness across management units in scenario 1 was compared to across-management unit connectedness obtained from scenarios 2, 3, and 4. We observed increased genetic connectedness as more individuals from the same clusters were shared between management units, resulting in the highest connectedness estimates in scenario 4. Transitioning from scenario 1 to scenario 4 increased connectedness for and for both heritability levels. The proportional increases in genetic connectedness in pedigree-based relationships were larger than those of genomic-based relationships because matrix substantially increased measured connectedness between disconnected management units in scenario 1, reducing the gains in the following scenarios 2, 3, and 4. Also, as heritability increased, larger values of connectedness were observed. In general, and increased the estimates of connectedness compared to those of the regardless of heritability levels. This is in agreement with the mice dataset. However, with values of PEVD were unexpected; although scaled produced estimates of connectedness that were higher than those with when was set to 0.8, the same pattern was not observed for

#### CD:

Across CDs for each of the four scenarios are presented in Table 3. Similar to PEVD, the extent of connectedness across management units increased when moving from scenario 1 to scenario 2 and 3 regardless of the heritability levels. Figure S1 in File S1 shows the percentage increase in CD across management units when scenario 1 was treated as a base comparison. As with PEVD, CD statistics revealed an increase in the degree of connectedness as more individuals from the same clusters were assigned to different management units. However, the increase of CD was not observed when transitioning from scenario 1 to scenario 4. Again, this is because CD accounts for the reduction of connectedness due to reduced relatedness variability between individuals under comparison in scenario 4. This pattern was observed for both pedigree and genomic-based connectedness. Overall, and all produced CD greater than those with regardless of heritability level, yielding consistent measures of connectedness.

#### Prediction error correlation:

Prediction error correlations across management units for each of four scenarios are presented in Table 3. The results align with those of the mice dataset in that -based r statistics behave erratically in all scenarios, making them difficult to interpret. However, the anticipated increases in r were observed with the transition from the to the or the scaled matrix. Here, and -based measures consistently yielded greater connectedness values than those of pedigree counterparts. Figure S2 in File S1 shows the percentage increases in r across management units when scenario 1 was treated as a base comparison. Here, the instead of the matrix was used. The results align with those of PEVD and CD, where the extent of pedigree-based and genomic-based r statistics increase the most when more individuals from the same clusters were assigned to different management units. The magnitude of the increase was larger when heritability was greater. However, the increase of connectedness moving from scenario 1 to 2 was not observed in pedigree-based measures. While both scenarios are not connected designs because pedigree-based relationships across the eight clusters were close to zero, it is interesting to note that with pedigree-based r scenario 2 was more disconnected than scenario 1. From Kennedy and Trus (1993), this is because stronger within unit connectedness can reduce between unit connectedness.

#### Ungenotyped and genotyped individuals:

We considered a scenario where only individuals in younger generations were genotyped in the cattle dataset. For this purpose, we used the matrix, which blends ungenotyped and genotyped individuals. As shown in Table S1 in File S1, results using the H matrix lie somewhere between the results obtained when using the and matrices. This is expected because the matrix was created from a combination of and or Although an increase in measures of connectedness was observed compared to using the pedigree alone, this increase was smaller than when all individuals were genotyped. This finding suggests the possibility of strengthening the degree of connectedness even when only a subset of individuals was genotyped. An exception was observed when was constructed from for PEVD; in this case, the measures of connectedness were less than that from

#### PCA of connectedness:

PC plots for CD derived from and matrices for scenarios 1 and 4 are presented in Figure 4 and Figure 5, respectively. These correspond to the two extreme scenarios considered in the cattle dataset. In scenario 1 with eight clusters assigned to distinctive management units were separated from each other as expected using pedigree-based relationships (Figure 4). Genomic information brought these eight clusters closer to each other, thus shortening the distance between individuals from different management units. While eight clusters were less distinguishable from one another due to lower heritability, the same pattern was observed when was 0.2. These findings align with the fact that use of genomic information increases measures of connectedness compared to pedigree. In both cases, cluster 6, which consisted of unrelated individuals, was clustered far away from the other clusters in the pedigree-based analysis. PCA yielded two clear clusters in scenario 4 when which correspond to the two management units considered (Figure 5). Replacing with resulted in a tighter concentration of a single cluster. A similar tendency was observed when supporting the findings that the extent of measures of connectedness between individuals from different management units is enhanced with genomic information. The remaining PC plots for PEVD ( and ) and r ( and ) are in Figure S3, Figure S4, Figure S5, and Figure S6 in File S1, which present a patterns similar to those in Figure 4 and Figure 5.

## Discussion

With sufficient connectedness across management units, BLUP of genetic values can be fairly compared. Without such connectedness, making selection decisions based on breeding values of individuals from different management units might be associated with an increased risk of uncertainty in genetic evaluation due to imperfect separation of the genetic signal from noise. In addition to PEVD, CD, and r, other connectedness measures have been applied to pedigree data (*e.g.*, Foulley *et al.* 1992; Fouilloux *et al.* 2008), which have their own characteristics. Advancement of molecular biotechnology now enables us to assess connectedness at the genomic level. Although genomic data are clearly important in genetic evaluations due to increased accuracy of estimates of genetic merit for nonparent individuals, little consideration has been given to the effect of genomic information on connectedness measures. In this study, we employed three measures of connectedness to examine the extent to which genomic information increases the estimates of connectedness.

### Relatedness in quantitative genetics

The majority of connectedness among management units was driven by the degree of genetic links or relatedness. The theory behind relatedness is largely entrenched in quantitative genetics dating back to the work of Fisher (1918) and Wright (1921). Quantitative genetics offers a useful framework to study traits and diseases that are controlled by a considerable number of small effect genes. For traits with polygenic genetic architectures, inheritance does not exhibit a clear genotype–phenotype pattern. However, genetic resemblance between relatives (*e.g.*, the genetic correlations between parent and offspring or between pairs of different types of siblings) can be exploited to estimate quantitative genetic parameters. For this reason, genetic resemblance between relatives has been at the heart of quantitative genetics. Consequently, the vast majority of the theoretical developments and applications of the last century were built around family data. The availability of dense panels of common SNPs has made it possible to trace Mendelian sampling and hence capture more detailed relatedness compared to pedigree information. It enables the quantification of genomic kinships among related individuals that are not otherwise apparent because of incomplete pedigrees or the general assumption that animals in a baseline or founder population are unrelated. Thus, it has opened up new opportunities for quantitative genetic analysis using data from distant relatives. The rationale is that individuals are genomically related to some extent and that molecular similarity introduces covariance even if individuals are not related in the sense of known pedigree. These factors possibly contribute to the reduction of PEV or increase of PEC, and hence lead to increased capturing of genetic connectedness in PEVD, CD, and r such that genetic merit estimates can be better compared across management units.

### The impact of genomic information on connectedness

We found from the mice data that genomic information increased favorable changes in measures of connectedness among individuals from different management units and reduced the risk of potential uncertainty in EBV-based comparisons when selecting individuals across management units. In addition, the rate of improvement in measures of connectedness in PEVD and r was greater when there was at least one full-sib in different management units. This is in concordance with Legarra *et al.* (2008), who used the same dataset and reported that the use of genome-wide selection increased predictive performance up to 0.22 across families and up to 0.03 within families compared to pedigree-based regression counterparts. On the other hand, CD accounted for the reduction of variability of relatedness between individuals under comparison resulting in decreased estimates of connectedness. Analysis of cattle data supported the results from mice and revealed that the benefit of using genomic information is greater for a disconnected design rather than a connected design. PCA was performed to visualize improvement in connectedness when moving from pedigree to genomic-based relationships. The PC plots supported the evidence that genomic information can improve detection of connectedness between individuals from different management units. This is particularly so when more individuals from the same clusters are assigned to different management units.

### Choice of kernel matrices

Unlike PEVD and CD, comparisons between the and kernel matrices evaluated by the r statistic behaved irregularly. By examining the specific components of the PEV matrix for and in the mice dataset, we found that genomic information reduces off-diagonal elements more than diagonals. This illustrates a fundamental difference between r and either PEVD or CD because this statistic is based on the ratio, rather than the magnitude, of individual elements. It may be argued that the inconsistent connectedness results from r occur because the matrix is not on the same scale as the matrix, suggesting that r statistics are not invariant regarding how genomic relatedness is defined. Given pedigree information, the numerator relationship matrix is defined as IBD. On the other hand, given a marker matrix, there are a number of ways to construct a genomic relationship matrix, as discussed by Toro *et al.* (2011). The matrix we used captures the proportion of the genome that is IBS by accounting for the covariance structure among individuals by molecular markers (Toro *et al.* 2002). This kernel matrix is an estimator of IBD relationships (Powell *et al.* 2010). Caution should be exercised when interpreting connectedness measures derived using genomic data, as the underlying assumption is that relationships are built based on alleles being IBS and not necessarily being IBD. Therefore, we attempted to make more compatible to by using derived from allele frequencies equal to 0.5 and by using the min–max scaler transformation to produce the scaled genomic relationship matrix For instance, compared to using entries of PEV matrix from using were more similar to those especially when there was connectedness, and in turn r statistics yielded greater connectedness values. Although connecting marker-based genomic relatedness to classical theory is still an open question in quantitative genetics, care needs to be taken when comparing genetic connectedness with genomic connectedness, especially when the ratio-based statistic is used. Moreover, many additional factors may influence the elements of IBS matrix such as the choice of MAF, the density of SNP, imperfect linkage disequilibrium (LD) between markers and QTL, and errors associated with estimating genomic relationships from a finite set of markers (*e.g.*, Goddard 2009).

### Choice of connectedness statistics

There was an issue with PEVD coupled with the matrix in the cattle dataset when = 0.2, as the estimates of connectedness were less than those using (Table 3). Note that this was not the case when = 0.8. The matrix blended from the and kernel matrices yielded the estimates of connectedness that lie somewhere between the results obtained when using and alone. However, this pattern was not observed when was used in conjunction with to compute PEVD (Table S1 in File S1). Apparently, scaling has a negative influence on blending for PEVD, which warrants further research. One potential reason with for the discrepancy is that the proportional increase of PEC relative to PEV is larger when transitioning from the to This issue of proportional change is similar to that observed earlier with the r statistic coupled with These results illustrate that connectedness statistics are not invariant with respect to how the genomic relationship matrix is created and that each of them captures different aspects of genomic connectedness. The CD was the only statistic that yielded consistent estimates of increased connectedness throughout this study. Its consistency was observed regardless of choice of kernel matrices, heritability levels, datasets used, and simulated scenarios for management units.

### Inferring variance components from data

One concern with the current study is fixing heritability levels for all scenarios based on the assumption that both pedigree and genomic relationship matrices explain the equal amount of heritability. In practice, this assumption might not hold true when SNPs do not capture the entire QTL signals. To address this concern, additional analysis was carried out such that variance components were estimated from the data rather than assuming these were known. We analyzed two publicly available phenotypes in the synbreedData R package (Wimmer *et al.* 2015) for the cattle data used in this study. The heritabilities of these traits are 0.66 and 0.41. Connectedness analysis in Table 3 was repeated based on variance components estimated by a restricted maximum likelihood. The measures of CD derived from the **A** and **G** matrices are shown in Table S2 in File S1. We found that genomic relatedness increased connectedness measures more so than those of pedigree when variance components were directly estimated from the data. This result was consistent with what we found in the cattle data analysis reported in Table 3.

### Future direction

One important direction for future study is to investigate whether increased connectedness observed by genomic relatedness also leads to increased predictive accuracy of genetic values across management units assessed by cross-validation. In this case, across-management units can be considered as training and testing sets. In addition, while the current norm of genomic prediction is to use an IBS relationship matrix that aims to capture relationships at unknown QTL through LD between markers and QTL, we argue that improving the quality of breeding value comparisons and improving the accuracy of genomic prediction can be viewed as relevant but two different items. In this regard, a genome-wide IBD relationship matrix (*e.g.*, Fernando and Grossman 1989), where marker inheritance is traced through a known pedigree, may be worthwhile to revisit for the purpose of ascertaining connectedness in a future study.

Also, for the r statistic, we summarized connectedness by averaging the r statistic of pairs of individuals across units rather than by averaging the relevant components of PEC and PEV followed by taking their ratio; our justification for that choice is provided in the *Appendix*. When the latter summary statistic was used for r, the differences were negligible in the mice data and the pattern was the same for scenario 1 of the cattle data.

In conclusion, this study confirms that use of genomic relatedness improved genetic connectedness across management units compared to the use of pedigree relationships. To our knowledge, this marks the first thorough investigation of genomic connectedness. We contend that our work is a critical first step toward better understanding genetic connectedness that may have a positive impact on genomic evaluation of agricultural species.

## Acknowledgments

The authors thank Dale Van Vleck and Larry Kuehn for their valuable comments on the manuscript. This work was supported in part by the University of Nebraska startup funds to G.M.

## Appendix

### Prediction error correlation statistic across units

When a population is divided into two management units, and relatedness between those two units is based on the matrix, the flock connectedness or unit connectedness correlation r of Kuehn *et al.* (2008) always yields an estimate of −1. Here, we wish to illustrate that result. Flock connectedness is derived by averaging the relevant components of PEC and PEV followed by taking their ratio. Suppose that there are two units and the numbers of individuals in units and are and respectively. The total number of individuals is The assumptions of the matrix is that the genotyped individuals represent the base population where the expected value of self-relatedness is one, assuming no inbreeding, and that the mean relatedness of any one individual to the rest of the individuals is zero. Consequently, the expectation of diagonal elements of the matrix is equal to the number of individuals assuming no inbreeding and the expectation of off-diagonal elements is (*e.g.*, Yang *et al.* 2013). Then, the expectation of numerator in the r statistic is proportional toand the expectation of denominator is the square root of the product betweenandso thatNote that the first three terms in the numerator are equal to zerobecause Therefore, the r statistic between units and is given byWhen this result holds regardless of relatedness level, connectedness level, and how individuals are partitioned into the two management units and The partitioning of animals into two distinct units is particularly relevant in the context of genomic prediction, where animals may be divided into training and testing sets. In this scenario, computing the connectedness between the two sets along the lines of Rincent *et al.* (2012) and Isidro *et al.* (2015) is potentially informative relative to expectations of the performance of resulting genomic predictors. The or matrix changes the expectation of off-diagonal elements to positive values and shifts the statistic by a constant, as explained in the *Materials and Methods* section, yielding connectedness between units and of close to one. Because scenarios 2–4 in the cattle dataset simulated two management units, the average of the r statistic of pairs of individuals in different management units was used to summarize connectedness in this study. Note that this is shown as lamb connectedness or individual connectedness in Kuehn *et al.* (2008). The two types of connectedness differ mainly by whether we take the average followed by the ratio (unit connectedness) or take the ratio first followed by the average (individual connectedness).

## Footnotes

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300151/-/DC1.

*Communicating editor: G. de los Campos*

- Received May 6, 2017.
- Accepted August 30, 2017.

- Copyright © 2017 Yu
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.