Extensions of BLUP Models for Genomic Prediction in Heterogeneous Populations: Application in a Diverse Switchgrass Sample

Genomic prediction is a useful tool to accelerate genetic gain in selection using DNA marker information. However, this technology typically relies on standard prediction procedures, such as genomic BLUP, that are not designed to accommodate population heterogeneity resulting from differences in marker effects across populations. In this study, we assayed different prediction procedures to capture marker-by-population interactions in genomic prediction models. Prediction procedures included genomic BLUP and two kernel-based extensions of genomic BLUP which explicitly accounted for population heterogeneity. To model population heterogeneity, dissemblance between populations was either depicted by a unique coefficient (as previously reported), or a more flexible function of genetic distance between populations (proposed herein). Models under investigation were applied in a diverse switchgrass sample under two validation schemes: whole-sample calibration, where all individuals except selection candidates are included in the calibration set, and cross-population calibration, where the target population is entirely excluded from the calibration set. First, we showed that using fixed effects, from principal components or putative population groups, appeared detrimental to prediction accuracy, especially in cross-population calibration. Then we showed that modeling population heterogeneity by our proposed procedure resulted in highly significant improvements in model fit. In such cases, gains in accuracy were often positive. These results suggest that population heterogeneity may be parsimoniously captured by kernel methods. However, in cases where improvement in model fit by our proposed procedure is null-to-moderate, ignoring heterogeneity should probably be preferred due to the robustness and simplicity of the standard genomic BLUP model.

typically benefit accuracy of the models (Lorenzana & Bernardo 2009, VanRaden et al. 2009). However, in practice, increasing the CS size may often involve calibrating prediction models on individuals with inconsistent LD patterns and/or backgrounds, which may result in reduced accuracy (Wientjes et al. 2016). This issue will arise in the typical situation where an initially homogeneous CS is augmented with individuals from extraneous populations, that is, multi-populationor (in the animal literature) multi-breedcalibration . Recently, studies in both plant and animal breeding have assessed the usefulness of combining populations from different genetic backgrounds in genomic prediction.
Under standard genomic prediction models (where population heterogeneity is ignored), the simulation study of de Roos et al. (2009) suggested that adding an extraneous population to a CS may benefit prediction accuracy if the added population is not too dissimilar (in terms of divergence time) from the initial CS. These authors also suggested that high enough marker density could prevent prediction accuracy from decreasing, even in cases of strong divergence between populations. Consistently, most empirical studies of multi-population calibration with high marker density, based on standard GBLUP and/ or BLR, have reported little or no gain in accuracy under strong population structure , Jarquin et al. 2016, Hayes et al. 2009c, Erbe et al. 2012. In contrast, only a few studies have reported substantial increases in accuracy from multi-population calibration in similar conditions (Technow et al. 2013, Daetwyler et al. 2012. In multi-population prediction models (where marker-by-population interactions are explicitly accounted for), studies have proposed to fit, to the whole set of available individuals, models that were capable of accommodating population heterogeneity explicitly. This type of models includes multi-trait GBLUP models, with "traits" corresponding to population backgrounds (Karoui et al. 2012, Carillier et al. 2014, and random regression models based on markers interacting with discrete population cluster coefficients , with an extension of a standard BLR model). To our knowledge, the implementation of these methods has not been adapted to contexts of admixture, where population structure variables are continuous. Furthermore, when calibration involves many populations, the increase in model complexity of these methods will make them computationally intractable and statistically inefficient. Parsimonious multi-population models, based on only a few parameters to capture population heterogeneity, have also been proposed (Zhou et al. 2014, Heslot and. In the presence of many populations, such models are more practical and potentially more useful than multi-trait and random interaction models. Also, since they generally assume some underlying basis for population heterogeneity (e.g., inconsistency in LD patterns), they may generate insight about the causes of marker-by-population interactions.
In this study, we first considered a standard GBLUP model and evaluated different types of fixed effects to reflect population structure, then we investigated the usefulness of standard GBLUP and two kernelbased extensions for coping with population heterogeneity. These procedures were compared to a standard GBLUP model under two validation schemes: whole-sample calibration, where all individuals except selection candidates are included in the calibration set, and cross-population calibration, where the target population is excluded from the calibration set. The two multi-population GBLUP extensions are one previously-reported model, derived from Heslot and Jannink (2015), and one proposed model, based on a flexible kernel function of population principal components. We applied the procedures to the analysis of three traits (plant height, heading date, and standability) in switchgrass (Panicum virgatum L.), an herbaceous biomass crop showing good promise for bioenergy production (Sanderson et al. 1996, Langholtz et al. 2016. This species is characterized by an extensive diversity which makes it particularly suitable for studying population heterogeneity (Casler 2012). The sample under study comprised seven population clusters from two diverse panels, assayed in the Midwestern region of the United States, which represent differentiation by ecotype (upland or lowland), geographical origin (latitudinal and longitudinal gradients) and ploidy level (tetraploid or octoploid). This sample exemplified the heterogeneity of data available for practical applications of genomic prediction, which pose both opportunities (by increased sample sizes) and challenges (by inconsistencies in marker effects across populations) for breeding based on DNA markers.
In this study, we did not fit BLR models (usually based on Markov chain Monte Carlo optimizations), since we focused on deterministic methods for model fit and considered only models based on computationally efficient best linear unbiased predictors (BLUP). Further research would be needed to develop kernel-based extensions of this type of models.

Panels and populations
In this study, two multi-population panels were assayed and considered together in one sample. The first panel was the breeding panel (BP) described in Ramstein et al. (2016), comprising two tetraploid breeding populations of half-sib families: WS4U-C2, which consisted of 137 half-sib families derived from a diverse upland-ecotype pool of 162 plants (Casler et al. 2006), and Liberty-C2, which consisted of 110 half-sib families derived from the lowland-upland cultivar Liberty (Casler and Vogel 2014). The second panel was the association panel (AP) described in Lu et al. (2013) and Evans et al. (2015), comprising six putative populations of clonally propagated genotypes of different ecotypes (U: upland; L: lowland), ploidy levels (4X: tetraploid; 8X: octoploid) and geographical origins (S: South; W: West; N: North; E: East): U4X-N (135 plants), U8X-W (129 plants), U8X-E (97 plants), U8X-S (10 plants), L4X-NE (106 plants) and L4X-S (37 plants). These populations corresponded to 66 diverse accessions (Lu et al. 2013, Evans et al. 2015 with up to 10 individuals per accession. In WS4U-C2, one individual was discarded so as to avoid assigning it to a population in AP, since it was too distantly related to the other individuals in BP (based on principal component analysis). In total, n ¼ 760 individuals were considered in this analysis. The main goal of this study was to assess different methods for accommodating genetic heterogeneity when predicting phenotypic means in a given target population. Four targets were chosen, with a defined focus on tetraploid populations with at least 100 relatively homogeneous individuals: WS4U-C2 and Liberty-C2 (from BP), and U4X-N and L4X-NE (from AP).

Marker data
Exome capture sequencing of individuals (parents in BP and clonally propagated plants in AP) was performed using the Roche-Nimblegen protocol for preparation of SeqCap EZ Developer libraries using the Roche-Nimblegen probeset '120911_Switchrass_GLBRC_R_EZ_HX1' as described previously (Evans et al. , 2015. Reads from sequencing were aligned to the hardmasked P. virgatum v1.1 reference genome (http://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Pvirgatum). Counts of reads corresponding to alternate and reference alleles for each individual were then determined as described previously (for BP, Ramstein et al. 2016;for AP, Evans et al. 2014 at 2,179,164 single nucleotide polymorphism (SNP) loci, which were identified as polymorphic in two diversity panels: the Northern Switchgrass Panel, corresponding to AP (Evans et al. , 2015, and a southern switchgrass panel (E. C. Brummer, unpublished data). The numbers of alternate allele at the SNP loci were then called by using the expectation-maximization algorithm of Martin et al. (2010) fitted in each population (in BP) or accession (in AP) separately, under the assumption of disomic inheritance. Although this assumption is supported in switchgrass for tetraploid genotypes (Okada et al. 2010;Li et al. 2014), it does not hold for octoploid genotypes, which would presumably exhibit tetrasomic inheritance. However, we did not adapt the algorithm of Martin et al. (2010) to accommodate possible tetrasomic inheritance, as the sequencing depth of $24 was deemed insufficient for calling intermediate heterozygotes (simplex and triplex) with high enough accuracy (Evans et al. 2015). Indeed, a sequencing depth of at least 60-80 was recommended by Uitdewilligen et al. (2013) for accurately calling tetrasomic genotypes. Therefore, for all individuals, the resulting marker-data matrix consisted of expected allelic dosages (sums of alternate-allele counts weighted by their posterior probabilities, for every individual and SNP) between 0 and 2.
The SNP data were filtered based on the following criteria: (i) proportion of missing values strictly lower than 2%, a stringent threshold given prior filtering of SNPs on read depth $ 5 (Evans et al. , 2015; (ii) minor allele frequency strictly greater than 1 2n and variance of allelic dosages strictly greater than 2 À 1 2n ÁÀ 1 2 1 2n Á , with n ¼ 760 individuals; (iii) p-value for Hardy-Weinberg equilibrium strictly greater than 10 24 in each BP population; (iv) availability of genomic-location information (as per v1.1 of the reference genome of P. virgatum). Missing values at SNPs were imputed by their mode in the whole sample. The resulting n · m filtered and imputed marker-data matrix X consisted of allelic dosages at m ¼ 717; 814 SNP markers.

Phenotypic data
Populations in BP were assayed each year between 2012 and 2014, in Arlington, WI (USA), in a randomized complete block design, with four replicates for WS4U-C2 and three replicates for Liberty-C2. Populations in AP were assayed each year between 2009 and 2011 in Ithaca, NY (USA), in a sets-in-reps design, with two replicates per individual and 10 sets within each replicate, with each set comprising at most one individual from each of the 66 accessions in AP (Lu et al. 2013, Evans et al. 2015. In each panel, three phenotypic traits were considered: plant height, heading date and standability. Plant height (PH) was measured in centimeters as the height from the ground to the top of the tallest panicle. Heading date (HD) was measured in growing degrees days as the cumulated sum of daily average temperatures (in degrees Celsius;°) above 10°, from January 1 st to the day of heading, defined as the emergence of at least half of the panicles from the boot (Mitchell et al. 1997); daily average temperatures were estimated by the average of the minimum and maximum daily temperatures. Standability (St) was measured on a 0-10 scale to describe plants' stature and stiffness, with 0 qualifying plants that are prostrate and 10 qualifying upright and rigid plants (Lipka et al. 2014).
Not all traits were measured every year in any given population: only HD was measured in all three years in AP populations and Liberty-C2. For all other cases, measurements were available for only a subset of years (Table 1).
In BP, observational units were plants within half-sib families from a given genotype (maternal parent) i. Half-sib families were arranged in a randomized complete block design and assayed in multiple years; so the following mixed model was fitted to phenotypic measurements P ijlm , to estimate half-sib family effects f i 's: where m is the population mean; f i , r j and t l are the effects of half-sib family i (fixed), block j (random) and year l (random) respectively; · indicates interactions (random); e ijlm are residuals for plant m within plot ij in year l. In AP, observational units were clones of a given genotype i. Genotypes were arranged in a sets-in-reps design and assayed in multiple years; so the following model was fitted to measurements P ijkl to estimate genotype effects g i 's: where m is the panel mean; g i , r j , s jk and t l are the effects of genotype i (fixed), replicate j (random), set k within replicate j (random) and year l (random) respectively; · indicates interactions (random); e ijl are residuals for clone ij in year l.
The models described above conform to analyses of strip-plot (splitblock) designs (Steel et al. 1996), in which years and genotype classes (half-sib families in BP, individual genotypes in AP) are whole-plot factors in cross-classification and sub-plot factors are combinations of years and genotype classes. For each random term, the corresponding effects were modeled as independent and identically normally distributed. The linear mixed models described above were fitted using ASREML-R (Butler et al. 2009).
Effects f i 's in BP are transmitting abilities of genotypes (the mean of their half-sib progeny in their respective breeding population), so where BV i is the breeding value of genotype i. In comparison, effects g i 's in AP are genotypic values, such that where D i is the deviation from additivity due to dominance and/or epistasis. Thus, outcomes of interest for genomic prediction were set to be genotype means y i 's such that y i ¼m þ 2f i in BP and y i ¼m þĝ i in AP, withm the estimated population mean in BP or estimated panel mean in AP.

Population structure and relatedness
Admixture analysis: The soft clustering model from the ADMIXTURE software was fitted on the whole sample and the whole set of SNPs, i.e., without selection on individuals or markers (Alexander et al. 2009). Based on the 10-fold cross-validation implemented in ADMIXTURE (Alexander and Lange 2011), the number of population clusters in the admixture model was set to K ¼ 7, as cross-validation error reached a plateau at that value ( Figure S1). The resulting n · K matrix A of admixture coefficients comprised inferred membership probabilities at each cluster ( Figure 1a). For convenience (in prediction models), minimum values in A (10 25 ) were set to zero while ensuring that each row still summed to one (by dividing each element in A by its row sum).
Principal component analysis: Principal component analysis (PCA) was performed on the whole sample and the whole set of SNPs. The number of principal components (PCs) to choose for depicting population structure was chosen based on the proportion of variance explained and the grouping patterns captured by PCs (Figure 1b). The resulting n · d PC matrix P consisted of coordinates for each individual at the first d ¼ 4 PCs.
Recent relationships among individuals: Let G ¼ _ X _ X9=v be the genomic relationship matrix as defined by VanRaden (2008), where _ X is the centered marker-data matrix and v ¼ 2 P m l¼1p l ð1 2p l Þ is a scaling factor depending on allele frequenciesp l 9s (estimated on the whole sample), where m ¼ 717; 814 is the number of SNP markers. Following Fan et al. (2013), G was decomposed as G ¼ PP 0 þ G P , where P consisted of the first d ¼ 4 PCs, as described above. Matrix PP 0 is the dense part of the relationship matrix G, representing resemblance among individuals through population structure, whereas matrix G P represents recent relationships conditional on population structure, similarly to the adjusted relationships introduced by Thornton et al. (2012) and Conomos et al. (2016), with the difference that here coefficients in G P are not scaled for direct estimation of recent-kinship coefficients. Here the graphical LASSO was applied to G P to infer a graph of recent relationships among individuals, according to a regularization parameter l. Parameter l was chosen to maximize the restricted likelihood in a GBLUP model based on the regularized genomic relationship matrix G, fitted to the whole sample (see Appendix 1 for technical details and discussion on graph inference). The GBLUP model depicting relationships throughG was fitted for each trait separately, so different optimal values of l were inferred for different traits.
Genomic prediction models All linear mixed models described below were fitted using the R package rrBLUP (Endelman 2011). For a given marker-data matrix X and vector y of outcomes, the standard GBLUP model is described as follows: where y is the n-vector of genotype means (y i 9s, as described above); X is the n · m marker-data matrix (here consisting of allelic dosages), and s 2 b is the variance of marker effects; Q is the model matrix for fixed effects a; I n s 2 e is the covariance matrix for errors considered independent and identically distributed.
Hereafter, the testing set TS is defined as the set of individuals left out for model validation. The calibration set CS is the set of individuals used to fit the prediction models, which excludes the TS but does not necessarily consist of all remaining (available) individuals.
We defined the mean structure in fitted models by Q being one of the following: (Intercept) a n-vector of ones 1 n , such that fixed effects consisted of a single intercept; (PCA) the n · 5 matrix ½ 1 n P of column vector of ones and first four PCs; (Panel) the n · 2 model matrix attributing observations to panel AP or BP, such that fixed effects reflected differences in genetic compositions and environments across panels; or (Group) the n · 7 matrix model matrix attributing individuals to the following putative population groups: WS4U-C2, Liberty-C2, U4X-N, U8X-W+U8X-S, U8X-E, L4X-NE and L4X-S. Genotypes from U8X-S were grouped with U8X-W on the basis of their proximity according to the first 4 PCs, to avoid having one group with too few observations.
In this study, we first compared mean structures with respect to prediction accuracy under the standard prediction procedure (GBLUP, as described below). Then, we focused on Q ¼ 1 n and compared prediction procedures for accommodating population heterogeneity (see below; GBLUP, GBLUP-Target, MPM-Mixture, MPM-Matérn). Prediction accuracy of models (differing either by mean structure or prediction procedure) was assessed by cross-validation as described in the next subsection (Validations).

Whole-sample model: GBLUP:
In the whole-sample model (GBLUP), we fitted model (1) to all available individuals, thereby assuming that the whole sample consists of only one population. This method consists of ignoring population heterogeneity and relying on robustness of standard GBLUP to interactions between markers and population backgrounds.  Target-population model: GBLUP-Target: In the target-population model (GBLUP-Target), we fitted model (1) to individuals belonging to the same population as the TS, when possible (see below). This method corresponds to a typical choice of reducing population heterogeneity and basing predictions only on individuals that have genetic backgrounds that are a priori similar to those in the TS.
Multi-population models: MPM-Mixture, MPM-Matérn: Multipopulation models (MPM) were extensions of model (1) intended to accommodate population heterogeneity. The following general model was fitted: where s is the element-wise (Hadamard) product, and V n is a n · n covariance matrix depicting population differentiation among individuals (see Appendix 2 for derivations and technical details). To parsimoniously estimate V n , we used two different procedures: MPM-Mixture (based on A) and MPM-Matérn (based on P). In both procedures, we did not model any heteroscedasticity for additive genetic effects u.
where J n is the n · n matrix of ones and Q K is a K · K matrix depicting relationships among population clusters as inferred in A. Here, we simply set Q K ¼ I K (I K is the K · K identity matrix), so V n ¼ rAA 0 þ ð1 2 rÞJ n . Therefore in this procedure, r 2 ½0; 1 set a trade-off between the case where relationships were cluster-specific (r ¼ 1) and the case where relationships assumed one single homogeneous population for all individuals (r ¼ 0). This approach is similar (but not exactly equivalent) to the K-kernel method of Heslot and Jannink (2015), which set a similar balance between cluster-specific and overall relationships, but using G for relationships (VanRaden 2008), instead of XX9, and considering only discrete population clusters (in which case values in A would then be only 0 or 1). Alternatively, MPM-Mixture may be viewed as a multi-kernel model where rs 2 b and ð1 2 rÞs 2 b are the variance components respectively associated to cluster-specific and main marker effects.
In MPM-Matérn (the proposed MPM procedure), V n ¼ ðk n;h ðp i ; p j ÞÞ n·n , where k n;h is a Matérn kernel function of p i and p j : is the Euclidean distance between the d-vectors of PC coordinates for any pair (i, j) of individuals, n . 0 is a shape parameter, h . 0 is a scale parameter, and R n fÁg is the modified Bessel function of the second kind, of order n (Abramowitz andStegun 1984, Ober et al. 2011). Matérn functions have been used in various contexts, including in genomic prediction for depicting relationships among individuals (Ober et al. 2011). Here, we used Matérn functions to depict relationships among populations, with the input p i 2 p j 2 representing differentiation with respect to population structure in d ¼ 4 orthogonal directions. We used Matérn functions instead of more typical kernel functions (e.g., an exponential or Gaussian kernel function) to allow for some flexibility in the shape of the correlation in V n : n ¼ 0:5 and n ¼ N correspond respectively to the exponential and Gaussian kernels as special cases, while different shapes can also be fitted (Ober et al. 2011).
The parameter r in MPM-Mixture was estimated by maximizing the restricted likelihood of model (2) using the optimization algorithm implemented in the R function optimize. The parameters n and h in MPM-Matérn were estimated by maximizing the restricted likelihood of model (2) using the Nelder-Mead algorithm implemented in the R function constrOptim, with constraints for positivity. In order to control (to some extent) for the possible presence of local maxima in the restricted likelihood surface in MPM-Matérn, we used four different starting points ðn 0 ; h 0 Þ: ð0:5; D max =2Þ, ð0:5; D max Þ, ð10; D max =2Þ and ð10; D max Þ, with D max the maximum distance p i 2 p j 2 observed over pairs of individuals (i, j). In cross-validation (see next section), parameters r, n and h were estimated in each CS separately.

Validations
We assessed the accuracy of our prediction procedures by crossvalidation (CV) under two schemes: whole-sample calibration, where all individuals except the TS are included in the CS, and cross-population calibration, where the target population (the population to which the TS belongs) is excluded from the CS. The target-population model GBLUP-Target was only assayed in wholesample calibration, since this model could only rely on individuals from the target population for calibration (in GBLUP-Target, the CS could only consist of individuals in the target population, which was not possible in cross-population calibration).
For each target population (L4X-NE, U4X-N, Liberty-C2 or WS4U-C2), we used as the TS a random subset of the target sample. The size of the TS was one fifth of the target sample size. All remaining individuals were used as input to the prediction procedures. Such validations were replicated n rep ¼ 20 times for each target.
Prediction procedures were evaluated for accuracy by c TS ¼ Corðy TS ;ŷ TS Þ, i.e., the correlation between actual and predicted outcomes in a given TS. To assess the significance of differences in prediction accuracy between two procedures, we performed a t-test is the vector of prediction accuracies over testing sets for the tested procedure (baseline procedure); and z is the Fisher transformation. The standard error of the mean difference in prediction accuracy, SDð dÞ, was estimated in two different ways: where SDðd TS Þ is the standard deviation of d, with all testing sets assumed to be independent datasets; (conservative t-test) based on the first method of Nadeau and Bengio (2003), , where redundancy over testing sets is accounted for by the additional term o 1 2 o , with o being the expected fraction of overlap among testing sets; here o ¼ 1 5 and o 1 2 o ¼ 1 4 because testing sets were random subsets consisting of a fifth of any given target sample. We considered that this method for estimating SDð dÞ was conservative in whole-sample calibration because Nadeau and Bengio (2003) derived it by assuming that the CV criterion (the "loss function", analog here to zðc TS;t Þ 2 zðc TS;0 Þ, for a given TS) did not depend on the CS instances, given a particular CS size. Therefore the adjustment from Nadeau and Bengio (2003) may have overestimated the correlation among values of the CV criterion across replicates, in whole-sample calibration, since prediction procedures are probably quite sensitive to differences in the composition of the CS. In all comparisons between procedures, we reported the results from both tests in order to characterize the significance of differences in prediction accuracy.

Data availability
Population information (population assignment and geographical origin of genotypes, when available), raw phenotypic data (trait measurements at individual plants) and estimated genotype means (for maternal parents in BP and individuals in AP) are available in Files S1, S2 and S3, respectively. These supplementary files as well as the marker data (allelic dosages at the 717,814 selected SNP markers; in .rds format readable in R) are available from figShare. Supplemental material available at Figshare: https://doi.org/10.25387/g3.7464863.

Population structure in the sample
Population-level differentiation: Seven population clusters were inferred from the ADMIXTURE software ( Figure S1; Alexander et al. 2009). These clusters corresponded roughly to populations L4X-NE, L4X-S, Liberty-C2 and U4X-N, WS4U-C2, U8X-E, U8X-W. One population with little representation in our sample, U8X-S, appeared to be of mixed origin (Figure 1a). The other populations generally displayed a low level of admixture, with relatively few individuals having intermediate admixture coefficients. There seemed to be some admixture involving upland populations (WS4U-C2 and U4X-N, WS4U-C2 and U8X-W, U8X-E and U8X-W), with even some shared ancestry between WS4U-C2 and U4X-N. The PCA confirmed that population structure was relatively discrete (Figure 1b). Expectedly, the first PC separated genotypes by ecotype while the second PC reflected geographical origin within the lowland ecotype (Lu et al. 2013, Evans et al. 2015. The third and four PCs discriminated upland genotypes by geographical origin and ploidy level, and distinguished L4X-S from the two other lowland populations (L4X-NE and Liberty-C2).
Differences in mean and range among populations were quite typical of previously reported differences between ecotypes (Table 1; Casler 2012). Indeed, L4X-S and Liberty-C2 (populations of lowland origin) had high mean values and range values for PH, HD and St, compared to upland populations (excluding U8X-S). However, L4X-NE stood out as a lowland population for being relatively short, early-flowering, and prone to lodging, with corresponding values for PH, HD, and St more similar to those of the upland populations.
Recent relationships in the sample: Here, marginal genomic relationships were defined as the elements of G ¼ _ X _ X9=n, with _ X consisting of centered marker variables, and n being some scaling factor. The strong and quite discrete population structure in the sample translated into multimodal marginal genomic relationship coefficients, with the multiple peaks in off-diagonal elements of G reflecting differentiation of population with respect to allele frequencies ( Figure S2a). Conditioning relationships on population structure (as depicted by the first four PCs of _ X) yielded the matrix G P , with G P ¼ G 2 PP 0 and P reflecting structure in G due to population-level variation (Fan et al. 2013). The conditional genomic relationships seemed sparser, in the sense that they appeared to cluster around zero, so most individuals could be assumed to be unrelated after accounting for population structure in the sample ( Figure S2b). Conditional relationships in G P were particularly relevant in this study, since among-population variation, captured by PP 0 , contributed little to variation within any given TS. Indeed, any TS generally consisted of selection candidates from a relatively homogeneous target sample (made of individuals from WS4U-C2, Liberty-C2, U4X-N or L4X-NE), where variation with respect to P was minimal. Graphs of recent relationships, inferred by the graphical LASSO, were rather dense, with average degrees (number of neighbors by node/individual in the graph) ranging from 217 to 458 ( Figure 2). However, some noticeable features of populations emerged from the inferred graphs ( Figure 2): WS4U-C2, U4X-N and U8X-E appeared quite connected to one another; U8X-W also showed some connection with other upland populations but seemed more distinct, as reflected by a relatively lower average degree ( Figure S3); Liberty-C2 and L4X-S were somewhat connected to both upland and lowland populations, which certainly explains why their individual degrees were generally high ( Figure S3); most notably, L4X-NE displayed an outstandingly low level of connection with the other populations, which translated in a clear separation of this population in the graph, after placing the nodes based on a force-directed algorithm (Fruchterman and Reingold 1991). These features exemplify the usefulness of conditional relationships and their associated graphs for describing relationships among individuals.

Impact of mean structure on prediction accuracy
Prior to assessing different prediction procedures accommodating population heterogeneity, models were compared for different fixed-effect specifications used to characterize population structure (mean structures of models). Mean structures were tested for prediction accuracy under GBLUP, in which the whole sample, excluding the TS, was used to fit a standard GBLUP model. In wholesample calibration (where the target population was included in the CS), the various mean structures assayed differed marginally with respect to their prediction accuracy. There were improvements over Intercept (only intercept as fixed effect) by mean structures which explicitly captured population structure, i.e., PCA (intercept and effects of PCs), Panel (effects of panels AP and BP) and Group (effects of putative population groups). However, those were small, inconsistent and moderately significant (hereafter "moderately significant" refers to P # 0:05 based on the liberal "naïve" t-test) (Table 2a). Conversely, in cross-population calibration (where the target population was excluded from the CS), fixed effects explicitly depicting population structure resulted in highly significant decreases in prediction accuracy compared to Intercept (hereafter "highly significant" refers to P # 0:05 based on the conservative t-test adapted from Nadeau and Bengio 2003; see Material and methods for details). In particular, prediction accuracy was substantially lower with PCA for L4X-NE (HD, St), as well as with Group for U4X-N (PH, HD), L4X-NE (PH, HD) and Liberty-C2. Mean structure Panel was not as sensitive to cross-population calibration compared to PCA and Group, and even showed one highly significant increase in prediction accuracy compared to Intercept, for Liberty-C2 (PH). But it also showed decreases in prediction accuracy, with U4X-N (PH) and L4X-NE (HD), which were stronger and highly significant (Table 2b). Interestingly, the deterioration of prediction accuracy in cross-population calibration by PCA and Group may be due to different factors. Indeed, while PCA may fail to properly extrapolate effects of PCs outside the set of populations represented in the CS, Group would fail to capture any difference due to population differentiation in the TS (since all individuals in the TS belong to the same unobserved population group). Due to the relative stability in performance of Intercept, hereafter we chose to focus on this mean structure when comparing prediction procedures.

Impact of prediction procedures on prediction accuracy
Target-population model: For prediction in a given TS, the targetpopulation model (GBLUP-Target) consisted in restricting the CS to the subset of the sample belonging to the same population as the TS. Compared to GBLUP, the target-population model yielded decreases in prediction accuracy which appeared moderately significant for PH (WS4U-C2, U4X-N) and St (Liberty-C2) ( Table 3a, Table S1). However, prediction accuracy for St (WS4U-C2) was higher, with a moderately significant difference. More intriguing is the consistent increase in prediction accuracy with L4X-NE, with differences being small yet highly significant for PH and HD, and moderately significant for St. It is unclear whether these differences are due to the consistently higher accuracies achieved with L4X-NE (in GBLUP-Target) compared to other populations, or a result of L4X-NE being relatively underconnected to the other populations in the sample (Figure 2, Figure  S3). Both factors could very well contribute to the observed decreases in accuracy when incorporating information from the whole sample.

Multi-population models and marker-by-population interactions:
The inferred mixing parameter r from the MPM-Mixture model was null (or close to null), low and intermediate, for PH, St and HD respectively, with estimations being quite consistent over CV replicates (Table 4). The improvement in fit, relatively to GBLUP, was nonsignificant for PH, rather significant (P , 0.05) for St, and strongly significant (P , 0.001) for HD (Table 4). In MPM-Matérn, the inferred correlation functions differed substantially across traits (Table  4), while being quite consistent over CV replicates in whole-sample calibration and across validation schemes, with similar shapes of the correlation function k n;h in whole-sample calibration and crosspopulation calibration (Figure 3): k n;h roughly resembled an exponential kernel with PH and HD, and was more similar to a Gaussian Figure 2 Inferred graphs of relationships, conditional on population structure Each graph represents the relationships as depicted by the graphical LASSO applied to the whole sample of individuals. The parameter l represents the degree of regularization on conditional relationships, fitted by maximum restricted likelihood for each trait, in a GBLUP model based on regularized relationships (Appendix 1). Nodes (individuals) were positioned using the force-directed placement algorithm of Fruchterman and Reingold (1991), as implemented in function ggnet (R package GGally), so aggregation of nodes reflects connectedness.
kernel with St, for which a "shoulder" maintained high correlation in marker effects for individuals that were relatively close to each other, based on their PCs. Remarkably, the shapes of inferred correlation functions were quite consistent in cross-population calibration, despite entire populations being left out from one CS to another (Figure 3b). Inferences regarding among-population correlations (V n ) in MPM-Matérn were weakly significant for PH and St, with p-values close to 0.05; in contrast, inferences regarding V n for HD were strongly significant, with P , 0.001 (Table 4). Interestingly, distances based on PCs may be equivalent to distances based on allele frequencies. Specifically, p i 2 p j 2 ¼ 2 p Pi 2 p Pj 2 , where p Pi (p Pj ) is the m-vector of individual-specific allele frequencies of individual i (j) as described by Conomos et al. (2016), with population structure described by ½ 1 n P (Appendix 3). Therefore, the significant relationship between PC-based distances and correlations in marker effects (depicted by V n ) for HD in MPM-Matérn indicates that marker effects for this trait were highly sensitive to variation in allele frequencies across genetic backgrounds.
In whole-sample calibration, the performance of MPM-Mixture was very similar to that of GBLUP, with differences in accuracy ranging from -0.018 to +0.009 (Table 3a). Quite surprisingly, MPM-Mixture displayed slightly deteriorated accuracies for HD (with the exception of U4X-N), despite the strongly significant improvement in fit for this trait. In contrast, MPM-Matérn yielded larger differences in accuracy, ranging from -0.019 to +0.060 in whole-sample calibration (Table 3a). With the two upland target populations (WS4U-C2 and U4X-N), noteworthy increases in prediction accuracy (+0.060 and +0.030 respectively) were observed for HD. But with the two other target populations (Liberty-C2 and L4X-NE), smaller differences in accuracy (-0.008 and +0.007 respectively) were observed for HD.
In cross-population calibration, MPM-Mixture showed more differences in accuracy compared to GBLUP, with differences in accuracy ranging from -0.150 to 0.032 (Table 3b). Again, MPM-Mixture displayed deteriorated accuracy for HD despite a strong improvement in fit, with a dramatic decrease by 0.15 for U4X-N. Such decrease in accuracy may be due to the lack of flexibility of MPM-Mixture in depicting among-population resemblance, as it only fits one correlation coefficient for all pairs of populations. In cross-population calibration, MPM-Matérn also resulted in large differences in accuracy compared to GBLUP, ranging from -0.304 to 0.147. The dramatic decreases in prediction accuracy with PH (-0.304 and -0.250 for WS4U-C2 and L4X-NE, respectively) could be explained by the relatively weak improvement in model fit from GBLUP to MPM-Matérn (Table 4). Interestingly, large and significant improvements in prediction accuracy were observed with HD, similarly to the results from whole-sample calibration, with nonetheless more dramatic increases in accuracy. While MPM-Mixture simply estimates a general coefficient for among-population resemblance, based on the CS, MPM-Matérn may be more suitable for extrapolation to unobserved population backgrounds, as it estimates the resemblance between any two populations as a function of their specific properties (here, PC coordinates). Consistently, the relative improvement in accuracy from GBLUP to MPM-Matérn seemed more predictable based on the relative improvement in model fit. Specifically, a decrease in Bayesian information criterion (BIC) seemed to discriminate cases where an improvement in accuracy could be achieved by MPM-Matérn, especially in crosspopulation calibration where a correct depiction of population heterogeneity seemed more critical ( Figure S4). : P # 0.05 in t-test corrected for overlap in testing sets as in Nadeau and Bengio (2003) (conservative). Underlined values correspond to the highest prediction accuracy for each validation scheme, trait and population.

Conclusions
The present study assessed different mean structures to represent population differentiation and evaluated various procedures to accommodate population heterogeneity in diverse samples, with an application in switchgrass. We considered different approaches to reflect population structure, i.e., characterizing it implicitly by random marker effects, using only an intercept as fixed effect (Intercept), or characterize it explicitly, by continuous differentiation (PCA) or discrete effects at the level of panels (Panel) or putative population groups (Group). Furthermore, we employed three typical strategies for dealing with marker-by-population interactions, i.e., ignoring (GBLUP), reducing (GBLUP-Target), or modeling (MPM) the source of heterogeneity in the data. Our assessment of mean structures points to a simple fixed-effect specification being preferable in genomic prediction analyses, since accuracies from Intercept were relatively high across populations and traits, and relatively stable across validation schemes (whole-sample calibration or cross-population calibration). These conclusions are consistent with those of Phocas and Laloë (2004), who showed larger prediction errors in cattle when including putative genetic groups as fixed effects (comparable to Group and Panel). Notably, deteriorations of prediction accuracies from PCA and Group were especially large in cross-population calibration in which entire populations were excluded from the CS. Moreover, these were often noted for L4X-NE which was underconnected to other populations in the sample (Figure 2). Decreases in accuracy with PCA suggest that linear fixed effects capturing population structure may fail to properly extrapolate on unobserved populations whose genetics may differ markedly from other populations in the sample (Figure 1b). However, it is worth noting that the switchgrass sample under study was highly structured. Samples in other species, e.g., in maize or rice, may not display such discrete population differentiation and therefore may not suffer as much from fixed effects at population level (Guo et al. 2014).
In whole-sample calibration, GBLUP often seemed robust to population heterogeneity, regarding prediction accuracy (Table 3a). This robustness was certainly due to the ability of GBLUP models to combine information from individuals according to the specified relationship matrix, by transferring information preferentially from the more related individuals (Searle et al. 2009, Habier et al. 2013. Furthermore, GBLUP models were probably all the more robust as marker density was high, such that genomic relationships were accurately estimated Berger 2002, Endelman andJannink 2012). However, some decreases in prediction accuracy compared to GBLUP-Target suggest that robustness of GBLUP may have been affected by other factors. Such factors may be related to relationships within the sample, i.e., under-connectedness of some populations with others (Figure 2), or differences in accuracy of the prediction model across populations, as reflected by GBLUP-Target being more accurate in certain populations (Table 3a).
In whole-sample calibration, prediction was mostly determined by individuals in the CS belonging to the same population as the TS. Consistently, MPM procedures, which shrink relationships involving individuals from distantly-related populations, did not dramatically affect prediction accuracy (Table 3a). However, in cross-population calibration, decreasing the contribution of individuals from distantly-related populations must have been more pertinent, so that there were more opportunities for improvement of prediction accuracy by MPM procedures. In this context, MPM-Matérn proved more useful than MPM-Mixture, especially with HD for which MPM-Matérn resulted in a dramatic improvement of fit (Table  3b). Importantly, this relative superiority may be due to the fact that MPM-Matérn extrapolated correlations between populations, through leading PCs, while MPM-Mixture merely interpolated such correlations, by estimating a common coefficient of correlation across populations. This lack of flexibility, and the subsequent inability to extrapolate to unobserved populations, must have resulted in high sensitivity of MPM-Mixture to the composition of the CS, making it particularly inappropriate in a cross-population context.
Marker-by-population interactions captured by MPM-Mixture and MPM-Matérn were presumably not confounded by marker-byenvironment interactions, since interactions between panel and markers were not significant (P . 0.25 in a model, similar to MPM-Mixture, which depicted correlation in marker effects between BP, assayed in n Table 3 Average prediction accuracy by prediction procedure WI, and AP, assayed in NY; Figure S5). Therefore, models analyzed in this study would reflect actual differences in genetic bases across populations. Moreover, for every trait, genomic variability (variance of marker effects) would be similar across panels. Indeed, the non-significant improvement in fit from an extension of a GBLUP model where genomic variance can vary by panel (P $ 0.18), suggested limited differences in variance of marker effects by panel ( Figure S6). This result further implied that estimation of genotype effects and half-sib family effects (in AP and BP, respectively) and scaling of half-sib family effects (multiplied by two, so they corresponded to breeding values) were effective to ensure concordance in genomic variability across panels.
Here, we modeled interactions between markers and population structure through products of relationships at markers, which were linear (XX9), and relationships about population structure, which were linear in MPM-Mixture and nonlinear in MPM-Matérn (V n ; Appendix 3). Using kernel functions to estimate relationships dispensed us from fitting effects of many variables, by estimating instead n breeding values directly from relationships. A similar strategy was adopted quite recently by Jarquín et al. (2014), who modeled genotype-by-environment interactions for genomic prediction, through products of linear kernels at markers and linear kernels at environmental covariates. As noted by these authors, such decomposition with respect to interactions had been introduced in quantitative genetics much earlier, by Kempthorne (1954) and Cockerham (1954) for depicting epistatic effects, based on expected relationships under an infinitesimal model. Importantly, relationships about population structure not only allowed us to efficiently specify a genome-wide marker-by-population interaction model, but they also enabled the use of nonlinear kernels at the population level (with V n produced by nonlinear functions in MPM-Matérn). Matérn kernels, introduced to genomic prediction research by Ober et al. (2011), were used here to estimate covariance among individuals in a flexible yet parsimonious way (Figure 3), while still using simple linear kernels for depicting within-population variability (by XX9). Our results exemplify the potential usefulness of parsimonious multi-population models, which are all the more interesting that they can be applied on samples comprising many populations. In contrast, typical multi-trait models would be computationally intractable or statistically inefficient here, since those would rely on one parameter for each population pair to model correlations among populations in V n (e.g., 21 parameters for K ¼ 7 population clusters). As a matter of fact, multivariate genomic BLUP models fitted by ASREML-R to estimate such correlations among putative population groups (WS4U-C2, Liberty-C2, U4X-N, U8X-W +U8X-S, U8X-E, L4X-NE and L4X-S) failed to converge.

Improvement of procedures
Our results suggest that a very high increase in quality of fit, as was observed for HD with MPM-Matérn, may allow for an increase in accuracy, especially in a cross-population context. In the analysis of n Table 4 Multi-population model fit: parameter estimates, likelihood-ratio test statistic and p-value, by trait and procedure In parentheses: range of values for every one of the four target populations omitted, in cross-population calibration. Trait: plant height (PH), heading date (HD) or standability (St). MPM: multi-population model with among-population correlations based on admixture coefficients (MPM-Mixture; r: mixture parameter) or PC distances (MPM-Matérn; n: shape parameter; h Ã ¼ h=D max , with h the scale parameter and D max the maximum distance observed over pairs of individuals). LRT (likelihood-ratio test) statistic: 22logðL 0 =L 1 Þ where L 0 and L 1 are the restricted maximum likelihoods of GBLUP and one of the MPM models, respectively; p-values were obtained from a x 2 -ditribution with one (MPM-Mixture) or two (MPM-Matérn) degrees of freedom. Heslot and Jannink (2015) across various multi-population contexts, there seemed to be a positive relationship between differences in quality of fit, as measured by the Akaike information criterion (AIC), and differences in prediction accuracy. Although this relationship was quite loose, it could be noted that for very high increases in AIC ( $ 30), gains in accuracy were null to high, similarly to the situation of MPM-Matérn with HD, for which increases in AIC varied from 28.68 to 42.88, across CV replicates in whole-sample calibration, and from 24.57 to 36.25 in cross-population calibration. Therefore, stringent thresholds on AIC increases could probably be used in MPM to avoid relative decreases in accuracy. In this study, one criterion more conservative than the AIC, the BIC, could discriminate cases where prediction accuracy was improved by MPM-Matérn, compared to GBLUP ( Figure  S4). Therefore, a possible improvement of MPM procedures could simply come from model selection as an integral part of the fitting process, based for example on the BIC. The BIC differences relative to GBLUP were almost always negative for PH and St in MPM. For these two traits, differences in prediction accuracy from GBLUP to MPM were quite inconsistent, especially with MPM-Matérn, so model selection could probably have made MPM procedures more robust. However, such conclusions are based on a restricted set of populations and genetic architectures. So future studies on other datasets would certainly be necessary to test this post hoc hypothesis and determine whether criteria such as the BIC can indicate cases where MPM-Matérn should be used instead of GBLUP.
Another way of potentially improving MPM procedures would be to use other types of kernels than those used here. For example, one may use linear kernels based on population-level covariates (e.g., PCs) in place of AA 0 in MPM-Mixture, hence taking an approach similar to that of Jarquín et al. (2014). Besides, modeling resemblance among population clusters in MPM-Mixture, by AQ K A 0 in place of AA 0 (where Q K captures similarity based on metrics at the population level), could be useful to increase quality of fit, and possibly prediction accuracy. Finally, an interesting way of extending the MPM procedures described here would be to incorporate more information at the population level. Here in MPM, population homogeneity was captured through admixture coefficients (MPM-Mixture) or differences in PC coordinates (MPM-Matérn), the latter reflecting differences in allele frequencies (Appendix 3). However, marker-bypopulation interactions may also be due to differences in LD patterns (Wientjes et al. 2016). Therefore metrics depicting such differences could be particularly appropriate for capturing population heterogeneity. Further research would be necessary to determine the type of metrics to use for reflecting differences in LD patterns, and the appropriate way to parsimoniously combine the different types of information regarding population differentiation in MPM. Interestingly, geographical distance may succinctly depict population differentiation, due to differences in allele frequencies and/or differences in LD patterns. Fitting population-level correlations in V n as a function of distance of origin would then be particularly useful in species under strong geographical structure, which include switchgrass (Grabowski et al. 2014), but also human (Coop et al. 2009), as was clearly shown in samples from Europe (Novembre et al. 2008), Africa (Bryc et al. 2010) and Latin America (Ruiz-Linares et al. 2014). Models such as MPM-Matérn, which are parsimonious yet flexible in the shape of the fitted correlation function (Figure 3), are promising in various applications on diverse samples, in prediction studies, but also in inferential studies aiming at characterizing the basis for population differentiation. In this study, marker data were based on exome capture sequencing, which targets a selected subset of exons for sequencing and subsequent SNP calling (Hirsch et al. 2014. The potential lack of representation of causal variants by our assay may have resulted in loss of prediction accuracy. While total lack of representation of some genomic regions imposes a limit on prediction accuracy achievable by our procedures, the relative overrepresentation of some genomic regions could be, to some extent, alleviated by genomic relationships which account for correlation among markers and differential degrees of tagging of loci in the marker data (Speed et al. 2012, Ramstein et al. 2016, Wang et al. 2017. Another limitation in our study is the assumed homogeneity of genetic and residual variances across populations. Here we focused on parsimonious models estimating genetic correlations (not covariances) between populations. Extending MPM models to capture variance heterogeneity across populations and/or environments would certainly deserve further investigation. Such models ought to fit functions of variance over genetic and/or environmental variables, similarly to Ou et al. (2015) who reported improvements in fit and marginal gains in prediction accuracy in swine, by modeling residual variance over sexes and slaughter dates. Indeed, a decisive advantage of models like MPM-Matérn (for correlations) and those of Ou et al. (2015) (for variances) is their ability to extrapolate population covariances to genotypes from unobserved population backgrounds.

Applications and prospects
Based on our case study, we would recommend using MPM whenever a strong improvement in model fit is achieved. Otherwise GBLUP would be the method of choice, since it is often robust enough to perform at least as well as GBLUP-Target. However, fitting a GBLUP model to a CS restricted to the target population may be preferred when making predictions on "outlier populations" such as L4X-NE, which are under-connected to other populations and are characterized by relatively high prediction accuracy in a single-population context. Nevertheless, more empirical studies on population heterogeneity should follow to support the conclusions from our specific application. Such studies could apply to various contexts: in particular, predictions on diverse samples and dynamic breeding programs. The former includes analyses similar to our case study as well as analyses on more complex data, such as historical datasets, in which not only population heterogeneity but also genotype-by-environment interactions must be taken into account (Dawson et al. 2013, Rutkoski et al. 2015. The latter involves selection across multiple breeding generations, which might not necessarily suffer from strong population heterogeneity (Sallam et al. 2015, Auinger et al. 2016) but could nonetheless benefit from robust multi-population models for potential increase in persistency of accuracy over generations (Habier et al. 2007). In the context of diverse samples or dynamic breeding, simulation studies could also be useful for assessing the suitability of procedures to accommodate population heterogeneity. Differences in allele frequencies and differential LD patterns could be simulated by various genealogies, as was done for example by de Roos et al. (2009). Additionally, dependency of marker effects on allele frequencies could be simulated through underlying non-additive genetic effects, as well as allele fixation in specific populations. Indeed, marginal additive marker effects, as captured by linear models such as standard GBLUP, depend on allele frequencies at the loci with which they interact (Hill et al. 2008, Mäki-Tanila and Hill 2014, Hill and Mäki-Tanila 2015. Therefore, dominance and epistatic effects could be simulated to generate dependency of marker effects on allele frequencies, which may then be captured by methods such as MPM-Matérn (Appendix 3). Though investigations based on simulations would be complex and, to some extent, arbitrary by their choice of genealogies and genetic architectures, they would provide useful frameworks for assessing procedures, such as those presented in this study, in contexts of population heterogeneity.

ACKNOWLEDGMENTS
We are grateful to Jeremy Schmutz of the Department of Energy Joint Genome Institute and Hudson Alpha for his work on the switchgrass genome. We are also grateful to Nick Baker and Joseph Halinar, USDA-ARS, Madison, WI for assistance with field operations and data collection. This research was funded in part by the following agencies and organizations: the U.S. Department of Energy Great Lakes Bioenergy Research Center, DOE Office of Science BER DE-FC02-07ER64494 (laboratory operations, genotyping, and bioinformatics), the U.S. Department of Energy Joint Genome Institute supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 (sequencing), Agriculture and Food Research Initiative Competitive Grant No. 2011-68005-30411 from the USDA National Institute of Food and Agriculture (CenUSA; field operations and phenotypic measurements), USDA-ARS Congressionally allocated funds (field operations, technical support, and logistics), and the University of Wisconsin Agricultural Research Stations (field operations). Mention of commercial products and organizations in this manuscript is solely to provide specific information. The USDA is an equal opportunity provider and employer. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

CONFLICT OF INTEREST
The authors declare that they have no conflict of interest.

APPENDIX 1
Inference of recent relationships by the graphical LASSO LetG P be a regularized form of G P ¼ G 2 PP 0 , with G the genomic relationship matrix as defined by VanRaden (2008), and P the matrix of d ¼ 4 PCs. We applied the graphical LASSO to G P to infer a sparse matrixG P 21 , which yielded a graph of relationships among individuals. Indeed, a zero ij-element inG P 21 indicates that individuals i and j are unrelated conditionally on all other individuals, which corresponds to no edge between individuals i and j in the underlying graph of recent relationships.
For a given sample covariance matrix Σ, the graphical LASSO infers a sparse precision matrix Σ 21 by maximizing the Gaussian likelihood of the data (as represented by Σ), penalized by a L 1 -norm penalty lkΣ 21 k 1 , where l is the regularization parameter and kΣ 21 k 1 is the sum of absolute values in Σ 21 (Friedman et al. 2008).
In this study, regularization of G P was performed as follows: 1. Standardizing G P to obtain the corresponding correlation matrix G P : G P ¼ diagðG P Þ 21=2 G P diagðG P Þ 21=2 2. Applying the graphical LASSO algorithm to G P , to obtain the regularized correlation matrixG P 3. Rescaling G P to obtainG P ¼ diagðG P Þ 1=2G P diagðG P Þ 1=2 The graphical LASSO algorithm was run using the R package huge (Zhao et al. 2012). The regularization parameter l was chosen to maximize the restricted likelihood for the following model: where y is the n-vector of genotype means at a given trait;G ¼ PP 0 þG P , depending on l, consists of regularized relationships; s 2 u is the variance of breeding values (here not equal to the variance of marker effects due to scaling and regularization onG); 1 n m is the n-vector of intercept values; I n s 2 e is the covariance matrix for errors considered independent and identically distributed. Here we chose the value of l for which the corresponding matrixG resulted in the highest restricted likelihood for the aforementioned model fitted to the whole sample. Possible values of l were the q-quantiles of absolute values in G P , with q varying from 0.05 to 1 by step of 0.05. The restricted likelihood as a function of l depended on y, so optimization of l was performed for each trait separately.

APPENDIX 2
Multi-population GBLUP models for heterogeneous calibration sets In this section, 1 t , I t , and J t refer to the vector of ones, identity matrix, and matrix of ones, respectively, of dimensions t, t · t and t · t (where t is specified). Consider the following model for population-specific marker effects with respect to K populations and n genotypes: where 5 indicates the Kronecker product; Q ¼ ð1 K 5QÞ is the Kn · p design matrix for the p-vector a of fixed effects; X ¼ ðI K 5XÞ is the Kn · Km marker-data matrix for the Km-vector b of marker effects at each of the K populations, with variance ðV K 5I m Þs 2 b . The matrix V K reflects covariances in marker effects between populations. The Kn-vector y, containing the phenotypic values for the n genotypes at each of the K populations, is hypothetical (and ill-defined from a practical standpoint), since genotypes typically do not belong to more than one population. The Kn-vector of residuals e, with unspecified variance R ¼ ðI K 5RÞ, is assumed to be uncorrelated to marker effects b. Let u ¼ X b be the Kn-vector of additive genetic effects at each of the K populations, as a linear combination of a normally-distributed vector ( b), u follows a normal distribution with expectation and variance as follows : So a multi-population model for breeding values that is equivalent to the model described above, by identical mean and variance structures, is as follows: