## Abstract

Genomic-enabled prediction in plant and animal breeding has become an active area of research. Many prediction models address the collinearity that arises when the number (*p*) of molecular markers (e.g. single-nucleotide polymorphisms) is larger than the sample size (*n*). Here we propose four Bayesian approaches to the problem based on commonly used data reduction methods. Specifically, we use a Gaussian linear model for an orthogonal transformation of both the observed data and the matrix of molecular markers. Because shrinkage of estimates is affected by the prior variance of transformed effects, we propose four structures of the prior variance as a way of potentially increasing the prediction accuracy of the models fitted. To evaluate our methods, maize and wheat data previously used with standard Bayesian regression models were employed for measuring prediction accuracy using the proposed models. Results indicate that, for the maize and wheat data sets, our Bayesian models yielded, on average, a prediction accuracy that is 3% greater than that of standard Bayesian regression models, with less computational effort.

- Bayesian regression
- shrinkage
- prior distribution
- inverse regression
- GenPred
- shared data resources
- genomic selection

Predicting complex traits possibly affected by large numbers of genes (each one having a small effect) that are greatly affected by the environment is a difficult task. Information from dense molecular markers attempts to exploit linkage disequilibrium between at least one marker and at least one putatively causal locus, so as to predict the genetic values of individuals based on their phenotypic data.

There is a vast literature describing methods that use different functions of markers for predicting genetic values (*e.g.*, de los Campos *et al.* 2012; Gianola 2013), starting with the seminal work of Meuwissen *et al.* (2001), in which they propose genome-based prediction functions that regress phenotypes on all available markers by using a Gaussian linear model with three different prior distributions on marker effects. Several regularized regression models, such as ridge regression (Hoerl and Kennard 1970), the Least Absolute Shrinkage and Selection Operator (*i.e.*, LASSO) (Tibshirani 1996), and its Bayesian counterpart (Park and Casella 2008; de los Campos *e al*. 2010), have been described and used for genomic-based prediction in animals and plants (de los Campos *et al.* 2009, 2010; Crossa *et al.* 2010, 2011; Gónzalez-Camacho *et al.* 2012; Heslot *et al.* 2012; Pérez-Rodríguez *et al.* 2010, 2012).

The basic quantitative genetic model describes the *i*^{th} response or phenotype () as the sum of an unknown genetic value () plus a residual expressed as a deviation from some general mean (μ); thus, the basic model is (). The unknown genetic value can be represented as a function of genotypes with a large number of genes that may involve all gene × gene interactions, if there are any. Because the genes affecting a trait are unknown, this complex function must be approximated by, for example, a regression of phenotype on marker genotypes. Large numbers of bi-allellic markers on ( is the number of copies of one of the two alleles observed in the *i ^{th}* individual at the

*j*marker) may be used in a regression function for predicting the genetic value of the

^{th}*i*

^{th}individual. The regression can be formulated as such that the basic genomic model becomes , where is a model error that may include errors due to unspecified environmental effects, imperfect linkage disequilibrium between markers and the actual loci affecting the trait, and unaccounted for gene × gene and gene × environment interactions. In several applications, is a parametric linear regression of the form , where is the substitution effect of the allele coded as ‘one’ at the

*j*marker.

^{th}The linear regression function on markers becomes , or, in matrix notation, (1)where ** X** is the

*n*×

*p*incidence matrix,

**is the vector of unknown marker effects, and**

*β***is an**

*ε**n*×1 vector of random errors, typically assumed to follow the normal distribution where is the random error variance component. If the vector of marker effects is assumed random with normal distribution , where is the variance of marker effects, then the genetic value of the individuals is with a variance-covariance matrix proportional to the genomic relationship matrix (where is the frequency of allele “1” often estimated from the data at hand); this random effect leads to the genomic best linear unbiased predictor (Vanraden, 2007, 2008), which is extensively used in genomic-assisted prediction. Note that the Bayesian ridge regression (BRR) is equivalent to the genomic best linear unbiased predictor when the distribution of single-nucleotide polymorphism (SNP) effects is regarded as Gaussian with mean

**0**and variance (Pérez-Rodríguez

*et al.*2012).

The regression function also can be represented by semiparametric approaches, such as reproducing Kernel Hilbert space regressions or different types of neural networks (Gianola *et al.* 2006, 2011; Gianola and Van Kaam 2008; de los Campos *et al.* 2010; González-Camacho *et al.* 2012; Pérez-Rodríguez *et al.* 2012). de los Campos *et al.* (2012) reviewed penalized linear regression models and Bayesian shrinkage estimation methods.

The basic linear model (1) is generally overparameterized and therefore ill-conditioned because there are many more predictors (markers) than individuals with observations (*p* > >*n*), as well as strong collinearity among predictors due to linkage disequilibrium; hence ** X** is not a full column rank. These challenges can be overcome using what is called the inverse problem approach (Aster

*et al.*2005; Tarantola 2005). The discrete inverse problem approach for matrices of incomplete rank or that are ill-conditioned due to possible collinearity is discussed at length by Aster

*et al.*(2005), whereas Tarantola (2005) addresses the continuous version of the same problem. Cavalier (2008) summarized the inverse problem as a nonparametric regression, , of a transformed data vector

**on the linear operator with as a vector of regression parameters and as a vector of Gaussian errors. Within a Bayesian framework, Knapick**

*d**et al.*(2012) proposed a solution by using the singular value decomposition of

**such that the decay of singular values is mimicked in the prior distribution of**

*T**θ*. Thus, estimating the unknown transformed marker effects of

*θ*from this perspective seems of interest but has not been attempted in genome-enabled prediction. The inferential solution to the inverse problem is based on the posterior distribution of

*θ*, with the variability of and prior beliefs about

*θ*conveyed through probabilistic models.

In the prediction context, de los Campos *et al.* (2010) used the fact that the matrix ** G** (or

**) may be eigen-decomposed as , with**

*X***containing the eigenvalues of**

*D***, and**

*G***being its corresponding eigenvectors. This transformation allows reducing the original highly dimensional data to fewer dimensions by extracting most of the signals in the “first” components of the decomposition and relegating the noise to the “last” components. As de los Campos**

*U**et al.*(2010) demonstrated, the regression of phenotypes on all markers is equivalent to the regression of phenotypes on a set of rotated covariates. Implementing regression methods with generally requires either shrinkage estimation or reducing the dimensionality of the data, which is the approach considered here. Recently, Gianola (2013) studied the influence of the prior on the posterior inference of marker effects in the overparameterized models used in genome-enabled prediction. He concluded that a main driving force in the various Bayesian models (Bayesian alphabet) is the prior and not the data; thus different priors will produce different results mainly because their shrinkage effect varies.

In this work, we propose using the parametric model (1) within the framework of inverse problems and a Bayesian approach to predict genetic values of individuals when *p* > >*n*. This proposal is similar to that introduced by de los Campos *et al.* (2010) for dimension reduction. However, ours has several different features, including an orthogonal transformation of the observed data (*y**)*, as well as differential shrinkage as a result of the prior variance assigned to the regression parameters.

## Materials and Methods

### Statistical methods

In model (1), when ** X** is not of full rank, the least squares solution for the unknown

**is neither unique nor stable. These ill-conditioned problems may be analyzed within the framework of an inverse problem theory by using the singular value decomposition of**

*β***of order**

*X*,*n*×

*p*(Aster

*et al.*2005), as where

**and**

*U***are orthonormal matrices of orders**

*V**n*×

*n*and

*p*×

*p*, respectively, and

**is a rectangular**

*S**n*×

*p*matrix containing

*n*non-negative singular values ordered from largest to smallest, . One can write , where is the matrix with singular values along its diagonal, and is a matrix of order

*n*×(

*p-n*) and all its entries are null. Consider the linear transformation on both sides of (1):and let , and . Since , the model (1) becomes: (2)In (2),

**and are vectors each of order**

*d**n*×1 and

**is a vector of order**

*b**p*×1. The column vector

**is partitioned as where is an**

*b**n*×1, and has order (

*p-n*)×1. Therefore, equation (2) becomes:because is zero for any value of . Here only the first

*n*entries of

**can be inferred.**

*b*Since , it follows that . The distribution of the transformed data ** d**, given

**and iswhere is the**

*b**i-th*element of

**d**and has the form) (3)since . To recover the original parameter

**, the matrix**

*β***V**is partitioned into , where contains the first

*n*columns of

**, and contains the remaining , so that . The**

*V**p*original parameters are recovered as whereas does not contribute to fitting the data, that is: (4)where is the

*i-th*column of . Hence (5)and, for example, independent Gaussian priors can be assigned to each of the elements of . Obviously, is an unknown quantity, so a prior must be assigned to this vector. However, as shown previously and in Gianola (2013), the likelihood function does not depend on , so the data do not provide information about . Here, the marginal posterior of is the likelihood times its prior, and the marginal posterior of is equal to its prior. Thus, if the Dirac delta function is assigned as a prior to each element of , we get (4) as the quantity of interest and can safely leave out of the analysis. Thus , and will herein after be denoted as

**,**

*b***and**

*V***, respectively, without loss of generality.**

*S*The ordinary least squares estimator of ** β** found using the generalized Moore-Penrose inverse method is (6)Thus, depends on the ratios between transformed data points and the decaying singular values, which could be interpreted as representing a decreasing signal-noise ratio sequence. So, as singular values decay, there is an increase in the noise component of the least squares estimate of component

**. This signal-noise ratio is inversely proportional to the singular values, as is clearly shown by the expression: (7)The least squares estimator of can be computed from expression (7) as and serves as a basis for regularizing estimation through truncation or weighting.**

*β*### A Bayesian inverse regression model

Several methods have been proposed for mitigating ill-conditioned regression problems by shrinking the solution space via restricting the magnitudes of the estimates and their variance. One of the first proposals was the ridge regression estimator (Hoerl and Kennard 1970). Since model (3) has the form , with , the ridge regression estimator is:(8)with ** Γ** as a diagonal matrix of dimensions

*n*×

*n*with values on the diagonal. Here is a vector of parameters that reduce ill-conditioning. If their magnitude increases excessively, this can lead to a poor fit of the model to the data. Therefore, the choice of vector

**is critical.**

*γ*Hoerl and Kennard (1970) showed that there is a range of ** γ** values where the mean squared error (MSE) of estimates is smaller than the corresponding MSE of the ordinary least squares solution. They minimized MSE by choosing (9)This requires knowing the true values of

**. However, (9) can be used to justify the choice (10)where , , and are estimates of**

*b***, ,, respectively, thus providing a single (global) shrinkage parameter.**

*b*In the Bayesian approach, the prior distribution reflects prior beliefs about , and its variance affects the extent to which the least squares estimate moves toward the prior mean. To construct a model along the lines of de los Campos *et al.* (2012), we adopted as the conditional prior distribution of (11)where the coefficients (given ) are conditionally independent. Therefore, the joint posterior distribution of ** b** and in model (3) is given by:(12)where (

*i = 1,…,n*) are variance parameters and is the prior distribution of . Usually the conjugate prior for is a scaled inverse χ

^{2}distribution with degrees of freedom and scale parameter , that is, (13)From (12), the conditional posterior distribution of

**is: (14)Expression (14) indicates that the conditional posterior mean depends on the data (), whereas the variance is a fixed but unknown quantity.**

*b*The conditional expectation of ** b** in (14) can also be expressed aswhere is known as the ‘weighting factor’ that weights the least squares estimate .

Because the magnitude of variance of grows with *i*, Casella (1985) suggested applying a lower weighting factor for estimates with larger variance. The weighting factor assigns weights close to one to the first elements of , whereas the remaining elements receive lower weights, and can take on values close to zero, or zero itself.

### Bayesian inverse parametric regression models

Four versions of the Bayesian inverse regression models are described in this section. The first two, Bayesian inverse ridge regression (BIRR) and Bayes A inverse (BAI), are versions of the standard BRR and Bayes A (Meuwissen *et al.* 2001), respectively. In these models, the decay of the prior variances is not explicitly considered, whereas the other two proposed models do incorporate a prior decay of the variances.

#### Bayesian inverse ridge regression (BIRR):

The posterior expectation of , given and in (14), can also be written as:which agrees with the ridge estimator in (8). A special case (BIRR) is when all prior distributions of are assigned a constant variance . ThenIf a scaled inverse χ^{2} distribution is assigned to , the joint posterior distribution of is given by:

Note that model BIRR is analogous to the standard BRR in the sense that they both assign to the regression parameters a prior distribution with mean zero and constant variances.

#### Bayes A inverse (BAI):

Bayes A (Meuwissen *et al.* 2001) assigns the scaled *t* distribution as a prior to each marker effect, which can be represented as a normal-gamma mixture. We then have a hierarchical model with two levels: in the first one, a normal distribution with zero mean and variance is used as a conditional prior for ** b** and, in the second level, the same scaled inverse χ

^{2}prior distribution is used for all (Yi and Xu 2008). Thus the joint posterior distribution of ) is (16)According to Yi and Xu (2008), when , this induces an improper distribution of

**. Therefore, these hyper-parameters should be greater than zero. We assigned and, for , a uniform distribution with support on the interval (0,A), where A > 0. It should be noted that model BAI is not equivalent to the standard Bayes A because the prior variance of**

*b***under BAI is not the prior variance of the transformation under Bayes A.**

*b*#### Bayesian inverse regression with decay model 1 (BIR1):

The mean of the posterior distribution of ** b** can be seen as the product of a filter producing values that tend to shrink the least squares estimate toward 0 through the weighting factor , which depends on singular values of

**. It is then reasonable to assume a prior that will shrink the values of , in accordance with the singular values decay. Suppose we assign as prior distribution to**

*X***with being the eigen-decomposition of**

*β***, where**

*Ψ***is the right orthogonal matrix of the singular value decomposition . Then the prior distribution of**

*V**is because and the vector of eigenvalues of and the vector*

**b****of eigenvalues of**

*λ***share the same**

*Ψ***eigenvectors. Therefore, in this case it can be assumed that there is a relationship between and , . Note that this relationship is consistent with what de los Campos**

*V**et al.*(2010) presented for semiparametric regression.

Knapick *et al.* (2012) proposed that the variances of the prior distribution be represented as in an attempt to mimic the decay of the singular values of ** X**, as suggested above. Thus,

*ϕ*is a parameter that scales the rate of decay of the prior variance with respect to

*i*. Knapick

*et al.*(2012) suggested fixing

*α*and letting

*ϕ*be the regularization parameter to be inferred, but without indicating how to do it. Since

*ϕ*is an unknown scale parameter, it seems appropriate to assign an inverse scaled χ

^{2}distribution to it with parameters . However, this may produce overshrinkage, which would be reflected in a poor fit and, thus, low prediction power. To address this problem, we redefined as (17)where

*h*represents a smoothing parameter. The idea of adjusting the decay of a prior variance with a smoothing parameter was given by Maruyama and George (2011).

Thus the posterior density of given is:(18)where is a proper prior distribution for .

#### Bayesian inverse regression with decay model 2 (BIR2):

A variant of BIR1 is obtained by defining . In this case, we may assign as prior for *ϕ* an inverse scaled χ^{2} distribution, with *α* kept fixed. Thus, the posterior is (19)with being a proper prior distribution. When the value of *α* increases, shrinkage tends to increase, and when *α* decreases, shrinkage decreases and approximates BIRR when *α*=0. In the examples below, *α*=1 (the default value).

### Gibbs sampler

It is difficult to sample from the joint posterior distributions (15), (16), (18), and (19), because there are no known closed forms. However, it is possible to obtain the closed form for conditional distributions of the parameters (see the Appendix). This allows using Markov Chain Monte Carlo through the Gibbs Sampler (Gelfand and Smith 1990) algorithm, which samples sequentially from the full conditional distributions until it reaches a stationary process, converging with the joint posterior distribution.

We carried out convergence and diagnostic tests on different data sets. The Gelman-Rubin convergence tests for the four models were satisfactory, using lag-10 thinning results in low autocorrelations in each of the four models. The Raftery-Lewis test suggested a small burn-in and a number of iterations between 10,000 and 20,000 for the data set used below.

With the aim of decreasing the possible impact of Markov Chain Monte Carlo errors on prediction accuracy, we performed a total of 60,000 iterations with a burn-in of 10,000 and a thinning of 10 so that 5000 samples were used for inference.

## Experimental Data

### Maize data set

The maize data represented 21 trait-environment combinations for 300 tropical inbred lines genotyped with 55,000 SNPs each (González-Camacho *et al.* 2012). A first group of traits included female flowering or days to silking, male flowering (MFL) or days to anthesis, and the anthesis-silking interval. Each trait was evaluated under severe drought stress and in well-watered environments. This data set was also used by Crossa *et al.* (2010) for assessing prediction performance, but they used only 1148 SNPs.

In the second group of traits, grain yields (GYs) were obtained under severe drought stress and well-watered environments. Furthermore, GYs of the 300 maize lines also were measured in a large number of relatively high yielding environments and low yielding environments. In the third group of traits, the 300 maize lines also were evaluated in nine international environments for gray leaf spot (GLS), a disease caused by the fungus *Cercospora zeae-maydis*. Finally, in the fourth group, the same 300 lines were evaluated in another set of trials for northern corn leaf blight (NCLB), a disease caused by the fungus *Exserohilum turcicum*.

### Wheat data set

This data set included 306 elite wheat lines from CIMMYT’s Global Wheat Program that were also used by Pérez-Rodríguez *et al.* (2012). These lines were genotyped with 1717 diversity array technology markers generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte.com.au). Two traits were analyzed: days to heading, measured in 10 different environments, and GY, measured in seven different environments.

### Comparing models using cross-validation

Model predictions for each of the maize and wheat data sets were done in each of 50 random partitions, with 90% of the individuals in the training set and 10% of individuals in the testing set to be predicted. We used the same 50 random partitions as González-Camacho *et al.* (2012) and Pérez-Rodríguez *et al.* (2012). Pearson’s correlation between predicted and observed values and the predictive mean squared error (PMSE) were used as measures of predictive ability.

### Data and software

The 21 maize data sets and the 17 wheat data sets, as well as the R scripts developed to fit the predictive statistical models BIRR, BAI, BIR1, and BIR2 used in this study, are deposited at http://repository.cimmyt.org/xmlui/handle/10883/4036.

## Results

### Maize data sets

Table 1 shows mean correlations and mean PMSE from 50 random partitions for four models (BIRR, BAI, BIR1, and BIR2), along with the results of BRR, obtained using the BLR package in R (de los Campos and Pérez-Rodríguez 2010; Pérez-Rodríguez *et al.* 2010). The largest correlations and smallest PMSE for each trait-environment combination are in boldface. Models BIRR and BAI gave results similar to those of BRR. Models BIR1 and BIR2, which explicitly considered the decay of the singular values in the prior distribution, in general had greater average correlations and lower average PMSE, indicating better prediction ability than BRR, BIRR, and BAI.

Table 1 also shows average correlations and percentage differences from BIR1 for each group of trials: Flowering, GY, GLS, and NCLB. The approximate differences from BIR1 or BRR and BIRR were 4% for GY and 9% for NCLB, whereas the differences for GLS and Flowering were 2.6% and 1.2%, respectively. BIR1 and BIR2 had similar performance. BAI and BIR1 had similar results in the Flowering and NCLB groups, but BAI was slightly worse for GY and GLS. This finding indicates a 3% improvement in the overall average prediction ability of BIR1 and BIR2 over that of BRR, BIRR, and BAI. Model BIR1 had greater predictive correlations than BRR in 18 of the 21 trait-environment combinations, whereas BIR1 had smaller PMSE than BRR in 18 of the trait-environment combinations. Models BIR1 and BIR2 had the lowest PMSE in 16 of the 21 maize data sets. The PMSE of BIRR and BAI were similar.

Table 2 shows the number of times a model had a greater correlation than another model in 50 random partitions for each of the 21 trait-environment combinations. BIR1 was better than BIRR in 17 trait-environment combinations, *i.e.*, in 697 of 1050 comparisons. BIR1 was better than BAI in 16 trait-environment combinations, *i.e.*, in 669 partitions. Also, BIR2 was superior to BIRR and BAI in 640 and 610 partitions, respectively, of a total of 1050 random partitions.

### Wheat data sets

Table 3 shows mean correlations of models BIRR, BAI, BIR1, and BIR2, along with results for models BL, BRR, Bayes A, and Bayes B, as reported by Pérez-Rodríguez *et al.* (2012) using the same 50 random partitions. The largest average correlation for each trait-environment combination is in boldface. The table also shows correlation standard deviations (in parentheses). In the case of trait days to heading, the average correlations for BIRR, BAI, BIR1, and BIR2 are very similar and better than those for BL, BRR, Bayes A, and Bayes B. Within the GY group, BIRR, BAI, and BIR1 have similar correlations; however, BIR2, BL, BRR, and Bayes B exhibit a smaller global average than BIRR, BAI, and BIR1. Table 4 shows the PMSE for both traits, and their values are in agreement with the results in Table 3.

## Discussion

The SVD transformation of the data generates a basic probability model such that the joint density of the data (given parameters) is the product of univariate independent random variables, . Moreover, the new parameterization allows reducing the dimensionality of the vector of regression parameters from *p* to *n*. Because *p-n* parameters are not estimable, they do not contribute to data fitting and are considered to have a prior distribution with mean and variance equal to zero (a Dirac delta function). Because the proposed transformation yields *n* positive singular values, it reduces the number of parameters to *n*. This gives BIRR, BAI, BIR1, and BIR2 an extra computational advantage over BRR because the Gibbs sampler algorithm is faster due to the posterior distributions simulating a smaller number of parameters with univariate distributions. The proposed inverse Bayesian models implement the idea put forward by Casella (1985) that the increased instability of the OLS estimates can be ameliorated by decreasing the value of the weighting factor. Shrinkage is achieved using the weighting factor, that is, , which weights the estimates such that the instability of is controlled with a smaller weighting factor. The weights are highly dependent on the singular values and on the variance of the prior distribution.

In the following section, we describe the prediction performance of the four inverse Bayesian methods in terms of the magnitude and decay of the prior variance and the behavior of the OLS estimates . Finally, we consider the pattern of noise amplification as a result of the decay of the singular values, which also depends on the data .

### Shrinkage and prediction in the maize data set

For the 21 maize data sets, the ** X** matrix had a rank equal to

*n*with moderate decay in singular values, indicating moderate ill-conditioning. Figure 1, A−C shows the decay of singular values for data set MFL-WW. Figure 1A shows all singular values, and Figure 1B excludes the first 10 singular values and depicts the rapid decay in the first subsequent 45 singular values, a smaller rate of decay in singular values 45−200 and the rapid decay of singular values after the 200th singular value, reflecting the actual collinearity trend. Figure 1C shows that the OLS estimator of is very stable for the first 200 singular values and becomes very erratic at the end due to noise, coinciding with the rapid decrease in singular values after the first 200. In short, Figure 1, A−C show that more weight should be given to the first 200 least squares estimates ( and less weight should be assigned to the last 50 ´s, which basically represent noise. The four inverse Bayesian methods had good predictive power, with differences being due to the variance assigned to the prior distribution of parameter .

Figure 2, A and B depicts the prior variance for models BIRR, BAI, BIR1, and BIR2 for trait-environment combination MFL-WW. In Figure 2A, the prior variance of for BIRR is represented by a solid line, whereas used in BAI are scattered dots, each representing an individual. It is interesting to note that, for MFL-WW, most of the values for BAI, BIR1, and BIR2 were smaller than those for BIRR, represented by a solid line. This indicates that BAI, BIR1, and BIR2 cause more shrinkage. Figure 2B depicts the decay of singular values for BIR2 (dashed line) and BIR1 (solid line), both mimicking the current decay of singular values shown in Figure 1, A and B. The decay of BIR1 reflects the polynomial function , but smoothed by the parameter *h*, as indicated in (18), with less shrinkage for the first singular values and increasing shrinkage toward the later singular values.

These prior variances are reflected in the weighting factor. For example, Figure 3A shows the weights for BAI and for BIRR and, in general, there is more shrinkage for BAI than for BIRR. Figure 3B shows that the weights for BIR1 are slightly larger than those for BIR2. When comparing Figures 2 and 3, it is clear that as shrinkage increases, the weights come closer to 0. The four inverse Bayesian methods assigned weights that parallel the decay of the singular values (Figure 1, A and B), but for trait MFL-WW, the methods that reduced the weighting factors the most were BIR2 and BIR1, which had larger mean predictive correlations (0.850 and 0.841, respectively) than BIRR, with a mean predictive correlation of 0.822 (Table 1). This may be a reflection of some overfit caused mainly by larger weights used in the last 50 estimators, where noise is concentrated (Figure 1C). Trends were different for other traits; for example, for trait-environment combination GLS-3, BIRR, BIR1, and BIR2 had mean predictive correlations of 0.589, 0.586, and 0.584, respectively.

Comparing Figure 1C (OLS estimates for MFL-WW) and Figure 4B (OLS estimates for GLS-3), instability of OLS estimates is observed for the last 50 OLS values in MFL-WW but this occurred in only a few of the last OLS estimates for GLS-3. This indicates that trait MFL-WW required more shrinkage than that needed for trait GLS-3. This may explain why, for MFL-WW, model BIR2, which showed more shrinkage than BIRR, had a better predictive correlation than BIRR. For the same reasons, model BIRR was a better predictor than BIR2 for trait GLS-3. BIR1 had a good predictive correlation for most traits because the decay of the variance was smoothed out by (see Equation 17), which allowed “intermediate” shrinkage.

### Shrinkage and prediction in the wheat data set

The decay pattern of singular values and the noise in the wheat data set were different from those found in the maize data set. The ** X** matrix had a rank equal to

*n*-1, indicating stronger collinearity and, thus, greater decay of singular values than that found in the maize data sets; in addition, data for various trait-environment combinations had outliers and were more complex. This is shown, for example, in Figure 5, A−C for trait-environment combination DHT-1. Figure 5A shows that the decay of singular values for this trait was more pronounced than that for MFL-WW in maize (Figure 1, A and B). This data set has more collinearity than the maize data, making it difficult to visualize and detect where random noise is manifested in the OLS estimates; this represents a moderate-to-high ill-conditioning situation. Figure 5B shows that one singular value close to zero (upper right-hand side of the figure) magnified the OLS estimate, but when that outlier was removed, the noise pattern was more clearly delineated (Figure 5C). Noise increased systematically toward the last singular values, where the signal was practically lost.

In cases with strong collinearity, Kyung *et al.* (2010) suggested ridge regression as an effective model. This may explain, in part, why BIRR, BAI, and BIR1 had similar performance, whereas BIR2, with more shrinkage than the other models, had a mean predictive correlation similar to those of other models but with differential performance for each trait-environment combination. The greater complexity of the ** X** matrix, as well as of the phenotypes included in the wheat data, suggests that other prediction models, such as the semiparametric regression described by Pérez-Rodríguez

*et al.*(2012), may be more suitable.

## CONCLUSIONS

Models were developed within the framework of inverse regression theory. Inverse solutions induce some parsimony while keeping conditional independence of the transformed phenotypes. Inverse solutions make it possible to visualize that noise is inversely proportional to singular values. The univariate structure allows graphical depiction of shrinkage behavior according to the prior variance and weighting factor. The models developed here seem to deal well with the ill-conditioning and random noise problems arising in genomic prediction, where *p* > >*n*.

The differences among models depend on several factors, such as the pattern and decay of singular values. It is expected that when the decay of the singular values is moderated, the prediction accuracy of the proposed models will be adequate or improved. For the maize data set, with a large number of markers, the moderate decay of singular values causes a low level of ill-conditioning. In these cases, BIR1 and BIR2 seemed to give slightly better predictions than BL, BIRR, and BAI. On the other hand, the wheat data set had a greater level of ill-conditioning, with several outliers and a drastic decay of singular values. BIRR, which assigns a constant prior variance (as standard Bayesian ridge regression), tends to over- or undershrink; this would favor markers with small singular values while penalizing markers with large singular values.

## Appendix

Recalling that model (1) is represented as a deviation from the overall mean μ, closed form distributions can be derived from their joint distributions (15), (16), (18), and (19). The conditional distribution of *μ* is univariate normal:The conditional distribution of is:(small values are given to ).

Specifically for the BIRR model, the conditional distributions for and are:

For the BAI model, the conditional distribution of is:The Bayesian model is completed by giving *υ* a value of 3 to avoid infinite variance in the distribution of . The conditional distribution of isFor BIR 1 and BIR2, the conditional distributions of *ϕ* are:,respectively.

For BIR1, ** Ω** is a diagonal matrix of order

*n*×

*n*, with elements in the diagonal for

*i*=1,2,…,

*n*. For BIR2,

**is equal to matrix . For BIR1, the default value of smoothing parameter**

*Ω**h*is .

## Footnotes

*Communicating editor: D. J. de Koning*

- Received July 6, 2014.
- Accepted August 18, 2014.

- Copyright © 2014 Cuevas
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.