Bayesian Genomic-Enabled Prediction as an Inverse Problem

Genomic-enabled prediction in plant and animal breeding has become an active area of research. Many prediction models address the collinearity that arises when the number (p) of molecular markers (e.g. single-nucleotide polymorphisms) is larger than the sample size (n). Here we propose four Bayesian approaches to the problem based on commonly used data reduction methods. Specifically, we use a Gaussian linear model for an orthogonal transformation of both the observed data and the matrix of molecular markers. Because shrinkage of estimates is affected by the prior variance of transformed effects, we propose four structures of the prior variance as a way of potentially increasing the prediction accuracy of the models fitted. To evaluate our methods, maize and wheat data previously used with standard Bayesian regression models were employed for measuring prediction accuracy using the proposed models. Results indicate that, for the maize and wheat data sets, our Bayesian models yielded, on average, a prediction accuracy that is 3% greater than that of standard Bayesian regression models, with less computational effort.

Bayesian regression shrinkage prior distribution inverse regression GenPred shared data resources genomic selection Predicting complex traits possibly affected by large numbers of genes (each one having a small effect) that are greatly affected by the environment is a difficult task. Information from dense molecular markers attempts to exploit linkage disequilibrium between at least one marker and at least one putatively causal locus, so as to predict the genetic values of individuals based on their phenotypic data. There is a vast literature describing methods that use different functions of markers for predicting genetic values (e.g., de los Campos et al. 2012;Gianola 2013), starting with the seminal work of Meuwissen et al. (2001), in which they propose genome-based prediction functions that regress phenotypes on all available markers by using a Gaussian linear model with three different prior distributions on marker effects. Several regularized regression models, such as ridge regression (Hoerl and Kennard 1970), the Least Absolute Shrinkage and Selection Operator (i.e., LASSO) (Tibshirani 1996), and its Bayesian counterpart (Park and Casella 2008;de los Campos e al. 2010), have been described and used for genomic-based prediction in animals and plants (de los Campos et al. 2009Crossa et al. 2010Crossa et al. , 2011Gónzalez-Camacho et al. 2012;Heslot et al. 2012;Pérez-Rodríguez et al. 2010. The basic quantitative genetic model describes the i th response or phenotype (y i ) as the sum of an unknown genetic value (g i ) plus a residual e i expressed as a deviation from some general mean (m); thus, the basic model is y i ¼ g i þ e i (i ¼ 1; :::; n). The unknown genetic value can be represented as a function of genotypes with a large number of genes that may involve all gene · gene interactions, if there are any. Because the genes affecting a trait are unknown, this complex function must be approximated by, for example, a regression of phenotype on marker genotypes. Large numbers of bi-allellic markers on fx i1 ; . . . ; x ip g (x ij is the number of copies of one of the two alleles observed in the i th individual at the j th marker) may be used in a regression function for predicting the genetic value of the i th individual. The regression can be formulated as uðxÞ ¼ uðx i1 ; . . . ; x ip Þ such that the basic genomic model becomes y i ¼ u i þ e i , where e i is a model error that may include errors due to unspecified environmental effects, imperfect linkage disequilibrium between markers and the actual loci affecting the trait, and unaccounted for gene · gene and gene · environment interactions. In several applications, uðxÞ is a parametric linear regression of the form uðx i1 ; . . . ; x ip Þ ¼ P p j¼1 x ij b j , where b j is the substitution effect of the allele coded as 'one' at the j th marker.
The linear regression function on markers becomes y i ¼ P p j¼1 x ij b j þ e i , or, in matrix notation, where X is the n·p incidence matrix, b is the vector of unknown marker effects, and e is an n·1 vector of random errors, typically assumed to follow the normal distribution Nð0; I n s 2 e Þ; where s 2 e is the random error variance component. If the vector of marker effects is assumed random with normal distribution Nð0; I p s 2 b Þ, where s 2 b is the variance of marker effects, then the genetic value of the individuals is u ¼ Xb with a variance-covariance matrix proportional to the genomic relationship matrix G¼XX 0 = P p j¼1 2q j ð1 2 q j Þ (where q j is the frequency of allele "1" often estimated from the data at hand); this random effect leads to the genomic best linear unbiased predictor (Vanraden, 2007(Vanraden, , 2008, which is extensively used in genomic-assisted prediction. Note that the Bayesian ridge regression (BRR) is equivalent to the genomic best linear unbiased predictor when the distribution of single-nucleotide polymorphism (SNP) effects is regarded as Gaussian with mean 0 and variance I p s 2 b ; that is Nð0; I p s 2 b Þ (Pérez-Rodríguez et al. 2012). The regression function uðx i1 ; . . . ; x ip Þ also can be represented by semiparametric approaches, such as reproducing Kernel Hilbert space regressions or different types of neural networks (Gianola et al. 2006(Gianola et al. , 2011Gianola and Van Kaam 2008;González-Camacho et al. 2012;Pérez-Rodríguez et al. 2012 (1) is generally overparameterized and therefore ill-conditioned because there are many more predictors (markers) than individuals with observations ( p . .n), as well as strong collinearity among predictors due to linkage disequilibrium; hence X is not a full column rank. These challenges can be overcome using what is called the inverse problem approach (Aster et al. 2005;Tarantola 2005). The discrete inverse problem approach for matrices of incomplete rank or that are ill-conditioned due to possible collinearity is discussed at length by Aster et al. (2005), whereas Tarantola (2005) addresses the continuous version of the same problem. Cavalier (2008) summarized the inverse problem as a nonparametric regression, d ¼ Tu þẽ, of a transformed data vector d on the linear operator T n · n with u 2 Q n 4ℝ n as a vector of regression parameters andẽ as a vector of Gaussian errors. Within a Bayesian framework, Knapick et al. (2012) proposed a solution by using the singular value decomposition of T such that the decay of singular values is mimicked in the prior distribution of u. Thus, estimating the unknown transformed marker effects of u from this perspective seems of interest but has not been attempted in genome-enabled prediction. The inferential solution to the inverse problem is based on the posterior distribution of u, with the variability ofẽ and prior beliefs about u conveyed through probabilistic models.
In the prediction context,  used the fact that the matrix G (or X) may be eigen-decomposed as G ¼ UDU 9 , with D containing the eigenvalues of G, and U being its corresponding eigenvectors. This transformation allows reducing the original highly dimensional data to fewer dimensions by extracting most of the signals in the "first" components of the decomposition and relegating the noise to the "last" components. As de los  demonstrated, the regression of phenotypes on all markers is equivalent to the regression of phenotypes on a set of rotated covariates. Imple-menting regression methods with p . . n generally requires either shrinkage estimation or reducing the dimensionality of the data, which is the approach considered here. Recently, Gianola (2013) studied the influence of the prior on the posterior inference of marker effects in the overparameterized models used in genome-enabled prediction. He concluded that a main driving force in the various Bayesian models (Bayesian alphabet) is the prior and not the data; thus different priors will produce different results mainly because their shrinkage effect varies.
In this work, we propose using the parametric model (1) within the framework of inverse problems and a Bayesian approach to predict genetic values of individuals when p . .n. This proposal is similar to that introduced by de los  for dimension reduction. However, ours has several different features, including an orthogonal transformation of the observed data (y), as well as differential shrinkage as a result of the prior variance assigned to the regression parameters.

Statistical methods
In model (1), when X is not of full rank, the least squares solution for the unknown b is neither unique nor stable. These ill-conditioned problems may be analyzed within the framework of an inverse problem theory by using the singular value decomposition of X, of order n·p (Aster et al. 2005), as X ¼ USV 9 ; where U and V are orthonormal matrices of orders n·n and p·p, respectively, and S is a rectangular n·p matrix containing n non-negative singular values ordered from largest to smallest, s 1 $ s 2 $ ⋯ $ s n . One can write S ¼ ½S 1 ; S 2 , where S 1 is the n · n matrix with singular values along its diagonal, and S 2 is a matrix of order n·(p-n) and all its entries are null. Consider the linear transformation on both sides of (1): In (2), d andẽ are vectors each of order n·1 and b is a vector of order p·1. The column vector b is partitioned as b 1 is an n·1, and b 2 has order (p-n)·1. Therefore, equation (2) becomes: Here only the first n entries of b can be inferred. Sinceẽ ¼ U9e, it follows thatẽ ¼ U9e e Nð0; U9Us 2 e Þ ¼ Nð0; I n s 2 e Þ. The distribution of the transformed data d; given b and s 2 e ; is where d i is the i-th element of d and has the form since S 1 ¼ diagfs 1 ; . . . ; s n g. To recover the original parameter b, the p · p matrix V is partitioned into V ¼ ½V 1 ; V 2 , where V 1 contains the first n columns of V; and V 2 contains the remaining p 2 n, so The p original parameters are recovered as b ¼ V 1 b 1 ; whereas b 2 does not contribute to fitting the data, that is: where V i;1 is the i-th column of V 1 . Hence and, for example, independent Gaussian priors can be assigned to each of the elements of b 1 . Obviously, b 2 is an unknown quantity, so a prior must be assigned to this vector. However, as shown previously and in Gianola (2013), the likelihood function does not depend on b 2 , so the data do not provide information about b 2 . Here, the marginal posterior of b 1 is the likelihood times its prior, and the marginal posterior of b 2 is equal to its prior. Thus, if the Dirac delta function is assigned as a prior to each element of b 2 , we get (4) as the quantity of interest and can safely leave b 2 out of the analysis. Thus b 1 , V 1 and S 1 will herein after be denoted as b, V and S, respectively, without loss of generality. The ordinary least squares estimator of b found using the generalized Moore-Penrose inverse method is Thus, b Ã depends on the ratios di si ; i ¼ 1; . . . ; n; between transformed data points and the decaying singular values, which could be interpreted as representing a decreasing signal-noise ratio sequence. So, as singular values decay, there is an increase in the noise component of the least squares estimate of component b. This signal-noise ratio is inversely proportional to the singular values, as is clearly shown by the expression: The least squares estimator of b i can be computed from expression (7) as b Ã i ¼ di si and serves as a basis for regularizing estimation through truncation or weighting.
A Bayesian inverse regression model Several methods have been proposed for mitigating ill-conditioned regression problems by shrinking the solution space via restricting the magnitudes of the estimates and their variance. One of the first proposals was the ridge regression estimator (Hoerl and Kennard 1970). Since model (3) has the form d ¼ Sb þẽ, with e $ Nð0; I n s 2 e Þ, the ridge regression estimator is: with G as a diagonal matrix of dimensions n·n with values on the diagonal. Here g . 0 is a vector of parameters that reduce ill-conditioning. If their magnitude increases excessively, this can lead to a poor fit of the model to the data. Therefore, the choice of vector g is critical. Hoerl and Kennard (1970) showed that there is a range of g values where the mean squared error (MSE) of estimates is smaller than the corresponding MSE of the ordinary least squares solution. They minimized MSE by choosing This requires knowing the true values of b. However, (9) can be used to justify the choice =n, andŝ 2 e are estimates of b, s 2 b , and s 2 e , respectively, thus providing a single (global) shrinkage parameter.
In the Bayesian approach, the prior distribution reflects prior beliefs about b i , and its variance affects the extent to which the least squares estimate moves toward the prior mean. To construct a model along the lines of de los Campos et al. (2012), we adopted as the conditional prior distribution of b i where the b i coefficients (given l i ) are conditionally independent. Therefore, the joint posterior distribution of b and s 2 e in model (3) is given by: where l i (i = 1,. . .,n) are variance parameters and pðs 2 e Þ is the prior distribution of s 2 e . Usually the conjugate prior for s 2 e is a scaled inverse x 2 distribution with y e degrees of freedom and scale parameter Sc e , that is, From (12), the conditional posterior distribution of b is: Expression (14) indicates that the conditional posterior mean depends on the data (d i ), whereas the variance is a fixed but unknown quantity.
The conditional expectation of b in (14) can also be expressed as is known as the 'weighting factor' that weights the least squares estimate Casella (1985) suggested applying a lower weighting factor for estimates with larger variance. The weighting factor assigns weights close to one to the first elements of b Ã , whereas the remaining elements receive lower weights, and can take on values close to zero, or zero itself.

Bayesian inverse parametric regression models
Four versions of the Bayesian inverse regression models are described in this section. The first two, Bayesian inverse ridge regression (BIRR) and Bayes A inverse (BAI), are versions of the standard BRR and Bayes A (Meuwissen et al. 2001), respectively. In these models, the decay of the prior variances is not explicitly considered, whereas the other two proposed models do incorporate a prior decay of the variances.
Bayesian inverse ridge regression (BIRR): The posterior expectation of b i , given d i ; s i ; l i and s 2 e in (14), can also be written as: which agrees with the ridge estimator in (8). A special case (BIRR) is when all prior distributions of b i are assigned a constant variance Then If a scaled inverse x 2 distribution is assigned to s 2 b , the joint posterior distribution of ðb; s 2 b ; s 2 e Þ is given by: Note that model BIRR is analogous to the standard BRR in the sense that they both assign to the regression parameters a prior distribution with mean zero and constant variances.
Bayes A inverse (BAI): Bayes A (Meuwissen et al. 2001) assigns the scaled t distribution as a prior to each marker effect, which can be represented as a normal-gamma mixture. We then have a hierarchical model with two levels: in the first one, a normal distribution with zero mean and variance l i ¼ s 2 bi is used as a conditional prior for b and, in the second level, the same scaled inverse x 2 prior distribution is used for all s 2 bi (Yi and Xu 2008). Thus the joint posterior distribution of ðb; s 2 bi ; Sc b ; s 2 e ) is According to Yi and Xu (2008), when y b ¼ 0; and Sc b ¼ 0, this induces an improper distribution of b. Therefore, these hyperparameters should be greater than zero. We assigned y b ¼ 3 and, for Sc b , a uniform distribution with support on the interval (0,A), where A . 0. It should be noted that model BAI is not equivalent to the standard Bayes A because the prior variance of b under BAI is not the prior variance of the transformation V9b under Bayes A.
Bayesian inverse regression with decay model 1 (BIR1): The mean of the posterior distribution of b can be seen as the product of a filter producing values that tend to shrink the least squares estimate toward 0 through the weighting factor f i , which depends on singular values of X. It is then reasonable to assume a prior that will shrink the values of b Ã , in accordance with the singular values decay. Suppose we assign Nð0; CÞ as prior distribution to b with C ¼ VLV 9 being the eigendecomposition of C, where V is the right orthogonal p · p matrix of the singular value decomposition X ¼ USV9. Then the prior distribution of b is Nð0; LÞ because b ¼ V9b and the vector s 2 of eigenvalues of X 9 X and the vector l of eigenvalues of C share the same V eigenvectors. Therefore, in this case it can be assumed that there is a relationship between s i and l i , i ¼ 1; . . . ; n. Note that this relationship is consistent with what de los Campos et al. (2010) presented for semiparametric regression. Knapick et al. (2012) proposed that the variances of the prior distribution be represented as l i ¼ ui 2122a ði ¼ 1; . . . ; nÞ in an attempt to mimic the decay of the singular values of X, as suggested above. Thus, u is a parameter that scales the rate i 2122a of decay of the prior variance with respect to i. Knapick et al. (2012) suggested fixing a and letting u be the regularization parameter to be inferred, but without indicating how to do it. Since u is an unknown scale parameter, it seems appropriate to assign an inverse scaled x 2 distribution to it with parameters y u ; Sc u . However, this may produce overshrinkage, which would be reflected in a poor fit and, thus, low prediction power. To address this problem, we redefined l i as where h represents a smoothing parameter. The idea of adjusting the decay of a prior variance with a smoothing parameter was given by Maruyama and George (2011). Thus the posterior density of ðb; u; Sc u ; s 2 e Þ; given d; S; is: where pðSc u Þ is a proper prior distribution for Sc u .
Bayesian inverse regression with decay model 2 (BIR2): A variant of BIR1 is obtained by defining l i ¼ us a i . In this case, we may assign as prior for u an inverse scaled x 2 distribution, with a kept fixed. Thus, the posterior is with pðSc u Þ being a proper prior distribution. When the value of a increases, shrinkage tends to increase, and when a decreases, shrinkage decreases and approximates BIRR when a=0.
In the examples below, a=1 (the default value).

Gibbs sampler
It is difficult to sample from the joint posterior distributions (15), (16),  (Gelfand and Smith 1990) algorithm, which samples sequentially from the full conditional distributions until it reaches a stationary process, converging with the joint posterior distribution.
We carried out convergence and diagnostic tests on different data sets. The Gelman-Rubin convergence tests for the four models were satisfactory, using lag-10 thinning results in low autocorrelations in each of the four models. The Raftery-Lewis test suggested a small burn-in and a number of iterations between 10,000 and 20,000 for the data set used below.
With the aim of decreasing the possible impact of Markov Chain Monte Carlo errors on prediction accuracy, we performed a total of 60,000 iterations with a burn-in of 10,000 and a thinning of 10 so that 5000 samples were used for inference.

Maize data set
The maize data represented 21 trait-environment combinations for 300 tropical inbred lines genotyped with 55,000 SNPs each . A first group of traits included female flowering or days to silking, male flowering (MFL) or days to anthesis, and the anthesis-silking interval. Each trait was evaluated under severe drought stress and in well-watered environments. This data set was also used by Crossa et al. (2010) for assessing prediction performance, but they used only 1148 SNPs.
In the second group of traits, grain yields (GYs) were obtained under severe drought stress and well-watered environments. Further-more, GYs of the 300 maize lines also were measured in a large number of relatively high yielding environments and low yielding environments. In the third group of traits, the 300 maize lines also were evaluated in nine international environments for gray leaf spot (GLS), a disease caused by the fungus Cercospora zeae-maydis. Finally, in the fourth group, the same 300 lines were evaluated in another set of trials for northern corn leaf blight (NCLB), a disease caused by the fungus Exserohilum turcicum.

Wheat data set
This data set included 306 elite wheat lines from CIMMYT's Global Wheat Program that were also used by Pérez-Rodríguez et al. (2012). These lines were genotyped with 1717 diversity array technology markers generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte.com.au). Two traits were analyzed: days to heading, measured in 10 different environments, and GY, measured in seven different environments.

Comparing models using cross-validation
Model predictions for each of the maize and wheat data sets were done in each of 50 random partitions, with 90% of the individuals in the training set and 10% of individuals in the testing set to be predicted. We used the same 50 random partitions as González-Camacho et al.

RESULTS
Maize data sets Table 1 shows mean correlations and mean PMSE from 50 random partitions for four models (BIRR, BAI, BIR1, and BIR2), along with the results of BRR, obtained using the BLR package in R (de los Campos and Pérez-Rodríguez 2010; Pérez-Rodríguez et al. 2010). The largest correlations and smallest PMSE for each traitenvironment combination are in boldface. Models BIRR and BAI gave results similar to those of BRR. Models BIR1 and BIR2, which explicitly considered the decay of the singular values in the prior distribution, in general had greater average correlations and lower average PMSE, indicating better prediction ability than BRR, BIRR, and BAI. Table 1 also shows average correlations and percentage differences from BIR1 for each group of trials: Flowering, GY, GLS, and NCLB. The approximate differences from BIR1 or BRR and BIRR were 4% for GY and 9% for NCLB, whereas the differences for GLS and Flowering were 2.6% and 1.2%, respectively. BIR1 and BIR2 had similar performance. BAI and BIR1 had similar results in the Flowering and NCLB groups, but BAI was slightly worse for GY and GLS. This finding indicates a 3% improvement in the overall average prediction ability of BIR1 and BIR2 over that of BRR, BIRR, and BAI. Model BIR1 had greater predictive correlations than BRR in 18 of the 21 trait-environment combinations, whereas BIR1 had smaller PMSE than BRR in 18 of the trait-environment combinations. Models BIR1 and BIR2 had the lowest PMSE in 16 of the 21 maize data sets. The PMSE of BIRR and BAI were similar. Table 2 shows the number of times a model had a greater correlation than another model in 50 random partitions for each of the 21 trait-environment combinations. BIR1 was better than BIRR in 17 trait-environment combinations, i.e., in 697 of 1050 comparisons. BIR1 was better than BAI in 16 trait-environment combinations, i.e., in 669 partitions. Also, BIR2 was superior to BIRR and BAI in 640 and 610 partitions, respectively, of a total of 1050 random partitions.
Wheat data sets Table 3 shows mean correlations of models BIRR, BAI, BIR1, and BIR2, along with results for models BL, BRR, Bayes A, and Bayes B, as reported by Pérez-Rodríguez et al. (2012) using the same 50 random partitions. The largest average correlation for each trait-environment combination is in boldface. The table also shows correlation standard deviations (in parentheses). In the case of trait days to heading, the average correlations for BIRR, BAI, BIR1, and BIR2 are very similar and better than those for BL, BRR, Bayes A, and Bayes B. Within the GY group, BIRR, BAI, and BIR1 have similar correlations; however, BIR2, BL, BRR, and Bayes B exhibit a smaller global average than n  BIRR, BAI, and BIR1. Table 4 shows the PMSE for both traits, and their values are in agreement with the results in Table 3.

DISCUSSION
The SVD transformation of the data generates a basic probability model such that the joint density of the data (given parameters) is the product of univariate independent random variables, pðdjS; b; s 2 e Þ ¼ Q n i¼1 Nðd i js i b i ; s 2 e Þ. Moreover, the new parameterization allows reducing the dimensionality of the vector of regression parameters from p to n. Because p-n parameters are not estimable, they do not contribute to data fitting and are considered to have a prior distribution with mean and variance equal to zero (a Dirac delta function). Because the proposed transformation yields n positive singular values, it reduces the number of parameters to n. This gives BIRR, BAI, BIR1, and BIR2 an extra computational advantage over BRR because the Gibbs sampler algorithm is faster due to the posterior distributions simulating a smaller number of parameters with univariate distributions. The proposed inverse Bayesian models implement the idea put forward by Casella (1985) that the increased instability of the OLS estimates b Ã i can be ameliorated by decreasing the value of the weighting factor. Shrinkage is achieved using the weighting factorðf i Þ, that is, i is controlled with a smaller weighting factor. The weights are highly dependent on the singular values and on the variance of the l i prior distribution.
In the following section, we describe the prediction performance of the four inverse Bayesian methods in terms of the magnitude and decay of the prior variance and the behavior of the OLS estimates b Ã i . Finally, we consider the pattern of noise amplification as a result of the decay of the singular values, which also depends on the data d i .
Shrinkage and prediction in the maize data set For the 21 maize data sets, the X matrix had a rank equal to n with moderate decay in singular values, indicating moderate ill-conditioning. Figure 1, A2C shows the decay of singular values for data set MFL-WW. Figure 1A shows all singular values, and Figure 1B excludes the first 10 singular values and depicts the rapid decay in the first subsequent 45 singular values, a smaller rate of decay in singular values 452200 and the rapid decay of singular values after the 200th singular value, reflecting the actual collinearity trend. Figure  1C shows that the OLS estimator of b i is very stable for the first 200 singular values and becomes very erratic at the end due to noise, coinciding with the rapid decrease in singular values after the first 200. In short, Figure 1, A2C show that more weight should be given to the first 200 least squares estimates (b Ã i Þ and less weight should be assigned to the last 50 b Ã i´s , which basically represent noise. The four inverse Bayesian methods had good predictive power, with differences being due to the variance assigned to the prior distribution of parameter b i . Figure 2, A and B depicts the prior variance for models BIRR, BAI, BIR1, and BIR2 for trait-environment combination MFL-WW. In Figure 2A, the prior variance of b i for BIRR is represented by a solid line, whereas l i used in BAI are scattered dots, each representing an individual. It is interesting to note that, for MFL-WW, most of the l i values for BAI, BIR1, and BIR2 were smaller than those for BIRR, represented by a solid line. This indicates that BAI, BIR1, and BIR2 cause more shrinkage. Figure 2B depicts the decay of singular values for BIR2 (dashed line) and BIR1 (solid line), both mimicking the current decay of singular values shown in Figure 1, A and B. The decay of BIR1 reflects the polynomial function i 21 , but smoothed by the parameter h, as indicated in (18), with less shrinkage for the first singular values and increasing shrinkage toward the later singular values.
These prior variances are reflected in the weighting factor. For example, Figure 3A shows the weights for BAI and for BIRR and, in general, there is more shrinkage for BAI than for BIRR. Figure 3B shows that the weights for BIR1 are slightly larger than those for BIR2. When comparing Figures 2 and 3, it is clear that as shrinkage increases, the weights come closer to 0. The four inverse Bayesian methods assigned weights that parallel the decay of the singular values ( Figure 1, A and B), but for trait MFL-WW, the methods that reduced the weighting factors the most were BIR2 and BIR1, which had larger mean predictive correlations (0.850 and 0.841, respectively) than BIRR, with a mean predictive correlation of 0.822 (Table 1). This may be a reflection of some overfit caused mainly by larger weights used in the last 50 estimators, where noise is concentrated ( Figure 1C). Trends were different for other traits; for example, for trait-environment  combination GLS-3, BIRR, BIR1, and BIR2 had mean predictive correlations of 0.589, 0.586, and 0.584, respectively.
Comparing Figure 1C (OLS estimates for MFL-WW) and Figure  4B (OLS estimates for GLS-3), instability of OLS estimates is observed for the last 50 OLS values in MFL-WW but this occurred in only a few of the last OLS estimates for GLS-3. This indicates that trait MFL-WW required more shrinkage than that needed for trait GLS-3. This may explain why, for MFL-WW, model BIR2, which showed more shrinkage than BIRR, had a better predictive correlation than BIRR. For the same reasons, model BIRR was a better predictor than BIR2 for trait GLS-3. BIR1 had a good predictive correlation for most traits because the decay of the variance was smoothed out by h ¼ 1 sr e 0:10 (see Equation 17), which allowed "intermediate" shrinkage.

Shrinkage and prediction in the wheat data set
The decay pattern of singular values and the noise in the wheat data set were different from those found in the maize data set. The X matrix had a rank equal to n-1, indicating stronger collinearity and, thus, greater decay of singular values than that found in the maize data sets; in addition, data for various trait-environment combinations had outliers and were more complex. This is shown, for example, in Figure 5, A2C for trait-environment combination DHT-1. Figure 5A shows that the decay of singular values for this trait was more pronounced than that for MFL-WW in maize (Figure 1, A and B). This data set has more collinearity than the maize data, making it difficult to visualize and detect where random noise is manifested in the OLS estimates; this represents a moderate-to-high ill-conditioning situation. Figure 5B shows that one singular value close to zero (upper right-hand side of the figure) magnified the OLS estimate, but when that outlier was removed, the noise pattern was more clearly delineated ( Figure 5C). Noise increased systematically toward the last singular values, where the signal was practically lost.
In cases with strong collinearity, Kyung et al. (2010) suggested ridge regression as an effective model. This may explain, in part, why BIRR, BAI, and BIR1 had similar performance, whereas BIR2, with more shrinkage than the other models, had a mean predictive correlation similar to those of other models but with differential performance for each trait-environment combination. The greater complexity of the X matrix, as well as of the phenotypes included in the wheat data, suggests that other prediction models, such as the semiparametric regression described by Pérez-Rodríguez et al. (2012), may be more suitable.

CONCLUSIONS
Models were developed within the framework of inverse regression theory. Inverse solutions induce some parsimony while keeping conditional independence of the transformed phenotypes. Inverse solutions make it possible to visualize that noise is inversely proportional to singular values. The univariate structure allows graphical depiction of shrinkage behavior according to the prior variance and weighting factor. The models developed here seem to deal well with the illconditioning and random noise problems arising in genomic prediction, where p . .n.
The differences among models depend on several factors, such as the pattern and decay of singular values. It is expected that when the decay of the singular values is moderated, the prediction accuracy of the proposed models will be adequate or improved. For the maize data set, with a large number of markers, the moderate decay of singular values causes a low level of ill-conditioning. In these cases, BIR1 and BIR2 seemed to give slightly better predictions than BL, BIRR, and BAI. On the other hand, the wheat data set had a greater level of illconditioning, with several outliers and a drastic decay of singular values. BIRR, which assigns a constant prior variance (as standard Bayesian ridge regression), tends to over-or undershrink; this would favor markers with small singular values while penalizing markers with large singular values.