Comparison Between Linear and Non-parametric Regression Models for Genome-Enabled Prediction in Wheat

In genome-enabled prediction, parametric, semi-parametric, and non-parametric regression models have been used. This study assessed the predictive ability of linear and non-linear models using dense molecular markers. The linear models were linear on marker effects and included the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B. The non-linear models (this refers to non-linearity on markers) were reproducing kernel Hilbert space (RKHS) regression, Bayesian regularized neural networks (BRNN), and radial basis function neural networks (RBFNN). These statistical models were compared using 306 elite wheat lines from CIMMYT genotyped with 1717 diversity array technology (DArT) markers and two traits, days to heading (DTH) and grain yield (GY), measured in each of 12 environments. It was found that the three non-linear models had better overall prediction accuracy than the linear regression specification. Results showed a consistent superiority of RKHS and RBFNN over the Bayesian LASSO, Bayesian ridge regression, Bayes A, and Bayes B models.

Genome-enabled prediction of complex traits based on marker data are becoming important in plant and animal breeding, personalized medicine, and evolutionary biology (Meuwissen et al. 2001;Bernardo and Yu 2007;de los Campos et al. 2009Crossa et al. 2010Crossa et al. , 2011Ober et al. 2012). In the standard, infinitesimal, pedigree-based model of quantitative genetics, the family structure of a population is reflected in some expected resemblance between relatives. The latter is measured as an expected covariance matrix among individuals and is used to predict genetic values (e.g. Crossa et al. 2006;Burgueño et al. 2007Burgueño et al. , 2011. Whereas pedigree-based models do not account for Mendelian segregation and the expected covariance matrix is constructed using assumptions that do not hold (e.g. absence of selection and mutation and random mating), the marker-based models allow tracing Mendelian segregation at several positions of the genome and observing realized (as opposed to expected) covariances. This enhances the potential for improving the accuracy of estimates of genetic values, thus increasing the genetic progress attainable when these predictions are used for selection purposes in lieu of pedigree-based predictions. Recently, de los Campos et al. (2009 and Crossa et al. (2010Crossa et al. ( , 2011 used Bayesian estimates from genomic parametric and semi-parametric regressions, and they found that models that incorporate pedigree and markers simultaneously had better prediction accuracy for several traits in wheat and maize than models based only on pedigree or only on markers. The standard linear genetic model represents the phenotypic response of the i th individual (y i ) as the sum of a genetic value, g i , and of a model residual, e i , such that the linear model for n individuals ði ¼ 1; :::; nÞ is represented as y i ¼ g i þ e i . However, building predictive models for complex traits using a large number of molecular markers (p) with a set of lines comprising individuals (n) with p n is challenging because individual marker effects are not likelihoodidentified. In this case, marker effects can be estimated via penalized parametric or semi-parametric methods or their Bayesian counterparts, rather than via ordinary least squares. This reduces the mean-squared error of estimates; it also increases prediction accuracy of out-of-sample cases and prevents over-fitting . In addition to the well-known Bayes A and B linear regression models originally proposed by Meuwissen et al. (2001) for incorporating marker effects into g i , there are several penalized parametric regression methods for estimating marker effects, such as ridge regression, the least absolute shrinkage and selection operator (LASSO), and the elastic net (Hastie et al. 2009). The Bayesian counterparts of these models have proved to be useful because appropriate priors can be assigned to the regularization parameter(s), and uncertainty in the estimations and predictions can be measured directly by applying the Bayesian paradigm.
Regression methods assume a linear relationship between phenotype and genotype, and they typically account for additive allelic effects only; however, evidence of epistatic effects on plant traits is vast and well documented (e.g. Holland 2001Holland , 2008. In wheat, for instance, detailed analyses have revealed a complex circuitry of epistatic interactions in the regulation of heading time involving different vernalization genes, day-length sensitivity genes, and earliness per se genes, as well as the environment (Laurie et al. 1995;Cockram et al. 2007). Epistatic effects have also been found to be an important component of the genetic basis of plant height and bread-making quality traits (Zhang et al. 2008;Conti et al. 2011). It is becoming common to study gene · gene interactions by using a paradigm of networks that includes aggregating gene · gene interaction that exists even in the absence of main effects (McKinney and Pajewski 2012). Interactions between alleles at two or more loci could theoretically be represented in a linear model via use of appropriate contrasts. However, this does not scale when the number of markers (p) is large, as the number of 2-locus, 3-locus, etc., interactions is mind boggling.
An alternative approach to the standard parametric modeling of complex interactions is provided by non-linear, semi-parametric methods, such as kernel-based models (e.g. Gianola et al. 2006;Gianola and van Kaam 2008) or artificial neural networks (NN) Gianola et al. 2011), under the assumption that such procedures can capture signals from high-order interactions. The potential of these methods, however, depends on the kernel chosen and on the neural network architecture. In a recent study, Heslot et al. (2012) compared the predictive accuracy of several genome-enabled prediction models, including reproducing kernel Hilbert space (RKHS) and NN, using barley and wheat data; the authors found that the non-linear models gave a modest but consistent predictive superiority (as measured by correlations between predictions and realizations) over the linear models. In particular, the RKHS model had a better predictive ability than that obtained using the parametric regressions.
The use of RKHS for predicting complex traits was first proposed by Gianola et al. (2006) and Gianola and van Kaam (2008).  further developed the theoretical basis of RHKS with "kernel averaging" (simultaneous use of various kernels in the model) and showed its good prediction accuracy. Other empirical studies in plants have corroborated the increase in prediction accuracy of kernel methods (e.g. Crossa et al. 2010Crossa et al. , 2011Heslot et al. 2012). Recently, Long et al. (2010), using chicken data, andGonzález-Camacho et al. (2012), using maize data, showed that NN methods provided prediction accuracy comparable to that obtained using the RKHS method. In NN, the bases functions (adaptive "covariates") are inferred from the data, which gives the NN great potential and flexibility for capturing complex interactions between input variables (Hastie et al. 2009). In particular, Bayesian regularized neural networks (BRNN) and radial basis function neural networks (RBFNN) have features that make them attractive for use in genomic selection (GS).
In this study, we examined the predictive ability of various linear and non-linear models, including the Bayes A and B linear regression models of Meuwissen et al. (2001); the Bayesian LASSO, as in Park and Casella (2008) and de los Campos et al. (2009); RKHS, using the "kernel averaging" strategy proposed by ; the RBFNN, proposed and used by González-Camacho et al. (2012); and the BRNN, as described by Neal (1996) and used in the context of GS by Gianola et al. (2011). The predictive ability of these models was compared using a cross-validation scheme applied to a wheat data set from CIMMYT's Global Wheat Program.

Experimental data
The data set included 306 elite wheat lines, 263 lines that are candidates for the 29 th Semi-Arid Wheat Screening Nursery (SAWSN), and 43 lines from the 18 th Semi-Arid Wheat Yield Trial (SAWYT) from CIMMYT's Global Wheat Program. These lines were genotyped with 1717 diversity array technology (DArT) markers generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte. com.au). Two traits were analyzed: grain yield (GY) and days to heading (DTH) (see Supporting Information, File S1).
The traits were measured in a total of 12 different environments (1-12) ( Table 1): GY in environments 1-7 and DTH in environments 1-5 and 8-12 (10 in all). Different agronomic practices were used. Yield trials were planted in 2009 and 2010 using prepared beds and flat plots under controlled drought or irrigated conditions. Yield data from experiments in 2010 were replicated, whereas data from trials in 2009 were adjusted means from an alpha lattice incomplete block design with adjustment for spatial variability in the direction of rows and columns using the autoregressive model fitted in both directions.
Data used to train the models for GY and DTH in 2009 were the best linear unbiased estimator (BLUE) after spatial analysis, whereas the BLUE data for 2010 were obtained after performing analyses in each of the 12 environments and combined. The experimental designs in each location consisted of alpha lattice incomplete block designs of different sizes, with two replicates each.
Broad-sense heritability at individual environments was calculated as h 2 ¼ s 2 g =ðs 2 g þ s 2 e nreps Þ, where s 2 g and s 2 e are the genotype and error variance components, respectively, and nreps is the number of replicates. For the combined analyses across environments, broad-sense heritability was calculated as h 2 ¼ s 2 Þ, where the term s 2 ge is the genotype · environment interaction variance component, and nenv is the number of environments included in the analysis.

Statistical models
One method for incorporating markers is to define g i as a parametric linear regression on marker covariates x ij with form gi ¼ P p j¼1 xijb j , such that yi ¼ P p j¼1 xijb j þ ei ( j = 1,2,. . .,p markers); here, b j is the partial regression of y i on the j th marker covariate (Meuwissen et al. 2001). Extending the model to allow for an intercept We adopted Gaussian assumptions for model residuals; specifically, the joint distribution of model residuals in Equation 1 was assumed normal with mean zero and variance s 2 e . The likelihood function is is a normal density for random variable x ij b j and with variance s 2 e . Depending on how priors on the marker effects are assigned, different Bayesian linear regression models result.
Linear models: Bayesian ridge regression, Bayesian LASSO, Bayes A, and Bayes B A standard penalized regression method is ridge regression (Hoerl and Kennard 1970); its Bayesian counterpart, Bayesian ridge regression (BRR), uses a prior density of marker effects, pðb j jvÞ, that is, Gaussian, centered at zero and with variance common to all the markers, that is, pðb j js 2 b Þ ¼ Nðb j j0; s 2 b Þ, where s 2 b is a prior-variance of marker effects. Marker effects are assumed independent and identically distributed a priori. We assigned scaled inverse chi distributions x 22 ðdf : ; s : Þ to the variance parameters s 2 e and s 2 b . The prior degrees of freedom parameters were set to df : ¼ 4 and s : ¼ 1. It can be shown that the posterior mean of marker effects is the best linear unbiased predictor (BLUP) of marker effects, so Bayesian ridge regression is often referred to as RR-BLUP .
The Bayesian LASSO, Bayes A, and Bayes B relax the assumption of common prior variance to all marker effects. The relationship among these three models is as follows: Bayes B can be considered as the most general of the three, in the sense that Bayes A and Bayesian ridge regression can be viewed as special cases of Bayes B. This is because Bayes A is obtained from Bayes B by setting p ¼ 0 (the proportion of markers with null effects), and Bayesian ridge regression is obtained from Bayes B by setting p ¼ 0 and assuming that all the markers have the same variance.
Bayes B uses a mixture distribution with a mass at zero, such that the (conditional) prior distribution of marker effects is given by The prior assigned to s 2 j ; j ¼ 1; ::::; p is the same for all markers, i.e. a scaled inverted chi squared distribution are the degrees of freedom and s b is a scaling parameter. Bayes B becomes Bayes A by setting p = 0.
In the case of Bayes B, we took p ¼ 0:95; df b ¼ 4, and q j is the allele frequency for marker j ands 2 S is the additive genetic variance explained by markers [see Habier et al. (2011) and Resende et al. (2012) for more details]. In the case of s 2 e , we assigned a flat prior as in Wang et al. (1994).
The Bayesian LASSO assigns a double exponential (DE) distribution to all marker effects (conditionally on a regularization parameter l), centered at zero and with marker-specific variance, that is,

Non-linear models: RBFNN, BRNN, and RKHS
In this section, we describe the basic structure of the non-linear single hidden layer feed-forward neural network (SLNN) with two of its variants, the radial basis function neural network and the Bayesian regularized neural network. We also give a brief explanation of RKHS with the averaging kernel method at the end of this section.
Single hidden layer feed-forward neural network: In a single-layer feed-forward (SLNN), the non-linear activation functions in the hidden layer enable a NN to have universal approximation ability, giving it great potential and flexibility in terms of capturing complex patterns. The structure of the SLNN is depicted in Figure 1, which illustrates the structure of the method for a phenotypic continuous response. This NN can be thought of as a two-step regression (e.g. Hastie et al. 2009). In the first step, in the non-linear hidden layer, S data-derived basis functions (k = 1, 2,. . ., S neurons), fz ½k i g, are inferred, and in the second step, in the linear output layer, the response is regressed on the basis functions (inferred in the hidden layer). The inner product between the input vector and the weight n Table 1 Twelve environments representing combinations of diverse agronomic management (drought or full irrigation, sowing in standard, bed, or flat systems), sites in Mexico, and years for two traits, grain yield (GY) and days to heading (DTH), with their broadsense heritability (h 2 ) measured in 2010 this is then transformed using a non-linear activation function is a vector of regression coefficients or "weights" of each neuron k in the hidden layer. The g k ð:Þ is the activation function, which maps the inputs into the real line in the closed interval [21,1]; for example, g k ðxÞ ¼ expð2xÞ 2 1 expð2xÞ þ 1 is known as the tangent hyperbolic function.
Finally, in the linear output layer, phenotypes are regressed on the data-derived features, fz ½k i g, according to Radial basis function neural network: The RBFNN was first proposed by Broomhead and Lowe (1988) and Poggio and Girosi (1990). Figure 2 shows the architecture of a single hidden layer RBFNN with S non-linear neurons. Each non-linear neuron in the hidden layer has a Gaussian radial basis function (RBF) defined as z where kx i 2 c k k is the Euclidean norm between the input vector x i and the center vector c k and h k is the bandwidth of the Gaussian RBF. Subsequently, in the linear output layer, phenotypes are regressed on the data-derived features, fz ½k i g, accord- Estimating the parameters of the RBFNN: The vector of weights v ¼ fw 1 ; :::; w S g of the linear output layer is obtained using the ordinary least-squares fit that minimizes the mean squared differences between theŷ i (from RBFNN) and the observed responses y i in the training set, provided that the Gaussian RBFs for centers c k and h k of the hidden layer are defined. The centers are selected using an orthogonalization least-squares learning algorithm, as described by Chen et al. (1991) and implemented in Matlab 2010b. The centers are added iteratively such that each new selected center is orthogonal to the others. The selected centers maximize the decrease in the meansquared error of the RBFNN, and the algorithm stops when the number of centers (neurons) added to the RBFNN attains a desired precision (goal error) or when the number of centers is equal to the number of input vectors, that is, when S = n. The bandwidth h k of the Gaussian RBF is defined in terms of a design parameter of the net spread, that is, h k ¼ 0:8326 spread 2 for each Gaussian RBF of the hidden layer. To select the best RBFNN, a grid for training the net was generated, containing different values of spread and different precision values (goal error). The initial value of the spread was the median of the Euclidean distances between each pair of input vectors (x i ), and an initial value of 0.02 for the goal error was considered. The parameter spread allows adjusting the form of the Gaussian RBF such that it is sufficiently large to respond to overlapping regions of the input space but not so big that it might induce the Gaussian RBF to have a similar response.
Bayesian regularized neural networks: The difference between SLNN and BRNN is in the function to be minimized (see the penalized function below); therefore, the basic structure of a BRNN can be represented in Figure 1 as well. The SLNN described above is flexible enough to approximate any non-linear function; this great flexibility allows NN to capture complex interactions among predictor Figure 1 Structure of a single-layer feedforward neural network (SLNN) adapted from González-Camacho et al. (2012). In the hidden layer, input variables x i = ðx i1 ; :::; x ip Þ ( j = 1,. . .,p markers) are combined for each neuron (k=1,. . .,S neurons) using a linear function, u ½k ½k j , and subsequently transformed using a non-linear activation function, yielding a set of inferred scores, z These scores are used in the output layer as basis functions to regress the response using the linear activation function on the data- variables (Hastie et al. 2009). However, this flexibility also leads to two important issues: (1) as the number of neurons increases, the number of parameters to be estimated also increases; and (2) as the number of parameters rises, the risk of over-fitting also increases. It is common practice to use penalized methods via Bayesian methods to prevent or palliate over-fitting. MacKay (1992MacKay ( , 1994) developed a framework for obtaining estimates of all the parameters in a feed-forward single neural network by using an empirical Bayes approach. Let u ¼ (w 1 ,. . .,w S ; b 1 ,. . .,b S ; b 1 [1] , . . ., b p [1] ;. . ., b 1 [S] , . . ., b p [S] , m)9 be the vector containing all the weights, biases, and connection strengths. The author showed that the estimation problem can be solved in two steps, followed by iteration: (1) Obtain the conditional posterior modes of the elements in u assuming that the variance components s 2 e and s 2 u are known and that the prior distribution for the all the elements in u is given by pðujs 2 u Þ ¼ MNð0; s 2 u IÞ. It is important to note that this approach assigns the same prior to all elements of u, even though this may not always be the best thing to do. The density of the conditional (given the variance parameters) posterior distribution of the elements of u, according to Bayes' theorem, is given by The conditional modes can be obtained by maximizing Equation 5 over u. However, the problem is equivalent to minimizing the following penalized sum of squares [see Gianola et al. (2011) for more details] , a ¼ 1=ð2s 2 u Þ, e i is the difference between observed and predicted phenotypes for the fitted model, and u j (j ¼ 1; :::; m) is the j th element of vector u.
(2) Update s 2 e and s 2 u . The updating formulas are obtained by maximizing an approximation to the marginal likelihood of the data pðyjs 2 e ; s 2 u Þ (the "evidence") given by the denominator of Equation 5.
The original algorithm developed by MacKay was further improved by Foresee and Hagan (1997) and adopted by Gianola et al. (2011) in the context of genome and pedigree-enabled prediction. The algorithm is equivalent to estimation via maximum penalized likelihood estimation when "weight decay" is used, but it has the advantage of providing a way of setting the extent of "weight decay" through the variance component s 2 u . Neal (1996) pointed out that the procedure of MacKay (1992MacKay ( , 1994 can be further generalized. For example, there is no need to approximate probabilities via Gaussian assumptions; furthermore, it is possible to estimate the entire posterior distributions of all the elements in u, not only their (conditional) posterior modes. Next, we briefly review Neal's approach to solving the problem; a comprehensive revision can be found in Lampinen and Vehtari (2001).
Prior distributions: a) Variance component of the residuals: Neal (1996) used a conjugate inverse Gamma distribution as a prior for the variance associated with the residual, e i , given in Equation 4, that is, s 2 e Inv-Gammaðs e ; df e Þ, where s e and df e are the scale and degrees of freedom parameters, respectively. These parameters can be set to the default values given by Neal (1996), s e =0.05, df e =0.5. These values were also used by Lampinen and Vehtari (2001). b) Connection strengths, weights, and biases: Neal (1996) suggested dividing the network parameters in u into groups and then using hierarchical models for each group of parameters; for example, connection strengths (b 1 [1] , . . ., b p [1] ;. . .; b 1 [S] , . . ., b p [S] ), biases (b 1 ,. . .,b S ) of the hidden layer, and output weights (w 1 ,. . .,w S ), and general mean or bias (m) of the linear output layer. Suppose that u 1 ,. . .,u k are parameters of a given group; then assume In the hidden layer, information from input variables ðx i1 ; :::; x ip Þ (j = 1,. . .,p markers) is first summarized by means of the Euclidean distance between each of the input vectors fx i g with respect to S (data-inferred) (k=1,. . .,S neurons) centers fc k g, that is, u ½k i ¼ h k jjx i 2 c k jj 2 . These distances are then transformed using the Gaussian function z And, at the last stage of the model, assign the prior s 2 u Inv-Gammaðs u ; df u Þ. The scale parameter of the distribution associated with the group of parameters containing the connection strengths (b 1 [1] , . . ., b p [1] ;. . .; b 1 [S] , . . ., b p [S] ) changes according to the number of inputs, in this case, s u ¼ ð0:05=p 1=dfu Þ 2 with df u ¼ 0:5 and p is the number of markers in the data set.
By using Markov chain Monte Carlo (MCMC) techniques through an algorithm called hybrid Monte Carlo, Neal (1996) developed a software termed flexible Bayesian modeling (FBM) capable of obtaining samples from the posterior distributions of all unknowns in a neural network (as in Figure 1).
Reproducing kernel Hilbert spaces regression: RKHS models have been suggested as an alternative to multiple linear regression for capturing complex interaction patterns that may be difficult to account for in a linear model framework (Gianola et al. 2006). In RKHS model, the regression function takes the form where x i ¼ ðx i1 ; :::; x ip Þ 9 and x i 9 ¼ ðx i 9 1 ; :::; x i 9 p Þ 9 are input vectors of marker genotypes in individuals i and i9; a i 9 are regression coefficients; and Kðx i ; x i 9 Þ ¼ expð2 hkx i 2x i 9 k 2 Þ is the reproducing kernel defined (here) with a Gaussian RBF, where h is a bandwidth parameter and kx i 2 x i 9 k is the Euclidean norm between each pair of input vectors. The strategy termed "kernel averaging" for selecting optimal values of h within a set of candidate values was implemented using the Bayesian approach described in de los . Similarities and connections between the RKHS and the RBFNN are given in González-Camacho et al. (2012).

Assessment of the models' predictive ability
The predictive ability of the models given above was compared using Pearson's correlation and predictive mean-squared error (PMSE) using predicted and realized values. A total of 50 random partitions were generated for each of the data sets, and each partition randomly assigned 90% of the lines to the training set and the remaining 10% to the validation set. The partition scheme used was similar to that in Gianola et al. (2011) and González-Camacho et al. (2012). All scripts were run in a Linux work station; for Bayesian ridge regression and Bayesian LASSO, we used the R package BLR (de los Campos and Perez 2010), whereas for RKHS, we used the R implementation described in de los , which was kindly provided by the authors. In the case of Bayes A and Bayes B, we used a program described by Hickey and Tier (2009), which is freely available at http://sites.google.com/site/hickeyjohn/alphabayes. For the BRNN, we used the FMB software available at http://www.cs. toronto.edu/~radford/fbm.software.html. Because the computational time required to evaluate the predictive ability of the BRNN network was great, we used the Condor high throughput computing system at the University of Wisconsin-Madison (http://research.cs.wisc.edu/ condor). The RBFNN model was run using Matlab 2010b for Linux. The differences in computing times between the models were great. The computing times for evaluating the prediction ability of the 50 partitions for each trait were as follows, 10 min for RBFNN, 1.5 hr for RKHS, 3 hr for BRR, 3.5 hr for BL, 4.5 hr for Bayes B, 5.5 hr for Bayes A, and 30 days for BRNN. In the case of RKHS, BRR, BL, Bayes A, and Bayes B, inferences were based on 35,000 MCMC samples, and on 10,000 samples for BRNN. The estimated computing times were obtained using, as reference, a single Intel Xeon CPU 5330 2.4 GHz and 8 Gb of RAM memory. Significant reduction in computing time was achieved by parallelizing the tasks.

RESULTS
Data from replicated experiments in 2010 were used to calculate the broad-sense heritability for each trait in each environment (Table 1). Broad-sense heritability across locations for 2010 data were 0.67 for GY and 0.92 for DTH. These high estimates can be explained, at least in part, by the strict environmental control of trials conducted at CIMMYT's experiment station at Ciudad Obregon. The heritability of the two traits for 2009 was not estimated because the only available phenotypic data were adjusted means for each environment.
Predictive assessment of the models The predictive ability of the different models for GY and DTH varied among the 12 environments. The model deemed best using correlations (Table 2) tended to be the one with the smallest average PMSE ( Table 3). The three non-parametric models had higher predictive correlations and smaller PMSE than the linear models for both GY and DTH. Within the linear models, the results are mixed, and all models gave similar predictions. Within the non-parametric models, RBFNN and RKHS always gave higher correlations between predicted values and realized phenotypes, and a smaller average PMSE than the BRNN. The mean of the correlations and the associated standard errors can be used to test for statistically significant improvements in the predictability of the non-linear models vs. the linear models. The t-test (with a ¼ 0:05) showed that RKHS gave significant improvements in prediction in 13/19 cases (Table 3) compared with the BL, whereas RBFNN was significantly better than the BL in 10/19 cases. Similar results were obtained when comparing RKHS and RBFNN with Bayes A and Bayes B.
Correlations between observed and predicted values for DTH were lowest overall in environments 4 and 8, in Cd. Obregon, 2009, andin Toluca, 2009. Average PMSE was in agreement with the findings based on correlations. Although accuracies in environment 4 were much lower than in other environments, the higher accuracy of the non-parametric models (RKHS, RBFNN, and BRNN) over that of the linear models (BL, BRR, Bayes A, and Bayes B) was consistent with what was observed in the other environments. Figures 3 and 4 give scatter plots of the correlations obtained with the three non-parametric models vs. the BL for DTH and GY, respectively; each circle represents the estimated correlations for each of the two models included in the plot. In Figure 3, A-C, DTH had a total of 500 points (10 environments and 50 random training-testing partitions). In Figure 4, A-C, GY had a total of 350 points (7 environments and 50 random partitions in each environment). A point above the 45-degree line represents an analysis where the method whose predictive correlation is given on the vertical axis (RKHS, RBFNN, BRNN) outperformed the one whose correlation is given on the horizontal axis (BL). Both figures show that although there is a great deal of variability due to partition, for both DTH and GY, the overall superiority of RKHS and RBFNN over the linear model BL is clear. For both traits, BL had slightly better prediction accuracy than the BRNN in terms of the number of individual correlation points. It is interesting to note that some cross-validation partitions picked subsets of training data that had negative, zero, or very low correlations with the observed values in the validation set. These results indicate that lines in the training set are not necessarily related to those in the validation set.

DISCUSSION AND CONCLUSIONS
Understanding the impact of epistasis on quantitative traits remains a major challenge. In wheat, several studies have reported significant epistasis for grain yield and heading or flowering time (Goldringer et al. 1997). Detailed analyses have shown that vernalization, day-length sensitivity, and earliness per se genes are mainly responsible for regulating heading time. The vernalization requirement relates to the sensitivity of the plant to cold temperatures, which causes it to accelerate spike primordial formation. Transgenic and mutant analyses, for example, have suggested a pathway involving epistatic interactions that combines environment-induced suppression and upregulation of several genes, leading to final floral transition (Shimada et al. 2009).
There is evidence that the aggregation of multiple gene · gene interactions (epistasis) with small effects into small epistatic networks n Table 2 Average correlation (SE in parentheses) between observed and predicted values for grain yield (GY) and days to heading (DTH) in 12 environments for seven models n Table 3 Predictive mean-squared error (PMSE) between observed and predicted values for grain yield (GY) and days to heading (DTH) in 12 environments for seven models is important for explaining the heritability of complex traits in genome-wide association studies (McKinney and Pajewski 2012). Epistatic networks and gene · gene interactions can also be exploited for GS via suitable statistical-genetic models that incorporate network complexities. Evidence from this study, as well as from other research involving other plant and animal species, suggests that models that are non-linear in input variables (e.g. SNPs) predict outcomes in testing sets better than standard linear regression models for genome-enabled prediction. However, it should be pointed out that better predictive ability can have several causes, one of them the ability of some nonlinear models to capture epistatic effects. Furthermore, the random cross-validation scheme used in this study was not designed to specifically assess epistasis but rather to compare the models' predictive ability. It is interesting to compare results from different predictive machineries when applied to either maize or wheat. Differences in the prediction accuracy of non-parametric and linear models (at least for the data sets included in this and other studies) seem to be more pronounced in wheat than in maize. Although differences depend, among other factors, on the trait-environment combination and the number of markers, it is clear from González-Camacho et al. (2012) that for flowering traits (highly additive) and traits such as grain yield (additive and epistatic) in maize, the BL model performed very similarly to the RKHS and RBFNN. On the other hand, in the present study, which involves wheat, the RKHS, RBFNN, and BRNN models clearly had a markedly better predictive accuracy than BL, BRR, Bayes A, or Bayes B. This may be due to the fact that, in wheat, additive · additive epistasis plays an important role in grain yield, as found by  Crossa et al. (2006) and Burgueño et al. (2007Burgueño et al. ( , 2011 when assessing additive, additive · additive, additive · environment, and additive · additive · environment interactions using a pedigree-based model with the relationship matrix A. As pointed out first by Gianola et al. (2006) and subsequently by Long et al. (2010), non-parametric models do not impose strong assumptions on the phenotype-genotype relationship, and they have the potential of capturing interactions among loci. Our results with real wheat data sets agreed with previous findings in animal and plant breeding and with simulated experiments, in that a non-parametric treatment of markers may account for epistatic effects that are not captured by linear additive regression models. Using extensive maize data sets, González-Camacho et al. (2012) found that RBFNN and RKHS had some similarities and seemed to be useful for predicting quantitative traits with different complex underlying gene action under varying types of interaction in different environmental conditions. These authors suggested that it is possible to make further improvements in the accuracy of the RKHS and RBFNN models by introducing differential weights in SNPs, as shown by Long et al. (2010) for RBFs.
The training population used here was not developed specifically for this study; it was made up of a set of elite lines from the CIMMYT rain-fed spring wheat breeding program. Our results show that it is possible to achieve good predictions of line performance by combining phenotypic and genotypic data generated on elite lines. As genotyping costs decrease, breeding programs could make use of genome-enabled prediction models to predict the values of new breeding lines generated from crosses between elite lines in the training set before they reach the yield testing stage. Lines with the highest estimated breeding values could be intercrossed before being phenotyped. Such a "rapid cycling" scheme would accelerate the fixation rate of favorable alleles in elite materials and should increase the genetic gain per unit of time, as described by Heffner et al. (2009).
It is important to point out that proof-of-concept experiments are required before genome-enabled selection can be implemented successfully in plant breeding programs. It is necessary to test genomic predictions on breeding materials derived from crosses between lines of the training population. If predictions are reliable enough, an experiment using the same set of parental materials could be carried out to compare the field performance of lines coming from a genomic-assisted recurrent selection program scheme vs. lines coming from a conventional breeding scheme. The accuracies reported in this study represent prediction of wheat lines using a training set comprising lines with some degree of relatedness to lines in the validation set. When the validation and the training sets are not genetically related (unrelated families) or represent populations with different genetic structures and different linkage disequilibrium patterns, then negligible accuracies are to be expected. It seems that successful application of genomic selection in plant breeding requires some genetic relatedness between individuals in the training and validation sets, and that linkage disequilibrium information per se does not suffice (e.g. Makowsky et al. 2011).

ACKNOWLEDGMENTS
Financial support by the Wisconsin Agriculture Experiment Station and the AVIAGEN, Ltd. (Newbridge, Scotland) to Paulino Pérez and Daniel Gianola is acknowledged. We thank the Centro Internacional de Mejoramiento de Maíz y Trigo (CIMMYT) researchers who carried out the wheat trials and provided the phenotypic data analyzed in this article.