Threshold Models for Genome-Enabled Prediction of Ordinal Categorical Traits in Plant Breeding

Categorical scores for disease susceptibility or resistance often are recorded in plant breeding. The aim of this study was to introduce genomic models for analyzing ordinal characters and to assess the predictive ability of genomic predictions for ordered categorical phenotypes using a threshold model counterpart of the Genomic Best Linear Unbiased Predictor (i.e., TGBLUP). The threshold model was used to relate a hypothetical underlying scale to the outward categorical response. We present an empirical application where a total of nine models, five without interaction and four with genomic × environment interaction (G×E) and genomic additive × additive × environment interaction (G×G×E), were used. We assessed the proposed models using data consisting of 278 maize lines genotyped with 46,347 single-nucleotide polymorphisms and evaluated for disease resistance [with ordinal scores from 1 (no disease) to 5 (complete infection)] in three environments (Colombia, Zimbabwe, and Mexico). Models with G×E captured a sizeable proportion of the total variability, which indicates the importance of introducing interaction to improve prediction accuracy. Relative to models based on main effects only, the models that included G×E achieved 9–14% gains in prediction accuracy; adding additive × additive interactions did not increase prediction accuracy consistently across locations.

ABSTRACT Categorical scores for disease susceptibility or resistance often are recorded in plant breeding. The aim of this study was to introduce genomic models for analyzing ordinal characters and to assess the predictive ability of genomic predictions for ordered categorical phenotypes using a threshold model counterpart of the Genomic Best Linear Unbiased Predictor (i.e., TGBLUP). The threshold model was used to relate a hypothetical underlying scale to the outward categorical response. We present an empirical application where a total of nine models, five without interaction and four with genomic · environment interaction (G·E) and genomic additive · additive · environment interaction (G·G·E), were used. We assessed the proposed models using data consisting of 278 maize lines genotyped with 46,347 single-nucleotide polymorphisms and evaluated for disease resistance [with ordinal scores from 1 (no disease) to 5 (complete infection)] in three environments (Colombia, Zimbabwe, and Mexico). Models with G·E captured a sizeable proportion of the total variability, which indicates the importance of introducing interaction to improve prediction accuracy. Relative to models based on main effects only, the models that included G·E achieved 9-14% gains in prediction accuracy; adding additive · additive interactions did not increase prediction accuracy consistently across locations.
KEYWORDS prediction accuracy threshold model disease resistance GBLUP genotype · environment interaction GenPred shared data resource Since genomic selection (GS) initially was proposed by Meuwissen et al. (2001), a large number of plant breeding studies have assessed the prediction accuracy of GS in different economically important crops. These studies have used different marker platforms and marker densities, as well as different parametric and nonparametric statistical models (e.g., de los Campos et al. 2009de los Campos et al. , 2010Heffner et al. 2009Heffner et al. , 2010Crossa et al. 2010Crossa et al. , 2011Crossa et al. , 2013aHeslot et al. 2012;Pérez-Rodríguez et al. 2010, 2012. Both simulation and empirical studies have shown that GS has greater prediction accuracy than standard pedigree-based prediction, and most of the benefits of GS arise from obtaining accurate predictions early in the breeding cycle. The success of genomic prediction is affected by the choice of model, the size of the training data, the heritability of the trait, the span of linkage disequilibrium, the marker density, and the strength of the genetic relationships between the training and validation populations, among other factors. A number of studies have assessed how these factors affect prediction accuracy for quantitative traits (Kizilkaya et al. 2014); however, little has been done with categorical traits. As first pointed out by Gianola (1980Gianola ( , 1982 and Gianola and Foulley (1983), these traits cannot be dealt with by the use of Gaussian models. Several studies support the notion that when the number of categories is large and the data seem to follow an approximately normal distribution, failure to address the ordinality of the data is likely negligible (Atkinson 1988). However, significant bias is observed when the number of categories is less than five and sample size is not large enough. In addition, analyzing the data as normally distributed has less power of detection of effects and often produces estimates of frequency categories outside the 021 range, which does not make sense. These problems often occur when the frequencies are small or very large. Also, even when treatment means obtained with a normal approximation can be interpreted as estimates of the proportions of each category, this is not the case for the variance of their estimates as the data are binary and polychotomous (Stroup 2012). If the data are transformed, many of the aforementioned problems remain in a linear model analysis, given that the most commonly used transformations are intended to stabilize variance but fail to address the problem of skewness. Consequently, transformations often are ineffective and, in addition, express the data on scales that are unfamiliar to those who use the results of the analysis (Littell et al. 2002).
Categorical traits are scored by assigning a data point to one of several mutually exclusive and exhaustive ordered categories. If these scores are treated as continuous variables, as in standard GS linear models, the following assumptions do not hold: (1) the relationship between genomic data and phenotypes is linear; (2) phenotypes follow a normal distribution; and (3) the variance is constant and not a function of the expected value. Therefore, standard GS linear models currently used in plant and animal breeding do not meet the assumptions required for categorical data.
Results of empirical and simulation studies indicate that generalized linear mixed models (GLMMs) provide more sensible results and have greater power to identify model effects as statistically significant (Stroup 2012). In a linear model, the response variable (observation) equals the sum of explanatory variables plus a residual, and a probability assumption about the residual is made. In GLMM, the statistical model for a multinomial variable is expressed in a probability distribution form. It is important to point out that the cumulative probit model (also known as the threshold model) assumes that the process that gives rise to the observed categories is an underlying continuous variable with a standard normal distribution that in many applications uses the linear predictor with reversed signs for the fixed and random effects: Threshold models have been used in animal GS to relate a hypothetical underlying scale to the outward categorical response. For example, González-Recio and Forni (2011) developed versions of BayesA, Bayesian Lasso, and two machine-learning methods for analyzing dichotomous traits and showed that the differences between methods are small, particularly with a large number of quantitative trait loci. On the other hand, Villanueva et al. (2011) developed a version of the BayesB method for dichotomous traits and concluded that the threshold BayesB method improves GS accuracy when diseaseresistant dichotomous phenotypes are encountered, compared with accuracies obtained with the linear model. The threshold model showed an increase in accuracy of up to 16%, as well as significant advantages when heritability and disease prevalence were low and individuals were genotyped but not measured (testing set). Both González-Recio and Forni (2011) and Villanueva et al. (2011) pointed out that the models they developed for dichotomous phenotypes can easily be extended to traits with more than two discrete categories.
Recently, Wang et al. (2012) proposed threshold models using the GS framework and extended three Bayesian methods (BayesA, BayesB, and BayesCp) to estimate genomic breeding values of animal threshold traits with more than two categories; they named these methods extended BayesTA, extended BayesTB, and extended BayesTCp. They derived computing procedures for the three BayesT methods by using the Gibbs sampler algorithm. Through a simulation study, they found that the three BayesT methods generally performed better than the corresponding standard Bayesian methods, especially when only two phenotypic categories are present. They also found that BayesTB and BayesTCp produced similar accuracies and that both performed better than BayesTA. Also, they addressed how heritability, number of quantitative trait loci, incidence, and number of phenotypic categories affect the performance of the three BayesT methods. Recently, Kizilkaya et al. (2014) proposed performing genomic prediction of ordinal categorical phenotypes using BayesTCp and compared it with BayesCp; they found that the accuracies of BayesTCp for ordinal categorical phenotypes were greater than those of BayesCp and that there was a greater advantage in using BayesTCp when the training population was small. These authors pointed out that a 2.25-fold increase in the training population size for ordinal categorical phenotypes analyzed as a linear model using BayesCp was sufficient to achieve an accuracy equal to or greater than that for continuous phenotypes with a training population size of 1000 individuals.
In plant breeding, the most economically important trait is grain yield, which is greatly affected by environmental factors, as well as by genotype · environment interaction (G·E). The GS models first introduced by Meuwissen et al. (2001) were further extended into multienvironment models that can handle large numbers of individuals genotyped with large numbers of markers and evaluated in multiple environments. Burgueño et al. (2012) analyzed multienvironment data using a multivariate version of the genomic best linear unbiased predictor (GBLUP). Jarquín et al. (2014) extended GBLUP to incorporate highly dimensional markers, environmental covariates, and their interactions into a class of random-effects models where main and interaction effects are modeled using Gaussian processes with covariance functions based on genetic and environmental similarity among entries. In their study, Jarquín et al. (2014) used as a covariance function the structure induced by a reaction norm model, and applied the proposed approach to the grain yield trait in wheat lines evaluated over multiple years and locations.
However, in plant breeding, many economically important traits, such as the degree of resistance/susceptibility, are categorical. Despite this, to our knowledge, threshold models have not been considered in GS for plant breeding. Therefore, in this study, we introduced a threshold GS model that is an extension of the GBLUP of Jarquín et al. (2014) which incorporates G·E and additive · additive · environment (G·G·E) interactions. The proposed model was evaluated using a data set consisting of 278 maize lines scored for resistance to gray leaf spot (GLS) measured using an ordered categorical scale [1 (no disease) to 5 (complete infection)] in three environments and genetic information consisting of 46,347 single-nucleotide polymorphisms (SNPs).

Experimental data
Phenotypic data: The trait analyzed in 278 maize lines was gray leaf spot (GLS), caused by the fungus Cercospora zeae-maydis, which was evaluated in three environments, Mexico, Zimbabwe, and Colombia. GLS is one of the most important foliar diseases of maize worldwide. The GLS trait was analyzed using an ordinal scale from 1 (no disease), 2 (low infection), 3 (moderate infection), 4 (high infection), to 5 (complete infection). These data are part of the data set previously analyzed by Crossa et al. (2011) and González-Camacho et al. (2012), among others.
Genotypic data: Genotypes of all 278 lines were obtained using 53,401 SNPs. Those SNPs with .10% missing values or a minor allele frequency #0.05 were excluded from the data. Call rate per line was considered sufficient (.90%) for all lines. Most lines (250 of 278) had a call rate greater than 99%. After line-specific quality control (applying the same quality control to each line separately), the maize data still contained 46,347 SNPs, which were used in the analysis.

Statistical models
Threshold (cumulative probit) model: Let y ¼ fy ijk g (i ¼ 1; . . . ; I; j ¼ 1; 2; . . . ; J; k ¼ 1; 2; . . . r ij ; where i represents the environment, j denotes the genotype and k is the number of replicates of each genotype in each environment. The response variable (disease resistance), y ijk , represents an assignment into one of C mutually exclusive and exhaustive categories (here C=5, I = 3 and J = 278) that follow an order, since 1 indicates no infection, 2 low infection, 3 moderate infection, 4 high infection, and 5 complete infection. Therefore, in a GLMM framework, this model can be described by defining the distribution, the linear predictor and the link function.
Distribution: There are two distributions, one for observations in the response variable: ðy 1ij ; y 2ij ; . . . ; y Cij jb; bÞ $ MultinomialðN ij ; p 1ij ; p 2ij ; . . . ; p Cij Þ, where b is the I·1 vector of fixed environmental effects and another distribution for the random effects, b $ Nð0; GÞ, where G is the variance-covariance matrix of b ¼ fb j g and b j is the effect of line j.
where h cij denotes the c th link (c ¼ 1; 2; . . . ; C 2 1Þ for the fixed and random effects combination, g c is the intercept (threshold) for the c th link, and x T ij and z T ij are known row incidence vectors corresponding to fixed and random effects in b and b, respectively. Since there are C categories, a total of C 2 1 link functions are required to fully specify the model.
FðÁÞ is the cumulative distribution function of a standard normal distribution (probit link) and F 21 its corresponding inverse. The inverse link for this model is given as follows: Once we have estimates of Fðh 1ij Þ; Fðh 2ij Þ; . . . ; Fðh ðC 2 1Þij Þ; we esti- Þ. This threshold model assumes that the process that gives rise to the observed categories is an underlying continuous variable with a normal distribution where l ijk are called "liabilities," e ijk $ Nð0; 1Þ (e.g., Gianola 1982, andSorensen et al. 1995). Furthermore, to generate ordinal categorical phenotypes with C categories, the underlying phenotypic values are mapped to ordinal categorical phenotypes based on threshold parameters g T ¼ ðg min , g 1 , ⋯ , g C 2 1 , g max Þ with g min ¼ 2N, and g max ¼ N; which are cutpoints of the continuous scale such that the observed ordinal categorical response (y ijk Þ is given by: That is, y ijk falls into category C when the latent variable falls into the c th interval of values. Given b and b, the liabilities l ijk are conditionally independent and distributed as where s 2 e is fixed to 1 to achieve identifiability in the likelihood because the liabilities are unobservable. Also, one of the thresholds was fixed (g 1 ¼ 0Þ to center the distribution. Therefore, is the predictor of the GLMM for the multinomial data given previously. Then, the conditional probability that y ijk falls into category c (c ¼ 1; . . . ; 5Þ given b,b; and g ¼ ðg min ; g 1 ; g 2 ; g C 2 1 ; g max Þ; is given by The data are assumed conditionally independent, given b, b, and g. Therefore, the sampling model is where I fyijk¼cg is an indicator function taking a value of 1 if the response falls into category c; and 0 otherwise.
Models fitted: Based on Jarquín et al. (2014), nine models are proposed for analyzing these data sets (Table 1). The sequence of models described below is similar to those presented by Jarquín et al. (2014) for a continuous variable. However, for simplicity, only the underlying latent variables (l ijk Þ are presented together with the distribution of the corresponding random effects that give rise to the observed categorical phenotypes, given that a probit link function is assumed by all models described below.
Model 1: The first model used to explain the liability value of the k th individual in the j th line at the i th environment is where E i is the fixed effect of the i th environment and coefficient regressions were assigned a flat Gaussian prior with mean zero and n Table 1 Nine models used to fit the data set

Model
Main Effects Interaction X X X X X X E, environment; L, line; G, marker covariates; G·G, additive · additive epistasis term; G·E, environment · marker interaction; G·G·E, additive · additive epistasis · environment interaction term.
variance equal to 1 · 10 10 , L j is the random effect of the j th line which is assumed to be identically and independently distributed (IID) as normal, L j $ IID Nð0; s 2 L Þ; and e ijk is an error term distributed as e ijk $ IID Nð0; 1Þ. The unknown variance parameters (s 2 L Þ were assigned scaled-inverted x 2 distributions as prior. This density is indexed by two hyper-parameters: degrees of freedom (df) and the scale (S) parameter. In this case, we defined these values using the internal rules provided by BGLR. By default, that is, if the user does not specify these parameters, BGLR assigns mildly informative priors with df =5 and a scale parameter that gives a prior mode, S df þ2 , that obeys a prior variance partition where 50% of the variance of the liability score corresponds to error terms, and the remaining 50% to the random effects included in the model. The internal rules implemented in BGLR are fully explained in Pérez-Rodríguez and de los Campos (2014).
Model 2: The second model is an extension of the GBLUP for ordered categorical phenotypes. This model was obtained from model 1 by replacing the line effect, L j , with a random effect that incorporates marker information using the genomic relationship matrix G. Model 2 expresses the liability value of the k th individual in the j th line in the i th environment as with g j ¼ P p m¼1 x jm b m ; where g j represents an approximation to the true genetic value of the j th line, x jm is the genotype of the j th line at the m th marker (scored as 0 or 2 for genotypes that are homozygous at minor and major allele frequency, respectively, and 1 for heterozygous genotypes), and b m is the effect of the m th marker with b m $ IID Nð0; s 2 b Þ, ðm ¼ 1; . . . ; pÞ. Therefore, g ¼ Xb m contains the genomic values of all lines, and g ¼ ðg 1 ; . . . ; g 278 Þ9 is assumed to follow the normal distribution g $ Nð0; Gs 2 g Þ; where G is a marker-derived genomic relationship matrix that has an expected value equal to (under ideal conditions) twice the coefficient of parentage matrix of lines, and s 2 g is a genomic variance. The entries of G were computed as G jj ' ¼ p 21 P p m¼1 z jm ðz j ' m Þ, where z jm ¼ ðx jm 2 2p m Þ= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2p m ð1 2 p m Þ p and p m is the estimated frequency of the m th marker. Subtracting 2p m from the genotype codes (centering) and dividing each marker covariate by ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2p m ð1 2 p m Þ p (standardizing) is not strictly needed; however, standardization allows interpreting s 2 g as a genomic variance. G was constructed using the genotypes of all 278 lines.
Model 3: The third model is an extension of model 2 that accounts for "marked" epistatic additive · additive relationships, g A $ Nð0; G A s 2 Ag Þ; where g A represents a regression on genomic epistatic additive · additive relationships with G A ¼ G#G (# is the element-wise multiplication operator, that is, a Hadamard product). The prior used for s 2 Ag was a scaled-inverted x 2 distribution. In this model, the liability score of the k th individual in the j th line in the i th environment is equal to Model 4: The third model combines model 1 and model 2 as Model 4 partitions the line effects into two components, one that is explained by a regression on markers (g j ) and another representing variation among lines that is not explained by regression on markers.
Model 5: This model is an extension of model 4 that accounts for epistatic additive · additive relationships, with liability equal to Note that none of the aforementioned models accounts for G·E (gE ij Þ and G·G·E (g A E ij Þ interactions. Next, we consider models that incorporate G·E and G·G·E. Model 6: Model 6 considers model 2 but, in addition to the main effects of environments (E) and markers (G), it takes into account the interaction between markers and environments (G·E). Because the data include multiple phenotypic records per line, the genetic covariance structure of genetic effects in the full data set is equal to Z g GZ T g , where Z g is the incidence matrix for the vector of additive genetic effects of markers (Jarquín et al. 2014). Therefore, the covariance structure for the vector of interaction terms in the full data set gE ¼ fgE ij g is the Hadamard product of Z g GZ T g and Z E Z T E ; where Z E represents the incidence matrix of the effects of environments (i.e., the matrix that connects phenotypes with environments) (Jarquín et al. 2014). Therefore, the liability value of the k th individual in the j th line in the i th environment is explained by where g $ Nð0; Gs 2 g Þ; gE $ Nð0; Z g GZ T g #Z E Z T E s 2 gE Þ and e ijk $ IID Nð0; 1Þ.
Model 7: Model 7 extends model 6 to account for marked epistatic additive · additive relationships and the interaction between the epistatic additive · additive term and the environments. The liability value of the k th individual in the j th line in the i th environment is now Model 8: Model 8 extends model 6 by adding the random effect of the lines (L j ). Thus, the liability value of the k th individual in the j th line in the i th environment is explained by Model 9: Finally, model 9 extends model 8 by adding marked epistatic additive · additive relationships and the interaction between the epistatic additive · additive term and the environments. The liability value of the k th individual in the j th line in the i th environment is explained by Implementation with BGLR: The nine models were fitted using the R-package BGLR (de los Campos and Pérez-Rodríguez 2013) in the R-software (R Core Team 2014). Parametric and semi-parametric models with both molecular marker and pedigree data can be fitted in this R package. It also allows including various random effects with user-defined covariance matrices. In Appendix 1, we provide the R code used for fitting the most parameterized model (model 9). The implementation of the other models is similar to that of model 9 but with fewer random effects. All these models are implemented via a Bayesian approach using the Gibbs sampler algorithm, and sampling from the fully conditional distributions, as shown in Sorensen et al. (1995). For more details about Gibbs sampler implementation used in the R-package BGLR (de los Campos and Pérez-Rodríguez 2013) for binary and ordered categorical phenotypes, refer to Sorensen et al. (1995). The prior distributions of the variance components of all the models are described in the appendix of the BGLR package (de los Campos and Pérez-Rodríguez 2013).
Assessing predictive ability for ordered categorical phenotypic data: The performance of the models was evaluated employing a cross-validation approach to estimate their prediction accuracies. To that end, we split each of the data sets into two parts (training and validation sets), where the training set was used to fit the model and the validation set was used to evaluate the training model's prediction ability. A total of 20 random partitions were performed, and the validation results were averaged over the 20 partitions with observations assigned to training and testing completely at random. This represents a prediction problem similar to that labeled as CV2 in Burgueño et al. (2012), where the performance of some lines was observed in some environments (training set) but not in others (testing set). Measuring prediction accuracy for categorical traits is more challenging than for quantitative traits, where the Euclidean metric appears as a natural choice. A variety of scoring rules have been proposed to assess accuracy for categorical traits. A scoring rule provides a summary measure for evaluating probabilistic prediction by assigning a numerical score based on the predictive distribution and on the event or value that materializes (Garthwaite et al. 2005;Kneib et al. 2007). This goal is achieved in a reasonable way by taking into account not only point prediction but the whole predictive distribution.
In the case of a multinomial model, this predictive distribution is simply obtained by computing the probabilities for all C categories of the response according to the estimated model, i.e., we obtain the predictive distributionp i ¼ ðp i1 ;p i2 ; . . . ;p iC ), derived from the estimated model for observation i, and c i is the realized value for this observation in the data set. A scoring rule is any real-value function Sðp ic ; c i Þ, c i ¼ 1; 2; . . . ; C, that assigns a value to the event that category c is observed whenp ic is the predictive probability for individual i in category c. A suitable score is the sum where S is the sum over all observations in the test data set. The hit rate (i.e., the percentage of true positive predictions) and the loglikelihood are two popular scoring rules. However, both have the drawback that they involve only one of the probabilities of the predictive distribution (p i ). In addition, it has been well documented that the log-likelihood score is sensitive to extreme observations. In our applications, we utilize the Brier score (Brier 1950), which is equal to where d ic takes the value of 1 if the ordinal categorical response observed for individual i falls into category c, and d ic ¼ 0 otherwise. This scoring rule uses all the information contained in the predictive distribution, not just a small part such as the hit rate or the loglikelihood score. Therefore, it is a reasonable choice for comparing categorical regression models, even though there are other scoring rules that also have good properties. The range of BS in equation (11) is between 0 and 2. For this reason, we took BS=2; to get the Brier score bound between 0 and 1, and lower scores imply better predictions.
Data and software: The phenotypic data for GLS in three environments (Mexico, Zimbabwe, and Colombia) for the 278 maize lines, the 46,347 SNPs data and the R scripts developed to fit the predictive models used in this study are given in the GLScode.rar file deposited at http://repository.cimmyt.org/xmlui/handle/10883/4128. The BGLR package (de los Campos and Pérez-Rodríguez 2013) can be downloaded from CRAN. Figure 1 shows the relative frequencies of each category for the whole data set and for each country. The whole data set contains 2798 observations; category 3 has the most observations (923), and category 1 is the one with the fewest (234). This pattern was also observed in Zimbabwe (1485 total observations; 37 in category 1 and 581 in category 3) but somewhat more pronounced. Colombia had 832 observations, with 215 in category 3, 208 in category 2, and only 63 in category 5. Mexico had 481 observations, most of them in category 2 (212), and category 1 had the fewest data (26 observations). When applying eigenvalue decomposition to the genomic relationship matrix, 82 eigenvectors (components) of a total of 278 were needed to explain 80% of the total variance of genotypes. The first eigenvector captured 13.68% of the total variation of marker genotypes, whereas the second eigenvector explained 6.28% of the total variation. For these reasons, we can say that there is some, but not strong, evidence of population (and family) stratification, indicating a relatively diverse set of lines. This was expected because these lines came from different breeding programs. Table 2 gives the estimates of the fixed (environment) effects and threshold parameters for each of the nine models; the first five models (125) had similar parameter estimates (fixed and threshold) that were different from those of models 629. Models 629 produced similar fixed and threshold parameter estimates. In plant and animal breeding, the focus is on estimating response probabilities associated with specific linear parameter combinations (Gianola and Foulley 1983). Figure 2 gives the probabilities for each ordinal categorical phenotype for model 9, for the whole data set, and for each country. In Figure 2, the average probabilities for category 5 (complete infection) were slightly greater than 0.20 (20%) in the whole data set and in each country. Colombia shows the greatest probability of complete infection (category 5), but it was only slightly greater than the probabilities in Mexico and Zimbabwe.

RESULTS
It is important to point out that the average probabilities of high infection (category 4) were very similar to the average probabilities of complete infection (category 5) for the whole data set and for each country, and that no infection (category 1) was the category with the lowest average probabilities (around 15%). These probability estimates were very different from estimates obtained based on raw frequencies (Figure 1) because they take into account the distribution of records across countries, the structure of lines, and the interaction between markers and environments (countries). The linear predictor used for doing this calculation was taken from model 9, which is given in Equation (10).
Estimates of variance components derived from the full data analysis are given in Table 3. The interaction (G·E) explained the largest proportion of the variance of liabilities, with estimated posterior means greater than 1.03 (models 629). The total variance explained by model 1 was 1.20, which is 1.7659 times smaller than the total variance captured by the most parameterized model (model 9), with a total variance of 2.1191. Also, model 2 (with E and G as main effects) only captured a total variance equal to 1.1911, which was 1.7791 times smaller than the total variance captured by model 9. The variance explained by models 3, 4, and 5 were, respectively, Figure 1 Relative frequency of each category in the whole data set.
n Table 2 Mean and SD of posterior distributions of fixed (environment) effects (b 1 for Colombia; b 2 for Zimbawe; and b 3 for Mexico) and threshold parameters of the nine proposed models

Model Mean Fixed Parameters
Mean Threshold Parameters 1.7755, 1.7767, and 1.782 times smaller than the variance captured by model 9. However, models 6, 7, and 8 captured a total variance equal to 2.1321, 2.0967, and 2.116, respectively, which were almost identical to the total variance captured by model 9; model 6 captured the highest variability. It should be noted that in models 6, 7, 8, and 9, the percentages of variance explained by the interaction term (G·E) were 51. 68, 49.42, 51.23, and 50.19%, respectively, clearly indicating that the interaction term was the most important source of variability in these models. It should be noted that if interactions were omitted, a large proportion of the variability in models 6 and 8 was not captured by main effects and the variance of the error term e ijk was the greatest source of variability. Another important finding was that the additive · additive epistatic terms explained 1.72, 1.06, 0.58, and 0.46% of the total variability in models 3, 5, 7, and 9, respectively, which means that adding this type of epistasis is not very important for this trait in these populations. Table 4 presents the Brier scores of the validation samples for each country and for the nine models. Because phenotype was ordinal categorical, we used Brier scores instead of Pearson's correlation coefficient for assessing prediction accuracy, because the Brier scores are bounded between 0 and 1, and values closer to zero imply better prediction accuracy. In Table 4, models 6, 7, 8, and 9 presented the lowest Brier scores, so these four models had better prediction ability. Models 125 showed the worst prediction ability. In Colombia, the prediction ability of model 9 was 19.85% greater than that of model 1. In Mexico, the increase in the prediction ability of model 9 over model 1 (the simplest model) was 11.26%, while in Zimbabwe this increase was 8.71%. Although Table 3 showed that inclusion of the interaction term (G·E) explained the largest proportion of variability in models 6 and 8, this did not produce a large increase in the prediction ability of these two models over that of the models without interactions (models 1, 2, and 4). This may be due to three reasons: (1) predictive power Figure 2 Estimated probability of each category in the whole data set and of each location in model 9.
n Numbers in parenthesis are the percentages of variance explained by each component. L, line; G, marker covariates; G·G, additive · additive epistasis term; G·E, environment · marker interaction; G·G·E, additive · additive epistasis · environment interaction term; TotVar, total variance explained by each model including the variance of e ijk ; which is equal to 1.
differs from goodness of fit, that is, a model may fit a particular data set well even though the predictive power the model provides is small; (2) even though the added interaction term explained the largest proportion of variability in models 6 and 8, the total variability explained in all models was not large; and (3) the set of lines was relatively diverse. However, including the interaction terms G·E plus the additive · additive epistatic terms [main (g Aj Þ and interaction term (g A E ij Þ did not improve predictions by much, and the inclusion of these terms did not explain a large percentage of the total variability.

DISCUSSION
GS offers important opportunities to improve genetic gains in plant and animal breeding programs; however, more powerful estimation and prediction methods are required to fully exploit the advantages of GS. Most GS methods assume a Gaussian response; however, many important traits in plant breeding, such as disease resistance, percent protein content, and proportion of seed or plant damage, are not normally distributed and need special treatment. In this study, we extended the GBLUP method to ordinal categorical responses. To this end, we used the implementation available in BGLR with a probit link that can accommodate both binary and ordinal traits (Pérez-Rodríguez and de los Campos 2014). To our knowledge, this is the first study in GS that uses an ordinal threshold model to analyze a categorical response in plant breeding. Most non-normal response variables are not linear with respect to model parameters and, unlike normal responses, the variance of the phenotypes is dependent on the mean. The GLMM uses: (1) a link function (typically nonlinear on the parameters) to specify the relationship between the mean of the response variable and a linear model; and (2) a variance function to describe the relationship between the mean and the variance of the distribution of the response variable, which enables it to relate traditional linear regression to non-normal data. This gives the GLMM framework a stronger basis for hypothesis testing and for more precise estimates of fixed and random effects. This statement applies to response variables coming from many different distributions, and not only to ordinal categorical responses.
GS methods have been implemented for threshold (ordinal categorical) traits in animal breeding. For example, González-Recio and Forni (2011) developed versions of BayesA, Bayesian Lasso, and two machine learning methods for dichotomous traits; Villanueva et al. (2011) also developed a version of BayesB for dichotomous traits; Wang et al. (2012) implemented methods BayesA, BayesB, and BayesCp for ordinal categorical traits; and Kizilkaya et al. (2014) implemented BayesCp for ordinal categorical traits. All these studies concluded that the traditional GS linear models are not suitable for threshold traits, since the basic assumptions of the linear model are violated, which produces a considerable reduction in prediction accuracy. However, to our knowledge, no plant breeding study so far has addressed the analysis of categorical traits. Our study fills this gap by providing a full description of a multithreshold GBLUP model (TGBLUP) and an empirical evaluation based on real data.
Prediction accuracy in GS usually is assessed using the sample correlation between predictions and phenotypes. However, this metric is not appropriate for categorical outcomes. For this reason, we used the Brier score for assessing the prediction ability of the proposed threshold model, which assigns a numerical score based on the predictive distribution. This scoring rule is "strictly proper" in the sense that it maximizes the expected score of an observation (and is unique) and allows making fair comparisons among models. To our knowledge, our study is the first one in GS to use this score for assessing prediction accuracy.
We presented nine specifications of the TGBLUP model which differ on the type of effects included. We considered main effects models and models for interactions between genetic factors (e.g., additive ·additive epistatic) and between genetic and environmental factors e.g., additive ·additive· environment interaction. For specifying these interactions, we used the framework proposed by Jarquín et al. (2014) within a threshold model framework. We found that models that take into account interactions (models 629) explained almost two times more variability than those without interaction (models 125). However, the interaction that was clearly most important was the G·E term; this effect captured the largest proportion of the total variability explained by the models (51.68% in model 6 and 50.19% in model 9), and was the one that led to the largest gain in prediction accuracy. This result is in agreement with previous studies that have highlighted the importance of modeling G·E in GS in plant breeding (Burgueño et al. 2012;Jarquin et al. 2014).
Inclusion of additive · additive epistatic terms did not explain much of the total variability, and did not help much to improve prediction ability. Also, we found that estimated threshold parameters could be clustered into two groups, one for those without interactions (models 125), and another for those with interaction terms (models 629). This was because including interactions increases the variance of the liability score and, therefore, changes in threshold values are needed to accommodate the observed probabilities of each of the categories.
Finally, although the analyses presented here used Gaussian priors for marker effects, the multi-threshold model used in this study can be implemented with any of the priors commonly used in GS, including those that induce differential shrinkage of estimates of effects or a combination of variable selection and shrinkage. The current implementation of BGLR allows users to do this.
n Table 4 Brier scores (mean, minimum and maximum; smaller indicates better prediction) evaluated for the validation samples We extended the GBLUP, a model commonly used in GS for analyses of normal traits, to situations where the response is ordinal (TGBLUP). We provided a detailed description of the model and introduced a metric, the Brier score, for assessing prediction accuracy of ordinal categorical outcomes. We presented an empirical evaluation using a (real) data set of 278 maize lines genotyped with 46,347 SNPs and evaluated for disease resistance using an ordinal categorical scoring system in three environments (Colombia, Zimbabwe, and Mexico). A total of nine models were used. In addition, we provide details of the R code used to implement these models using the BGLR package.
Our results highlight the importance of including G·E (capturing at least 49.42% of the total variability); when this interaction was taken into account, this increased the total variability explained by these models and increased prediction accuracy between 8 and 19% relative to models based on main effects only. Considering additive · additive epistasis did not produce a sizable increase in prediction accuracy.