## Abstract

Most genomic-enabled prediction models developed so far assume that the response variable is continuous and normally distributed. The exception is the probit model, developed for ordered categorical phenotypes. In statistical applications, because of the easy implementation of the Bayesian probit ordinal regression (BPOR) model, Bayesian logistic ordinal regression (BLOR) is implemented rarely in the context of genomic-enabled prediction [sample size (*n*) is much smaller than the number of parameters (*p*)]. For this reason, in this paper we propose a BLOR model using the Pólya-Gamma data augmentation approach that produces a Gibbs sampler with similar full conditional distributions of the BPOR model and with the advantage that the BPOR model is a particular case of the BLOR model. We evaluated the proposed model by using simulation and two real data sets. Results indicate that our BLOR model is a good alternative for analyzing ordinal data in the context of genomic-enabled prediction with the probit or logit link.

Genomic-enabled prediction models are revolutionizing animal and plant breeding. There is some evidence that they are powerful for predicting the genomic merit of animals and plants based on high-density single-nucleotide polymorphism (SNP) marker panels and are being recommended increasingly for genomic prediction in human health (Yang and Tempelman 2012). However, most genomic-enabled prediction models assume a continuous and normally distributed phenotype. Because often this assumption is not fulfilled, researchers normally approach phenotypes in three ways: (a) they ignore the lack of normality in the phenotypes; (b) they transform the non-normal phenotype to approximate it to normality; or (c) they use generalized linear mixed models (GLMMs) to model the appropriate distribution of the phenotype (Stroup 2015).

The use of the first approach is justified for large sample sizes with the central limit theorem, which states that treatment means have an approximate normal distribution if the sample size is large enough. However, there is a lot of evidence indicating that the first approach produces highly biased results for small- and moderate-sample sizes (Stroup 2012, 2015). Transformations introduced by Bartlett (1947) for non-normal data were proposed for variance-stabilization to fulfill the assumption of homogeneous variance; they are still considered standard procedures in many agricultural disciplines. Implementation with transformations is equal to that for phenotypes normally distributed (based on the linear model). However, there is mounting evidence that transformations do more harm than good for the models required by most agricultural research, because the use of the linear model with or without transformed data produces a great loss of accuracy and power (Stroup 2015), mostly in small sample sizes.

Nelder and Wedderburn (1972) introduced generalized linear models, a major departure from the usual approach to non-normal data. GLMMs extend the linear model theory to accommodate non-normal data with heterogeneous variance and even correlated observations. Viewed through the GLMM lens, the pre-1990s understanding of non-normal data––still pervasive in the agricultural research community––is antiquated at best, obsolete at worst. Today, small sample investigations are providing an increasing body of evidence that GLMMs work as well in practice as they do in theory because they produce more accuracy and power than approaches (a) and (b). Also, now there are textbooks and software available for the implementation of GLMMs, although the implementation of approaches (a) and (b) is still dominant in agricultural research (Stroup 2015). In genomic-enabled prediction, the use of GLMMs is still in its early stages because their implementation is not straightforward given that the number of observations (*n*) is usually smaller than the number of covariates (*p*) included in the model. In addition, a complex dependence structure is observed among covariates (markers) and observations (lines) due to the joint involvement of biological processes and pathways.

To overcome this situation in the pregenomic era, Gianola (1980, 1982) and Gianola and Foulley (1983) proposed a probit (threshold) model for dealing with ordinal categorical traits in animal breeding. This probit model was extended to deal with *p >> n* in the genomic era by González-Recio and Forni (2011) and Villanueva *et al.* (2011) for binary trials, and by Wang *et al.* (2012) and Montesinos-López *et al.* (2015) for more than two ordinal categories. Also, BGLR (Bayesian generalized linear regression), software developed for genomic-enabled prediction that is able to deal with normal, binary, ordinal, and censored data (de los Campos and Perez-Rodriguez 2013; Perez-Rodriguez and de los Campos 2014), is now available. However, no GLMMs are available for genomic-enabled prediction for counts and percentage phenotypes.

For modeling ordinal categorical phenotypes, the ordinal logistic regression model is often preferred over the ordinal probit model in statistical applications, because it provides regression coefficients that are more interpretable due to their connection to odds ratios (Zucknick and Richardson 2014). However, in genomic-enabled prediction (when *p >> n*), only the Bayesian probit model is frequently implemented, given that Bayesian methods that introduce sparseness through additional priors on the model size are very well-suited to this problem. Therefore, because of the lack of a Bayesian logistic ordinal model analogous to the Bayesian probit ordinal model that uses a data augmentation approach, the logistic model is not practical for genomic selection.

Both logistic and normal distributions are symmetric with a basic, unimodal “bell curve” shape. The only difference is that the logistic distribution has a somewhat heavier tails, which means that it is less sensitive to outlying data (and hence somewhat more robust for modeling mis-specifications or erroneous data). This is another advantage of logistic regression over probit regression. Because of its easy implementation, the use of BPOR is extremely common, even though it is less robust for modeling mis-specifications and its coefficients are less interpretable. Because of the aforementioned properties of the logistic model, some researchers have proposed approximations to logit regression. For example, Bartholomew and Knott (1999) proposed , where Φ is the cumulative density function for the standard normal distribution and , whereas Camilli (1994) proposed using *k* = 1.702, obtained by minimizing the maximum distance between two cumulative distribution functions. Although Amemiya (1981) proposed a value of *k* = 1.6 and computed tables for representative values of the density function for different values of *k*, he did not explain why he used . More recently, Savalei (2006) obtained a value of *k* = 1.75 based on minimizing the Kullback-Leibler information. However, although some of these approximations do a reasonable job of approximating the logistic distribution, they are only approximations, and it goes without saying that an exact solution is preferred.

In this paper, we propose a Bayesian logistic ordinal regression (BLOR) model for genomic-enabled prediction by using a data augmentation approach. We illustrate our proposed method with simulation and real data. We compare the BLOR with the Bayesian probit ordinal regression (BPOR) model with and without approximation.

## Materials and Methods

### Gray leaf spot (GLS) and Septoria data sets

GLS is one of the most important foliar diseases of maize worldwide. The GLS data set is composed of 278 maize lines; the ordinal trait measured in each line was GLS [1 (no disease), 2 (low infection), 3 (moderate infection), 4 (high infection), 5 (complete infection)] caused by the fungus *Cercospora zeae-maydis*, evaluated in three environments (México, Zimbabwe, and Colombia). These data are part of a data set previously analyzed by Crossa *et al.* (2011), González-Camacho *et al.* (2012), and Montesinos-López *et al.* (2015). Genotypes of all 278 lines were obtained using the 55k SNP Illumina platform. SNPs with >10% missing values or a minor allele frequency of ≤ 0.05 were excluded from the data. After line-specific quality control (applying the same quality control to each line separately), the maize data still contained 46,347 SNPs.

On the other hand, the Septoria data set contains 268 wheat lines planted in Toluca, México, in 2010, and the trait (Septoria scores) was measured using an ordinal four-point scale. Genotypes of these lines were obtained with 45,000 genotype by sequencing (GBS), following the protocol of Poland *et al.* (2012). We kept only 13,913 GBS that had <50% missing data; after filtering for minor allele frequency, we ended up with 6787 GBS that were used in the analysis.

For the implementation of the proposed model, we formed five data sets from these two real data sets (GLS and Septoria), four from the GLS data set and one from the Septoria data set. The first three data sets formed from GLS correspond to each environment in which they were evaluated for GLS; the last one was formed by pooling the data from the three environments (information from the three environments without taking into account the environments as covariates).

### Bayesian logistic ordinal regression

Let (where *i* represents the genotype and *j* denotes the number of replicates or experimental units of each genotype. The total number of observations is In other words, the observed vector contains elements, and the *n*-dimensional vector **y** of all responses can be written as . The response variable represents an assignment into one of *C* mutually exclusive and exhaustive categories that follow an order. Therefore, the ordinal logistic regression model can be written in terms of a latent response variable as follows: (1)where are called “liabilities”, , where denotes the logistic distribution, and the vectors (*p* × 1) are explanatory variables associated with the fixed effects ** β**. The random effect . In genomic-enabled prediction, Since are unobservable and can be measured indirectly by an observable ordinal variable , then can be defined by: This means that is divided by thresholds into

*C*intervals, corresponding to

*C*ordered categories. The first threshold, , defines the upper bound of the interval corresponding to observed outcome 1. Similarly, threshold defines the lower bound of the interval corresponding to observed outcome

*C*. Threshold defines the boundary between the interval corresponding to observed outcomes

*c*− 1 and

*c*for (

*c*= 1,2,…,

*C*− 1). Threshold parameters are with , and

Assuming that the error term of the latent response is distributed as , the cumulative response probability for the *c* category of the ordinal outcome is: for . (2)Similarly, model (2) can be written as a cumulative logit model:This GLMM model is described by: (1) two distributions, one for observations in the response variable Multinomial ), where ** β** is the p × 1 vector of fixed effects, and another for the random effects, or , where

*b*is the effect of line

_{i}*i*; (2) linear predictor , where denotes the

*c*

^{th}link (

*c*= 1,2,…,

*C*− 1) for the fixed and random effects combination, is the intercept (threshold) for the

*c*

^{th}link, and are known row incidence vectors corresponding to fixed effects in

**. Because there are**

*β**C*categories, a total of

*C*− 1 link functions are required to fully specify the model; and (3) link function: cumulative logit {, (

*c*= 1,2,…,

*C*− 1)}.

Using the inverse link for this model, we can calculate as follows:.Since we have latent variables distributed as and we observe if, and only if, , then the joint posterior density of the parameter vector and latent variable becomes Let’s assume a scaled independent inverse chi-square prior for , a normal prior distribution for , a normal prior distribution for , and also a prior for (Gianola 2013). Following Sorensen *et al.* (1995), the prior for the *C* − 1 unknown thresholds has been given as order statistics from U(, distribution, where .

The fully conditional posterior distributions are provided below and details of all derivations are given in Appendix A.

### Liabilities and Pólya-Gamma values

The fully conditional posterior distribution of liability is a truncated normal distribution and its density is (3)For simplicity, *ELSE* is the data and the parameters, except the one in question. is a normal density with parameters as indicated in the argument, Φ is the cumulative distribution function of a normal density with mean and variance **, and the fully conditional posterior distribution of ***ω _{ij}* is

### Regression coefficients (*β*)

The fully conditional posterior of ** β** is as follows: (5)where , . With , ,

**, , , , . It is important to point out that if we use a prior for (improper uniform distribution), then in and we need to make**

**0**the term .

### Polygenic effects (*b*)

Now the fully conditional posterior of ** b** is given as

### Variance of polygenic effects

Next, the fully conditional posterior of is

(7)### Threshold effects (*γ*)

*γ*

The density of the fully conditional posterior distribution of the *c*^{th} threshold, , is(8)

### Variance of regression coefficients

The fully conditional posterior of is

(9)### The Gibbs sampler

The Gibbs sampler is implemented by sampling repeatedly from the following loop:

Sample liabilities from the truncated normal distribution in (3).

Sample values from the Pólya-Gamma distribution in (4).

Sample the regression coefficients from the normal distribution in (5).

Sample the polygenic effects from the normal distribution in (6).

Sample the variance effect ( from the scaled inverted χ

^{2}distribution in (7).Sample the thresholds from the uniform distribution in (8).

Sample the variance of regression coefficients ( from the scaled inverted χ

^{2}distribution in (9).Return to step 1 or terminate if chain length is adequate to meet convergence diagnostics.

In the absence of polygenic effects (**b**), the aforementioned Gibbs sampler can be used only by ignoring steps 4 and 5. If all marker effects are taken into account in the design matrix, ** X**, with a prior for the beta regression coefficients, we end up with a threshold Bayesian ridge regression. This is a version for ordinal categorical data of the ridge estimator of Hoerl and Kennard (1970), since the posterior expectation of

**is equal to with pseudo-response**

*β***. Another important point is that by setting each , the aforementioned Gibbs sampler for the BLOR with the logistic link is reduced to the Gibbs sampler for the BPOR with the probit link proposed by Albert and Chib (1993). This implies that the proposed BLOR model is more general and includes the Gibbs sampler for the BPOR model as a particular case.**

*l*### Simulation study

The purpose of this simulation study is twofold: (1) to compare the performance of the proposed BLOR with: (a) the approximation resulting from using the estimates of the BPOR with , denoted as BLOR*, (b) with the results of using maximum likelihood estimators (MLEs) with logit link for ordinal data (MLLOR), and (c) the approximation resulting from multiplying the MLEs with probit link for ordinal data by 1.75, denoted as MLLOR*; (2) to evaluate the performance of the BLOR in the presence of outliers. We used the value 1.75 because, according to the literature review, it is the most reasonable value (Savalei 2006).

To reach these two goals, two data sets were simulated. Both simulation studies were carried out with the following liability:where and , and the vectors have been drawn independently, with components following a uniform distribution within the interval [−0.1, 0.1]. The threshold parameters used were γ_{1}= −0.8416, γ_{2}= −0.2533, γ_{3}= −0.2533, and γ_{4}= −0.8416. Then the response variable was generated as follows: For simulated data set 1, we used four values of sample size *n _{i}* = 5, 10, 20, and 40 and all the were drawn independently from a

*L*(0,1), whereas for simulated data set 2, we used only one sample size (

*n*= 40) and the error terms () were obtained from two distributions [

_{i}*L*(0,1) and a student's

*t*distribution with four degrees of freedom, denoted as t4]. We studied four scenarios: Scenario 1, the percentage of outliers (PO) from the t4 was 5% and the remaining percentage was obtained from the

*L*(0,1) distribution; Scenario 2: the PO from the t4 was 10%, and 90% from the

*L*(0,1); Scenario 3: the PO from the t4 was 20%, and 80% from the

*L*(0,1); and Scenario 4: the PO from the t4 was 30%, and 70% from the

*L*(0,1).

The MLE estimates for the ordinal regression were obtained using the polr function of the MASS package in R (R Core Team 2015). It is important to point out that the priors used for the Bayesian methods were not informative for , and for the hyperparameters for thresholds, we used and . We computed 20,000 Markov chain Monte Carlo (MCMC) samples. Bayes estimates were computed using 10,000 samples because the first 10,000 were discarded as burn-in.

### Model implementation

The Gibbs sampler described previously for the BLOR model was implemented with the R-software (R Core Team 2015). Implementation was done with a Bayesian approach and MCMC through the Gibbs sampler algorithm, which samples sequentially from the full conditional distributions until it reaches a stationary process, converging with the joint posterior distribution (Gelfand and Smith 1990). In the real data, to reduce the potential impact of MCMC errors on prediction accuracy, we performed a total of 60,000 iterations with a burn-in of 20,000, so that 40,000 samples were used for inference. We did not apply thinning of the chains, following the suggestions of Geyer (1992), Maceachern and Berliner (1994), and Link and Eaton (2012), who provide justification for the ban on subsampling MCMC output for approximating simple features of the target distribution (*e.g.*, means, variances and percentiles), since thinning is neither necessary nor desirable, and unthinned chains are more precise. It is important to point out that implementation of the BLOR model for the real data sets was done using the hyperparameters , , *γ*_{min} = −1000 and *γ*_{max}= −1000 for thresholds parameters, all of which were chosen to lead weakly informative priors.

### Assessing prediction accuracy

We used cross-validation to estimate the prediction accuracy of the proposed models. The data set was divided into training and validation sets 10 times, with 90% of the data set used for training and 10% for testing; this was done only for the pooled data. The training set was used to fit the model and the validation set was used to evaluate the prediction accuracy of the proposed models. Since the phenotype response is ordinal categorical, we used the Brier score (Brier 1950) to measure prediction ability, which is equal to (10)where BS denotes the Brier Score, and *d _{ic}* takes a value of 1 if the ordinal categorical response observed for individual

*i*falls into category

*c*; otherwise,

*d*= 0. This scoring rule uses all the information contained in the predictive distribution, not just a small portion like the hit rate or the log-likelihood score. Therefore, it is a reasonable choice for comparing categorical regression models, although there are other scoring rules that also have good properties. The range of BS in equation (10) is between 0 and 2. For this reason, we divided Brier scores (BS)/2 to get the BS bounded between 0 and 1; lower scores imply better predictions. It is important to point out that we also used the BS when analyzing the full data sets. We also used the deviance information criterion (DIC) to compare Bayesian models, as suggested by Gelman

_{ic}*et al.*(2003); here, the lower the DIC, the better the model.

### Data availability

The two real data sets and two simulated data sets together with R codes are deposited in the link http://hdl.handle.net/11529/10254. The phenotypic data for GLS in three environments (México, Zimbabwe, and Colombia) for the 278 maize lines, the 46,347 SNPs, and the R scripts developed to fit the predictive models used in this study are given in the files PhenoGLS.RData, and MarkersGLSFinal.RDat. The repository also contains the Septoria genotypic and phenotypic data sets, in files SeptoriaGenotypic.RDat, and SeptoriaPhenotypic. RDat, respectively. The R codes to generate the simulated data sets 1 and 2, and for analyzing the real data set from Colombia are directly given in Appendices B, C, and D, respectively.

## Results

In the following sections, we investigate the performance of the proposed BLOR estimator through a simulation study and with real data.

### Simulated data set

In Table 1, we report average estimates obtained by all methods, along with SDs; all the results are based on 50 replications. From Table 1, it is clear that as the sample size increases, the average biases and SD decrease in all cases. This confirmed the consistency properties of all the estimates. Table 1 also shows that, in general, the point estimates of the Bayesian estimates (BLOR and BLOR*) are similar to the MLEs (MLLOR, and MLLOR*, which was approximated with the MLE with probit ordinal regression); however, the approximations (BLOR* and MLLOR*) have greater bias and SD. BLOR has less bias and SD in most of the studied parameters, producing better parameter estimates than the MLLOR (which is the correct method that use maximum likelihood with the logit link function). For this reason, the proposed BLOR is an excellent alternative. However, a more in-depth simulation study is required to ensure that these findings are valid for all possible scenarios.

Table 2 shows that the smaller the PO, the less bias there is in the parameter estimates under the four methods (two Bayesians and two under Maximum Likelihood). However, under the two Bayesian methods, the proposed BLOR showed less bias than BLOR*, and this behavior is observed for all parameters and PO under study, except when the PO is 30% and for the parameter *γ*_{2}. Under the two maximum likelihood methods, the approximate method (MLLOR*) showed greater bias in all scenarios and parameters. Finally, although BLOR and MLLOR produced better results in term of bias, in most cases, MLLOR was better than BLOR.

### Real data sets

Next we compared BLOR and BPOR with the real data set (GLS and Septoria data) described in the *Materials and Methods*. However, because with this data set the number of parameters (betas and thresholds) to be estimated is high, we compared both models with point and interval estimates of the probabilities estimated for each category on the four- and five-point ordinal scale for each data set studied. In Table 3, we compare BLOR, BPOR, and BLOR* (this is the approximate method since the estimates resulting from BPOR were used to approximate BLOR with with five data sets (three locations, the resulting pooling of the three locations and the Septoria data set) made up from the data sets described in the section *Materials and Methods*. First, when comparing BLOR with BLOR*, we see that there are differences in the point probability estimates, and the lower the probabilities, the greater the differences. However, in general, the widths of the credible sets are similar. Second, when comparing BLOR with BPOR, we see that the point and credible sets for each location and Septoria produced similar results; however, for the pooled data, BLOR produced estimates with narrower confidence sets than BPOR (Table 3). We observed that the estimates produced using BPOR are less accurate because wider confidence sets are produced when the data are pooled; this could be because the assumption of errors normally distributed with mean zero and variance one when the data are pooled is not fulfilled. Also, there were no differences in the BS of the three models (BLOR, BLOR*, and BPOR).

Comparing the models with DIC, we see that in Zimbabwe, BPOR produced the lowest DIC (DIC = 4150.18), whereas BLOR* produced the highest (DIC = 4156.86). However, the DIC of BLOR (DIC = 4150.29) was very close to that of BPOR. In México, the lowest DIC was for BLOR (DIC = 1313.82), whereas the highest was for BLOR* (DIC = 1315.21). In Colombia, BLOR had the lowest DIC (2577.92) and BLOR* had the highest (2581.66). In the Septoria data set, BPOR had the lowest DIC (651.79) and BLOR had the highest (654.33). Finally, in the pooled data, BLOR (8327.36) also had the lowest DIC, whereas BLOR* had the highest (8339.539). All these results shows that even approximations that did a reasonable job (BLOR*) were sometimes very far from the exact methods (BLOR and BPOR). For this reason, whenever possible, an exact model (BLOR or BPOR) should be chosen. The fact that BPOR is sometimes the best (lower DIC) implies that for these data sets, the assumption in Equation (1) is enough.

Finally, in Table 4 we present the BS for the testing sets of the pooled real GLS data with 10 cross-validations and 90% of data used for the training set and 10% for the testing set. For models BLOR and BPOR, the BS are almost identical, which means that, with regard to prediction, both models had a similar performance. However, although the approximation method (BLOR*) produced a higher BS, its prediction accuracy was very close to that of BLOR and BPOR. Although we found that the three models had a similar performance regarding prediction accuracy with this data set, more in-depth research is required to validate this observation.

This paper proposes a method for BLOR using the Pólya-Gamma data augmentation approach of Scott and Pillow (2013), which produces a Gibbs sampler with full conditional distributions similar to that of the BPOR model of Albert and Chib (1993). The proposed method is reduced to the BPOR model when the sampled values, , from the Pólya-Gamma distribution in Equation (4) are set to 1. This is an advantage because with the proposed model, researchers can perform an exact logistic or probit ordinal regression without having to do approximations to perform a logistic ordinal regression. The performance of the proposed method was compared with the approximation using the probit ordinal regression model in a small simulation study and real data sets (GLS and Septoria data sets) using a four- and five-point ordinal scale. On the basis of the simulation study, it is clear that the estimation of parameters using the approximation produces a considerable amount of bias and can give rise to wrong conclusions in association studies. However, we observed with the real data that, in terms of prediction ability, both models (BLOR and BPOR) have a similar performance even though we observed BLOR had lower DIC values in México, Colombia, and the pooled data. This means that when violation of the assumption in Equation (1) is not strong, any model can be used. For this reason, we observed greater accuracy (narrow confidence sets) for the BLOR model compared with the exact BPOR model (BPOR) only with the pooled data set without a covariate for location.

Although with the real data we did not observe an advantage in prediction accuracy with the proposed BLOR model, it is very well documented in statistical literature that logistic ordinal regression is more robust for dealing with outlying data, because logistic distribution has heavier tails which was corroborated in terms of parameter estimates with the simulation done using simulated data set 2. For this reason, the proposed BLOR should be preferred because it is usually not practical to test if the error term in Equation (1) is or . In addition to being more robust, the proposed method also provides regression coefficients that are more interpretable because of their connection to odds ratios (Zucknick and Richardson 2014). However, this advantage does not make sense when *p >> n*, because the main driving force in Bayesian models in the case of *p >> n* is the prior and not the data (Gianola 2013). Even with this restriction, this paper unifies logistic and probit ordinal regression under a Bayesian framework and is a useful alternative for genomic-enabled prediction of ordinal categorical trials where available data sets have a larger number of parameters to estimate than observations. Also, the proposed method should be preferred over BPOR when outliers are present and not easily detected. This is especially true for multidimensional data, since many times a few outliers in a data set can have strong influence on parameter estimation and inference.

Finally, it is important to point out that, to devise the method proposed in this paper, we generalized the work of Scott and Pillow (2013) for ordered categorical responses. Our method is elegant, easy to implement, and produces a unified Gibbs sampler framework useful for both the logit and the probit link. For this reason, we believe it is an appealing alternative for plant and animal researchers. Also, the proposed BLOR model can be easily extended to take into account genotype × environment interactions, which play an extremely important role in plant breeding, especially when selecting candidate genotypes.

## Acknowledgments

We thank all International Maize and Wheat Improvement Center (CIMMYT) researchers in the Global Maize Breeding Program (GMP) and in the Global Wheat Breeding Program (GWP), as well as the researchers in the National Programs, that generate the data we use in this study.

## Appendix A

### Derivation of full conditional distributions

#### Liabilities and Pólya-Gamma values

The fully conditional posterior distribution of liability is The last inequality was obtained using a technique called the Pólya-Gamma method (Scott and Pillow 2013), which is useful when working with logistic likelihoods, and has the formwhere and denotes the density of the random variable , where denotes a Pólya-Gamma distribution with parameters *b* and *d* and density , where denotes the hyperbolic cosine.

Then the joint distribution of and is equal to Therefore, the fully conditional posterior distribution of liability is a truncated normal distribution and its density is where is a normal density with parameters as indicated in the argument, Φ is the cumulative distribution function of a normal density with mean and variance **, and the fully conditional posterior distribution of isand from here and equation (5) of Polson ***et al.* (2013) we get that

#### Regression coefficients (*β*)

First note that the fully conditional posterior of is where . Then, the fully conditional posterior distribution of ** β** is where , . It is important to point out that if we use a prior for (improper uniform distribution), then in and we need to make

**0**the term . Finally, the fully conditional posterior of

**is**

*β*#### Polygenic effects (b)

Now the fully conditional posterior of ** b** is given as This implies that the fully conditional posterior of

**is**

*b*#### Variance of polygenic effects

Next, the conditional distribution of is obtained. If , then This is the kernel of the scaled inverted distribution; therefore, the fully conditional posterior is

#### Threshold effects (*γ*)

The density of the fully conditional posterior distribution of the *c*^{th} threshold, *γ _{c}*, is (A.1)If Equation (A.1) is seen as a function of

*γ*, it is evident that the value of

_{c}*γ*must be larger than all the and smaller than all the . Hence, as a function of

_{c}*γ*, Equation (A.1) leads to the uniform density (A.2)Equation (A.2) corresponds to a uniform distribution on the interval (Albert and Chib 1993; Sorensen

_{c}*et al.*1995).

#### Variance of location effects

If we give *, then This is the kernel of the scaled inverted distribution; therefore, the fully conditional posterior is*

## Appendix B

### R code for simulating data set 1

We used this code for each sample size studied, and thus we only need to change ni, that denotes *n*_{i}, for the other values (10, 20, 40). For each sample size, we used this code 50 times for estimating the results in Table 1.

thetav = c(-5,-10,15)

Datos<-numeric(0)

nC=40;ni=5 ####Change this ni for other sample sizes

for(i in 1:nC)

{xi1<-runif(length(thetav),-0.1,0.1)

Eta=t(xi1)%*%thetav

for(j in 1:ni)

{L=Eta + rlogis(1, location = 0, scale = 1)

y=ifelse(L<=-0.8416,1,ifelse(L<=-0.2533,2,ifelse(L<0.2533,3,ifelse(L<0.8416,4,5))))

Datos<-rbind(Datos,c(j,y,t(xi1)))

}}

colnames(Datos) = c(’j’,’y’,’x1’,’x2’,’x3′)

Datos

## Appendix C

### R codes for simulating data set 2

We used this code for each percentage of outliers (PO) studied; for this reason, we only need to change sam.out for values: 4, 8, and 12, which represent 10, 20, and 30% of outliers. For each sample size, we used this code 50 times for estimating the results in Table 2.

thetav = c(-5,-10,15)

Datos<-numeric(0)

nC=40; ni=40, sam.out =2 ###(this 2 represents 5% of outliers)

for(i in 1:nC)

{xi1<-runif(length(thetav),-0.1,0.1)

Eta=t(xi1)%*%thetav

for(j in 1:ni)

{ L=Eta + ifelse(j>sam.out,rlogis(1, location = 0, scale = 1), rt(1, df=4, ncp=0))

y=ifelse(L<=-0.8416,1,ifelse(L<=-0.2533,2,ifelse(L<0.2533,3,ifelse(L<0.8416,4,5))))

Datos<-rbind(Datos,c(j,y,t(xi1)))

}}

colnames(Datos) = c(’j’,’y’,’x1’,’x2’,’x3′)

Datos

## Appendix D

### R code for fitting the proposed BLOR for Colombia

This code was used for all the analysis given in Table 3.

#####Code for the Bayesian Ordinal REgression ##################################

rm(list=ls()) # remove everything from memory in the working environment.

library(matrixcalc)

library(BayesLogit)

library(mvtnorm)

##### We load the matrix of markers ############################################

load(’MarkersFinal.RData’)

M=as.matrix(X)

DataOrd1=data.frame(read.table(’GLSdataOsval1.csv’,sep=’,’,h=T));

DataOrd=na.omit(DataOrd1)

DataOrd=DataOrd[order(DataOrd$Stock), ]

DataOrd=subset(DataOrd, Loc==’Colombia’)

y=DataOrd[,4]

XD=DataOrd[,1:4]

##### Calculating the marker-derived genomic relationship matrix (GRM) #########

M<-scale(M,center=TRUE,scale=TRUE)

G<-tcrossprod(M)/ncol(M)

LL=t(chol(G))

##### Incidence matrix and covariance for main eff. of lines ###################

XD$Stock<-factor(x=XD$Stock, levels=rownames(M), ordered=TRUE)

ZL<-model.matrix(∼XD$Stock-1)

Z=(ZL%*%LL)

X=Z

tX=t(X)

##### Starting values of Beta0, gammas(thresholds) #############################

gammas<-c(-Inf,-3.11,-2.54,-1.89,-1.27,Inf)

nt=length(gammas)-2

m = dim(X)[1]

nB=dim(X)[2]

In =diag(nB)

p=nB

Betas=rep(0,,dim(X)[2])

SigmaB=1

L<-rep(0,m)

##### Priors for the parameters ################################################

vB=3

SB=0.001

gam.Min=-1000

gam.Max=1000

##### Number of iterations requiered for the Gibbs sampler #####################

Niter<-60000

#### Matrices for saving the output ############################################

MBetas<-matrix(nrow=Niter,ncol=nB)

Mgammas<-matrix(nrow=Niter,ncol=nt)

Mvar_Beta<-matrix(nrow=Niter,ncol=1)

#### Function for sampling normal truncated values###############################

ntruncada<-function(lo,hi,media,std)

{

U = runif(m)

F0 = pnorm(lo,media,std)

c = pnorm(hi,media,std)-F0

muestra = qnorm(c*U+F0)*std+media

}

##### Gibbs sampler ############################################################

for(i in 1:Niter)

{

##### Linear predictor #########################################################

eta=as.numeric(Betas)

##### Samples from polya gamma distribution ####################################

##### Replacing w=rpg(num=m,2,-L+eta) with w=rep(1,m) we get the BPOR #######

w=rpg(num=m,2,-L+eta)

D=diag(w)

##### Sample liabilities (L) from a truncated normal distribution ##############

L = ntruncada(gammas[y],gammas[y+1],media=eta,1/sqrt(w))

##### Sample of thresholds #####################################################

newgammas<-numeric(0)

for(k in 2:(nt+1))

{

lo<-max(c(max(L[y==k-1]),gammas[k-1],gam.Min))

hi<-min(c(min(L[y==k]),gammas[k+1],gam.Max))

newgammas<-append(newgammas,runif(1,lo,hi))

}

gammas<-c(-Inf,newgammas,Inf)

##### Sample of Betas ##########################################################

C = 1/SigmaB*In

X_tD = tX%*%D

MM=X_tD%*%X

diag(MM)=diag(MM)+diag(C)

S_a=solve(MM,tol=1e-19)

mu_a = S_a%*%(t(X)%*%D%*%L)

Betas=t(rmvnorm(1, mean =mu_a, sigma =S_a))

##### Sample of sigma_Beta##########################################################

SigmaB = 1/rgamma(1,shape=(vB+p)/2,(SB+tcrossprod(Betas))/2)

##### Saving output ############################################################

MBetas[i,]=t(Betas)

Mgammas[i,]=t(newgammas)

Mvar_Beta[i]=SigmaB

##### Printing some posterior values estimated #################################

cat(’Niter’,i,’SigmaB =’,SigmaB,’Beta1=’,Betas[1],’\n’)}

## Footnotes

*Communicating editor: E. Huang*

- Received May 13, 2015.
- Accepted August 11, 2015.

- Copyright © 2015 Montesinos-López
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.