## Abstract

Because of the increased availability of genome-wide sets of molecular markers along with reduced cost of genotyping large samples of individuals, genomic estimated breeding values have become an essential resource in plant and animal breeding. Bayesian methods for breeding value estimation have proven to be accurate and efficient; however, the ever-increasing data sets are placing heavy demands on the parameter estimation algorithms. Although a commendable number of fast estimation algorithms are available for Bayesian models of continuous Gaussian traits, there is a shortage for corresponding models of discrete or censored phenotypes. In this work, we consider a threshold approach of binary, ordinal, and censored Gaussian observations for Bayesian multilocus association models and Bayesian genomic best linear unbiased prediction and present a high-speed generalized expectation maximization algorithm for parameter estimation under these models. We demonstrate our method with simulated and real data. Our example analyses suggest that the use of the extra information present in an ordered categorical or censored Gaussian data set, instead of dichotomizing the data into case-control observations, increases the accuracy of genomic breeding values predicted by Bayesian multilocus association models or by Bayesian genomic best linear unbiased prediction. Furthermore, the example analyses indicate that the correct threshold model is more accurate than the directly used Gaussian model with a censored Gaussian data, while with a binary or an ordinal data the superiority of the threshold model could not be confirmed.

- genomic selection
- multiocus association model
- G-BLUP
- threshold model
- ordinal
- binary
- censored Gaussian
- GenPred
- shared data resources

Genomic estimated breeding values based on genome-wide sets of molecular markers have become an essential resource in plant and animal breeding (Eggen 2012; Nakaya and Isobe 2012). The most commonly used approach to predict the genomic breeding values is the genomic best linear unbiased prediction (G-BLUP), a direct descendant of the pedigree-based BLUP model. G-BLUP uses the marker information in estimating realized relationships between the individuals, and then uses the marker-estimated genomic relationship matrix in a mixed model context (*e.g.*, VanRaden 2008; Powell *et al.* 2010). A relatively recent contender for the BLUP-type of model in the genomic selection field is to apply simultaneous estimation and variable selection or variable regularization to multilocus association models (*e.g.*, Meuwissen *et al.* 2001; Xu 2003). Contrary to G-BLUP, a multilocus association model uses the marker information directly by assigning different, possibly zero, effects to the marker genotypes. The genomic breeding value of an individual is then quantified as a sum of the marker effects. Because the number of genetic markers is usually orders of magnitude greater than the number of individuals available for the study, the selection or regularization of the predictors is the key factor of a multilocus association model. In Bayesian genomic selection models the regularization of the excess predictors is performed by shrinking the effects of the markers not linked to the phenotype toward zero by assigning a suitable shrinkage inducing prior density for the marker effects. The most widely used shrinkage inducing priors comprise the Student’s *t* and the Laplace densities, the former of which has been used by Meuwissen *et al.* (2001), by Xu (2003), Yi and Banerjee (2009), Hayashi and Iwata (2010), and Habier *et al.* (2011), whereas the latter has been used, *e.g.*, by de los Campos *et al.* (2009), Meuwissen *et al.* (2009), and Sun *et al.* (2010). The models relying on the Laplace density are commonly denoted as Bayesian LASSO (Park and Casella 2008).

In the basic form of these linear models, the response is assumed continuous with normally distributed residual variation. However, in many instances the actual phenotypic records are given as binary case-control, ordered categorical (*e.g.*, from nonaffected via different severity levels to strongly affected) or censored Gaussian records (*e.g.*, a logarithm of an event history or survival data, or spiked phenotypes as in Broman 2003). With a binary response, either logit or probit transformation is used to convert the binary response into the probability of the positive outcome. Both logit and probit models are developed by assuming an underlying continuous response, often called the latent variable or liability, dichotomized by setting a threshold to limit the two classes. The difference between the logit and probit models is the assumed distribution of the underlying response; the logit model assumes a logistic and the probit model a Gaussian density for the underlying variable. In the frequentist framework, the discrete response is usually modeled without considering the underlying continuous variable, leading to quite different estimation procedure for the model parameters than in linear Gaussian regression. Although under a linear Gaussian model the maximum-likelihood estimate has a closed-form solution, under the logit and probit models the partial derivatives of the likelihood function with respect to the regression coefficients either do not exist or cannot be determined analytically, and the maximum-likelihood estimate must be computed iteratively. The logistic link function leads to somewhat simpler algebraic expressions when handled as an integrated part of the linear model and is therefore often preferred by the frequentists (McCullagh and Nelder 1989).

Contrary to majority of the frequentist models, in the Bayesian context the underlying continuous response is included into the model as a separate latent variable layer, usually assumed to follow a Gaussian density. These two factors, that the augmentation of the latent variable is now an additional layer in the hierarchical model and that the augmented variable is assumed Gaussian, permit the usage of the original linear Gaussian model as such without further modifications. Moreover, in genetics, the normal assumption is especially reasonable as the inheritance of complex traits is determined by multiple genes and environmental factors resulting the liability likely to be normally distributed.

The ordered categorical records can be dichotomized and analyzed with a binary model, or alternatively incorporated as Gaussian observations in a general linear model (*e.g.*, Meijering and Gianola 1985; Wang *et al.* 2013). The problem with the former procedure is that it loses the information contained by the extra categories, whereas in the latter method the model is in effect not compatible with the data. Similarly to a binary phenotype, the ordinal phenotypes can be assumed to have an underlying continuous response discretized by introducing thresholds delimiting the categories (Hackett and Weller 1995). Now there are several thresholds at unknown positions, but otherwise the binary model can be seen as a special case of the ordinal model. The threshold idea can be extended to censored data sets by simply using the uncensored data as such while considering the censored phenotypes as latent variables. The advantage of this approach is that it does not matter which part of the data are censored (right, left, interval, two way censoring), the latent variable is always handled similarly.

Threshold models for ordinal and binary traits have been considered previously by several authors. A threshold model for BLUP with fixed thresholds has been covered by Gianola (1982) and with unknown, estimated threshold positions by Sorensen *et al.* (1995). Multilocus association models of binary and ordinal traits have been considered by Hoti and Sillanpää (2006), Iwata *et al.* (2009), González-Recio *et al.* (2009), González-Recio and Forni (2011), and Wang *et al.* (2013). Furthermore, the threshold approach for censored observations has been considered by Broman (2003), within BLUP context by Sorensen *et al.* (1998), and with multilocus association models by Sillanpää and Hoti (2007) and Iwata *et al.* (2009).

In this article we aim to enlarge on the threshold model more generally. We consider two Bayesian hierarchical models representing the alternative modeling approaches, a Bayesian version of the G-BLUP and a hierarchical Bayesian LASSO (*e.g.*, Park and Casella 2008; Kärkkäinen and Sillanpää 2012a) as a representative of a multilocus association model with variable regularization, and show that in case of a binary, ordinal, or censored Gaussian phenotype the same additional latent variable layer can be plugged into both types of the genomic selection models. In fact, the additional latent variable layer can be subsumed into legions of different linear Gaussian models; Wang *et al.* (2013) have used it with BayesA, BayesB, and BayesC*π*, whereas in our previous work (Kärkkäinen and Sillanpää 2012a) we incorporated a binary threshold-based latent layer into 13 distinct models, including a Bayesian G-BLUP, BayesA, BayesB and both hierarchical and nonhierarchical Bayesian LASSO. In this work, we exemplify the threshold method with a hierarchical Bayesian LASSO as it proved the best working model in the aforementioned previous work and, on the other hand, we did not want to pick anything lesser known, such as the extended Bayesian LASSO (introduced by Mutshinda and Sillanpää 2010, used successfully, *e.g.*, in Kärkkäinen and Sillanpää 2012b) to serve as an example.

A more immediate practical offering of this paper is the fast *maximum a posteriori* (MAP) estimation algorithm presented. Traditionally the parameter estimation for Bayesian models has been performed by finding the posterior density by Markov chain Monte Carlo (MCMC) sampling; however, because of the ever-increasing number of markers available, the focus within the genomic breeding value prediction field has been shifting to the fast MAP estimation methods (*e.g.*, Meuwissen *et al.* 2009; Yi and Banerjee 2009; Hayashi and Iwata 2010; Shepherd *et al.* 2010; Xu 2010; Sun *et al.* 2012). The MAP estimation is based on finding the maximum of the posterior density rather than the whole posterior distribution, usually by an expectation-maximization (EM) algorithm (Dempster *et al.* 1977; McLachlan and Krishnan 1997). The difference in speed between an MCMC and a MAP estimation algorithm is far from trivial: while the run time of an MCMC algorithm is typically hours at the lowest, our generalized expectation maximization (GEM) algorithms perform the example analyses in some 20 sec. With existing genome-wide data sets a fast estimation algorithm is an invaluable asset because it will significantly facilitate the frequent re-estimation of the marker effects and breeding values, the use of cross-validation and permutation-based techniques, and massive simulation studies of breeding programs. Nonetheless, in all of the Bayesian methods for threshold traits found in the literature the parameter estimation has been performed with MCMC. In this respect the methods for discrete data are a bit out of date compared to the methods for Gaussian traits.

In our previous work (Kärkkäinen and Sillanpää 2012a) we already have considered a kindred threshold approach for binary traits in MAP-estimation context. However, an ordinal data set poses an additional challenge because the model has to address the unknown thresholds as well as the latent response. Hence, although a binary phenotype can be regarded as a special case of the ordinal model considered here, a binary model is not readily expandable to several categories. Since the MAP-estimation methods are able to handle large data sets far more efficiently than MCMC methods, it is clear that an applicable MAP-algorithm is needed for all conceivable types of phenotypic observations.

## Materials and Methods

Our hierarchical Bayesian model, depicted as a directed acyclic graph in Figure 1, consists of two separate parts, the linear Gaussian model and the threshold model. Under the linear Gaussian model the phenotype measurements are assumed to be continuous and follow a Gaussian density, while the additional threshold model handles binary, ordinal and censored Gaussian observations.

Under the threshold model, we assume that the observed phenotype **w** consists of either ordered categorical or censored Gaussian observations, and that the ordered categorical variable has arisen as an underlying normally distributed continuous response **y** is rendered discrete with a known number of thresholds at unknown positions. The underlying Gaussian response **y** can be explained by genetic factors with either a multilocus association model (1)or with a G-BLUP model (2)(the linear Gaussian model module in Figure 1). Both (1) and (2) are linear Gaussian models commonly used in genomic selection. In the former model (1), the matrix **X** denotes the genotypic records of *p* biallelic single-nucleotide polymorphisms (SNP) of *n* learning set individuals, coded with respect to the number of the rare alleles and standardized to have zero mean and unity variance, and **β** denotes the marker effects. In the latter model (2) the *N*-vector **u** denotes the additive genetic values of the *N* learning and prediction set individuals, whereas **Z** is a *n* × *N* design matrix connecting the genetic values of the *n* learning set individuals to the latent response. Although the additive genetic values of both learning and prediction set individuals are present in the model (2), the latter do not contribute to the likelihood. *β*_{0} denotes the population intercept in both equations. The residuals **ε** are considered independent and identically distributed under both models, with . To avoid overparametrization and ensure identifiability, the residual variance component is set to unity when the Gaussian response is unobservable (see *e.g.*, Cox and Snell 1989). When the actual Gaussian response is fully observed, the threshold module is omitted from the model (Figure 1) and the residual variance is estimated simultaneously to other model parameters.

In this work the regularization of the excess predictors is performed by a hierarchical Bayesian LASSO (Park and Casella 2008), by specifying a Laplace prior density for the regression coefficients. The Laplace density works best and provides an easy derivation of the fully conditional posterior densities for the effect variances (even though not actual conjugacy) when expressed hierarchically as a scale mixture of normal densities (Park and Casella 2008; de los Campos *et al.* 2009; Kärkkäinen and Sillanpää 2012a). The hierarchical formulation of the prior densities, also known as model or parameter expansion, is a well-known method to simplify MCMC algorithms by transforming the prior into a conjugate and hence enabling Gibbs sampling, and to accelerate convergence of the sampler by adding more working parts and therefore more space for the random walk to move (see *e.g.*, Gilks *et al.* 1996; Gelman *et al.* 2004; Gelman 2004). In our previous work (Kärkkäinen and Sillanpää 2012a), we showed that the hierarchical formulation of the Laplace density is superior to its nonhierarchical counterpart also in EM context. The hierarchy is acquired by setting a Gaussian prior for the marker effects and an exponential prior to the effect variances . Unconditionally for the effects this leads to a Laplace density. In Figure 1 the hierarchical formulation is observable as the fourth, latent parameters, layer. The scale parameter *λ*^{2} of the Laplace prior determines the amount of shrinkage introduced by the prior, and hence the sparseness of the model. Because the optimal amount of shrinkage depends on the heritability and the genetic architecture of the trait, the number of markers and amount of linkage disequilibrium (LD) present in the data, the selection of the hyperparameter *λ*^{2} is of central importance. Although some authors prefer to give a fixed value to *λ*^{2} (*e.g.*, Figueiredo 2003; Meuwissen *et al.* 2009; Xu 2010), the most prevalent solution is to estimate it simultaneously to the model parameters (*e.g.*, Yi and Xu 2008; de los Campos *et al.* 2009; Shepherd *et al.* 2010). To this end we give the hyperparameter *λ*^{2} a Gamma(*κ*, *ξ*) hyperprior, and tune the rate parameter *ξ* of the gamma density to a suitable data specific value (sixth layer in Figure 1). The prior densities for the population intercept *β*_{0} and the residual variance (when applicable, *i.e.*, under the Gaussian phenotype model) are uniform *p*(*β*_{0}) ∝ 1 and Jeffreys’ , respectively. As the model parameters are considered *a priori* independent, the joint posterior density of the parameter vector becomes(3)where is a vector of the marker variances.

Under the Bayesian G-BLUP (2) the prior density for the genetic values **u** is a conjugate multivariate normal , where the realized relationship matrix **G** has been estimated from the marker data. In this work the estimation has been performed with the second method described in VanRaden (2008). Contrary to the classical framework, in a Bayesian approach the variance components are estimated simultaneously with the genomic breeding values (Hallander *et al.* 2010; Kärkkäinen and Sillanpää 2012a). The genetic variance component has been given a flat Inverse-*χ*^{2}(*ν*, *τ*^{2}) prior distribution with a substantially large *τ*^{2} to ensure the flatness of the density. Similarly to the multilocus association model, the prior densities for the population intercept and the residual variance (under the Gaussian phenotype model) are *p*(*β*_{0}) ∝ 1 and . The joint posterior density of the G-BLUP model parameters is given by

Given the value of the continuous, normally distributed latent variable *y _{i}*, the binary or ordinal variable

*w*has value

_{i}*k*∈ {1, …,

*K*} with a probability (5)where

*t*

_{k}_{−1}and

*t*are the thresholds delimiting the

_{k}*k*th category. If the ordinal variable has

*K*categories, there will be

*K*+ 1 thresholds, such that

**t**= {(

*t*

_{0},

*t*

_{1}, …,

*t*)|

_{K}*t*

_{0}<

*t*

_{1}< … <

*t*,

_{K}*t*

_{0}= −∞,

*t*

_{1}= 0,

*t*= ∞}. One of the thresholds must be fixed in order to center the underlying distribution; we adopt the common fashion to set

_{K}*t*

_{1}into zero (

*e.g.*, Cox and Snell 1989; Sorensen

*et al.*1995). The

*K*− 2 of the thresholds

**t**

^{⋆}= {(

*t*

_{2}, …,

*t*

_{K}_{−1})|

*t*

_{2}< … <

*t*

_{K}_{−1}} are considered unknown, and are estimated simultaneously to the model parameters. With a binary response (

*K*= 2) there obviously are no unknown threshold values.

Although the likelihood of the observed phenotype **w** follows a categorical density, conditionally on the underlying response and the thresholds the observed ordinal phenotype is known with certainty and hence the likelihood P(*w _{i}* =

*k*|

*y*,

_{i}**t**) gets a constant value, zero or one. Therefore, the fully conditional posterior density of the latent Gaussian variable

*y*, given the value of the ordinal observation

_{i}*w*, corresponds the prior density of

_{i}*y*when

_{i}*t*

_{k}_{−1}<

*y*<

_{i}*t*and is zero otherwise. As the latent variable is an additional layer in the hierarchical model, the prior density for the underlying Gaussian response

_{k}**y**is the likelihood of the latent variable under the linear Gaussian models (1) or (2). The likelihood of the latent Gaussian variable, given by the model equations (1) or (2) and the assumptions below them, is a multivariate normal centered at

*β*

_{0}+

**Xβ**under the multilocus association model and at

*β*

_{0}+

**Zu**under the G-BLUP model, respectively, the covariance being set to unity under both models. Hence, the fully conditional posterior density of

*y*is a truncated normal distribution (truncated at points

_{i}*t*

_{k}_{−1}and

*t*) with a density function (for simplicity, the ⋆ denotes the data and all other parameters) (6)where

_{k}*ϕ*(⋅) and Φ(⋅) denote the standard normal density and cumulative distribution functions, respectively, while (

*y*) is the linear predictor of the model (1) or (2).

_{i}Following Sorensen *et al.* (1995) the prior for the *K* − 2 unknown thresholds **t**^{⋆} = (*t*_{2}, …, *t _{K}*

_{−1}) has been given as order statistics from an Uniform(0,

*t*) distribution, (7)The fully conditional posterior density for a

_{max}*t*is proportional to the product of the prior and the likelihood of the observed ordinal phenotype

_{k}**w**. Note, that the threshold values

**t**

^{⋆}appear in the prior density (7) only at the definition of the support of the distribution. As the terms not including the parameter are discarded as constants from the fully conditional posterior, the support definition is all that passes from the prior to the posterior. Therefore, the fully conditional posterior density for a

*t*is given by the likelihood of the observed ordinal phenotype

_{k}**w**, within the set of values determined by the prior density of

**t**

^{⋆}, (8)for 0 <

*t*

_{2}< …

*t*

_{K}_{−1}<

*t*and 0 otherwise. If (8) is seen as a function of

_{max}*t*, it can be seen that the value of

_{k}*t*must be larger that all of the

_{k}*y*|

_{i}*w*=

_{i}*k*and smaller than all of the

*y*|

_{i}*w*=

_{i}*k*+ 1. Hence, as a function of

*t*, (8) leads to the uniform density (9)For a Gaussian phenotype with censored observations we define an additional binary variable

_{k}*ω*= 1 if the

_{i}*i*th observation is censored and

*ω*= 0 if not. As the threshold model assumes an unity variance for the latent Gaussian response, the observed phenotype must be standardized accordingly. This is done by regarding the available observations as a sample from a truncated normal density and using the connection between the quantiles and the standard deviation of a Gaussian density (

_{i}*e.g.*, 25% of the observations are ≤

*μ*− 0.67

*σ*, or 15.73% are ≤

*μ*−

*σ*). Now, if

*ω*= 0 the standardized Gaussian phenotype is used directly, and if

_{i}*ω*= 1 the underlying uncensored response is computed as previously. The latent variable parametrization of the binary phenotype is similar to a generalized linear model with the probit link function (Albert and Chib 1993), whereas the parametrization of the censored phenotype corresponds to a generalized linear model with the tobit link function (see

_{i}*e.g.*, Tobin 1958; Sorensen

*et al.*1998; Iwata

*et al.*2009).

The model parameters are estimated by the GEM (Neal and Hinton 1999) algorithm, which is presented in the Appendix A2. The algorithm finds a *maximum a posteriori* point estimate for each of the parameters and latent variables by repeatedly updating the parameters one at the time to their conditional expectations (see Kärkkäinen and Sillanpää 2012a). Due to the conjugate or otherwise suitable prior densities chosen, the fully conditional posterior densities for the parameters and the latent Gaussian response are known probability density functions. This guarantees an easy derivation of the GEM-algorithm; as the expected values of the known densities are automatically available, we do not need to find the fully conditional posterior expectations by integration. In addition, if preferred it would be trivial to implement an MCMC Gibbs sampler to sample from these densities. The fully conditional posterior densities for the latent Gaussian response and for the unknown thresholds are given in the aforementioned models (6) and (9), whereas the fully conditional posterior densities for the parameters of the linear Gaussian models can be easily extracted from the joint posterior densities of the models (3) and (4). The derivations of the fully conditional posterior densities of the multilocus association model are presented in the Appendix A1.

The fully conditional posterior densities for the multilocus association model (1) parameters are as follows. The fully conditional posterior density for a marker effect *β _{j}* is normal (10)where the residual variance unless the actual Gaussian phenotype is observed. The fully conditional posterior density for the inverse of a marker variance is an inverse-Gaussian (Chhikara and Folks 1989) (11)The fully conditional posterior density for the LASSO parameter

*λ*

^{2}is a gamma density (12)The fully conditional posterior density for the population intercept equals a normal density (13)where again the residual variance unless the actual Gaussian phenotype is observed. Finally, when estimated, the fully conditional posterior density of the residual variance is an inverse-

*χ*

^{2}(14)The fully conditional posterior densities for the Bayesian G-BLUP (2) parameters are the following. The fully conditional posterior density for the additive genetic values is a multivariate normal (15)where the residual variance is again one if the actual Gaussian phenotype is not observed. The fully conditional posterior density for the additive genetic variance is an inverse-

*χ*

^{2}density (16)where the capital

*N*denotes the total number of individuals in the learning and test sets. The fully conditional posterior density for the population intercept is normal (17)where the residual variance unless the actual Gaussian phenotype is observed. When estimated, the fully conditional posterior density of the residual variance is an inverse-

*χ*

^{2}

The generalized expectation maximization algorithms presented in the Appendix A2 work by updating the parameters to the expected values of the aforementioned fully conditional posterior densities. In our method, as in nearly all Bayesian approaches, the user has to provide some, usually data specific, parameter values for the hyperprior densities at the very bottom of the model hierarchy. In the multilocus association model the hyperprior parameters for the LASSO parameter *λ*^{2} ∼ Gamma(*κ*, *ξ*) have to be given by the user, whereas in the Bayesian G-BLUP this role falls to the hyperprior parameters of the additive genetic variance component . The selection of the data specific hyperprior parameters is called tuning of the algorithm. The tuning is the easier to perform the fewer parameters there are to be tuned. Because the number of markers (*p*) is very large, the impact of *κ* into the fully conditional posterior expectation of the LASSO parameter , derived from the fully conditional posterior density in (12), is obviously negligible. As the only information the GEM algorithm uses in the update process is the fully conditional expectation, we shall simplify the tuning by setting the value of *κ* to a constant value *κ* = 1. Thereby, the rate parameter *ξ* is the only entity in the model to which the user has to provide a data specific value. Accordingly, under the Bayesian G-BLUP the degrees of freedom *ν* of the inverse-*χ*^{2} density do not have a substantial contribution to the fully conditional posterior expectation of the additional genetic variance , and we therefore set permanently *ν* = 2, while the scale parameter *τ*^{2} may need data specific tuning.

## Example analyses

In our example analyses, we have considered the predictive performance of the Bayesian multilocus association model and the Bayesian G-BLUP *per se* and with the three different latent variable layers, with two different data sets.

The first of the data sets consists of a simulated data introduced in the XII QTL-MAS Workshop 2008 (Lund *et al.* 2009). The data set can be downloaded from the workshop homepage at http://www.computationalgenetics.se/QTLMAS08/QTLMAS/DATA.html. There are 5865 individuals from seven generations of half sib families with information on 6000 biallelic SNP loci, the loci are evenly distributed over six chromosomes of length 100 cM each (see Lund *et al.* 2009 for details). Since SNPs with minor allele frequency <0.05 within the learning set were discarded, the actual number of markers in the analysis is 5726. The first four generations of the data, 4665 individuals, have both marker information and a phenotypic record, and function as a learning set, whereas the generations five to seven, comprising 1200 individuals, are treated as a prediction set. There are 48 simulated quantitative trait loci (QTL) in the data set, with allele substitution effects drawn from a Gamma(0.42, 1.85) distribution (with shape and rate parametrization). The cumulative effect of the simulated QTL equals the genetic value of the individuals, while the phenotypes of the individuals have been obtained as the sum of the individuals’ genetic value and a random residual drawn from a normal distribution with mean zero and a variance set to produce heritability value 0.3 (Lund *et al.* 2009). As in our previous works, we have generated 100 replicates of the data set by resampling the residuals from a normal density N(0, *var*(*TBV*)(1/*h*^{2} − 1)), where *var*(*TBV*) denotes the observed variance of the genetic values and the heritability *h*^{2} equals 0.3 (Kärkkäinen and Sillanpää 2012a,b). After this each of the generated phenotype sets was scaled to have zero mean and unity variance. The advantage of using a simulated data set in the example analysis is the availability of the true genetic values of the individuals, enabling us to determine the accuracy of the estimates by a direct comparison of the simulated and estimated genetic values.

The second data set, described in detail by Cleveland *et al.* (2012), is a real pig (*Sus scrofa*) data, provided by the Genetics Society of America to be used for benchmarking of genomic selection methods. The pig dataset consists of phenotypic records of 3184 individuals for a quantitative trait (standardized to zero mean and unity variance) with predetermined heritability 0.62, and genotypic records for 60k biallelic SNP markers (45,317 with minimum allele frequency over 0.05 actually included in the analysis). Contrary to the simulated data set, there are neither true genetic values of the individuals nor true effects of the QTL available, and hence we estimate the accuracy of the predicted genomic breeding values by dividing the correlation between the estimates and the original Gaussian phenotypic values by the square root of the predetermined heritability of the trait (Legarra *et al.* 2008). Since the data does not consist a separate validation population we compute the result statistics using cross-validation, where the 3184 individuals are randomly partitioned into 10 subsets (10-fold cross-validation) of 318 or 319 individuals. At each round 9 of the sets are treated as a learning set and the remaining one as the prediction set.

The binary, ordinal, and censored phenotypes of the data sets were constructed as follows. We tested two binary phenotypes with success probabilities 50% and 80%, two ordinal phenotypes with four classes, and three right censored phenotypes consisting of 20%, 50%, or 80% of censored observations. The proportions of observations belonging into each class of the four-class phenotype were either even 20:30:30:20% of observations in each class, or highly unbalanced with 70% belonging to the first class and 10% in the subsequent three classes. The value of the censored observations was set to equal the largest of the noncensored values, leading to a spiked Gaussian phenotype (see Broman 2003). The binary and the evenly distributed ordinal data sets are generated in preparation for an easy ascertainment of the extra power acquired by using the category information compared to the dichotomized phenotype. The binary phenotype with 80% success probability simply sets the first category of the ordinal phenotype as a failure and the subsequent three classes as a success, while the binary response with 50% success probability sets the first and second category as a failure and the third and fourth as a success. The same holds true for the censored data, as the threshold values are set to correspond the thresholds of the binary phenotype: the 20% and 80% censored data can be compared with the binary data with 80% success rate, as the proportions of the observations belonging to the classes is 20:80, and similarly the 50% censored data are equivalent to the 50% or 50:50 binary data. All threshold values were determined as standard normal distribution function parameters leading to the desired threshold value, *e.g.*, a threshold at 0.84, leading to 20% success probability, since Φ(0.84) = 0.8.

The multilocus models are not able to handle an unlimited number of loci with respect to the sample size. Hoti and Sillanpää (2006) have proposed an upper limit of 10 times more loci than individuals, but it seems that in practice a smaller number of loci might be optimal (*e.g.*, real data analysis in Kärkkäinen and Sillanpää 2012a). Furthermore, to our experience, the best results are necessarily not acquired by using as many markers as the model can possible handle, but with a significantly smaller marker set (results not shown). In the QTL-MAS data, the proportion of markers to individuals is almost one-to-one, and no extra measures are needed, but with the pig data the multilocus association model becomes too oversaturated to function properly. Therefore, with the real data, in the beginning of each cross-validation round the number of SNPs is first reduced from 45,317 to 10,000 by the sure independence screening method (Fan and Lv 2008). The method works by ranking the markers with respect to their marginal correlation with the phenotype within the current learning set, and selecting the 10,000 best ranking markers to the multilocus association model. The marginal correlation is computed as the Pearson’s product-moment coefficient, by using the same phenotypic records (binary, ordinal, censored or Gaussian) as is used in the actual multilocus association model, except in the case of the 80% censored Gaussian phenotype, where it proved better to dichotomize the phenotype by setting the uncensored data into zero. In Kärkkäinen and Sillanpää (2012a) we performed the preselection in advance and used the same set of markers in all cross-validation rounds, whereas here the preselection is integrated into the cross-validation procedure. The correlation produced by the former approach appeared to be slightly overestimated. An advantage of the G-BLUP over the multilocus association model is that no preselection of the markers is needed. Some authors have found out that preselection of the markers might have a positive impact also to G-BLUP (see Resende *et al.* 2012), but we did not observe such a behavior (results not shown).

The hyperprior parameters for the example analyses are selected to produce best accuracy. The tuning is performed by testing different parameter values and choosing the one resulting the best correlation between the estimated and the true genetic values. In practice, we simply select two arbitrary values for the parameter, observe the correlation acquired under these values, and proceed to search for an optimal value to the direction pointed by the better performing one. This step could be automatized, but so far we have performed it manually. As we have used both learning and prediction sets in the parameter tuning, the obtained accuracies must be considered as best-case scenarios. Under the multilocus association model (1), we give the LASSO parameter *λ*^{2} a Gamma(1, *ξ*) hyperprior and tune the rate (*ξ*) of the gamma density into a data specific value, while under the G-BLUP model (2) the scale hyperparameter *τ*^{2} for the genetic variance is tuned. The values selected for *ξ* and *τ*^{2} and some of the proposed values, along with the corresponding accuracy estimates under the multilocus association model and the Bayesian G-BLUP, are given in Tables 1 and 2, respectively.

Comparing the relative performance of the multilocus association model and the Bayesian G-BLUP with the two data sets (Table 3) clearly shows that the multilocus association model is superior when the trait is controlled by a moderate number of genes (QTL-MAS data), whereas the G-BLUP is a reasonable choice when the trait is either truly polygenic or there is strong linkage disequilibrium present (the pig data). With the 100 Gaussian QTL-MAS data replicates, the multilocus association model produces an average correlation 0.89 whereas the G-BLUP produces an average correlation 0.80. Consistently, with all of the binary, ordinal, and censored Gaussian QTL-MAS data sets, there is an approximately 10-point difference in favor of the multilocus association model in the average correlation. Regarding the pig data, the G-BLUP has the advantage over the multilocus association model: the average correlation in the 10 cross-validation sets is three points higher with the fully observed Gaussian phenotype and two to four points higher with most of the other phenotypes. The advantage of the G-BLUP is even more significant with the 80% success rate binary phenotype and the 80% censored phenotype, the G-BLUP being on average six points more accurate. In this case, however, the culprit is not only the multilocus association model itself but also the sure independence screening used beforehand to reduce the number of markers: if the marginal correlation was computed by using the Gaussian phenotype instead of the binary, the final average correlation would be more consistent 0.51 instead of the now observed 0.49 (data not shown).

As the binary and the evenly distributed ordinal data sets are related to each other it is easy to ascertain the extra power acquired by using the category information compared with the dichotomized phenotype. Table 3 shows a significantly improved accuracy if the additional categories are taken into account: with the ordinal data the mean correlation is 6–10 points or 7–20% higher than with the 80% success rate binary data, and 3–4 points or 5–7% higher than with the 50% success rate binary data. The percentage advantage is greater in situations in which the power of the analysis is lower. As expected, the accuracy is lower with the 70:10:10:10% ordinal phenotype than with the evenly distributed ordinal phenotype, the difference being 2–4 points with both data sets under both models. The additional accuracy gained by using the correct model (*i.e.*, the threshold model) for the binary and ordinal phenotypes, instead of using the linear Gaussian model directly, was minor. The threshold model was a trifle of more accurate in some cases (Table 3). The correlation obtained with the threshold model was one point greater in 7 cases of 16, more often with the unevenly distributed responses (binary 80% and ordinal 70:10:10:10) than with the evenly distributed (binary 50% and ordinal 20:30:30:20); with the unevenly distributed phenotypes there was five cases in which a modicum of extra accuracy was gained with the threshold model, whereas with the evenly distributed there was only two. The extra accuracy was also observed more often with the pig data than with the QTL-MAS data (five and two cases, respectively), and with the Bayesian G-BLUP than with the Bayesian LASSO (also five and two cases, respectively).

The censored data sets consist of a continuous normally distributed phenotype with 20%, 50%, or 80% right censored observations, set to equal the maximum of the non-censored observations. The non-censored observations clearly contain extra information compared to the corresponding binary data (Table 3). The correlations acquired with the data sets with 20% censored observations were in all cases considerably, 6–11 points or 6–22%, higher than with the corresponding 20:80 binary data sets. With the 50% censored data sets the difference is 1–3 points, or 1–5%, compared to the 50:50 binary data. Even the data sets with 80% censored observations may be slightly more informative than the 20:80 binary data, the correlation being one point in favor of the censored phenotype with the QTL-MAS data under the threshold LASSO and threshold G-BLUP. The 20% censorship weakens the accuracy only slightly compared to the fully observed Gaussian phenotype: no more than one point if the threshold model is used, and 1–2 points if the Gaussian model is used directly for the censored phenotype. Contrary to the ordinal-phenotype-case, with a censored phenotype the threshold model is clearly more accurate than the Gaussian model (Table 3). The difference between the models with the data sets with 20% censored phenotypes is tiny (one point in three cases out of four), but increases when the censoring grows stronger. With 50% censored observations the difference is 1–3 points or 1–5%, and with 80% censored observations 3–7 points, or 4–17%.

The steps of the GEM algorithm are repeated until convergence. The algorithm is considered to be converged when the correlation between the estimated breeding values of two consecutive iterations is greater than 1−10^{−6}. The convergence is confirmed visually by examining the behavior of parameter values during the iterations and verifying that all of the parameters have reached a constant level; this is also how the suitable value for the convergence rule has been originally ascertained. The required number of iterations is usually between 40 and 80 under the multilocus association model, and around 10 under the G-BLUP. So far we have not encountered problems in the convergence, given that appropriate hyper(prior) parameter values have been selected, and that the number of markers with respect to the sample size in the multilocus association model has not been too large. Depending on the data and the model variate, the computation time is around 15–50 sec with a 64-bit Windows 7 desktop computer with 3.50 GHz Intel(i7) CPU and 16.0 GB RAM.

## Discussion

Our example analyses show that using the extra classes and the uncensored observations present in an ordered categorical and a censored Gaussian data set, instead of dichotomizing the data into case-control observations, increases the accuracy of genomic breeding values predicted by Bayesian multilocus association models or by Bayesian G-BLUP. The amount of extra information of an ordinal data depends on the number of the classes and the distribution of the observations into the classes, higher number and more even distribution corresponding to higher information content. With a mildly to moderately (20–50% of observations) censored Gaussian data, the increase of the accuracy is substantial compared to binary data, but even if 80% of the observations are censored the remaining observations seem to possess some extra information.

Our results indicate that only a minor benefit is gained by using the correct threshold model compared with using the linear Gaussian model directly with a binary or ordinal data. These results are in concordance with the observations by Wang *et al.* (2013) under BayesB (Meuwissen *et al.* 2001) and BayesC*π* (Habier *et al.* 2011), and with the early observations of Meijering and Gianola (1985) for BLUP. However, for the same data set, under BayesA Wang *et al.* (2013) noted a substantial increase in the accuracy when the threshold model was used. Also, Meijering and Gianola (1985) observed that the threshold model was more reliable than BLUP if the number of fixed effects required in the mixed model was high. Even though our results do not confirm the practical superiority of the correct threshold model over the linear Gaussian model, we urge caution when applying a Gaussian model directly for an ordinal data. Some data sets may be less well-behaving than the ones we have studied and, as proven by Wang *et al.* (2013), different linear models may be less robust to the incompatible data.

The example analyses support the observation of Sorensen *et al.* (1998) that in the case of a censored Gaussian data the threshold model behaves better than the linear Gaussian model used directly. The benefit of the correct threshold model increases as the proportion of the censored observations increase. With 20% censored observations, the threshold model is slightly more accurate, whereas with 50% censored observations the accuracy gain is substantial. On the basis of our results, a heavily censored (80%) Gaussian data should not be analyzed with a linear Gaussian model, or, if analyzed, it should be dichotomized into a binary case-control data.

The sure independence screening (Fan and Lv 2008) works very well for such a strikingly simple method. It works so well probably because all it needs to do is to let all of the important markers pass to the next step whereas, because the final variable regularization is performed by the multilocus association model, it does not matter if unimportant ones are also selected. The optimal number of SNPs selected into the multilocus association model is data specific as it depends on the number of individuals in the learning set, and probably also on the genetic architecture of the trait and the LD structure. Additionally, multilocus association models with different shrinkage or variable selection mechanisms may be able to cope with different amount of oversaturation. The number of markers selected to the multilocus association model can be tuned into an optimal value similarly to the prior parameters. The model performance seems to be reasonably robust to the number of markers: in the pig data 10,000–20,000 markers produced almost identical accuracies with all of the response types. However, even though the sure independence screening seems to be a decent method indeed, and we have contented ourselves with using it for the marker preselection for the time being, there probably is room for improvement in this respect.

The difference in computation time between MCMC and (G)EM-algorithms is massive. Some authors have compared the speed difference between an MCMC and a MAP-estimation algorithm, for example, the fast BayesB implementation of Meuwissen *et al.* (2009) took 2–5 min to converge, whereas the MCMC-based BayesB required 47 hr. Using the same QTL-MAS data we have used, Shepherd *et al.* (2010) reported a computer time of few minutes for a MAP-estimation algorithm and 2 d for an MCMC algorithm (with a 2-GHz computer). Accordingly, with the same QTL-MAS data, the frequentist LASSO-LARS implementation of Usai *et al.* (2009) took more than 7 hr to converge. The enormity of the time difference can be illustrated by extrapolating the computer times reported by Shepherd *et al.* (2010) into the analysis of our 100 replicated QTL-MAS data sets: although the 100 analyses take 20 min with our GEM algorithm, with an MCMC algorithm the computation time would be stunning 600 hr, or 25 d. To get the best possible results, it seldom is sufficient to run an algorithm once, for instance, due to the tuning of the prior parameters and sensitivity analysis. The extremely short time requirement facilitates the adjusting for optimal performance, not to mention the usage of computer intensive techniques such as cross-validation and empirical threshold determination by phenotype permutation. Fast estimation is also extremely useful in simulation studies of entire breeding programs (see, *e.g.*, Pedersen *et al.* 2009; Axelsson *et al.* 2013).

The Bayesian threshold approach for binary, ordinal, and censored data enables the usage of a variety of different linear Gaussian model types—here we have demonstrated the method with Bayesian LASSO multilocus association model and Bayesian G-BLUP; however, depending on the genetic architecture and LD structure of the data, other variable selection or regularization methods than LASSO may be preferred. Whether the additional threshold layer actually increases the accuracy of the genomic breeding value estimates is questionable with a binary or an ordinal data but less so with a censored Gaussian data. To our experience it seems that especially with a heavily censored Gaussian data the threshold model should be used, but there is no harm in using it also with the binary and the ordinal phenotypes.

## Acknowledgments

Supported by research grant from the Academy of Finland and Tekes funding for public research ”Individualized Connected Health.”

## Appendix A1

### Derivations of the fully conditional posterior densities

The joint posterior distribution of the unknown parameters, given the data, is proportional to the product of the joint prior and the likelihood. As the model parameters are considered *a priori* conditionally independent, the joint posterior density of the parameter vector under the multilocus association model (1) becomes(A1)The fully conditional posteriors of the parameters can be easily determined by selecting the terms including the parameter in question from the joint posterior. (For simplicity: * = ”the data, and the parameters except the one in question”)

As the prior density for the population intercept *β*_{0} is proportional to one, the fully conditional posterior distribution is proportional to the likelihood, (A2)Note that we select from the joint posterior (A1) only the terms including *β*_{0}. The right hand side of (A2) is a product of *n* kernels of normal probability density functions, with a common variance , and means , *i* = 1, …, *n*. The set of Gaussian functions is closed under multiplication, *i.e.* the product of normal densities is also a normal density, with the mean and the variance of the product density given byrespectively. Hence, since in this case variance is same for all of the factors, the product density reduces to the sum of the means of the individual distributions, divided by *n*, while the variance of the product density is given simply by the variance of the individual distributions divided by *n*, leading to the fully conditional posterior density function given in (13).

For the fully conditional posterior distribution of the regression a coefficient *β _{j}* we select from (A1) the terms including

*β*, located at the likelihood and at the prior , and get (A3)Here we have a product of two types of kernels of normal densities. The latter part of (A3) comes from the prior, it has mean 0 and variance . The former part represents the likelihood—regarding

_{j}*β*there are

_{j}*n*kernels of normal distributions with meansand variances ,

*i*= 1, …,

*n*. As a product we get a normal distribution with a meanand variancewhich corresponds to the function in (10) when expanded by .

The residual variance appears in the joint posterior density (A1) in the likelihood and in the prior density , leading to (A4)Regarding , this is an unnormalized probability density function of an inverse *χ*^{2}-distribution (14), with *n* degrees of freedom and scale parameter equal to

The fully conditional posterior density for the variance of the marker effects (11) is proportional to the product of the conditional prior for the marker effect and the exponential prior of the effect variance ,for all *j* = 1, …, *p*. Since exponential density is not conjugate to normal density, we need to consider the inverse of the marker variancewhere the last term is the Jacobian of the transformation . By rearranging the terms, we getand completing the numerator of the exponent into squareAs is canceled out from the last term inside the exponent, the term becomes constant and is left out. After that the exponent is expanded by and we get (A5)This is an inverse-Gaussian probability density function (Chhikara and Folks 1989) with mean *μ*′ and shape *λ*′the parametrization of the inverse-Gaussian density being

The LASSO parameter *λ*^{2} occurs in the joint posterior (A1) in the exponential priors for the effect variances and in the gamma prior for the LASSO parameter. (A6)that is an unnormalized gamma probability density function with shape *κ* + *p* and ratecorresponding to (12).

## Appendix A2

### The GEM algorithms

Below we present the GEM algorithms for *maximum a posteriori* parameter estimation under the threshold model combined with the Bayesian LASSO and the Bayesian G-BLUP. In both algorithms, the steps are repeated until convergence. Matlab codes of the algorithms are included as Supporting Information, File S1.

### GEM algorithm for the multilocus association model

Set initial values for parameter vectors. We use zeros for

*β*_{0},**β**, and the latent variable**y**, and small positive values, namely 0.1, for the variances**σ**^{2}and (when estimated), and for the LASSO parameter*λ*^{2}. The initial value for the threshold vector .Update the values of the latent variable

**y**by replacing the current values*y*with the expected values of the truncated normal distribution (6),where , and_{i}*ϕ*(⋅) and Φ(⋅) denote the standard normal density function and distribution function, respectively. If the actual Gaussian phenotype**y**is fully available, this step naturally becomes obsolete.Update the K − 2 unknown thresholds to their conditional expectations,for all

*k*= 2, …, K-1. This step is bypassed if i) the phenotype is Gaussian and fully observed, ii) the phenotype is binary,*i.e. K*= 2, or iii) the phenotype is left or right censored (no unknown thresholds).Maximize the posterior distributions of

*β*_{0}and*β*(for all j) by substituting the fully conditional expectations for the current values of the parameters, one at the time.Note that the residual variance unless the Gaussian phenotype is fully observed._{j}In case of a fully observed Gaussian phenotype, update the error variance into its conditional expectation

Update the effect variances (for all j) to their conditional expectations. The precision, or inverse of the variance parameter , has an inverse-Gaussian fully conditional posterior distribution leading to following fully conditional expectation for the effect variances

Update the hyperparameter λ

^{2}into its conditional expectation

### GEM algorithm for the Bayesian G-BLUP

Set initial values for parameter vectors; zeros for

*β*_{0},**u**and**y**, and 0.1 for and (when estimated), .Update the latent variable

**y**as previously, except that here*μ*_{i}=*β*_{0}+*u*._{i}Update the thresholds as previously.

Update the population intercept

*β*_{0}into its fully conditional expectationUpdate the polygenic effects

**u**by replacing the current values with the conditional expectationswhere**G**^{−1}denotes a generalized inverse of the realized relationship matrix. Note that the residual variance except in the fully observed Gaussian phenotype case.In case of a fully observed Gaussian phenotype, update the error variance into its fully conditional expectation

Replace the additive genetic variance component with its conditional expectationwhere and

*N*is the total number of individuals in the learning and validation sets (*i.e.*the dimension of**G**is*N*×*N*), and**G**^{−1}is a generalized inverse of the realized relationship matrix.

## Footnotes

Supporting information is available online at http://www.g3journal.org/lookup/suppl/doi:10.1534/g3.113.007096/-/DC1

*Communicating editor: D.-J. De Koning*

- Received March 28, 2013.
- Accepted June 24, 2013.

- Copyright © 2013 Kärkkäinen and Sillanpää

This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.