## Abstract

Genomic tools allow the study of the whole genome, and facilitate the study of genotype-environment combinations and their relationship with phenotype. However, most genomic prediction models developed so far are appropriate for Gaussian phenotypes. For this reason, appropriate genomic prediction models are needed for count data, since the conventional regression models used on count data with a large sample size () and a small number of parameters (*p*) cannot be used for genomic-enabled prediction where the number of parameters (*p*) is larger than the sample size (). Here, we propose a Bayesian mixed-negative binomial (BMNB) genomic regression model for counts that takes into account genotype by environment interaction. We also provide all the full conditional distributions to implement a Gibbs sampler. We evaluated the proposed model using a simulated data set, and a real wheat data set from the International Maize and Wheat Improvement Center (CIMMYT) and collaborators. Results indicate that our BMNB model provides a viable option for analyzing count data.

- Bayesian model
- count data
- genome enabled prediction
- Gibbs sampler
- GenPred
- shared data resource
- genomic selection

In most living organisms, phenotype is the result of genotype (*G*), environment (*E*) and genotype by environment interactions . Garrod (1902) observed that the effect of genes on phenotype could be modified by the environment (*E*). Similarly, Turesson (1922) demonstrated that the development of a plant is often influenced by its surroundings. He postulated the existence of a close relationship between crop plant varieties and their environment, and stressed that the presence of a particular variety in a given locality is not a chance occurrence; rather, there is a genetic component that helps the individual adapt to that area.

For these reasons, today the consensus is that *G×E* is useful for understanding genetic heterogeneity under different environmental exposures (Kraft *et al.* 2007; Van Os and Rutten 2009), and for identifying high-risk or productive subgroups in a population (Murcray *et al.* 2009); it also provides insight into the biological mechanisms of complex traits such as disease resistance and yield (Thomas 2011), and improves the ability to discover resistance genes that interact with other factors that have few marginal effects (Thomas 2011). However, finding significant *G×E* interactions is challenging. Model misspecification, inconsistent definition of environmental variables, and insufficient sample sizes are just a few of the issues that often lead to low-power and nonreproducible findings in *G×E* studies (Jiao *et al.* 2013; Winham and Biernacka 2013).

Genomics and its breeding applications are developing very quickly with the goal of predicting yet-to-be observed phenotypes, or unobserved genetic values for complex traits, and inferring the underlying genetic architecture utilizing large collections of markers (Goddard and Hayes 2009; Zhang *et al.* 2014). Also, genomics is useful when dealing with complex traits that are multigenic in nature, and have major environmental influence (Pérez-de-Castro *et al.* 2012). For these reasons, the use of whole genome prediction models continues to increase. In genomic prediction, all marker effects are fitted simultaneously on a model and simulation studies promote the use of this methodology to increase genetic progress in less time. For continuous phenotypes, models have been developed to regress phenotypes on all available markers using a linear model (Goddard and Hayes 2009; de los Campos *et al.* 2013). However, in plant breeding, the response variable in many traits is a count (*y* = 0,1,2,…), for example, number of panicles per plant, number of seeds per panicle, weed count per plot, etc. Count data are discrete, non-negative, integer-valued, and typically have right-skewed distributions (Yaacob *et al.* 2010).

Poisson and negative binomial regression are often used to deal with count data. These models have a number of advantages over an ordinary linear regression model, including a skewed, discrete distribution (0,1,2,3,…,) and the restriction of predicted values for phenotypes to non-negative numbers (Yaacob *et al.* 2010). These models differ from an ordinary linear regression model. First, they do not assume that counts follow a normal distribution. Second, rather than modeling y as a linear function of the regression coefficients, they model a function of the response mean as a linear function of the coefficients (Cameron and Trivedi 1986). Regression models for counts are usually nonlinear and have to take into consideration the specific properties of counts, including discreteness and non-negativity, and are often characterized by overdispersion (variance greater than the mean) (Zhou *et al.* 2012).

However, in the context of genomic selection, it is still common practice to apply linear regression models to these data, or to transformed data (Montesinos-López *et al.* 2015a, 2015b). This does not take into account that: (a) many distributions of count data are positively skewed, many observations in the data set have a value of 0, and the high number of 0s in the data set does not allow a skewed distribution to be transformed into a normal one (Yaacob *et al.* 2010); and (b) it is quite likely that the regression model will produce negative predicted values, which are theoretically impossible (Yaacob *et al.* 2010; Stroup 2015). When transformation is used, it is not always possible to have normally distributed data, and often transformations not only do not help, they are counterproductive. There is also mounting evidence that transformations do more harm than good for the models required by the vast majority of contemporary plant and soil science researchers (Stroup 2015). To the best of our knowledge, only the paper of Montesinos-López *et al.* (2015c) is appropriate for genomic prediction of count data in a Bayesian framework; however, it does not take into account interaction.

In this paper, we extend the negative binomial (NB) regression model for counts proposed by Montesinos-López *et al.* (2015c) to take into account by using a data augmentation approach. A Gibbs sampler was derived since all full conditional distributions were obtained, which allows samples to be drawn from them to estimate the required parameters. In addition, we provide all the details of the efficient derived Gibbs sampler so that it can be implemented easily by most plant and animal scientists. We illustrate our proposed methods with a simulated data set and a real data set on wheat *Fusarium* head blight. We compare our proposed models (NB and Poisson) with the Normal and Log-Normal models commonly implemented for analyzing count data. We also provide R code for implementing the proposed models.

## Materials and Methods

The data used in this study were taken from a PhD thesis (Falconi-Castillo 2014) aimed at identifying sources of resistance to *Fusarium* head blight (FHB), caused by *Fusarium graminearum*, and at identifying genomic regions and molecular markers linked to FHB resistance through association analysis.

### Experimental data

#### Phenotypic data:

A total of 297 spring wheat lines developed by the International Maize and Wheat Improvement Center (CIMMYT) was assembled and evaluated for resistance to *F. graminearum*. Phenotyping was done at CIMMYT’s El Batan experimental station in Mexico over two years (2012 and 2014), and at the Santa Catalina Experimental Station of the National Institute for Agricultural Research (INIAP), Ecuador, for one year (2014). For the application, we considered these three environments, which we named Batan 2012, Batan 2014, and Ecuador 2014. In all the experiments (environments), the genotypes were arranged in a randomized complete block design, in which each plot comprised two 1-m double rows separated by a 0.25 m space. In Ecuador 2014, the nursery was inoculated with maize seeds infected with a local *F. graminearum* isolate (SC01). The inoculum was broadcast in the field at 3 and 2 wk before anthesis, at a rate of 50 g/m^{2}.

FHB severity data were collected shortly before maturity by counting symptomatic spikelets on 10 randomly selected spikes in each plot. In Mexico, plots were inoculated with a mixture of five *F. graminearum* isolates (CIMFU235, 702, 715, 720, and 770) at each line’s flowering period by spraying 30 ml of an *F. graminearum* macroconidial suspension (50,000 spores/ml) using a CO_{2}-powered backpack sprayer (model T R&D Sprayers, Opelousas, LA) calibrated to 40 psi. High humidity was maintained in the field by a mist irrigation system controlled by a programmable timer that applied 10 min of spray every hour from 9:00 to 20:00. FHB severity data were collected at 25 days after inoculation by counting spikelets showing FHB symptoms on 10 spikes that had been tagged at anthesis. In this study, we used only 182 spring wheat lines because we had complete marker information only for those lines.

#### Genotypic data:

DNA samples were extracted from young leaves (2- to 3-wk-old) taken from each line, using Wizard Genomic DNA purification (Promega) following the manufacturer’s protocol. DNA samples were genotyped using an Illumina 9K SNP chip with 8632 SNPs (Cavanagh *et al.* 2013). For a given marker, the genotype for the *i*th line was coded as the number of copies of a designated marker-specific allele carried by the *i*th line (absences equal to zero, and presents equal to one). SNP markers with unexpected genotype AB (heterozygous) were recoded as either AA or BB, based on the graphical interface visualization tool of GenomeStudio (Illumina) software. SNP markers that did not show clear clustering patterns were excluded. In addition, 66 simple sequence repeat (SSR) markers were screened. After filtering the markers for 0.05 minor allele frequency (MAF), and deleting markers with more than 10% of no calls, the final set of SNPs was 1635 SNPs.

### Data and software availability

The phenotypic (FHB) and genotypic (marker) data used in this study, as well as basic R codes (R Core Team 2015), for fitting the models can be downloaded directly from the repository at http://hdl.handle.net/11529/10575.

### Statistical models

We assume that, at each environment, the *J* genotypes were grown in a randomized complete block design, and we let represent the count response for the *t*th replication of the *j*th line in the *k*th block in the *i*th environment, with , and we propose the following combined linear predictor for the response variable:(1)where represents environment *i*, represent the effect of block *k* within environment *i*, is the marker effect of genotype *j*, and is the interaction between markers and the environment; since we have three environments (Batan 2012, Batan 2014, and Ecuador 2014), , since it is the number of lines under study, , since only two blocks are available per environment, and represents the number of replicates of each line in each block and environment but this was the same for all combinations of *i*, *j* and *k n* was 10 since 10 spikes were selected at random from each plot). The number of observations in each environment *i* is , while the total number of observations is . *IJ* is the product of the number of environments and number of lines. Four models were implemented using the linear predictor given in expression (1).

#### Model NB:

**Model NB** stands for model negative binomial and is defined by three distributions: , with *r* being the scale parameter, , and . Note that the NB distribution has expected value and variance and for . and were assumed known, with computed from marker as ; this matrix is called the Genomic Relationship Matrix (GRM) (VanRaden 2008). The matrix defines the covariance between individuals based on observed similarity at the genomic level, rather than on expected similarity based on pedigree, so that more accurate predictions of merit can be achieved. While is computed as of order *IJ* × *IJ* and ⊗ denotes the Kronecker product, means that we assume independence between environments.

#### Model Pois:

Model Poisson (**Model Pois**) is the same as **Model NB**, except that Poisson(). Since, according to Zhou *et al.* (2012) and Teerapabolarn and Jaioun (2014), the ** Model Pois** was implemented using the same method as **Model NB**, but fixing *r* to a large value, depending on the mean count. We used which is a good choice when the mean count is less than 50 (see Figure 1). However, when the count is between 50 and 200, we suggest using , and, when the count is larger than 200, we suggest a value of or larger. These suggestions are supported by Figure 1, where we plot the mean and variance of **Model NB** as a function of the scale parameter *r*, with three values of *r* (1000; 5000; 10,000). Good approximations to the **Model Pois** with the **Model NB** occur when the mean and variance are very similar. For this reason, good approximations are those that follow the diagonal in Figure 1 where . We can see that the mean count and variances are very similar for mean counts of less than 50 with ; however, when the mean count is larger than 50 and less than 200, we should use , and for counts greater than 200, we suggest using a value of or larger. In our applications with simulated and real data, the mean count is less than 50; for this reason, we used a value of .

#### Model Normal:

**Model Normal** is similar to **Model NB,** except that with identity link function , and is the scale parameter of the normal distribution and is associated with the residual in the *i* environment, *k* block, *j* line and replication *t*. The parameter must be estimated since the Normal distribution, Log-normal distribution, and the Negative binomial distribution belong to the two-parameter exponential family, while the Poisson distribution belongs to the one-parameter exponential family. For this reason, only the need to be estimated since the mean is equal to the variance. However, the scale parameter in the NB distribution is represented by *r*.

#### Model LN:

Model Log-Normal (**Model LN**) is similar to **Model NB**, except that with identity link function and is the scale parameter associated with the residual in the *i* environmment, *k* block, *j* line, and *t* replication.

When the number of markers is larger than the number of observations , implementing **Models NB** and **Pois** is challenging. For this reason, we propose a Bayesian method for dealing with situations when and our model takes into account all markers through the GRM described above. **Models Normal** and **LN** were implemented in the BGLR package of de los Campos *et al.* (2014). Therefore, our proposed Bayesian model for count data is a so-called Genomic Best Linear Unbiased Prediction (GBLUP) method, since it utilizes genomic relationships to predict the genetic value of an individual.

##### Bayesian mixed negative binomial regression:

Rewriting the linear predictor (1) as , with , where and are indicator variables that take the value of 1 if the observed environment *i* is 1, 2, and 3, respectively, and 0 otherwise, and are indicator variables that take the value of 1 if the block *k* is observed within environment *i*, and 0 otherwise. where the first three beta coefficients belong to the effects of environment, and the last six beta coefficients correspond to the blocks effects in each of the environments (that is, **β** is a vector of beta coefficients of order , with *p*=I+I). Therefore, , and . Note that, under the **Model NB**, because , conditionally on and , the probability that the random variable takes the value is equal to(2)We arrive at Equation (2) since we make , with , and . Therefore, in Equation (2), we have the connection between the probability distribution of the response ( induced by the assumed relation between the linear predictor ( and the expected value of ( under the **model NB**. Then we can rewrite the given in Equation (2) as:(3)Expression (3) was obtained using the following equality given by Polson *et al.* (2013): , where and denotes the density of the Pólya-Gamma distribution ( with parameters *b* and [] (see Definition 1 in Polson *et al.* 2013). From here, conditioning on , we have that (4)To be able to get the full conditional distributions, we provide the prior distributions, for all the unknown model parameters , , , , , *r*). We assume prior independence between the parameters, that is,We assign conditionally conjugate but weakly informative prior distributions to the parameters because we have no prior information. Prior specification in terms of instead of **β** is for convenience. We adopt proper priors with known hyper-parameters whose values we specify in model implementation to guarantee proper posteriors. We assume that denotes a scaled inverse chi-square distribution with shape and scale parameters, , , , , and . Next we combine (Equation 4) using all data with priors to get the full conditional distribution for parameters , , , , , and *r*.

##### Full conditional distributions:

The full conditional distribution of is given as:(5)where , , , , , , , , , , , , , , , , , , , , , and , where indicates the horizontal Kronecker product between and The horizontal Kronecker product performs a Kronecker product of and ** X**, and creates a new matrix by stacking these row vectors into a matrix. and

**must have the same number of rows, which is also the same number of rows in the result matrix. The number of columns in the result matrix is equal to the product of the number of columns in and**

*X**. When the prior for constant, the posterior distribution of is also normally distributed, , but we set the term to zero in both and .*

**X**The fully conditional distribution of is(6)Defining , the conditional distribution of is given as(7)with . Similarly, by defining , the conditional distribution of is(8)where .

The fully conditional distribution of for is(9)with =*J* and =*IJ*.

The conditional distribution of is (10)Taking advantage of the fact that the NB distribution can also be generated using a Poisson representation (Quenouille 1949) as and is independent of , where Log and Pois denote logarithmic and Poisson distributions, respectively. Then, we infer a latent count *L* for each conditional on *Y* and *r*. Therefore, following Zhou *et al.* (2012), we obtain the full conditional of *r* by alternating(11)(12)where denotes a Chinese restaurant table (CRT) count random variable that can be generated as where . For details of the CRT random variable derivation, see Zhou and Carin (2012, 2015).

#### Gibbs sampler:

The Gibbs sampler for the latent parameters of the NB with can be implemented by sampling repeatedly from the following loop:

Sample values from the Pólya-Gamma distribution in (6).

Sample from (12).

Sample the scale parameter from the gamma distribution in (11).

Sample the location effects ( from the normal distribution in (5).

Sample the random effects from the normal distribution in (7).

Sample the random effects from the normal distribution in (8).

Sample the variance effects ( with from the scaled inverted distribution in (9).

Sample the variance effect ( from the scaled inverted distribution in (10).

Return to step 1 or terminate when chain length is adequate to meet convergence diagnostics.

#### Model implementation:

The Gibbs sampler described above for the BMNB model was implemented in R-Core Team (2015). Implementation was done under a Bayesian approach using Markov Chain Monte Carlo (MCMC) through the Gibbs sampler algorithm, which samples sequentially from the full conditional distribution until it reaches a stationary process, converging with the joint posterior distribution (Gelfand and Smith 1990). To decrease the potential impact of MCMC errors on prediction accuracy, we performed a total of 60,000 iterations, with a burn-in of 30,000, so that 30,000 samples were used for inference. We did not apply thinning of the chains following the suggestions of Geyer (1992), MacEachern and Berliner (1994), and Link and Eaton (2012), who provide justification of the ban on subsampling MCMC output for approximating simple features of the target distribution (*e.g.*, means, variances, and percentiles). We implemented the prior specification given in the section *Bayesian mixed negative binomial regression* with , where is the GRM, that is, the covariance matrix of the random effects, , is the covariance matrix of the random effects that belong to the term, and All these hyper-parameters were chosen to lead weakly informative priors. The convergence of the MCMC chains was monitored using trace plots and autocorrelation functions. We also conducted a sensitivity analysis on the use of the inverse gamma priors for the variance components, and we observed that the results are robust under different choices of priors.

#### Assessing prediction accuracy:

We used cross-validation to compare the prediction accuracy of the proposed models for count phenotypes. We implemented a 10-fold cross-validation, that is, the data set was divided into 10 mutually exclusive subsets; each time we used nine subsets for the training set, and the remaining one for the validation set. The training set was used to fit the model, and the validation set was used to evaluate the prediction accuracy of the proposed models. To compare the prediction accuracy of the proposed models, we calculated the Spearman correlation (Cor) and the mean square error of prediction (MSEP), both calculated using the observed and predicted response variables of the validation set. Models with large values of Cor indicate better prediction accuracy, while small MSEP indicate better prediction performance. The predicted observations, , were calculated with *M* collected Gibbs samples after discarding those of the burn-in period. For **Models NB** and **Pois,** the predicted values were calculated as , where , , and are estimates of , and , for line *j*, block *k*, in environment *i* obtained in the s*th* collected sample. For **Model Normal** as , and for **Model LN**, the predicted observations were calculated as , using the corresponding estimates of each model.

#### Simulation study:

To show the performance of the proposed Gibbs sampler for count phenotypes that takes into account , we performed a simulation study under model (1) with the following linear predictor: with two scenarios (S1 and S2). Scenario 1 had three environments (), 20 genotypes (, , and with four different numbers of replicates of each genotype in each environment, 5, 10, 20, and 40. Scenario 2 is equal to scenario 1, except that , where is a square matrix of ones of the order . In this second scenario, we imitated the correlation between lines of real data available in genomic selection. The priors used for the simulation study in both scenarios (S1 and S2) were approximately flat for all parameters: for , for , for and a , while for , and for . We computed 20,000 MCMC samples; Bayes estimates were computed with 10,000 samples, since the first 10,000 were discarded as burn-in. We report average estimates obtained by using the proposed Gibbs sampler, along with SD (Table 1). All the results in Table 1 are based on 50 replications.

## Results

Table 1 list the results of the simulation study of both scenarios (S1 and S2). The bias when estimating the parameters is a little larger in S1 compared to S2. Also, parameter is the parameter with larger bias (underestimated). Both variances (, ) are overestimated in scenario 1, but only is overestimated in scenario 2. Also, with a sample size of , parameter *r* had a larger SD; however, for larger sample sizes (, the SD were considerably reduced. In general, there was not a large reduction in SD when the sample size increased from 5 to 10, 20, and 40, the exception being the estimation of *r* in both scenarios, and the estimation of in S1, where there was a large reduction in SD when the sample size increased. Although estimations do not totally agree with the true values of the parameters, the proposed Gibbs sampler for count data, which takes into account , did a good job of estimating the parameters, since the estimates are close to the true values with a SD of reasonable size.

In all the experiments (environments) using the real data set, the genotypes were arranged in a randomized complete block design with two blocks; thus the linear predictor used was that given in Equation (1). Using the real data set, we compared four scenarios (S1–S4, given in Table 2) for each model. Table 2 shows that, in the linear predictor, S1 and S2 do not take into account interaction effects between genotypes and environments, only the main effects of these factors. Also, S1 and S3 do not use marker information. These four scenarios were studied to investigate the gain in model fit and prediction ability taking into account the interaction effects, and using the marker information available.

The posterior means (Mean), posterior SD of the scalar parameters, and posterior predictive checks for each scenario of the proposed models are given in Table 3. For the four models, the posterior means of the beta regression coefficients, variance components, and overdispersion parameters (*r*) are similar between S1 and S2, and between S3 and S4. In terms of goodness-of-fit measured by the loglikelihood posterior mean (Loglik), the scenarios rank as follows: S3, rank 1; S4, rank 2; S1, rank 3; and S2, rank 4, for the four proposed models, with the exception of **Model Pois**, where the ranking was S3, rank 1; S4, rank 2; S2, rank 3; and S1, rank 4. Therefore, there is evidence that, with the four proposed models in terms of goodness-of-fit, the best scenario is S3. Of the four models under study, Table 3 shows that **Model LN** reports the best fit since it has the largest Loglik.

Table 4 presents the mean and SD of the posterior predictive checks (Cor and MSEP) for each location (Batan 2012, Batan 2014, and Ecuador 2014) resulting from the 10-fold cross-validation implemented for the four models and four scenarios. The predictive checks given in Table 4 were calculated using the testing set. In **Model NB**, according to the Spearman correlation, the ranking of scenarios was as follows: in Batan 2012, 1 for S4, 2 for S3, 3 for S1, and 4 for S2. In Batan 2014, the ranking was 1 for S4, 2 for S3 and 3 for S1 and S2. In Ecuador 2014, the ranking was 1 S3, 2 for S2, 3 for S1, and 4 for S4. With the MSEP, the ranking for **Model NB** in Batan 2012 was 1 for S3, 2 for S4, 3 for S1 and 4 for S2. In Batan 2014, the ranking was 1 for S2, 2 for S1, 3 for S3, and 4 for S4. In Ecuador 2014, the ranking in terms of MSEP was 1 for S3, 2 for S2, 3 for S4, and 4 for S1. Under **Model Pois,** the ranking of the four scenarios in each locality was exactly the same as the ranking reported for **Model NB**. For **Model Normal** in terms of the Spearman correlation, S1 was the best in prediction accuracy in Batan 2012, while scenario 4 was the worst in all three locations. In terms of MSEP, the best scenario was S3 in Batan 2012 and Ecuador 2014, and the worst was S4 in Batan 2014 and Ecuador 2014. For **Model LN** in terms of the Spearman correlation, the best scenarios were scenarios S1, S2 and S3 and the worst was S4 in Batan 2012. In Batan 2014, the best scenario was S1, then scenario S3 and the worst was scenario S4. In Ecuador 2014, the best scenario was scenario S1 and S3, then S2 and S4. In terms of MSEP for Batan 2012, the best scenario was S3, then S1 and S2 and the worst was S4. In Batan 2014, the best scenario was S1, then S2 and the worst was scenario S4. Finally, in Ecuador 2014, the best scenario was S3, then S2 and the worst was scenario S1.

Table 5 gives the average of the ranks of the two posterior predictive checks (Cor and MSEP) that were used. Since we are comparing four scenarios for each model, the values of the ranks range from 1 to 4, and the lower the values, the better the scenario. For ties, we assigned the average of the ranges that would have been assigned had there been no ties. Table 5 shows that the best scenarios were S3 and S4 under **Models NB** and **Pois** in Batan 2012. In Batan 2014, under **Models NB** and **Pois**, the best scenario was S2, while in Ecuador 2014, the best scenarios were S3. Under **Model Normal**, the best scenario was S1 in Batan 2014 S1 and S3 in Ecuador 2014, while in Batan 2012, the best scenarios were S2 and S3. Finally, under **Model LN**, the best scenario was S3 in Ecuador 2014, S3 in Batan 2012 and S1 in Batan 2014.

Results in Table 4 and Table 5 indicate that the best models, in terms of prediction accuracy, are **Models NB** and **Pois**, since they had better predictions in the validation set based on both posterior predictive checks (Cor and MSEP) implemented, although, in terms of goodness-of-fit, **Model LN** was the best. These results are in partial agreement with the findings of Montesinos-Lopez *et al.* (2015c), who came to the conclusion that **Models NB** and **Pois** are good alternatives for modeling count data, although in this study, the best predictions were produced by **Model LN**. However, this model did not take into account interaction.

## Discussion

Generalized linear mixed models (GLMM) are widely recognized as one of the major methodological developments of the second half of the twentieth century. The main factor contributing to the success of their wide applicability over the last 30 years or so has been their flexibility, since they can be applied to many different types of data (Berridge and Crouchley 2011). These types of data include continuous interval/scale, categorical (including binary and ordinal) data, count data, beta data, and others. Each member of the GLMM family is appropriate for a specific type of data (Berridge and Crouchley 2011). However, GLMM for non-normal data are scarce in the context of genome-enabled prediction, since most of the models developed so far are linear mixed models (mixed models for Gaussian data). For this reason, we believe that developing specific methods for count data for genome-enabled prediction can help to improve the selection of candidate genotypes early when the phenotypes are counts. Because using transformation to approximate the counts to normality, or assuming that the counts are normally distributed, frequently produces poor parameter estimates and lower power. Also, parameter interpretation is more difficult when transformation is used (Stroup 2015). However, in genomic selection, phenotypic data (dependent variable) are not currently taken into account before deciding on the modeling approach to be used, mainly due to the lack of genome-enabled prediction models for non-normal phenotypes. Although our proposed Bayesian regression models are only for count data, they help fill this lack of genome-enabled prediction models for non-normal data.

Another advantage of our proposed methods for count data is that they take into account the nonlinear relationship between responses, and consider the specific properties of counts, including discreteness, non-negativity, and overdispersion (variance greater than the mean); this guarantees that the predictive response will not be negative, which makes no sense for count data. In addition, our methods help modeling for count data in the context of genome-enabled prediction, which plays a central role in plant breeding for the selection of candidate genotypes that present high stability over a wide range of environmental conditions, and for the prediction of yet-to-be observed phenotypes when the relative performance of genotypes varies across environments.

Another advantage of our proposed method is that the proposed Gibbs sampler has an analytical solution because we were able to obtain all the analytically required full conditional distributions. This is important, because, of all the computational intensive methods for fitting complex multilevel models, the Gibbs sampler is the most popular due to its simplicity and ability to effectively generate samples from high-dimensional probability distributions (Park and van Dyk 2009). This was possible because we constructed our Gibbs sampler using the data augmentation approach proposed by Polson *et al.* (2013). For this reason, we believe it is an attractive alternative for fitting complex count data that arise in the context of genomic selection.

Our proposed methods showed superior performance in terms of prediction accuracy compared to **Models Normal** and **LN**. Also, we observed that, in **Models NB** and **Pois**, taking into account considerably increased the prediction accuracy, which was expected since there is enough scientific evidence that including interaction improves prediction accuracy. However, to use these models correctly, it is important to first understand the types of data we have before deciding on the modeling approach to be used. If the phenotypic data are normally distributed, the linear mixed models for genome-enabled prediction developed so far for Gaussian phenotypes should be used. If the phenotypic data are binary or categorical ordinal, the methods proposed by Montesinos-López *et al.* (2015a, 2015b) developed for ordinal data for genome-enabled prediction are preferred. If the phenotypic data are counts (number of panicles per plant, number of seeds per panicle, weed count per plot, etc.), and the counts are small, the models developed in this study, and those proposed by Montesinos-López *et al.* (2015c), are the best option, since they have more advantages over the conventional linear mixed models with Gaussian response, as was observed when we applied them to the real data set. We also need to keep in mind that **Model Pois** will be enough when the equi-dispersion (equality of mean and variance) is supported by the data at hand. However, when this assumption is violated, and the variance of the counts exceeds the mean count, overdispersion is present; in this situation, the most appropriate model is the NB model because it can control the overdispersion with the scale parameter , and improve parameter estimates, power, and predictions (Yaacob *et al.* 2010). Finally, more research is needed to study the proposed methods using other real data sets, and extend the proposed genomic-enabled prediction models to deal with the large number of zeros in count response variables, and for modeling multiple traits.

## Acknowledgments

We very much appreciate the International Maize and Wheat Improvement Center (CIMMYT) field collaborators, laboratory assistants, and technicians who collected the phenotypic and genotypic data used in this study.

## Appendix A

### Derivation of full conditional distribution for all parameters.

#### Full conditional for

where .

#### Full conditional for

#### Full conditional for *:*

Defining the conditional distribution of is given aswhere.

#### Full conditional for *:*

Defining , the conditional distribution of is given aswhere.

#### Full conditional for

with J, IJ and

#### Full conditional for

#### Full conditional for *r*

To make the inference of *r*, we first place a gamma prior on it as . Then we infer a latent count L for each count conditional on *Y* and *r*. To derive the full conditional of *r*, we use the following parameterization of the NB distribution: with . Since by construction we can use the Gamma-Poisson conjugacy to update *r*. Therefore,(A5)According to Zhou *et al.* (2012), the conditional posterior distribution of is a Chinese restaurant table (CRT) count random variable. That is, and we can sample it as where For details of the CRT random variable derivation, see Zhou and Carin (2012, 2015).

## Appendix B

### The Pólya-Gamma distribution

According to Polson *et al.* (2013), random variable ω has a Pólya-Gamma distribution with parameters and , denoted if (B1)where ∼ Gamma(b, 1) are independent gamma random variables, and D = indicates equality in distribution (Polson *et al.* 2013). However, it is not easy to simulate Pólya-Gamma random variables from Equation (B1), which is a sum of gamma random variables. To avoid the difficulties that can result from truncating the infinite sum given in Equation (B1), the density of a Pólya-Gamma random variable is expressed as an alternating-sign sum of inverse-Gaussian densities as:(B2)where denotes the hyperbolic cosine. A further useful fact is that all finite moments of a Pólya-Gamma random variable are available in closed form (Polson *et al.* 2013). In particular, the expectation may be calculated directly, and is equal to , where tanh denotes the hyperbolic tangent. Also, the Pólya-Gamma distribution is closed under convolution for random variates with the same scale parameter, given that if , and if are independent, then (Polson *et al.* 2013). This is used to construct the proposed Gibbs sampler. More details on simulating Pólya-Gamma random variates can be found in section 4 of the paper of Polson *et al.* (2013). Also, it is important to point out that this method for simulating Pólya-Gamma random variables is implemented in the R package Bayeslogit.

## Footnotes

*Communicating editor: D. J. de Koning*

- Received December 19, 2015.
- Accepted February 17, 2016.

- Copyright © 2016 Montesinos-López
*et al*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.