- Split View
-
Views
-
Cite
Cite
Yanyan Liu, Sican Xiong, Wei Sun, Fei Zou, Joint Analysis of Strain and Parent-of-Origin Effects for Recombinant Inbred Intercrosses Generated from Multiparent Populations with the Collaborative Cross as an Example, G3 Genes|Genomes|Genetics, Volume 8, Issue 2, 1 February 2018, Pages 599–605, https://doi.org/10.1534/g3.117.300483
- Share Icon Share
Abstract
Multiparent populations (MPP) have become popular resources for complex trait mapping because of their wider allelic diversity and larger population size compared with traditional two-way recombinant inbred (RI) strains. In mice, the collaborative cross (CC) is one of the most popular MPP and is derived from eight genetically diverse inbred founder strains. The strategy of generating RI intercrosses (RIX) from MPP in general and from the CC in particular can produce a large number of completely reproducible heterozygote genomes that better represent the (outbred) human population. Since both maternal and paternal haplotypes of each RIX are readily available, RIX is a powerful resource for studying both standing genetic and epigenetic variations of complex traits, in particular, the parent-of-origin (PoO) effects, which are important contributors to many complex traits. Furthermore, most complex traits are affected by >1 genes, where multiple quantitative trait locus mapping could be more advantageous. In this paper, for MPP-RIX data but taking CC-RIX as a working example, we propose a general Bayesian variable selection procedure to simultaneously search for multiple genes with founder allelic effects and PoO effects. The proposed model respects the complex relationship among RIX samples, and the performance of the proposed method is examined by extensive simulations.
Most human genes have functional mouse counterparts, and genomes of both organisms are organized similarly. Thus, mouse serves as a good model organism for complex human diseases. Recombinant inbred (RI) mice are among the major mouse resources in biomedical and genetic research. However, traditional mouse RI lines are derived from only two inbred parental strains, with limited a number of lines available, resulting in a low percentage (15%) of genetic variation across all mouse inbred strains (Yuan et al. 2011) and extensive blind spots where a sizable proportion of the genome is identical by descent. These limitations make traditional RI insufficient for studying complex traits. With the need for more powerful resources, multiparent populations (MPP) (de Koning and McIntyre 2017), a set of inbred lines using multiple lines as founders, can overcome the limitations of traditional RI lines and have become an innovative tool for fine quantitative trait locus (QTL) mapping. In the past 15 yr, different kinds of MPP have been established in plants and animals, such as nested association mapping design (Yu et al. 2008) and multiparent advanced generation intercross (MAGIC) populations (Cavanagh et al. 2008; Kover et al. 2009; Huang et al. 2015; Ladejobi et al. 2016) in plants, and the Drosophila Synthetic Population Resource (King et al. 2012) in animals. The mouse MPP include the collaborative cross (CC) (Complex Trait Consortium 2004), in which a genetically diverse set of eight inbred strains were selected as breeding founders (Iraqi et al. 2008) to generate a large number of RI lines. The eight founder strains were predicted to represent ∼90% of the genetic variation presented in laboratory mice (Threadgill and Churchill 2012). The CC thus greatly overcomes the limitations of the traditional mouse RI lines, and is the only mammalian resource with high genome-wide genetic variation that is uniformly distributed across a large, heterogeneous, and infinitely reproducible population (Threadgill et al. 2011).
Through the generation of RI intercrosses (RIX) of RI lines, a large number of potential “outbred” RIX samples can be generated. That is, given L RI lines, or genetically distinct reciprocal or nonreciprocal individuals, or RIX, can be produced. An example of using CC-RIX to illustrate a distinct response to Ebola virus infection, by Rasmussen et al. (2014), was summarized by the editor: “the CC-RIX mice could prove valuable for preliminary screens of candidate therapeutics and vaccines.” Since all parental RI lines are isogenic at each locus, genotypes of RIX can be imputed in advance from those of their parental RI lines. Additional advantages of RIX can be found in Threadgill and Churchill (2012), in particular, its power in support of analysis of parent-of-origin (PoO) effects, where effects of certain alleles are different depending on whether those alleles are inherited maternally or paternally. PoO makes a significant contribution to the heritability of most complex traits (Mott et al. 2014). In addition, genomic PoO effects provide a great model to study epigenetic regulation of gene expression (Barlow 2011).
During the past 20 yr, QTL mapping methods, including analysis of variance (ANOVA), interval mapping (Lander and Botstein 1989), composite interval mapping (Jansen and Stam 1994; Zeng 1994), and multiple interval mapping (Kao and Zeng 1997; Kao et al. 1999), have been well developed using experimental crosses, such as backcross, , and RI (see Broman (2001) for reviews), and many excellent open software packages, such as QTLCart (Basten et al. 1999), MapManager (Manly and Olson 1999), and R/qtl (Broman et al. 2003), are freely available online. For diallel data obtained from the CC founder strains, Lenarcic et al. (2012) employed a general Bayesian model for decomposing phenotypic variance into biologically intuitive components. Crowley et al. (2014) also applied a Bayesian method to a diallel cross of the eight founder strains to estimate genome-wide genetic and PoO effects (not QTL mapping) on responses to haloperidol, an antipsychotic drug. Zhang et al. (2014) proposed a general single QTL Bayesian framework for MAGIC data to coherently estimate haplotype and diplotype effects of founder alleles. Similarly, for MAGIC data, Wei and Xu (2016) developed a single QTL mapping method with a random-effects model by treating the founder allelic effects of each locus as random, and scanning the entire genome one locus at a time using a likelihood ratio test. For RIX mice, a mixed-effects model was developed for modeling the unbalanced relatedness among them (Tsaih et al. 2005; Zou et al. 2005). Gong and Zou (2012) proposed a more flexible, nonparametric single QTL mapping method for detecting QTL with time-varying coefficients. Hallin et al. (2016) applied a random-effects model to map single QTL by partitioning the variance of growth traits across different environments of yeast strains into additive, dominance, and pairwise epistatic components. However, many complex traits are affected by >1 gene, and multiple QTL mapping may be more powerful than locus-by-locus analysis. For CC-RIX data, Yuan et al. (2011) constructed a mixed-effects model to simultaneously map multiple QTL, but treated the eight CC founder alleles as standard biallelic single-nucleotide polymorphisms (SNPs). For many complex traits, it is arguable that modeling the effects of the eight founder alleles could lead to improved mapping power (Collaborative Cross Consortium 2012; Vered et al. 2014). In addition, for complex traits where PoO effects are suspected, modeling such effects may further improve QTL mapping power (Lawson et al. 2013) and understanding of etiologies of complex traits (Threadgill and Churchill 2012).
In this paper, for MPP-RIX data, we develop a Bayesian variable selection procedure to simultaneously map multiple genes with founder allelic effects and PoO effects. We demonstrate our method with CC-RIX data, but the method is general enough for other MPP-RIX populations. We place parameter-expanded Gaussian (PeG) priors on both the random founder allelic effects and PoO effects for variable selection.
The paper is organized as follows. In the Statistical Method section, we first introduce the CC-RIX experiment, then propose a random-effects model and describe a Bayesian variable selection procedure. In the Simulation Study section, we perform extensive simulation to examine the proposed method. Summary comments are given in the Discussion section.
Statistical Method
The CC-RIX panel is the RI intercross of CC lines. For L CC RI lines, a total of reciprocal CC-RIX and nonreciprocal CC-RIX can be generated [see Figure 1 of Zou et al. (2005), Yuan et al. (2011)]. Let n be the total number of CC-RIX samples and p be the total number of genetic markers. Further, let the phenotype of individual i (the dependent variable) be
Model
At a given locus, PoO effects can only be estimated from individuals with heterogeneous genotypes, since for those with homozygote genotypes, the parental contributions cannot be distinguished. Label the eight-founder alleles A, B, C, D, E, F, G, and H by numbers 1–8, and let the element of be equal to 1 or −1 if the maternal or paternal allele of the ith subject at the jth locus equals the kth founder allele, and 0 otherwise. For those CC-RIX samples with a homozygous genotype at a given locus, such as AA or BB, we let Therefore, represents the PoO effects of the jth locus.
To enable selection of QTL and PoO effects, we reexpress the eight-dimensional vectors and as the expanded parameters and for respectively, such that and This parameter-expansion idea was first proposed by Kuo and Mallick (1998) and has an advantage in selecting important variables and shrinking coefficients.
Block Gibbs sampling algorithm for posterior computation
The specific priors above result in known marginal conditional distributions for all variables. The blockwise Gibbs sampling algorithm that we employ can be summarized as follows:
First, we initiate , and from uniform distribution then sample other parameters , and indicators and from their priors. We then perform the following block Gibbs sampling procedures. Superscripts and signify the Markov chain Monte Carlo (MCMC) iterations, and refers to the initial iteration.
- Step 1. Updating μ: is drawn from the normal distribution.
Step 2. Updating we divide the long vector into two blocks and corresponding to the selected () and unselected () predictors, respectively. We then sample from their priors, i.e., the eight-dimensional multivariate normal distribution with zero mean and covariance matrix and sample from the eight-dimensional multivariate normal distribution with mean and covariance matrix
Step 3. Updating similar to Step 2, we divide into two parts, and corresponding to the selected () and unselected () predictors, respectively. Then is sampled from their priors, i.e., the eight-dimensional multivariate normal distribution with zero mean and covariance matrix and is sampled from the eight-dimensional multivariate normal distribution with mean and covariance matrix
Step 4. Updating is drawn from the multivariate normal distribution with mean and covariance matrix
Step 5. Updating is sampled from the scale-inverted distribution,
Step 6. Updating is sampled from the scale-inverted distribution,
Step 7. Updating the random-effects variance is sampled from the scale-inverted distribution,
Step 8. Updating the residual variance is sampled from the scale-inverted distribution,
Step 9. Updating is sampled from the Bernoulli distribution with success probability where and
Step 10. Updating is sampled from the Bernoulli distribution with success probability where and
Step 11. Updating is sampled from the beta distribution
Step 12. Updating is sampled from the beta distribution
Data availability
Supplemental Material, File S1 contains five supplemental figures. The MATLAB code used to analyze the simulated data is provided in File S2.
Simulation Study
In this section, we run extensive simulations to evaluate the performance of the proposed Bayesian method. We apply the loop design in Zou et al. (2005) and Yuan et al. (2011) by ordering the L CC-RI lines randomly and forming them into a circle, and then mating each CC-RI line (clockwise) with the next three CC-RI lines after it (i.e., CC-RI1 mated with CC-RI2, CC-RI3, and CC-RI4; CC-RI2mated with CC-RI3, CC-RI, and CC-RI; and so on). For in this way, we can generate a total of CC-RIX samples. Parental CC-RI genotypes are simulated in R/qtl (Broman et al. 2003). All the simulation results are based on total 100 replications for each simulation setup.
In model (2), we set the overall mean μ = 1, the variance of random errors the polygenic effect variance , and Nineteen chromosomes each with a length of 70 cM are simulated, on which total p = 133, 266, and 1330 evenly spaced markers (resulting in 10-, 5-, and 1-cM intervals between nearby markers on each chromosome, named cases 1–3) are generated, corresponding to three marker density cases. Among the total p simulated markers, we randomly select five markers, and let the first three markers have QTL effects the last three markers have PoO effects. That is, the first two genes only have QTL effects, and the last two only have PoO effects. However, the middle-selected marker has both QTL and PoO effects. The corresponding variances of QTL or PoO effects of the selected markers are all set to 1.
To better assess the performance of our method in situations where multiple nearby SNPs jointly affect the outcome, for cases 1–3, we now let the number of causal SNPs in each of the first two QTL be 2 instead of 1. Specifically, we randomly select two nearby SNPs for each of the first two simulated QTL and denote the alleles of the two SNPs A and a. Based on the haplotype frequencies , and (without loss of generality, we assume that ), we create three haplotype allelic groups, where haplotype is group 1, is group 2, and the other two are group 3, and set the genetic effects of the three groups to 1, 2, and 3, respectively. The three new simulation setups are labeled as cases 1∗ to 3∗.
For each simulated data point, we generate a single long chain with 20,000 cycles, of which the first 10,000 cycles are discarded as burn-in, resulting in a total of 10,000 samples for post-MCMC analysis. All the analysis is done in MATLAB, and the MATLAB source code is submitted as supplemental material (File S2).
For comparison, we fit each simulated data point with the following three models.
The same model as (2), but without the PoO terms
All the priors are set to be the same as their counterparts in model (2). For convenience, we subsequently call model (3) the “mixed model”.
Linear mixed-effects model (LMM) for single gene mapping
Yuan’s Bayesian method
Here, is the effect of the jth putative QTL with , where equals −1, 0, or 1 if the putative QTL genotype is , or respectively. The other parameters are set the same as those in the mixed model. The prior of is set to 1 ≤ j ≤ p in Yuan et al. (2011).
To compare the methods, ROC curves (Fawcett 2006), where true positive rates (also known as sensitivity) are plotted against false positive rates (also known as 1-specificity) evaluated at various threshold cut-offs. To estimate the sensitivity (the proportion of positives that are correctly identified as such) and specificity (the proportion of negatives that are correctly identified as such), we define false and true positive findings as follows. If a detected locus falls no more than 10 cM apart from any simulated genes, we call it a true positive finding, otherwise a false positive finding. Then, combining the outputs of each model for the 100 data sets, we can calculate the corresponding sensitivity and specificity. Specifically, for the jth marker, we record the LOD scores, , for the LMM model; the maximum posterior frequency between and for our proposed method; the posterior frequency for the mixed model; and the posterior mean of for Yuan’s model. The corresponding area under the ROC curve (AUC) is also calculated for each of the four models, and the results are presented in Table 1. Generally speaking, a model with a higher AUC value indicates on average a better performance compared with those with lower AUC scores (Fawcett 2006).
Simulation settings and AUC values
Case . | ISa . | p . | b . | c . | d . | e . |
---|---|---|---|---|---|---|
1 | 10 | 133 | 0.9642 | 0.8499 | 0.5155 | 0.9436 |
2 | 5 | 266 | 0.9722 | 0.9022 | 0.5029 | 0.9463 |
3 | 1 | 1330 | 0.9915 | 0.9489 | 0.5000 | 0.9525 |
1* | 10 | 133 | 0.9499 | 0.6885 | 0.5085 | 0.9264 |
2* | 5 | 266 | 0.9624 | 0.9056 | 0.4998 | 0.9238 |
3* | 1 | 1330 | 0.9908 | 0.9393 | 0.5040 | 0.9482 |
Case . | ISa . | p . | b . | c . | d . | e . |
---|---|---|---|---|---|---|
1 | 10 | 133 | 0.9642 | 0.8499 | 0.5155 | 0.9436 |
2 | 5 | 266 | 0.9722 | 0.9022 | 0.5029 | 0.9463 |
3 | 1 | 1330 | 0.9915 | 0.9489 | 0.5000 | 0.9525 |
1* | 10 | 133 | 0.9499 | 0.6885 | 0.5085 | 0.9264 |
2* | 5 | 266 | 0.9624 | 0.9056 | 0.4998 | 0.9238 |
3* | 1 | 1330 | 0.9908 | 0.9393 | 0.5040 | 0.9482 |
Interval space (IS; cM) between nearby markers.
AUC values of our proposed model (2).
AUC values of the mixed model (3).
AUC values of Yuan’s model (5).
AUC values of the LMM model (4).
Case . | ISa . | p . | b . | c . | d . | e . |
---|---|---|---|---|---|---|
1 | 10 | 133 | 0.9642 | 0.8499 | 0.5155 | 0.9436 |
2 | 5 | 266 | 0.9722 | 0.9022 | 0.5029 | 0.9463 |
3 | 1 | 1330 | 0.9915 | 0.9489 | 0.5000 | 0.9525 |
1* | 10 | 133 | 0.9499 | 0.6885 | 0.5085 | 0.9264 |
2* | 5 | 266 | 0.9624 | 0.9056 | 0.4998 | 0.9238 |
3* | 1 | 1330 | 0.9908 | 0.9393 | 0.5040 | 0.9482 |
Case . | ISa . | p . | b . | c . | d . | e . |
---|---|---|---|---|---|---|
1 | 10 | 133 | 0.9642 | 0.8499 | 0.5155 | 0.9436 |
2 | 5 | 266 | 0.9722 | 0.9022 | 0.5029 | 0.9463 |
3 | 1 | 1330 | 0.9915 | 0.9489 | 0.5000 | 0.9525 |
1* | 10 | 133 | 0.9499 | 0.6885 | 0.5085 | 0.9264 |
2* | 5 | 266 | 0.9624 | 0.9056 | 0.4998 | 0.9238 |
3* | 1 | 1330 | 0.9908 | 0.9393 | 0.5040 | 0.9482 |
Interval space (IS; cM) between nearby markers.
AUC values of our proposed model (2).
AUC values of the mixed model (3).
AUC values of Yuan’s model (5).
AUC values of the LMM model (4).
From Table 1, it is clear that our proposed method outperforms the other three methods for all cases, regardless of whether there is a single causal SNP or multiple causal SNPs in each QTL. Yuan’s method fails in all the simulated cases, which is expected as the method only models biallelic SNP effects instead of the founder allelic effects that we simulate. The LMM model outperforms the mixed model, in particular when the marker density is sparse. This phenomenon is further confirmed by Figure 1, where the ROC curves of the proposed method are always higher than the ROC curves of the other three methods; the ROC curves of Yuan’s model fall along the 45-degree line, indicating its low power for mapping genes with founder allelic effects. The ROC curves of the LMM model fall between those of our proposed model and the mixed model in cases 1 and 2, and cases 1* and 2*, but cross with the ROC curves of the mixed model in cases 3 and 3*. Figure 1 and Table 1 show that the mapping precisions of all the methods increase as the number of markers increases, as expected.
Figure 2 presents the Manhattan plots of the four models based on the average estimate of a total of 100 replications across the whole genome for case 1. Clearly, our proposed method is the winner in mapping genes with founder allelic effects (marked by an asterisk) and PoO effects (marked by a circle). The Manhattan plots for cases 2 and 3, and cases 1*, 2*, and 3*, are included in Figures S1–S5 in File S1, respectively, and show similar patterns.
Discussion
The CC (Threadgill and Churchill 2012), a panel of newly emerged multiparent RI mouse strains, was developed, similar to other MPPs, to provide greater genetic diversity than the traditional RI populations and thereby to improve our power of understanding of complex traits. CC-RIX, F1 crosses that are generated from parental CC RI lines (Zou et al. 2005) can serve as excellent mouse models for mapping genes with both traditional genetic effects and epigenetic effects, such as imprinting. This study extends the model of Gong and Zou (2012) and Yuan et al. (2011) and, to the best of our knowledge, it is the first use of Bayesian variable selection methods to jointly map multiple QTL with both founder allelic effects and PoO effects for MPP-RIX data, in particular, CC-RIX data.
In this article, it is assumed that the RI lines are equally distanced from each other in terms of genetic distance which is a sensible assumption given the funnel design used for generating CC RI lines. However, we do observe that some RI lines share more or fewer founder alleles than expected. Our limited simulation shows that such genetic unbalance has a negligible effect on mapping genes, but this issue deserves further investigation. Alternatively, as genotypes of the parental RI lines are available, we could modify the design matrices , and in model (1) and replace them with local and genome-wide similarity matrices as done in sequence kernel association tests (SKAT) (Ionita-Laza et al. 2013) and genome-wide complex trait analysis (GCTA) (Yang et al. 2011). In addition, in our model we assume additive founder allelic effects. This assumption can be easily relaxed to allow genes with both additive and dominance effects (Woods et al. 2012; Zhang et al. 2014). Because dominant genetic effects are orthogonal to PoO effects, missing the dominant genetic effects does not cause any bias in the PoO effects estimate.
One of the drawbacks of Bayesian methods is their computational cost, especially for data with large numbers of samples and markers. Our model employs a block Gibbs sampler, which dramatically reduces computational time. However, further improvements may be possible. For example, instead of modeling each marker separately, we could jointly model multiple nearby markers and reduce the magnitude of p. Similar ideas have been proposed for genome-wide association study data (Wu et al. 2010; Lu et al. 2015). This approach is also biologically meaningful, since for some complex traits multiple causal SNPs may be located in a single region, and our simulations (cases 1*–3*) have shown that the proposed method is powerful in mapping regions with multiple causal SNPs. However, grouping nearby markers or SNPs may offer further help in improving mapping power, which deserves more thorough investigation.
Acknowledgments
We thank two anonymous referees for their valuable comments on our manuscript. This research was partially supported by NIH grant R01 GM105785, and by the scholarship from China Scholarship Council (CSC) under the Grant CSC No.201506270042.
Footnotes
Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300483/-/DC1.
Communicating editor: D. J. de Koning