Abstract
Genotyping by sequencing allows for large-scale genetic analyses in plant species with no reference genome, but sets the challenge of sound inference in presence of uncertain genotypes. We report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundinacea L., Phalaris caesia Nees), a cool-season grass species with potential as a biofuel crop. Our study involved two linkage populations and an association panel of 590 reed canarygrass genotypes. Plants were assayed for up to 5228 single nucleotide polymorphism markers and 35 traits. The genotypic markers were derived from low-depth sequencing with 78% missing data on average. To soundly infer marker-trait associations, multiple imputation (MI) was used: several imputes of the marker data were generated to reflect imputation uncertainty and association tests were performed on marker effects across imputes. A total of nine significant markers were identified, three of which showed significant homology with the Brachypodium dystachion genome. Because no physical map of the reed canarygrass genome was available, imputation was conducted using classification trees. In general, MI showed good consistency with the complete-case analysis and adequate control over imputation uncertainty. A gain in significance of marker effects was achieved through MI, but only for rare cases when missing data were <45%. In addition to providing insight into the genetic basis of important traits in reed canarygrass, this study presents one of the first applications of MI to genome-wide analyses and provides useful guidelines for conducting GWAS based on genotyping-by-sequencing data.
Perennial crops, which include herbaceous energy crops (HEC), are increasingly studied as potential significant sources of energy because of their environmental benefits and the increase in prices of petroleum. In its 2005 billion-ton supply report, the US Department of Agriculture (USDA) and the US Department of Energy (USDOE) set the goal of a 30% replacement of US petroleum consumption with biofuels by 2030. This goal implies a production of approximately 1 billion dry matter tons of biomass per year from forest and agricultural lands. According to the report’s projections and assumptions, achieving that objective will require the following: (i) the conversion of active croplands, pasture land, and lands under the conservation reserve program (CRP) into perennial crop lands and (ii) achieving biomass yields ranging between 5.5 and 8 dry tons per acre from perennial crops (between 12.4 and 19.8 Mg ha-1) (US Department of Agriculture and US Department of Energy, 2005). Plant breeding has an important part to play in achieving such yields. To efficiently select for higher biomass yield, selection may act on secondary traits such as plant height and flowering time (Price and Casler 2014), resistance to biotic stress, tiller density (Boe and Beck 2008), leaf area, and plant architecture. Biomass quality considerations, for conversion into bioenergy, bring another suite of traits to bear in HEC breeding (e.g., Vogel et al. 2011). Common methods for transforming biomass feedstock into energy include direct combustion, pyrolysis, and fermentation of sugars (soluble sugars, starch, cellulose, and hemicellulose) into ethanol (Wrobel et al. 2009).
Reed canarygrass (Phalaris arundinacea L., Phalaris caesia Nees) is a promising HEC in North America. It belongs to the tribe Avenae (sub-family Pooideae, family Poaceae), which includes the oat genus Avena (Quintanar et al. 2007). Of the grass model species, such as Brachypodium distachyon (Brachypodium), Oryza sativa (Rice), Zea mays (maize), and Sorghum bicolor (Sorghum), Brachypodium is the most closely related to reed canarygrass (Bouchenak-Khelladi et al. 2008). Reed canarygrass is a species complex that comprises two chromosomal races: the tetraploid race, Phalaris arundinacea L. (2n = 4x = 28), which is thought to be native to Europe, Asia, and North America (Jakubowski et al. 2012), and the hexaploid race, Phalaris caesia Nees (2n = 6x = 42) (McWilliam and Neal-Smith 1962; Baldini 1995). Most cultivars and wild accessions of P. arundinacea found in North America are of European ancestry (Casler et al. 2009a; Jakubowski et al. 2011). Reports of breeding efforts in P. arundinacea trace back to the early 20th century in North America (Casler 2010), but this species was already cultivated in the 18th century in Europe (Alway 1931). Some noncrop uses for reed canarygrass have been reported, such as phytoremediation (Picard et al. 2005), erosion control (Rice and Pinkerton 1993), and paper production (Pahkala and Pihala 2000). However, reed canarygrass has mostly been used as a forage crop. Consistently, most recent breeding efforts have focused on low alkaloid content for palatability to livestock, not on biomass yield (Wrobel et al. 2009).
Most of the research on HEC in the US has been focused on switchgrass (Panicum virgatum L.) (Sanderson et al. 1996), but reed canarygrass presents characteristics that may complement those of switchgrass; it is particularly tolerant to northern climates (Casler et al. 2009a), as well as to soil acidity, alkalinity, and moisture content (Bittman et al. 1980) and high levels of metals and minerals (Cureton et al. 1991). However, biomass yields in reed canarygrass are not very high. Based on trials in the US Midwest involving wild accessions and cultivars, genotype means for dry matter yield have been estimated to range from 7.6 to 10 Mg.ha-1 (Casler et al. 2009b). Also, quality tends to be lower in reed canarygrass than it is in switchgrass (Cherney et al. 1988). Nonetheless, judging from the significant genotypic variation observed in both biomass yield (Asay et al. 1968; Casler et al. 2009b) and quality (Carlson et al. 1996; Olmstead et al. 2013), improvement by selection of these primary biofuel traits is feasible.
Because no linkage map and no genome sequence are available in reed canarygrass, association studies have not been conducted in this species complex. However, genotyping by sequencing (GBS) provides the opportunity to call polymorphisms without prior marker development. With this technology, a genome-wide association study (GWAS) can be performed on reed canarygrass, but because no reference sequence is available, one limitation of such study is the inability to assign single nucleotide polymorphisms (SNP) to specific physical positions in the reed canarygrass genome. This has two important implications: (i) SNP-trait associations cannot be mapped to a particular region of the genome and (ii) no imputation method based on marker sequences, e.g., hidden Markov models (HMM) or sliding-window algorithms, may be used. However, other methods are available for imputing unordered marker data (Poland et al. 2012; Rutkoski et al. 2013). Another limitation of using SNPs derived from GBS is the high amount of missing values in marker data and, therefore, the importance of accounting for uncertainty in the imputation of marker genotypes.
The purpose of this study was to make statistically sound inferences about associations between GBS markers with a high frequency of missing values and biofuel traits (related to biomass yield and quality) in reed canarygrass. GWAS was performed under the assumption of disomic inheritance in reed canarygrass, which is suggested by the alloploidy of a closely related species, Phalaris aquatica (Carlson et al. 1996). To perform statistically valid association tests, it was necessary to avoid false positives, not only by controlling for population structure and familial relatedness (Yu et al. 2006) but also by accounting for imputation uncertainty. Our study exemplifies the use of multiple imputation, initially proposed by Rubin (1987), to account for this source of variability when making inferences. Classification tree models, shown by Poland et al. (2012) and Rutkoski et al. (2013) to be promising, were used under this framework to impute missing values without information on markers’ genomic location and order. After a description of the variability present in the panels under study, we examine the general behavior of inferences in multiple imputation compared to more traditional methods that would not fully account for imputation uncertainty. Then, we present our results of GWAS performed in a multiple-imputation framework and describe the significant markers in light of where they map onto the B. distachyon genome.
Materials and Methods
Populations
Three panels were assessed in this study: two randomly segregating populations derived from biparental crosses (LP1 and LP2) comprising 177 and 189 clones, respectively (Supporting Information, Table S1), and one association panel (AP) comprising 590 clones originating from North America and Europe (Table S1). LP1 and LP2 were derived from two distinct crosses involving genotypes of WR00, a tetraploid cultivar originating from Wisconsin. In AP, four populations (AR Upland, Superior, PI-284179, and PI-236525) were accessions of P. caesia (hexaploid). The AP subset consisting of only the 550 P. arundinaceae clones is referred to as AP-4x.
Clones were assessed in spaced-plant trials arranged in a Sets-in-Reps design with two replicates. The trials were performed in two locations: Arlington, Wisconsin (43.3°N, 89.4°W) and Ithaca, New York (42.6°N, 76.4°W). Soil type was Plano silt loam (fine-silty, mixed, mesic Typic Argiudoll) in Arlington and was Niagara silt loam (fine-silty, mixed, active, mesic Aeric Endoaqualf) in Ithaca.
Phenotypic data
Traits related to disease resistance (Ds), morphology [standability (St); leaf width (LW); leaf length (LL); total stem count (STC); plant height (PH); full height (FH)], phenology [heading date (HD); anthesis date (AD)], and quality [dry matter percentage (DM); neutral detergent fiber (NDF); acid detergent fiber (ADF); NDF digestibility (NDFD); acid insoluble ash (AIA); Klason lignin (Lignin); crude protein (CP); Ash (ASH); calcium (Ca); chlorine (Cl); copper (Cu); iron (Fe); potassium (K); magnesium (Mg); manganese (Mn); sodium (Na); phosphorus (P); sulfur (S); zinc (Zn); glucose (GLC); galactose (GAL); xylose (XYL); arabinose (ARA); GLC conversion efficiency (GLC_Eff); XYL conversion efficiency (XYL_Eff); energy content (BTU)] were determined on plants grown in Ithaca in 2009–2011 and in Arlington in 2010–2011 (Table 1). Quality traits were predicted from near-infrared reflectance spectroscopy (NIRS) measurement, using methodology described by Vogel et al. (2011). The NIRS prediction equations were developed on a diverse set of 110 reed canarygrass samples from the experiments described by Casler et al. (2009b). Wet-laboratory traits were determined for these samples using the procedures described by Vogel et al. (2011). The validity of predictions from the predictive models was verified by the extremely low frequency of outliers: nine biomass-quality samples with Mahalanobis distance more than three out of a total of >3300 samples (Shenk and Westerhaus 1991). The raw phenotypic data are available for download from http://dfrc.wisc.edu/sniper/.
Association analyses were performed on best linear unbiased estimations (BLUEs) of genotypes’ performance for a given trait, inferred from the following linear mixed model:where
is the measurements at one of the 35 traits considered,
is the genotypic value of clone i, modeled as fixed (to guarantee convergence of the fitting algorithm and to avoid assumptions about genotypes’ sampling). For all other terms, the corresponding effects were considered random, independent, and identically normally distributed:
is the effect of location j;
is the effect of year l within location j;
is the effect of replicate m within location j; × indicates interactions; and
is the effect of the plot at row r and column c, within an environment (year l within location j). Plot effects within environments were modeled as
, where
is the Kronecker product of the first-order autoregressive covariance matrices on rows and on columns, respectively. The mixed models were fitted using ASREML-R (Butler et al. 2007).
The matrix of genotypes’ BLUEs, computed as described above, can be downloaded from http://dfrc.wisc.edu/sniper/.
Marker data and quality control
Genome reduction, by ApeKI restriction, and sequencing were performed according to Elshire et al. (2011). Reduced DNA samples were sequenced on the Illumina HiSequation 2000, with 95 samples plus one negative control per lane. To simultaneously discover SNP markers and call genotypes, the UNEAK pipeline was used (http://www.maizegenetics.net/gbs-bioinformatics; Lu et al. 2013). This pipeline trims reads to a 64-bp length to limit sequencing errors and speedup computation, and discards markers based on a network filter designed to detect and eliminate markers showing complex relationships with others, which suggests paralogy and/or sequencing errors (Lu et al. 2013). A total of 29,313 SNPs were called out of the UNEAK pipeline. Marker genotypes were coded as allelic dosages: −1 for homozygotes at the reference allele, 0 for heterozygotes, and 1 for homozygotes at the alternate allele, assuming disomic inheritance in reed canarygrass (Carlson et al. 1996). The raw genotypic data are available for download from http://dfrc.wisc.edu/sniper/.
Markers were selected based on proportion of missing values (PMV) <0.90 in each of three panels, separately. It is typical to filter out GBS SNPs by a predetermined low missing rate (e.g., PMV <0.20). However, we did not filter these markers to avoid removing potentially useful information, and also because a rigorous study of the behavior of the MI approach at different missing rates is merited. After this first filtering step, 18,818 markers were retained. Marker variables were then discarded if they met one of the following criteria: (i) being “constant”, i.e., having a variance close to 0 (only 24 markers were discarded based on that criterion) or (ii) being “collinear”, i.e., being correlated by at least 0.999 (in absolute value) to some other marker variable with a smaller amount of missing values in the dataset. This filtering step was recommended by van Buuren and Groothuis-Oudshoorn (2011) and was implemented in the mice R package. This avoided overly conservative tests in GWAS due to inadequate adjustment for multiple testing on highly correlated test statistics and also ensured that multiple imputations were not biased, as a result of strong correlations among predictors, and appropriately reflected imputation uncertainty. At this point, 6138 markers were considered for further analyses. Figure 1 shows the distributions of PMV and MAF for marker data after the first filtering step (18,818 markers) and after the second filtering step (6138 markers). As expected, filtering for variability and noncollinearity preferentially discarded markers with high PMV and low MAF. Note that filtering out markers on PMV <0.80 across panels (which is still a very lenient criterion) would have resulted in only 3419 markers being considered for subsequent analyses (Table S2). Also, for markers not meeting the criteria of van Buuren and Groothuis-Oudshoorn (2011), estimates of marker effects from averaged imputed data ((AD); see Association analyses) tended to show larger deviations from those obtained based on nonmissing data only ((CC); see Association analyses), no matter what the imputation uncertainty was (Table S3, Figure S2). Finally, effects of discarded markers as estimated from multiple imputes ((MI); see Association analyses) tended to show excessively strong shrinkage in comparison with (AD) estimates, with little adjustment of estimates in response to imputation uncertainty (Table S4, Figure S3).
Summary statistics and distributions of proportions of missing values (PMV) and minor allele frequency (MAF) on markers retained after the first filtering step (“on PMV only”, i.e. PMV < 0.9 by panel; 18,818 markers) or after both filtering steps [“on PMV + variability and noncollinearity”, i.e. PMV < 0.9 by panel + filtering step recommended by van Buuren and Groothuis-Oudshoorn (2011), in which collinear and constant marker variables are discarded; 6138 markers]. Statistics are the mean, SD, and range for “on PMV only” / “on PMV + variability and noncollinearity”. Filtering for variability and noncollinearity preferentially discarded markers with high PMV and low MAF.
The high values of PMV among selected markers (78% of missing values on average) reflect low sequencing depth of the GBS. From the quality control step, the marker data matrix was produced, in which missing values were coded as NA. Matrix
, used as input to the imputation procedure, is available for download from http://dfrc.wisc.edu/sniper/.
Marker data imputation
General principle of multiple multivariate imputation:
Consider some incomplete data set , with
and
denoting the observed and missing data, respectively. Given an estimand of interest θ (e.g., a model parameter), multiple imputation (MI) aims at producing a correct Monte Carlo (MC) approximation of
from m imputes of
, with
denoting some imputation of
. One implication of the MC approximation being correct is that, given an impute
, the distribution
is well-approximated, i.e., the sampling model is correct. Another implication that is of particular concern when performing MI is that the distribution
is also well-approximated, i.e., the imputation procedure is proper (Rubin 1987), which is defined as follows: let
be an estimate of θ from a given impute, let W be the within-impute estimation variance of
, and let
be the among-impute variance of
(corrected for a finite number of imputes): (i)
, i.e., the average of
over imputes
is an unbiased estimator of
, the estimate of θ from the hypothetically complete dataset; (ii)
; and (iii)
(Rubin 1987; van Buuren 2012). Condition (i) implies correct assumptions regarding the imputation model and the source of missingness in the data. Typically, the missing-at-random assumption (no factor other than those accounted for in the imputation model caused missingness) (Rubin and Little 2002) is necessary to guarantee that condition (i) holds. Conditions (ii) and (iii) imply that inferences from MI are confidence-valid (van Buuren 2012), i.e., that the total variance of
is realized in a conservative way:
. In this study, we assess imputation models for unbiasedness, under specific assumptions [condition (i)], but we did not test imputation models for their ability to preserve variability in the data [condition (ii)] or to correctly reflect among-impute variability [condition (iii)].
When performing MI on data sets containing missing values at several variables (SNP markers), one must sample from the joint distribution of missing values at all variables. To achieve this, two strategies have been proposed: joint modeling (JM) and fully conditional specification (FCS). JM consists of sampling from the joint distribution directly, by ordinary MC (Rubin and Schafer 1990; Schafer 2010). This method is theoretically sound, but it requires complex model specifications and assumptions that, if not correct, may result in imputation bias. JM cannot accommodate tree-based approaches that have the advantage of being flexible and not requiring any model specification. FCS, however, implicitly samples from some hypothetical joint distribution by repeatedly sampling from the fully conditional distributions at each variable of interest using a Gibbs sampling scheme (van Buuren et al. 2006; van Buuren 2007). Because it relies on Markov chain MC (MCMC), FCS can be computationally costly. Also, there is no guarantee that the joint distribution actually exists, so FCS is usually described as a pseudo-Gibbs sampling procedure (Gelman and Raghunathan 2001; van Buuren et al. 2006; van Buuren 2007). However, simulation studies have suggested that FCS is robust to such issues (van Buuren et al. 2006). A critical incentive for using FCS rather than JM is that FCS allows for much more flexibility in the imputation procedure: each fully conditional distribution can be derived separately, using either parametric (e.g., linear regression) or nonparametric models (e.g., classification trees). This flexibility is particularly valuable when dealing with missing values at many variables, in which case JM may simply not be practical (Gelman and Raghunathan 2001).
Implementation of multiple imputation:
In this study, we sampled missing values from by FCS. In the pseudo-Gibbs sampling process, missing values at the
SNP were sampled from
based on a classification and regression tree model (CART) (Breiman et al. 1984). For unbiasedness, imputation of
relied on the missing-completely-at-random (MCAR) assumption, stating that missingness occurred at random, with no factor causing SNP genotypes to be systematically missing (Rubin and Little 2002). CART presents the advantages of not requiring explicit model specification and conveniently accommodating nonlinear effects.
The general algorithm was:
For impute
For
fill in missing values at the
marker by random draws
from
.
For
For
sample
from
Return
as
,
where m is the total number of imputes of , q is the number of marker variables, and L is the number of MCMC iterations.
The number of imputes m was set to 20, based on available computational and memory resources. The number of iterations L up to which the actual sampling occurred (i.e., the burn-in period) was set to 10, based on previous studies (Dai et al. 2006; Burgette and Reiter 2010) and available computational resources.
MI, based on CART, was implemented according to Doove et al. (2014) with packages mice (van Buuren and Groothuis-Oudshoorn 2011) and rpart (Therneau and Atkinson 1997) in the R programming language (R Development Core Team 2014). CARTs were fitted with pruning up to at least five donors per terminal node (i.e., no less than five observations at each leaf in the tree). For the MI algorithm to be computationally tractable, only a subset of SNPs was considered for fitting the CART models: for each SNP k, only the 500 SNPs showing the highest marginal mutual information with SNP k were considered as potential predictors.
MI implemented as described above took 1 d per imputation (chain of 10 iterations) on a workstation consisting of 24 Intel Xeon X7460 CPU processors at 2.66 GHz with 264 Gb of RAM. The procedure was parallelized on five threads.
Two types of marker data were generated from the multiple imputation procedure: the MI set of matrices and the average-dosage (AD) matrix
(i.e., each element in
was the mean of genotype codes across the
imputes). The array
and the matrix
from the imputation procedure can be downloaded from http://dfrc.wisc.edu/sniper/.
Association analyses
Data subsets:
Only tetraploid samples were considered for GWAS, and the following subsets of the data were examined separately when testing marker-trait associations: LP1, LP2, and AP-4x (AP panel with only P. arundinacea samples) consisting of individuals, respectively. After discarding marker variables with MAF <0.05 (as estimated from
), in each subset separately, 5024, 5096, and 5228 markers were assayed in association testing, respectively. Figure S1 shows, for each subset separately, the distribution of PMV and MAF for markers selected prior to conducting GWAS. The distributions are equivalent across subsets and similar to that observed with the larger set of 6138 markers, with the exception that the average MAF increased from 0.28 to 0.32.
Association model:
Within each of the three data subsets (LP1, LP2, and AP-4x), the linear mixed model of Yu et al. (2006) was fitted for each SNP k retained, using the P3D (population parameters previously determined) approximation for computational efficiency (Zhang et al. 2010; Kang et al. 2010):where
is the n-vector of clones’ genotypic values as described above;
,
, and
are the vectors of allelic dosage for SNP k and correspond to the three types of marker data described above;
is the effect of SNP k; Q is the matrix of the first t components used to account for population structure (Price et al. 2006) obtained from a principal component analysis (PCA) performed either on
or
; v is the vector of their effects; Z is the design matrix for relating observations to clones; u is the vector of random polygenic effects; K is the realized genetic relationship matrix estimated from either
or
; e is the vector of residuals; and I is the identity matrix.
and
were estimated by restricted maximum likelihood (REML). The R package rrBLUP (Endelman 2011) was used to calculate the realized relationship matrix K and estimate
and
. Subscript
refers to the subset of individuals for which there was no missing value at SNP k.
(CC) is the analysis restricted to complete cases (nonmissing values) for a given marker tested, with the approximation of Q and K estimated from used as fixed, to account for population structure and relatedness. In (AD), marker effects are assessed from
, considered as fixed. In (MI) and (MI*), marker effects are assessed from
and the variability across imputes
is accounted for. Relying on the MCAR assumption, (CC) served as a reference for assessing the consistency of estimates from (AD) or (MI): (AD) estimates departing too much from (CC) estimates were considered unreliable, especially when imputation uncertainty (γ; see below) was high. Estimates from (MI) were expected to show appropriate control over imputation uncertainty, i.e., shrinkage toward zero as γ increases. In (MI*), marker effects are assessed from
as in (MI), but variability across imputes for estimates of Q and K is also accounted for, through
and
, thus allowing for full control over imputation uncertainty in inferences regarding β. Although (CC), (AD), and (MI) are convenient to assess the impact of imputation uncertainty on β estimates, (MI*) should be the method permitting the most statistically sound inferences regarding marker-trait associations.
In AP-4x, population structure and relatedness were, respectively, accounted for by one principal component and by matrix calculated on AP-4x individuals. In LP1 and LP2, no principal component and no genetic relationship matrix were included in the GWAS model, then equivalent to a single-marker analysis model.
Combination of parameter estimates in MI and calculation of p values:
In (CC) and (AD), significance of marker-trait associations was assessed by performing an F-test for the regression coefficient estimate β, as described in Kang et al. (2008). In (MI) and (MI*), an F-test was performed in which the p value was (Rubin and Schenker 1986; van Buuren 2012):where
is the average estimate ofβ over imputations, with
being the estimate of β based on the
imputation;
is the (estimated) total variance of β estimates, partitioned into
, the average within-impute variance of β estimates, and
, the among-impute variance of β estimates, corrected by
for unbiasedness (
means “asymptotically equal”). Barnard and Rubin (1999) derived a formula for the number of denominator degrees of freedom v of the F-statistic:
where
, with
;
, with
being the number of degrees of freedom for the hypothetically complete dataset.
For a particular coefficient β, the uncertainty in its estimate due to imputation is characterized by γ, the fraction of information about β lacking due to missingness:with
(Rubin 1987; van Buuren 2012). γ reflects the dependency of the inferences about β on the imputation procedure. Although quite complex, the γ-statistic may simply be interpreted as λ, the proportion of among-impute variance in inferences, adjusted for a finite number of imputes (van Buuren 2012). In this study,
. According to Li et al. (1991),
, and
indicate a “modest,” “moderately large,” and “high” missing data problem, respectively.
False discovery rate and significance for marker-trait association:
The p values obtained from (CC), (AD), or (MI) were transformed into adjusted false discovery rates (FDR) using the method of Storey and Tibshirani (2003). Marker-trait associations for which FDR <0.1 in (CC) or (MI*) were deemed significant and considered for further analyses, although evidence from (MI*) was preferred because (CC) relied on the approximation of Q and K estimated from used as fixed in the GWAS model and also because (CC) was considered more prone to false positives or false negatives due to smaller sample sizes,
(see Discussion). The R package qvalue (Dabney et al. 2013) was used to compute FDR.
Results
Genotypic variability in the panels
Phenotypic traits:
Traits were measured in varying numbers of years and locations (Table 1). Some traits were measured in only one location (Ds, TSC, St, and Quality traits in AP, measured/predicted in Ithaca only) and/or during only 1 yr (Ds only in 2009 and Quality traits only in 2011). As a result, they would show an inflated genotypic variance. Ds, St, LL, LW, and all quality traits showed a much higher ratio of genotypic-to-phenotypic variance (H) in AP compared to LP1 and LP2. PH showed a higher H in LP1 and AP compared to LP2. AD showed a higher H in LP2 and AP compared to LP1 (Table 2).
Population structure:
To analyze population structure, principal component analysis (PCA) was performed on the average dosage matrix . For the whole dataset (panels LP1, LP2, and AP combined), the first four principal components (PC) were deemed relevant for describing population structure based on proportions of variance explained and grouping patterns on PCs (Figure 2). Grouping patterns in the whole dataset are consistent with race (first PC, distinguishing accessions that are Phalaris arundinacea from those that are Phalaris caesia), panel (second and third PCs), and geographical origin—Eastern Europe vs. North America or Western Europe (fourth PC). The fifth and sixth PCs do not seem to reflect any grouping and explain a small proportion of the total genotypic variation compared with the first four PCs.
Principal component analysis (PCA) on the three panels combined: first six components and the proportion of marker variation explained, in parentheses on axis labels. Colors refer to location of origin (as in Table S1). Shapes refer to the panel (as in Table S1). Data points for P. caesia clones are circled in black.
PCA was also performed for four other subsets of the data: LP1, LP2, AP, and AP-4x (AP panel with only P. arundinacea samples). The numbers of selected PCs for those subsets were 0, 0, 2, and 1, respectively (data not shown). The relevant PCs in AP were consistent with race and geographical origin, respectively, in the same way as in the whole data set. The relevant PC in AP-4x was consistent with geographical origin.
Analysis of imputation uncertainty
Consistency of PCs and genetic relationship coefficients across imputes:
Consistency of a variable in MI is defined as the correlation with the same variable in (AD), in absolute value, averaged over imputes. Such value aims at assessing to what extent it is appropriate to use Q and K, estimated from , as fixed, in (CC), (AD), and (MI) association analyses. In the whole dataset, K estimates seem very consistent over imputes (consistency of 0.90 ± 0.0002; Table 3), but although estimates of
(first PC) are quite consistent, the estimates at the subsequent PCs lose coherence across imputes, with the fourth PC having a consistency of only 0.043 (± 0.032). This result suggests that accounting for imputation uncertainty with respect to population structure variables—when those are estimated from marker data—is important, which implies that (MI*) should be the most appropriate type of analysis for assessing marker-trait associations. In the AP-4x subset, the consistency of the only relevant PC is 0.076 (± 0.067) and the consistency of genetic relationship coefficients is 0.934 (± 0.0002).
Consistency across imputation schemes of marker effects and significance:
From complete case (CC) to average dosage (AD):
Under the MCAR assumption, the relationship between and the trait of interest adequately reflects the effect of markers in the whole sample. That is, for any marker-trait association, the estimate
based on complete cases is unbiased with respect to
, the estimate of β based on the hypothetically complete data set. It follows from the MCAR assumption that imputations are unbiased if estimates of marker effects from (AD),
, are unbiased with respect to those from (CC),
. The plot (Figure 3A) and the regression analysis (Table 4) of
on
, across traits and markers, strongly suggest that there is no bias from (CC) to (AD). For marker-trait associations with low imputation uncertainty
, there is a close relationship between
and
(R2 = 0.94). However, for marker-trait associations with moderate-to-high imputation uncertainty
, large differences between
and
can be observed, but at random. Such differences indicate noise in inferred values of β and opportunities for false positives or false negatives (Figure 3B).
Concordance in inferences from (CC) to (AD) in the AP-4x subset of the data. The points’ size refers to the stability of inferences across imputes (1−γ). (A) Concordance across procedures and over all traits in marker effect estimates, standardized within each trait. (B) Concordance across procedures and over all traits in significance [−log10(p)]; green areas correspond to significant associations according to a Bonferroni correction on a single-trait basis (p < 9.56 × 10−6).


From average dosage (AD) to multiple imputation (MI):
The purpose of MI is to make sound inferences when one bases statistical analyses on imputed data. Therefore, MI should produce more conservative estimates when imputation uncertainty regarding marker-trait associations is high, in order to avoid declaring as significant associations that are due to “fortuitous” imputations. The plot (Figure 4A) and the regression analysis (Table 5) of
on
, across traits and markers, show that as γ increases, estimates of β tend to shrink toward zero, from (AD) to (MI). This behavior results in higher p values (lower significance) for those associations in which imputation uncertainty is high: whether associations inferred in (AD) are true or false, their p values will tend to fall above the significance threshold in (MI) if their among-impute variability is too large, but there is good agreement between (AD) and (MI) for low values of γ (Figure 4B). Only for
(no variation among impute) would there be no adjustment on
and
. In other words, one should rely on (AD) for inferences only if it can be assumed that the situation is “close enough” to the case where
. Clearly, this is not the case for these data. In Figure S4, the consistency of marker effects from (CC) to (MI) is shown.
Concordance in inferences from (AD) to (MI) in the AP-4x subset of the data. The points’ size refers to the stability of inferences across imputes (1−γ). (A) Concordance across procedures and over all traits in marker-effect estimates, standardized within each trait. (B) Concordance across procedures and over all traits in significance [−log10(p)]; green areas correspond to significant associations according to a Bonferroni correction on a single-trait basis (p < 9.56 × 10−6).


From multiple imputation (MI) to multiple imputation with full account of imputation uncertainty (MI*):
In (MI*), the marker data matrix X, but also the principal component and the genetic relationship matrices Q and K (both estimated from X) are allowed to vary over imputes. Although considering Q and K as fixed (estimated by and
) was convenient for studying the behavior of inferences from (CC) to (AD) to (MI), (MI*) should permit the safest inferences by appraising the imputation uncertainty in the imputed genotypes as well as in the resulting Q and K estimates. Figure 5A suggests concordance of β estimates from (MI) to (MI*), despite the relatively low consistency of Q estimates across imputes (Table 3). Also, marker effects with low imputation uncertainty seem less shrunk toward zero in (MI*), where the variability in Q and K estimates is accounted for. As a result, more significant associations could be detected in (MI*) than in (MI) (Figure 5B). Values of γ were quite consistent from (MI) to (MI*), with increased noise around perfect concordance for values of γ above 0.2 (Figure 6).
Concordance in inferences from (MI) to (MI*) in the AP-4x subset of the data. The points’ size refers to the stability of inferences across imputes (1−γ). (A) Concordance across procedures and over all traits in marker-effect estimates, standardized within each trait. (B) Concordance across procedures and over all traits in significance [−log10(p)]; green areas correspond to significant associations according to a Bonferroni correction on a single-trait basis (p < 9.56 × 10−6).
Concordance in imputation uncertainty, as reflected by the γ-statistic, over all traits in the AP-4x subset from (MI*) to (MI). The blue dashed lines correspond to the seemingly critical threshold of γ > 0.2, beyond which the γ-statistic loses coherence across procedures.
Imputation uncertainty and potential gains in power:
As Figure 7A suggests, for (i.e., markers with a modest missing data problem) there were generally small differences in significance
from (CC) to (MI*), but some opportunity for a few outstanding gains in significance, with increases of up to 8.08 in
(for the association between marker TP87762 and TSC; see next paragraph and Table 7). Assuming that the associations detected are true positives, this increase in significance may be considered an increase in power. This higher sensitivity in GWAS would logically come from a higher effective sample size due to imputation, with a modest missing data problem at the same time. For
, however, there was little benefit from imputation, with very few opportunities for higher sensitivity from (CC) to (MI*): most values for the differential in
are actually negative. Those results show again that a threshold of 0.2 for γ is a good guideline for setting apart inferences with excessively high imputation uncertainty.
(A) Relationship between γ, in (MI*), and the difference in significance from (CC) to (MI*) [Δ{−log10(p)}]. Decrease in significance [−log10(p)] seems to occur more often and with higher intensity for γ > 0.2; presumably, there would to be more opportunities for gaining detection power with γ < 0.2. (B–D) Potential factors affecting imputation uncertainty (γ): relationship between γ, in (MI*), and (B) PMV (proportion of missing values), (C) MAF (minor allele frequency), and (D) the average mutual information between one given marker and all other markers in the dataset. In purple are the smoothed curves obtained from thin plate regression (default smoother in mgcv R package; Wood 2003). The γ-statistics and p values are from analyses on the AP-4x subset.
Potential factors influencing imputation uncertainty:
Because there is good consistency between γ and the apparent usefulness of imputation for increasing power, it would be helpful to identify the factors that allow the analyst to effectively predict γ and determine how likely imputation is to generate gains in power. Figure 7, B–D, respectively, show the marginal relationships between γ and proportion of missing values (PMV), MAF, or the average mutual information (AMI) between given markers and all other markers in the data set. There seems to be a strong positive relationship between γ and PMV, with values of γ that are below 0.2 for markers that have up to approximately 25% of missing values. As PMV increases, the slope for γ decreases and the variability around the conditional mean, as determined by a smoothing curve, increases. There seems to be a weak correlation between γ and MAF, with markers having high MAF showing slightly higher imputation uncertainty, especially for MAF <0.1. However, this relationship might be an artifact from markers with low PMV, which tended, by happenstance, to have lower MAF. There seems to be some (nonlinear) relationship between γ and AMI, but it is not clear whether it is due to the fact that markers with very low AMI tended to have lower PMV on average: for AMI <0.02, average PMV is 0.45 (SD 0.32); for AMI >0.02, average PMV is 0.80 (SD 0.071). A simple additive model fitted to the data with regressed on PMV, MAF, and AMI suggests that that all three factors considered have a significant effect on imputation uncertainty: MAF (usually equivalent to marker genotype variance) and—to a much larger extent—PMV generate imputation uncertainty, whereas AMI reduces imputation noise (Table 6). The model fitted has some predictive value (r2CV, prediction reliability in 10-fold cross-validation, was 0.21), but a rather large part of the variation could not be accounted for (Table 6).
Association analyses
Significance of marker-trait associations:
Significant associations, for which FDR <0.1 in (CC) or (MI*), were detected in the AP-4x subset and in the LP2 subset; no significant association was detected in LP1. These involved nine markers (thereafter “significant markers”): TP140584; TP184396; TP191264; TP217634; TP268059; TP341988; TP477925; TP521945; and TP87762. The one association involving TP341988 was detected only in LP2 (TP341988 was not included in the analysis in AP-4x due to the threshold of MAF >0.05), whereas the others were detected only in AP-4x. As shown in Table 7, behavior of inferences from (CC) to (MI*) was highly dependent on γ: for values of γ above 0.3 (associations involving TP184396, TP191264, TP217634, TP268059, TP341988, and TP521945), marker effects estimated from (MI*) were closer to zero and significance of associations decreased, certainly as a result of the shrinkage of marker effects as well as the high among-impute variance. These marker-trait associations that lost significance from (CC) to (MI*) generally had a high PMV (above 0.69). However, associations with (involving TP140584, TP477925, and TP87762), characterized by lower PMV (below 0.34), were more prone to show higher significance from (CC) to (MI*), with similar (sometimes larger) estimated marker effects. In some cases, such as TP140584-TSC, TP477925-ARA, TP477925-Mg, TP477925-Ds, TP477925-P, TP477925-TSC,TP87762-ARA, and TP87762-Mg, associations were deemed significant in (MI*) but not in (CC). Such behavior suggests an increase in detection power despite the uncertainty associated with imputation. Those associations generally show consistency from (CC) to (MI*), which would bring evidence for unbiasedness of the imputation procedure. However, for associations TP477925-Mn
and TP477925-Cu
, marker effects were estimated to be weaker in (MI*). Under the assumption that the imputation procedure is proper, such results suggest that the associations detected in (CC) were actually false positives.
The good overall concordance between observed and expected quantiles of p values for (CC) suggests that potential confounders (due to population structure and relatedness in AP-4x) were well accounted for in the GWAS models (Figure 8). Regarding (MI*), the strong overall deflation of suggests that p values simply do not follow a
distribution under the null hypothesis (no significant marker-trait association) because of the extra variability due to imputation uncertainty, which was large for the majority of markers (Figure 6).
Q-Qplot of p values from (CC), for St, TSC, AD, CP, Lignin (in AP-4x subset), and GLC (in LP2 subset), with corresponding p values from (MI*). Expected quantiles are from the (CC) analysis; the corresponding values from (MI*) are shown unordered to reflect how consistent quantiles are across analyses. Open circles, −log10(p) based on (CC); full circles, −log10(p) based on (MI*). Green symbols indicate associations for which FDR <0.1. Gray areas correspond to the 95% C.I. of quantiles under the null hypothesis that p values follow a Uniform(0,1). The p values from (CC) seem to follow a Uniform(0,1), indicating good control for potential confounders (population structure and relatedness). However, p values from (MI*) do not follow a Uniform(0,1) because of losses of significance caused by imputation uncertainty.
Correlation among significant markers:
In the absence of haplotypic information, the squared correlation between markers’ allelic dosages was used to reflect linkage disequilibrium (LD) between markers. The
values were calculated based on
. Among the nine significant markers, there seemed to be one group of four markers in moderately strong association with each other (TP140584, TP217634, TP477925, and TP87762) and one group of two markers in mild association (TP521945 and TP268059) (Figure 9). This grouping is consistent with the associations inferred from GWAS (i.e., markers in one group tend to be associated to similar traits; Table 7), except for TP217634, which is not associated with the same traits as TP140584, TP477925, or TP87762.
Squared coefficients of correlation among significant markers based on (below diagonal elements). Above diagonal elements indicate significance; *: p < 0.05; **: p < 0.01; ***: p < 0.001. TP140584, TP217634, TP477925, and TP87762 form a group of fairly correlated marker loci. TP521945 and TP268059 are in very mild correlation with each other. TP191264, TP184396, and TP341988 each seem independent of other significant markers.
Homology of SNP 64-bp sequences with the Brachypodium distachyon genome:
For each marker, a 64-bp read containing the corresponding SNP was obtained from the UNEAK pipeline. The reads corresponding to the nine significant markers were analyzed for homology with the Brachypodium distachyon genome (v1.0) (Table 8). Three of the nine marker reads significantly match regions of the Brachypodium distachyon genome: (i) TP184396, shown to be negatively associated to Lignin (the nonreference allele has a negative effect on the trait), is in a region homologous to the last intron of an arginine-tRNA ligase gene (e-value = 1.2E-6); (ii) TP268059, shown to be negatively and positively associated with CP and XYL_Eff, respectively, is in a region homologous to the only exon in a phenylalanine/histidine-ammonia-lyase gene (e-value = 3.4E-12); and (iii) TP341988, shown to be negatively associated with GLC in LP2, is in a region homologous to the second exon of a translation elongation factor G gene (e-value = 4.3E-19). The moderate significance of TP184396's alignment to the B. distachyon genome may be due to the fact that the marker is located in an intron, which is usually more likely than exons to be under low selection pressure and diverge substantially across species.
Discussion
Genotype calling uncertainty
Genotype calling uncertainty in GBS typically arises from sequencing error, which generates miscalls in SNP alleles, and low sequencing depth, which causes heterozygotes to be miscalled as homozygotes. Genotype calling uncertainty may be accounted for by applying cut-offs on proportion of reads (e.g., calling individuals homozygous for a SNP if >80% of their reads is of a particular allele) or probabilistic methods (returning posterior probabilities of genotypes), which rely on some estimates of error rate and population allele frequency at SNPs (Nielsen et al. 2011). In this study, sequencing depth was very low; there were, on average, 0.29 reads per SNP called (SD across SNPs: 0.38). As a result, none of the aforementioned methods could be used conveniently (probabilistic methods were not used because of the high amounts of missing values, particularly prohibitive for estimating population allele frequencies). Genotype calling uncertainty was therefore not accounted for, and the genotypes were used as returned by the UNEAK pipeline in the imputation procedure and association tests. Because genotype calling uncertainty translates into random error under a GWAS model, not accounting for it typically results in shrunk marker effects and loss of power (Nielsen et al. 2011). Because there was presumably no systematic loss in precision (true discovery rate), any GWAS inference made from (CC) or (MI*) was still deemed valid in this study. Nonetheless, false positives may have arisen at random from noise (due to genotype-calling uncertainty or other sources), particularly in (CC), which usually had low sample size .
The use of tree models for imputation
Here, no reference genome or genetic map was available for imputing the GBS marker data. Although it was possible to generate genetic maps on LP1 and LP2, the proportion of missing values was such that, in both populations, parental SNP genotypes could not be determined clearly (in particular, 37.6% and 25.6% of selected markers had missing values at both parents in LP1 and LP2, respectively) and imputation uncertainty would have to be accounted to produce reliable genetic maps, which would have been a serious statistical and computational challenge. Consequently, imputation methods based on HMM, which can be very accurate, were not used. In presence of unordered marker data with a general pattern of missingness and no reference panel, the strategy described here to impute missing values (i.e., FCS based on tree models), should be pertinent: although computationally intensive, the proposed imputation procedure was flexible, easy to implement, and appropriate for modeling marker data, as was suggested by Dai et al. (2006).
Structure in the panels
Analysis of structure in the three panels combined revealed stratification by panels, race, and geographical origin. The observed stratification by race and geographical origin was consistent with the results from Jakubowski et al. (2011), i.e., P. caesia being genetically distinct from P. arundinacea and, within P. arundinacea, East European strains being distinct from West European and North American strains. Even though accessions native to North America with a genetic background distinct from that of European accessions do exist (Jakubowski et al. 2012), most accessions of reed canarygrass found in North America, including all those evaluated in this study, share some common ancestry with West European accessions.
Results of imputation-based association tests
In this association study, for the rare cases in which γ was low enough , there were gains of significance from (CC) to (MI*) (and, presumably, a gain of power) (Figure 7A, Table 7). Another interesting outcome from MI was the decreased significance, given one marker (TP477925), for only a subset of the associations detected in (CC) (Table 7). Such results suggest a possible gain of precision in MI-based association tests. Unfortunately, no novel significant markers could be detected from (CC) to (MI*). As expected, for high values of γ
, MI had the desired property of decreasing significance of associations for the sake of precision.
Values of γ lower on average than 0.3 would correspond to PMV lower than approximately 0.45 according to the model presented in Table 6, with values of MAF and AMI set to 0.1 and 0.05, respectively. PMV lower than 0.45 corresponded to 1.6%, 1.8%, and 1.6% of the markers considered for GWAS in the LP1, LP2, and AP-4x subsets, respectively. As discussed below, if a reference panel had been available for imputation, then there would have been many more opportunities for gains in power from MI.
GWAS results and similarity of marker sequences to Brachypodium distachyon
This GWAS revealed associations of markers with multiple traits, which may indicate pleiotropy of one single tagged causal variant, LD between distinct causal variants affecting different traits, or genetic correlation among traits. For example, TP87762 was negatively associated with TSC and positively associated with Ds, St, ARA, and Mg. This result would probably suggest that, through lower tiller density, reed canarygrass plants carrying the nonreference allele at TP87762 were less prone to disease and lodging and had higher concentration of arabinose and magnesium.
Significant markers could not be mapped to a particular region of the reed canarygrass genome, because no reference map or genome sequence is available for that species. However, the significant markers identified here may be used in further studies making use of DNA sequences in reed canarygrass. Potentially, our studies may bring some insight about the function of genes in related species such as Brachypodium or oat. However, actual causal genes were probably not directly tagged by significant markers. Markers that did not map to the Brachypodium genomes (TP140584, TP191264, TP217634, TP477925, TP521945, and TP87762; because of absence of homology or because of evolutionary divergence) were probably in LD with unmarked functional regions, which possibly could have been mapped to the Brachypodium genome. TP184396 and TP 341988 could be mapped to the Brachypodium genome, but within genes with very general purposes (both matched genes involved in mRNA translation). Such results could be “true hits,” but this does not seem very likely. However, TP268059, negatively associated to CP and positively associated to XYL_Eff, was mapped to the exon of a gene coding for Phenylalanine/Histidine-Ammonia-Lyase. Although it is entirely possible that TP268059 did not directly tag the true causal gene, it is plausible that TP268059 is a true hit: Phenylalanine-Ammonia-Lyase (PAL) is involved in the very first step of the monolignol biosynthetic pathway leading to the synthesis of lignin (Boerjan et al. 2003), and a decreased ability to synthesize lignin is consistent with lower crude protein content and higher xylose conversion efficiency because lignin inhibits fermentation (Vogel and Jung 2001). However, one would then expect to find significant associations involving Lignin (lignin content) and GLC_Eff (glucose conversion efficiency), which was not the case here. That said, the association between TP268059 and lignification still makes good sense from the homology of the marker in the Brachypodium genome and the relative directions of the detected effects.
A genome-wide association study based on multivariate MI
Historically, MI has been prevalent in epidemiological studies, probably because of the frequent and prohibitive occurrence of nonresponse in survey data (Rubin 1987; Rubin 1996; Klebanoff and Cole 2008; Sterne et al. 2009). Here, the large amount of missing values in our GBS data motivated us to use MI to account for imputation uncertainty and make sound inferences about marker-trait associations. There have been previous quantitative genetics studies that used MI to increase the precision of significance tests for marker-associated effects: Dai et al. (2006) used an imputation procedure similar to that presented here and compared it with expectation-maximization techniques. However, their study dealt with small marker data (10 SNPs in real data and 4 SNPs in simulated data) and limited PMV (10% or 20% of missing data). Bobb et al. (2011) used MI to gain precision in QTL mapping, but they generated multiple imputes of the phenotypic data, not the genomic data. To our knowledge, this study is the first report of MI applied in a genome-wide context (with thousands of markers over the genome). We believe the methodology presented here could be useful in large-scale genetic analyses involving hypothesis testing in two ways. First, it exemplifies the use of tree models to impute unordered marker data in GWAS and, in the context of multivariate MI, approximate the distribution by following the methodologies developed by Dai et al. (2006) and Burgette and Reiter (2010). As multiplexed GBS, which trades sequencing depth (and therefore genotyping costs) for uncertainty in genotype calling and imputation, is increasingly used for association mapping (Poland and Rife 2012), this type of imputation procedures in a MI context should be particularly useful in species where no or only part of a reference genome is available, like wheat or switchgrass. Second, it shows how estimates from different imputes about marker effects in the unified linear mixed model (Yu et al. 2006) could be pooled using the rules developed by Rubin (1987) to conveniently account for imputation uncertainty and perform statistically valid association tests. That said, MI is not the only method that has been developed to explicitly account for imputation uncertainty. A basic approach would be to use expected genotype counts based on some distribution
and to perform association tests as if the marker data were fixed, as was done in (AD) here (Guan and Stephens 2008; Zheng et al. 2011). As noted previously (Guan and Stephens 2008), such simplification may yield erratic behavior in association testing if imputation uncertainty is high and marker effects are large, which is consistent with the results obtained here (Figure 3B). A more sophisticated approach would be to use tests with explicit account for increased variance of parameter estimates due to imputation uncertainty that are based on some approximation of
(Marchini et al. 2007; Guan and Stephens 2008; Zheng et al. 2011). In this framework, frequentist tests include score tests and likelihood ratio tests, implemented in SNPTEST (Marchini et al. 2007). The Bayesian counterparts of this type of tests, implemented in SNPTEST and BIMBAM (Servin and Stephens 2007), feature some interesting advantages compared to the frequentist tests; in particular, they avoid an inflation in significance arising from high imputation uncertainty (Guan and Stephens 2008). Unfortunately, both the frequentist and Bayesian methods described above, in the current state of their implementations, are limited in the type of models that can be fitted to the data: the tests do not apply to linear mixed models (accounting for relatedness through the K matrix), and uncertainty in covariates cannot be conveniently accounted for, as was done in (MI*) here. Although in human GWAS such models may be appropriate (e.g., Burton et al. 2007), the strong population stratification in plant GWAS panels call for mixed models that incorporate information on population structure and relatedness (Zhu et al. 2008). We believe the approach used here was particularly useful in that it allowed accounting for imputation uncertainty when using linear mixed models for testing associations.
Properness of imputations
An imputation procedure is proper if it is unbiased and it is confidence-valid, i.e., the variability among imputed values is equivalent to what it would be if the data had been complete. Here, we made the MCAR assumption stating that no factor (in particular, not the phenotypes of interest) influenced missingness at the marker data (Rubin and Little 2002; Sterne et al. 2009). Because marker effect estimates based on complete cases are unbiased under the MCAR assumption, (CC) analyses served as the reference to assess consistency of imputation-based tests. Judging from regression analyses on marker-effect estimates from (AD) (Table 4) and from (MI) (Table 5), it seemed that imputations did not generate bias, with shrinkage in marker-effect estimates from MI occurring only because of imputation uncertainty in (MI) [and (MI*)], which is desirable. Imputation bias may have occurred if the MCAR assumption—more generally, the MAR assumption—had not been valid. This may occur in survey data but is unlikely to occur, systematically, in GBS data; the probability of a marker read being available is unlikely to vary systematically across individuals. Confidence validity could only be assumed here. For other imputation methods, which do not preserve the variability present in the original dataset, this assumption cannot be made. Such methods, which include mean imputation and major allele imputation, should therefore be avoided in inference studies.
In the future, a simulation study might be conducted to characterize the unbiasedness and confidence validity of our imputation procedure in a setting where both marker effects and missing values are known a priori. Similar analyses were already performed on SNP data (e.g., Dai et al. 2006), but assessing the properness of MI based on GBS data in a genome-wide context could certainly be useful.
Practical issues for MI in GWAS
Here, given the amount of missing data (78% on average) and the information available for imputation (no reference panel and no genetic map of markers), imputation uncertainty was too prohibitive for detecting novel significant markers when performing imputation-based association tests. Gains of significance were nonetheless possible, mostly for low values of γ (Figure 7A), corresponding approximately to PMV <0.45 (Figure 7B). Pasaniuc et al. (2012) showed that a very substantial gain in imputation accuracy on low-depth GBS data in humans could be achieved when using a reference panel (the 1000-genomes data). For example, with 0.1× sequencing depth, they reported an increase in imputation accuracy from approximately 0.05 to more than 0.70 when the reference panel was used. In light of the results from Pasaniuc et al. (2012) and this article, we recommend imputation-based association studies on GBS data with sporadically missing values only when coverage is high enough to produce few missing values or when a reference panel is available for imputation.
When dealing with marker datasets that are much larger than the one considered here (numbers of markers q on the order of or
), implementing MI in GWAS would be a challenge. Computational and memory requirements will of course increase and parallelization should be devised accordingly. When dealing with unordered marker data, MI based on CART should be a good choice with regard to computational—and statistical—efficiency. However, the mice package uses a
predictor adjacency matrix to keep record of the candidate predictors for each marker. With many threads and/or high q, memory usage can be reduced by replacing the adjacency matrix with a
adjacency list (
: number of variables selected as potential predictors; e.g., 500 in this study) or determining the set of candidate predictors at each iteration with no storage, hence trading computational efficiency for memory efficiency. Both of these measures would involve modifications of the mice code. Importantly, if a (completely typed) reference panel H is available for imputation, then one can base imputations of
on H only, i.e.,
becomes
. In such settings, when imputing a given variable k, there would be no uncertainty about values at the predictors to account for (supposedly, genotypes or haplotypes in H have been perfectly called). As a result, the multiple imputed datasets in MI could be sampled by ordinary MC instead of MCMC. This would dramatically reduce the computational and memory requirements when implementing MI. Note also that when imputing at markers that are completely untyped in X (in silico genotyping), basing imputations on
rather than
may actually yield more accurate imputations (Guan and Stephens 2008). Thus, in presence of a reference panel, MI should be not only more useful (imputations being more accurate) but also more tractable (probably applicable when dealing with hundreds of thousands of markers).
Acknowledgments
The authors thank two anonymous reviewers for remarks and suggestions that greatly helped with improving the manuscript. This research was supported by the US Department of Energy & Department of Agriculture Plant Feedstock Genomics for Bioenergy Program Project Number DE-A-102-07ER64454, by Agriculture and Food Research Initiative Competitive Grant No. 2011-68005-30411 from the USDA National Institute of Food and Agriculture (CenUSA), and by USDA-ARS Congressionally allocated funds. Mention of commercial products and organizations in this manuscript is solely to provide specific information. The USDA is an equal opportunity provider and employer. G.P.R. was supported by the Gabelman-Shippo Wisconsin Distinguished Graduate Fellowship at the University of Wisconsin–Madison.
Footnotes
Supporting information is available online at http://www.g3journal.org/lookup/suppl/doi:10.1534/g3.115.017533/-/DC1
Communicating editor: K. M. Devos
- Received February 17, 2015.
- Accepted March 10, 2015.
- Copyright © 2015 Ramstein et al.
This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.