LinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms

Table 1

Performance of the different imputation methods on the apple dataset

Method	Genotype Error	Allele Error	Run Time, sec
Mode	23.0%	12.4%	^a
kNNi^b	20.6%	10.8%	18
MF	9.9%	5.1%	40,107
fastPHASE	7.7%	3.9%	52,399
Beagle	7.6%	3.9%	424
LD-kNNi^c	7.4%	3.9%	104

Method	Genotype Error	Allele Error	Run Time, sec
Mode	23.0%	12.4%	^a
kNNi^b	20.6%	10.8%	18
MF	9.9%	5.1%	40,107
fastPHASE	7.7%	3.9%	52,399
Beagle	7.6%	3.9%	424
LD-kNNi^c	7.4%	3.9%	104

kNNi, k-nearest neighbors imputation; LD-kNNi, linkage disequilibrium k-nearest neighbors imputation.

a

Run time was under a second.

b

Using a fixed value of k = 8.

c

Using fixed values of k = 5 and l = 20.

Our results show that LD-kNNi performs slightly better than Beagle and fastPHASE, which have the greatest accuracy of all the other methods tested (Table 1). MF performs noticeably worse than these methods, although this may, in part, be due to having imputed on a per chromosome basis. kNNi performs significantly worse than any of these methods, only slightly out-performing Mode imputation. We investigated the difference between LD-kNNi and kNNi further by computing, for each imputed genotype, the number of neighbors that are shared between the kNNi and LD-kNNi methods, using k = 5 in both cases. We found that in 56% of imputations the two methods share no neighbors (Figure 1). This finding suggests that in many cases kNNi is imputed using samples that, although similar across the whole genome, may not be informative for the SNP we are imputing. Figure S3 further supports this hypothesis by measuring the average distance, using the LD-kNNi methodology (d_l, Equation 3), from the sample to be imputed to the neighbors being used in the imputation. This shows that the average distance, again using k = 5 in both cases, is much greater in the case of kNNi (average distance: 6.4) than LD-kNNi (average distance: 1.8).

Figure 1

The number of shared neighbors between the k-nearest neighbors imputation (kNNi) and linkage disequilibrium k-nearest neighbors imputation (LD-kNNi) methods. The value of l was set to 5 for both methods.

Unsurprisingly, we found that the performance of LD-kNNi is dependent on the level of LD between the SNP to be imputed and the SNPs used to find the nearest neighbors. Where the average LD between the SNPs used and the imputed SNP is high, the imputation error is lower (Figure S4). Although the apple reference genome is not used in LD-kNNi, we exploited it to investigate how often our nearest neighbor calculations used SNPs from chromosomes other than the chromosome on which the imputed SNP is located. To do this, we calculated the probability of being on the same chromosome as the imputed SNP for the 20 SNPs in greatest LD with the imputed SNP. Figure 2 shows that for the SNP with the highest LD, there is a probability of 0.7 of being on the same chromosome and that this drops off to 0.31 for the 20th-ranked SNP.

Figure 2

The probability of a single-nucleotide polymorphism (SNP) being on the same chromosome as the imputed SNP as a function of linkage disequilibrium (LD) with the imputed SNP. SNPs are ranked according to LD, with the SNP most in LD with the imputed SNP ranked one.

We investigated the performance of the different imputation methods based on the MAF of the imputed SNPs. Figure 3 shows the genotype error rate of the different methods stratified by MAF. While the error rate noticeably increased with MAF for Mode and kNNi, the increase is small for the other four methods.

Figure 3

Imputation accuracy as a function of the minor allele frequency (MAF) of the imputed SNP for each of the six imputation methods. MAF is binned in 5% bins and the number of SNPs in each bin is shown in parentheses. kNNi, k-nearest neighbors imputation; LD-kNNi, linkage disequilibrium k-nearest neighbors imputation.

Run time

Comparing the run time of the various imputation methods (Table 1), we note that both MF and fastPHASE took significantly longer than any of the other methods: these two methods take on the order of 10 hr compared with only a few minutes or less for the other methods. Further analysis of the run time suggests that, as both the number of samples and SNPs increases, LD-kNNi will continue to have a shorter run time than Beagle (Figure S7 and Figure S8).

Comparing the performance of LinkImpute and Beagle on multiple datasets (Table 2) we note that LinkImpute has a similar run-time to Beagle on all three datasets while achieving slightly better accuracy.

Performance of LinkImpute and Beagle on different datasets

Table 2

Performance of LinkImpute and Beagle on different datasets

Dataset	Number of SNPs	Number of Samples	Genotype Error		Run Time, sec
Dataset	Number of SNPs	Number of Samples	LinkImpute^a	Beagle	LinkImpute^a	Beagle
Apple	8404	711	7.4%	7.6%	104	424
Maize	43,696	4300	18.1%	18.7%	7608	16,585
Grape	8506	77	9.5%	11.0%	28	16

Dataset	Number of SNPs	Number of Samples	Genotype Error		Run Time, sec
Dataset	Number of SNPs	Number of Samples	LinkImpute^a	Beagle	LinkImpute^a	Beagle
Apple	8404	711	7.4%	7.6%	104	424
Maize	43,696	4300	18.1%	18.7%	7608	16,585
Grape	8506	77	9.5%	11.0%	28	16

SNP, single-nucleotide polymorphism.

a

Using the LD-kNNi option and optimized values of k and l

Table 2

Performance of LinkImpute and Beagle on different datasets

Dataset	Number of SNPs	Number of Samples	Genotype Error		Run Time, sec
Dataset	Number of SNPs	Number of Samples	LinkImpute^a	Beagle	LinkImpute^a	Beagle
Apple	8404	711	7.4%	7.6%	104	424
Maize	43,696	4300	18.1%	18.7%	7608	16,585
Grape	8506	77	9.5%	11.0%	28	16

Dataset	Number of SNPs	Number of Samples	Genotype Error		Run Time, sec
Dataset	Number of SNPs	Number of Samples	LinkImpute^a	Beagle	LinkImpute^a	Beagle
Apple	8404	711	7.4%	7.6%	104	424
Maize	43,696	4300	18.1%	18.7%	7608	16,585
Grape	8506	77	9.5%	11.0%	28	16

SNP, single-nucleotide polymorphism.

a

Using the LD-kNNi option and optimized values of k and l

Accuracy of allele frequency estimation

Figure 4 shows a bubble plot of actual and incorrectly imputed genotypes for each of the six imputation methods. This shows that all six methods have a bias toward imputing the major allele. This allele bias is pronounced for Mode and kNNi and is less severe for the other methods.

Figure 4

Bubble plots of the actual and imputed genotypes for each of the 10,000 masked genotypes using each of the six imputation methods. Bubbles are not shown for the correctly imputed cases. The size of the bubbles is proportional to the frequency of observations in that category. kNNi, k-nearest neighbors imputation; LD-kNNi, linkage disequilibrium k-nearest neighbors imputation.

The allele bias observed in Figure 4 is expected to affect allele frequency estimation. We investigated this further by using our smaller dataset. For each of the six methods, we calculated the MAF across the 1001 SNPs without missing genotype data using both the observed genotypes and the imputed genotypes. Figure 5 shows that every imputation method biases the MAF downward. This finding is consistent with our observation of allele bias in Figure 4. The resulting bias is least pronounced for genotype specific methods, which all bias the MAF downward by 0.5% as opposed to a minimum of 0.6% for any of the other methods.

Figure 5

Minor allele frequency (MAF) computed by the use of actual and imputed genotypes for each of the six imputation methods. kNNi, k-nearest neighbors imputation; LD-kNNi, linkage disequilibrium k-nearest neighbors imputation.

Figure 5 shows the tendency for the MAF to be underestimated when calculated using an imputed dataset no matter what imputation method is used. In addition, LD-kNNi outperforms every other method in estimating MAF: the points cluster much closer to the line for LD-kNNi than for any of the other methods. Moreover, LD-kNNi’s most extreme deviation (3.8%) from the observed MAF is lower than any of the other tested methods (Figure S5). The two groups in the Mode plot are caused by the two different modal values for SNPs (0 or 1). The bottom left group is where the modal value is 0, the top right group is where it is 1 (Figure S6).

Discussion

LD-kNNi performs well compared with the most commonly used imputation methods. On our apple dataset it results in both superior imputation accuracy (Table 1) and more accurate allele frequency estimates (Figure 4 and Figure 5). Accuracy results on the two other tested datasets are similar, and the results presented here suggest that performance should be comparable on other similar datasets. In particular, Figure 3 suggests that the MAF distribution should have little effect on the relative performance of LD-kNNi.

The run time of LinkImpute also compares favorably with existing methods. Only two of the methods studied here have both high accuracy and reasonable run times, namely Beagle and the LD-kNNi option of LinkImpute. Of these, our method is slightly faster. In addition, as the number of samples and SNPs increases, LD-kNNi is expected to outperform the other methods (Figure S7 and Figure S8), which is particularly noteworthy because increasing sample size is critical to augmenting the statistical power of GWA studies (Spencer et al. 2009).

A recently developed imputation algorithm was designed for heterozygous species without a reference genome and was applied to raspberry (genus Rubus; Ward et al. 2013). However, this method applies only to biparental populations and relies on the construction of a genetic map. The primary advantage of LD-kNNi over existing methods is that it does not rely on ordered markers and can be applied to diverse and heterozygous populations (Figure S9), not just biparental crosses. Although we called SNPs using the apple reference genome, LD-kNNi makes no use of this information during imputation. Indeed Figure 2 shows that in many cases our algorithm is using information from SNPs that are not on the same chromosome as the imputed SNP. It is worth noting that linkage group assignments from apple F1 populations conflict with reference genome locations for 14–18% of SNPs (Antanaviciute et al. 2012; Gardner et al. 2014). It is therefore likely that a significant number of sequences are anchored incorrectly in the version of the apple genome used here. Thus, the values in Figure 2 may be upward biased. Nevertheless, LD-kNNi clearly often makes use of information from SNPs on other chromosomes and the quality of the apple reference genome has no effect on its performance.

We demonstrated that the performance of LD-kNNi improves as the LD between the imputed SNPs and the SNPs used to find the nearest neighbors increases (Figure S4). This suggests that, as the SNP density of a dataset increases and more SNPs are in LD with one another, one can expect improvements in the imputation accuracy of LD-kNNi. One way of obtaining more SNPs would be to allow greater levels of missing genotypes, although the increase in missing data are likely to have a negative effect on imputation accuracy. Whether this negative effect is offset by the positive effect of increased SNP density is an area that warrants further study.

Like most other imputation methods, LinkImpute is applied to a table of genotypes that have been called by a genotype calling algorithm. In many cases, a genotype without sufficient sequence coverage is set to missing in the table even though it has several supporting sequence reads from the original data source. In such cases, the information from those reads is lost and remains unused during imputation. By including the information from these reads during imputation, we are likely to improve imputation performance. In turn, this should enable greater confidence genotype calls from lower read depths thereby significantly increasing the total number of genotypes called. Moreover, incorporating imputation and SNP calling in this manner should help improve genotyping error rates, especially in cases of low read depth. This is an active area of research and future improvements are expected to increase both genotype quality and quantity.

Our results suggest that LD-kNNi produces more accurate allele frequency estimates at the cost of a slight decrease in imputation accuracy. Biased allele frequencies are known to adversely affect downstream analyses (Han et al. 2014), whereas increased imputation accuracy does not always lead to improved phenotype prediction (Rutkoski et al. 2013). For many studies, an imputation method with less bias in allele frequency estimation, such as LD-kNNi, may therefore be preferable to a method with slightly increased accuracy. It is worth noting that, in cases where one is only interested in the MAF, one can simply estimate it from the nonmissing genotypes. We show that such an estimate is indeed unbiased and that it is more accurate than estimating MAF after imputation (Figure S5 and Figure S10). The relationship between imputation accuracy, allele frequency bias and their effects on downstream analyses warrants further investigation.

Genotype imputation is a crucial step in many genomic studies as all existing genotyping methods result in some missing data. Most imputation algorithms rely on physical or genetic maps, either directly or in the generation of ordered SNPs, and are not suitable for use in non-model organisms with poor or underdeveloped genomic resources. Our novel genotype imputation method, LD-kNNi, does not rely on physical or genetic maps and imputes genotypes as accurately as the best existing methods that require ordered markers. In addition, it is fast and outperforms other methods in its ability to accurately estimate allele frequencies. Thus, LinkImpute is a valuable tool for improving genome-wide analyses in nonmodel organisms, especially for GWA and GS in highly diverse and heterozygous organisms.

Acknowledgments

We thank Patrick J Brown and Kate Crosby for providing useful discussion. This work was supported by a Genome Canada Bioinformatics and Computational Biology grant; the Canada Research Chairs program; and the National Sciences and Engineering Research Council of Canada.

Footnotes

Supporting information is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.115.021667/-/DC1

Communicating editor: J. Ross-Ibarra

Literature Cited

Adam-Blondon

A-F

,

Jaillon

O

,

Vezzulli

S

,

Zharkikh

A

,

Troggio

M

et al. ,

2011

Genome sequence initiatives.

Genet. Genomics Breed. Grapes

211

–

234

.

Altshuler

D

,

Daly

M J

,

Lander

E S

,

2008

Genetic mapping in human disease.

Science

322

:

881

–

888

.

Antanaviciute

L

,

Fernández-Fernández

F

,

Jansen

J

,

Banchi

E

,

Evans

K M

et al. ,

2012

Development of a dense SNP-based linkage map of an apple rootstock progeny using the Malus Infinium whole genome genotyping array.

BMC Genomics

13

:

203

.

Bovine HapMap Consortium

R. A.

Gibbs

,

Taylor

J F

,

Van Tassell

C P

,

Barendse

W

,

Eversole

K A

et al. ,

2009

Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds.

Science

324

:

528

–

532

.

PubMed

Browning

S R

,

Browning

B L

,

2007

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.

Am. J. Hum. Genet.

81

:

1084

–

1097

.

Chagné

D

,

Crowhurst

R N

,

Troggio

M

,

Davey

M W

,

Gilmore

B

et al. ,

2012

Genome-wide SNP detection, validation, and development of an 8K SNP array for apple.

PLoS One

7

:

e31745

.

Danecek

P

,

Auton

A

,

Abecasis

G

,

Albers

C A

,

Banks

E

et al. ,

2011

The variant call format and VCFtools.

Bioinformatics

27

:

2156

–

2158

.

Elshire

R J

,

Glaubitz

J C

,

Sun

Q

,

Poland

J A

,

Kawamoto

K

et al. ,

2011

A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species.

PLoS One

6

:

e19379

.

Gardner

K M

,

Brown

P

,

Cooke

T F

,

Cann

S

,

Costa

F

et al. ,

2014

Fast and cost-effective genetic mapping in apple using next-generation sequencing

.

G3 (Bethesda)

4

:

1681

–

1687

.

Glaubitz

J C

,

Casstevens

T M

,

Lu

F

,

Harriman

J

,

Elshire

R J

et al. ,

2014

TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline.

PLoS One

9

:

e90346

.

Han

E

,

Sinsheimer

J S

,

Novembre

J

,

2014

Characterizing bias in population genetic inferences from low-coverage sequencing data.

Mol. Biol. Evol.

31

:

723

–

735

.

Hayes

B J

,

Bowman

P J

,

Chamberlain

A J

,

Goddard

M E

,

2009

Invited review: genomic selection in dairy cattle: progress and challenges.

J. Dairy Sci.

92

:

433

–

443

.

Hearne, S., C. Chen, E. Buckler, and S. Mitchell, 2014 Unimputed GbS derived SNPs for maize landrace accessions represented in the SeeD-maize GWAS panel. Available at: http://data.cimmyt.org/dvn/dv/seedsofdiscoverydvn/faces/study/StudyPage.xhtml;jsessionid=6dede0d2bfdf0cb29ddb610981cc?globalId=hdl:11529/10034&tab=files&studyListingIndex=0_6dede0d2bfdf0cb29ddb610981cc. Accessed September 25, 2015.

International HapMap 3 Consortium

D. M.

Altshuler

,

Gibbs

R A

,

Peltonen

L

,

Dermitzakis

E

,

Schaffner

S F

et al. ,

2010

Integrating common and rare genetic variation in diverse human populations.

Nature

467

:

52

–

58

.

PubMed

Jaillon

O

,

Aury

J-M

,

Noel

B

,

Policriti

A

,

Clepet

C

et al. ,

2007

The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.

Nature

449

:

463

–

467

.

PubMed

Li

H

,

Durbin

R

,

2009

Fast and accurate short read alignment with Burrows–Wheeler transform.

Bioinformatics

25

:

1754

–

1760

.

Li

Y

,

Willer

C

,

Sanna

S

,

Abecasis

G

,

2009

Genotype imputation.

Annu. Rev. Genomics Hum. Genet.

10

:

387

–

406

.

Marchini

J

,

Howie

B

,

2010

Genotype imputation for genome-wide association studies.

Nat. Rev. Genet.

11

:

499

–

511

.

Matukumalli

L K

,

Lawley

C T

,

Schnabel

R D

,

Taylor

J F

,

Allan

M F

et al. ,

2009

Development and characterization of a high density SNP genotyping assay for cattle.

PLoS One

4

:

e5350

.

McClure

K A

,

Sawler

J

,

Gardner

K M

,

Money

D

,

Myles

S

,

2014

Genomics: a potential panacea for the perennial problem.

Am. J. Bot.

101

:

1780

–

1790

.

McKenna

A

,

Hanna

M

,

Banks

E

,

Sivachenko

A

,

Cibulskis

K

et al. ,

2010

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Genome Res.

20

:

1297

–

1303

.

Myles

S

,

Chia

J-M

,

Hurwitz

B

,

Simon

C

,

Zhong

G Y

et al. ,

2010

Rapid genomic characterization of the genus vitis.

PLoS One

5

:

e8219

.

Poland

J

,

Endelman

J

,

Dawson

J

,

Rutkoski

J

,

Wu

S

et al. ,

2012

Genomic selection in wheat breeding using genotyping-by-sequencing.

Plant Genome J.

5

:

103

.