Sequencing, Assembling, and Correcting Draft Genomes Using Recombinant Populations

Current de novo whole-genome sequencing approaches often are inadequate for organisms lacking substantial preexisting genetic data. Problems with these methods are manifest as: large numbers of scaffolds that are not ordered within chromosomes or assigned to individual chromosomes, misassembly of allelic sequences as separate loci when the individual(s) being sequenced are heterozygous, and the collapse of recently duplicated sequences into a single locus, regardless of levels of heterozygosity. Here we propose a new approach for producing de novo whole-genome sequences—which we call recombinant population genome construction—that solves many of the problems encountered in standard genome assembly and that can be applied in model and nonmodel organisms. Our approach takes advantage of next-generation sequencing technologies to simultaneously barcode and sequence a large number of individuals from a recombinant population. The sequences of all recombinants can be combined to create an initial de novo assembly, followed by the use of individual recombinant genotypes to correct assembly splitting/collapsing and to order and orient scaffolds within linkage groups. Recombinant population genome construction can rapidly accelerate the transformation of nonmodel species into genome-enabled systems by simultaneously producing a high-quality genome assembly and providing genomic tools (e.g., high-confidence single-nucleotide polymorphisms) for immediate applications. In populations segregating for important functional traits, this approach also enables simultaneous mapping of quantitative trait loci. We demonstrate our method using simulated Illumina data from a recombinant population of Caenorhabditis elegans and show that the method can produce a high-fidelity, high-quality genome assembly for both parents of the cross.


Introduction
Here we present a new approach for producing de novo whole genome sequences--recombinant population genome construction (RPGC)--that solves many of the problems encountered in standard genome assembly (see more details in the accompanying manuscript). This manual goes through the analysis explained in the manuscript with great detail.

Requirements
To run our RPGC walk-through below, you will need the following software installed: pIRs v1.10 ALLPATHS-LG build 41292 BWA v0.6. You will also need python 2.7 or above to run our home-brew python scripts.
Note that RPGC does not require any particular software at each step. The ones listed above best served our interests; e.g. you could use SOAPdenovo instead of ALLPATHS-LG to generate your assemblies. The general approach remains the same. 3. Programs within square brackets indicate that valid paths to where the programs are installed are required. In the example below, you need to specify the path to the directory where pirs can be found.
[pirs] diploid -i n2.fasta \ -s 0.001 -d 0.0001 -v 0.000001 \ -o n2b.fasta 2/41 4. Command-line arguments within <> indicate valid paths to where les or directories are stored are expected. In the example below, you need to provide both a path to the n2.fasta le and a path to the output.

RPGC Manual
We used the reference C. elegans strain N2 as one parental genome, and simulated a second parent, N2b, using pIRS (v1.10).
We simulated 70 individual RILs equivalent to eight generations of sel ng by a set of individual F2s that were produced from crossing perfectly isogenic parental strains and self-fertilizing the resulting F1. We assumed that this resulted in a total of ve crossing-over events per chromosome in the sequenced RILs, and that recombination events were uniformally distributed along chromosomes. [

RPGC Manual
We built a de novo genome assembly using ALLPATHS-LG (build 41292). We rst prepared the data as ALLPATHS-LG suggests. Reads from two parents plus 10 randomly chosen RILs were used.

8/41
Reads from the parents and all of the RILs were mapped against our assembled genome using BWA (v0.6.1) with default settings. Here you can decide to either run BWA (or your mapper of choice) on each of the individuals separately, or to use our home-brew python wrapper script that calls BWA internally. But before making up your mind, you need rst to index your assembled genome using the index subprogram built in BWA.

RPGC Manual
Input: con guration le assembled_genome Output: SAM les The command-line for using our python script is shown below. Mapping results will be organized by individual under speci ed "mapping_dir".

Input:
assembled_genome.fasta Output: assembled_genome Because we use paired-end read mapping, two subprograms in BWA are called internally from our python script: aln and sampe, using the default settings. Below is an example of our script calling BWA on one of the RILS.

10/41
To run our python script, you have to provide a con guration le whose format is de ned as follows: 1. one individual per line 2. each line consists of three space-separated columns: (i). individual ID (ii). fastq le of one end of the read (iii). fastq le of the other end of the read (iv). fastq le of single-end reads if any (Optional) 3. lines starting with "#" will be taken as comments and will not be read by the script An example of con guration le, bwa.con g, will be provided along with our python scripts.

11/41
After Read Mapping The raw mapping results in SAM format are further processed using Picard (v1.77). Similar to what we did for read mapping, we used another home-brew python wrapper script to call Picard internally with its built-in subprograms: To run our python script, you are required to provide a con guration le whose format is de ned as follows: 1. one individual per line 2. each line consists of three space-separated columns: (i). read group de nition (ii) SAM le from pair-end mapping (iii) SAM le from single-end mapping if any (Optional) 3. lines starting with "#" will be taken as comments and will not be read by the script An example of con guration le, picard.con g, will be provided along with our python scripts. To ensure call quality, we independently used SAMtools (v0.1.18) and the GATK HaplotyeCaller .
We then used GATK Uni edGenotyper on the realigned BAM le to make the initial call set. Note that because we did not have a set of known variants, we could not run base quality recalibration at this stage. The overlapping call set was then further ltered to include only i) sites with two alleles, ii) sites where at least one of the 10 individuals is identi ed as homozygous for the non-reference base with a minimum of 4X coverage, and iii) sites where at most one RIL is genotyped as heterozygous. Below, we rst ltered sites with more than two alleles using GATK SelectVariants.

RPGC Manual
Then we used our home-made python script to further lter the biallelic call-set based on the second and third rules. In cases where di erent thresholds de ned in rule 2 and 3 are required, you can customize them through -lter option as long as its de ned format is followed.

20/41
After Initial Call and Before Second Call The ltered call-set was then treated as a high-con dence set of variants and used for base recalibration (BQSR) using GATK, with the default set of covariates. In addition, we extracted the top 10% of the highest-scoring calls in the resulting set as a training set for variant-quality recalibration (VQSR) of the entire set of calls, using a home-made python script shown below. The value for the percentage of top highest-scoring calls is customizable through "-top_percent" option.

Variant Recalibration
As GATK suggests on their website, the VQSR was done for SNPs and indels separately because they were called by Uni edGenotyper. For both SNP and indel recalibration, we tried di erent annotation combinations ("-an" option) and picked the one that best trained the mixture Gaussian models based on the .recal, .tranche, and .plots les. Also from the .tranche le you can decide at what threshold level (speci ed by "-ts_ lter_level" in ApplyRecalibration command-line setting below) you want to lter out your calls.
Then we applied the best SNP and indel model, respectively, on our second call-set, using the GATK ApplyRecalibration. All 70 RILs and two parents were genotyped at all variant sites in the nal call set, using the realigned and base-quality recalibrated BAM le as input to GATK Uni ed Genotyper.

RPGC Manual
To ensure genotyping quality, we independently genotyped all the data using SAMtools. The SAMtools-generated genotypes were used later as one of the inputs to our identi cation of split loci.