Construction of Reference Chromosome-Scale Pseudomolecules for Potato: Integrating the Potato Genome with Genetic and Physical Maps

The genome of potato, a major global food crop, was recently sequenced. The work presented here details the integration of the potato reference genome (DM) with a new sequence-tagged site marker−based linkage map and other physical and genetic maps of potato and the closely related species tomato. Primary anchoring of the DM genome assembly was accomplished by the use of a diploid segregating population, which was genotyped with several types of molecular genetic markers to construct a new ~936 cM linkage map comprising 2469 marker loci. In silico anchoring approaches used genetic and physical maps from the diploid potato genotype RH89-039-16 (RH) and tomato. This combined approach has allowed 951 superscaffolds to be ordered into pseudomolecules corresponding to the 12 potato chromosomes. These pseudomolecules represent 674 Mb (~93%) of the 723 Mb genome assembly and 37,482 (~96%) of the 39,031 predicted genes. The superscaffold order and orientation within the pseudomolecules are closely collinear with independently constructed high density linkage maps. Comparisons between marker distribution and physical location reveal regions of greater and lesser recombination, as well as regions exhibiting significant segregation distortion. The work presented here has led to a greatly improved ordering of the potato reference genome superscaffolds into chromosomal “pseudomolecules”.

Genome sequencing of crop plants has become increasingly routine, primarily due to the reduction in cost and increase in throughput brought about by continuing advances in sequencing technologies. First reports on the whole-genome sequences of plants, such as Arabidopsis thaliana (The Arabidopsis Genome Initiative 2000) and rice (International Rice Genome Sequencing Project 2005), were mainly accomplished with the use of clone-based (e.g., "BAC by BAC") strategies. In this approach, a library of bacterial artificial chromosome (BAC) clones is mapped onto chromosomes by the use of molecular markers, the aim being to generate a clone-based physical map with a "minimum tiling path." This assures good genome coverage whileminimizing the sequencing effort. More recently, plant genome sequencing has been based on whole-genome shotgun approaches involving conventional Sanger sequencing, next-generation sequence technologies, or a combination of both (Hamilton and Buell 2012). The whole-genome shotgun approach does not require a physical map, and there is no preassumption of the position of the resulting sequence assemblies. Several research groups have developed "scaffolding" algorithms to assemble these typically short sequence contigs into larger constructs (Miller et al. 2010). However, because of the genome size and complexity of most crop plants, scaffolds typically remain unoriented and without chromosomal coordinates, despite being well annotated for gene content. A reference genome sequence requires that the products of the assembly process (contigs and scaffolds) be globally ordered and oriented to generate chromosomal pseudomolecules (PMs). In the absence of a clone-based physical map or genetic map of the reference sequenced genotype, this task is a significant and challenging one. One widely adopted approach has been to link the sequence assembly to a genetic map using the presence of mapped sequence-tagged site (STS) genetic markers (Green and Green 1991) in the genome sequence. For example, a set of 409 molecular markers was used to order 69% of the assembled 487 Mb grapevine genome along the 19 grape linkage groups (The French-Italian Public Consortium for Grapevine Genome Characterization 2007). The link between the genome sequence and its genetic maps is critical in moving between trait loci and candidate genes underlying such loci. Successful genetic anchoring of a plant genome sequence assembly with the use of maps developed in the reference-sequenced genotype depends on marker density and distribution, as well as map accuracy and resolution. Other approaches can also be implemented to augment the anchoring process, including comparative analysis with physical and genetic maps of closely related species.
The Potato Genome Sequencing Consortium (Potato Genome Sequencing Consortium 2011) has published the genome of the doubled monoploid Solanum tuberosum group Phureja DM1-3 516 R44 (hereafter referred to as DM). At the time the genome sequencing was initiated, DM did not have a physical map, nor was there any preexisting genetic map for this genotype. Therefore, a genome-anchoring strategy was developed that included the generation of a segregating biparental mapping population involving DM as a parent, and generation of a dense STS-based genetic map. Other genetic mapping resources, such as the ultra-high density (UHD) map of diploid potato genotype RH89-039-16 (RH) (van Os et al. 2006), and the tomato-EXPEN 2000 genetic reference map (Fulton et al. 2002) were also used.
We describe for the first time in detail the generation of an integrated de novo genetic/physical map of potato and significant refinements to the previously published assembly. Our combined map orders the genome sequence into 12 chromosomal PMs corresponding to each of the 12 potato chromosomes and is linked to previously existing potato and Solanaceae mapping resources. The work represents the assimilation of various data types that required complex interpretation for correct ordering and orientation of superscaffolds. This process involved considerable manual curation, driven largely by a novel approach for visualization of mate-pair sequences from large genomic clones (BAC and fosmid) and long insert 454 reads (20 kb and 8 kb). This allowed us to assign robust orientations to many superscaffolds and also enabled the inclusion of many superscaffolds that remained unanchored when the reference genome sequence was published (Potato Genome Sequencing Consortium 2011). This resource will facilitate exploitation of the potato genome sequence for genetic analysis and crop improvement, and our approach can serve as a guide for others wishing to engage in genome sequencing of genotypes which lack physical or genetic maps.

MATERIALS AND METHODS
Genetic cross/population construction A segregating diploid potato population (BC 1 ) derived from the reference sequence clone DM 1-3 516 R44 (DM) was developed. The homozygous DM clone (2n = 2x = 24) was generated by chromosome doubling of a monoploid (2n = 1x = 12) derived from a heterozygous accession of S. tuberosum Group Phureja (Paz and Veilleux 1999). A heterozygous diploid clonal accession (CIP 703825, referred to as D) belonging to the Solanum tuberosum diploid Andigenum Group Goniocalyx cultivar group (Spooner et al. 2007;Ovchinnikova et al. 2011) was crossed to DM. The direction of the cross (DM · D) was chosen because DM is male sterile. One of the resulting F 1 hybrids (DM/D, CIP 305156.17) was used as the stylar parent in a backcross with D as pollen parent. The mapping population comprising 180 backcross progeny clones (hereafter referred to as DMDD) was raised in the greenhouse for DNA extraction and pathogen testing and is also maintained pathogen-free in vitro (https://research.cip.cgiar.org/ confluence/display/dm/Home) at the International Potato Center, Peru.
Plant material and genomic DNA extraction Genomic DNA from 180 progeny clones of the mapping population and the pedigree parents was isolated by the use of standard protocols (Herrera and Ghislain 2000). DNA concentration was estimated with a TBS-380 Fluorometer (Turner BioSystems) with PicoGreen reagent using salmon sperm DNA at 500 ng/mL as a reference. All DNA samples were normalized to a final concentration of 250 ng/mL and distributed among members of the Potato Genome Sequence Consortium (PGSC) mapping group to perform multilocation genotyping by using diversity arrays technology (DArT), simple sequence repeat (SSR), single-nucleotide polymorphism (SNP), and amplified fragment-length polymorphism (AFLP) markers.
SNP markers: A custom filtering pipeline was developed to select 1920 SNPs from a set of 69,011 high-confidence SolCAP SNPs (Hamilton et al. 2011) that were incorporated into five 384-plex (5 · 384) Illumina GoldenGate oligonucleotide pool assays (OPAs; Fan et al. 2003), hereafter referred to as POPA (potato OPAs). Hamilton et al. (2011) identified these SNPs by comparing RNA-Seq and EST sequences from six potato cultivars (Atlantic, Premier, Snowden, Bintje, Kennebec, and Shepody) to the draft DM potato reference genome. Our filtering pipeline involved finding nonrepetitive positions on the DM assembly, avoiding overlapping SNPs that may have interfered with the Illumina SNP genotyping assay, and striving to cover the genome as fully as possible. In addition, a POPA containing SNPs derived from pre-existing potato ESTs in the public databases was also designed and used. Table S1B shows details of 2304 SNPs, derived from pre-existing potato ESTs (POPA1) and SolCAP markers (POPA2-6) used in the study. Genotyping was performed using an Illumina BeadXpress platform following the recommendations of the manufacturer (GoldenGate Genotyping Assay, Illumina VeraCode Manual, VC-901-1001). All reagents, unless stated otherwise in the standard protocol, were provided by Illumina. The data files were processed and genotypes called using Genome Studio software.
AFLP markers: AFLP analysis was performed according to the procedures described by Vos et al. (1995) using the restriction enzyme combination EcoRI and MseI. AFLP fragments were separated on a LI-COR 4300 DNA Sequencer (LI-COR Biosciences) using 4.5% polyacrylamide denaturing gels (acrylamide:bisacrylamide, 19:1) as described in the user manual. The LI-COR size standard ladder was loaded into each lane to facilitate the semiautomatic analysis of the gel and the sizing of the fragments. The names of the markers indicate the enzymes used, the selective nucleotides, and the size of the fragment; for instance, EACTMAAC_205.0 is an AFLP marker derived from a primer combination with the enzymes EcoRI and MseI, selective nucleotides ACT and AAC, and a mobility that corresponds to a fragment with an estimated size of 205 bp. Polymorphic bands were manually scored following the intensity degree and the parent backcross pattern. The details of the enzyme combinations, selective nucleotides, and adapter sequences are provided in Table S1C.
DArT markers: Representations from 180 DMDD progeny clones and the pedigree parents (DM, DM/D, D) were obtained by subjecting DNA from each clone to double restriction enzyme digestion (PstI/ TaqI) and ligation to PstI adaptors for reducing genome complexity followed by PCR amplification for preparation of targets (Wenzl et al. 2004). Cy3-labeled representations (targets), mixed in an ExpressHyb buffer containing cy5-labeled polylinker fragment of the plasmid used for library preparation (as a reference), were denatured and hybridized to a high-resolution potato genotyping array containing 7680 DArT probes (Sliwka et al. 2012). After overnight hybridization at 62°, arrays were washed and scanned with 20 mm resolution at 543 nm (cy3) and 488 nM (FAM) on a LS300 confocal laser scanner (Tecan, Grödig, Austria) to detect fluorescent signals emitted from the hybridized fragments. The data from the scanned images were extracted and analyzed using the DArTsoft 7.4 software (Diversity Arrays Technology P/L, Canberra, Australia). The logarithm of the ratio between the two background-subtracted averages of feature pixels in the cy3 and cy5 channels (log2[cy3/cy5]) was used as a measure of the difference in abundance of the corresponding DNA fragment in the two representations hybridized to the array. The log2[cy3/FAM] and log2[cy5/FAM] values, which are approximate measures of the amount of hybridization signal per amount of DNA spotted on the array, were used for quality-control purposes. The unique signal pattern obtained by hybridizing each sample pair (individual clone and reference) to the genotyping array was recorded as "0" or "1." All DArTs were sequenced and are available from Spud DB site (http://potato.plantbiology.msu.edu/); the detailed methodology is published on the Diversity Arrays Technology website (http://www. diversityarrays.com).
groups. The backcross parents and offspring were coded according to the cross-pollinated (CP) population type (outbreeder full-sib family after two independent meioses). A female-male combined DMDD map was generated that included markers informative in one or both parents. Linkage groups were formed using the Independence LOD parameter under "population grouping" with a range from 2 to 15. Before grouping and ordering markers within linkage groups, loci or progeny clones with $20% missing values were removed along with all identically segregating loci. The regression mapping algorithm with modified settings (recombination frequency threshold , 0.49, LOD threshold . 0.01) was used to order loci within each linkage group. All linkage groups were subjected to three rounds of mapping. Recombination frequencies were converted into map distances using the "Kosambi" mapping function.
Locating STS markers on the DM assembly STS markers were aligned to the reference genome assembly using SSAHA2 (Ning et al. 2001) or BLAST. The total set of alignments was processed as follows. First, alignments caused by short repetitive sequences were removed using a custom depth/coverage filter. In detail, any alignment covering a region of the query or target sequence that overlapped with five or more other competing alignments in that region was removed if this depth threshold was exceeded greater than 20% or more of the alignment length. In this way alignments spanning short repeats were not penalized, but alignments largely composed of likely repeats were removed. Second, short alignments were grouped by sequence into "hits" that allowed for indels. Third, where applicable, the relative distance and orientation of the forward and reverse reads for the marker was taken into consideration. Pairs of forward and reverse reads with an incorrect orientation or implausible separation were removed. Finally, only markers with a unique, high-scoring alignment position on the genome assembly were selected as anchor points in the physical map. The final positions of all the STS markers (SSRs, SNPs, and DArTs) are provided in Table S2.
Integration of additional sequence-based and physical resources DM BAC-and Fosmid-end sequences, RH BAC-end sequences, and tomato BAC-and Fosmid-end sequences were aligned to the DM superscaffolds using SSAHA2 (Ning et al. 2001). The resulting alignments were filtered as described previously. Roche 454 Paired-end (PE) reads from 14-and 20-kb insert-size libraries from DM, representing 0.7 and 1.0 Gb of raw data, respectively, were aligned to the superscaffold sequences using Newbler (Margulies et al. 2006) with all the default settings. Unsequenced BAC clones from the RH physical map (de Boer et al. 2012) were positioned on the superscaffolds using BLAST alignment of their whole-genome profiling (WGP) sequence tags. For each BAC, the alignment hits of the individual 25 nt tags were processed to retain only unique hits. The aligned BAC clones that carried AFLP markers provided the link between the DM superscaffolds and the RH UHD genetic map (van Os et al. 2006). In addition, sequenced RH BAC clones and RH BAC-end sequences were used for anchoring and scaffolding of the DM sequences. Finally, sequences from the available tomato PMs (v2.40, The Tomato Genome Sequencing Consortium 2012) were aligned using ATAC (Istrail et al. 2004).
Manual scaffolding using the "link-peak" strategy All paired-end and mate-pair (PEMP) sequence data that could be reliably mapped to the DM superscaffolds were combined to compute a composite directional link-score across each superscaffold. In detail, the link-score combined PEMPs that had unique, high-scoring alignments for both ends of each mate pair sequence, but with the two end sequences aligning to different non-adjoining superscaffolds. A reciprocally high link-score between the ends of a pair of superscaffolds indicated a probable scaffolding link between them. The composite directional link-score is calculated in a sliding window along the length of a superscaffold (the source) as follows: 1. All mate pairs with one end aligning in that window and the other corresponding mate pair end reliably mapping to another superscaffold (the target) are selected. These are designated as unsatisfied mate pairs. 2. These mate pairs are grouped according to the target superscaffold. 3. For each target superscaffold group, a score is calculated by summing the value for each mate pair in that group (see below for details of how the value is determined). 4. The link-peak score is the greatest score of all the target groups.
Different link-score values were empirically assigned to the different PEMP sequence libraries, with greater scores assigned to DM based libraries over RH and tomato-based libraries and greater values given to longer sequences that have more accurate alignments. In addition to accumulating link-evidence from consistent unsatisfied PEMPs, a noise-score was calculated for unsatisfied PEMPs that suggested links to multiple different target superscaffolds. The noise score allowed spurious, high-scoring links caused by repeats to be identified. In this way the evidence for links between pairs of superscaffolds could be conveniently described as a continuous value in wiggle format (https://www.genome.ucsc.edu/goldenPath/help/wiggle. html), which allows for visualization as tracks in GBrowse, alongside genetic and physical evidence from other sources.

Visualization of integrated genetic and physical map
The integrated genetic and physical maps of the DM genome were visualized with the software 'DMAP' (D. M. A. Martin, unpublished data). The figures produced by the DMAP software take as input the accessioned golden path (AGP) file describing the PM architecture, a GFF file describing the sequence positions of the markers on the superscaffolds, and the JoinMap output file from linkage mapping for each linkage group. As there are many more markers than those that can be coherently visualized on a printed figure, DMAP employs a selection and layout algorithm where only a user determined maximum number of labels are displayed.
DM chromosome idiogram figures were reproduced from the potato reference genome publication (Potato Genome Sequencing Consortium 2011) and were aligned by orienting the short arms toward the start of the PM sequence, except for chromosomes 5 and 11, where the PM sequence begins in the long arm (Tang et al. 2009; Potato Genome Sequencing Consortium 2011).

Identification of centromere positions and pericentromeric regions
Centromere positions were determined with the sequence information provided by Gong et al. (2012). For chromosomes 4, 6, 9, 10, 11, and 12, the DM superscaffolds covering the centromere locations were identified from the major peaks in the CENH3 chromatin immunoprecipitation sequence read plots on the DM V2.1.10 PM sequences. Satellite repeat analysis was performed by searching for the repeats in the DM sequence at http://yh.genomics.org.cn/potato/search.jsp and by evaluating the repeat coverage through dot plot alignment of candidate DM sequences with the repeat sequence. In addition, centromere positions were also indirectly inferred from the marker density in RH UHD genetic map (van Os et al. 2006).
The revised physical positions of all of the Illumina Potato 8303 Infinium array SNPs, reported by Felcher et al. (2012) using their customized version (2.1.11) of potato reference PMs, were obtained for the latest version (4.03) of PMs (Table S3). Graphs depicting the progression of genetic distance and recombination rate vs. physical distance were calculated for all of the SNPs included in the current PMs and D84 and DRH genetic maps, using the MareyMap package (Rezvoy et al. 2007). The pericentromeric heterochromatin regions of the DM PMs were identified in these plots from the absence of genetic recombination between the SNP markers in such regions. In addition, AFLP markers from the marker-dense pericentromeric bins of the RH genetic maps were used to define heterochromatin boundaries in the PMs (Park et al. 2007), especially in cases where the genetic maps of Felcher et al. (2012) offered limited resolution.

BAC assembly and comparison with PMs
A total of 96 DM BACs spanning scaffolding gaps on chromosome 4 were selected (using DM BAC-end hits; Potato Genome Sequencing Consortium 2011). The BACs were picked from the library and endsequenced to verify correct selection. Eighty-two verified BACs were further processed and grouped into six normalized pools as well as a composite master pool containing all 82 BACs. Each of the six BAC pools was subjected to Roche 454 single-end shotgun sequencing and the master pool to 3-kb PE sequencing. Single-end data for each pool were combined with the PE data and were assembled together using the Newbler GSAssembler (Margulies et al. 2006). The sequences were deposited in the EBI Short Read Archive (accession number: ERP000934).
Candidate BAC scaffolds containing BAC-end sequences were identified with BLAST, filtering hits with a minimum match length of 400 bases and bit score exceeding 700 before manual curation. BAC scaffolds were matched to DM genomic superscaffolds with MUMmer (Kurtz et al. 2004). Matching regions were filtered to retain only matches longer than 1000bp with .97% identity. Data were expressed graphically with matches as edges and BAC end sequences, superscaffolds and BAC scaffolds as nodes using the graphical exchange format. Code was written in Python with the pygexf library and visualization performed with Gephi (http://www.gephi.org). In addition, BAC ends were linked by a BAC label as a node. Assemblies which linked superscaffolds with sequence data could then be readily observed as cycles containing a BAC label in the graph. BAC-oriented GFF files were generated and visualized with R.

RESULTS AND DISCUSSION
DM genome assembly: a brief summary The potato nuclear genome involved generation of~96.6 Gb of raw sequence, which assembled into 66,254 "superscaffolds" comprising a net sequence assembly of 727 Mb, 117 Mb less than the estimated genome size of 844 Mb. Superscaffold length is inversely proportional to the numerical value in the name of each DM superscaffold (DMB), where the largest DMB (7.1 Mb) bears the ID "PGSC0003DMB000000001" and the smallest (100 bp) "PGSC0003DMB000066254." Approximately 94% of the assembled genome is nongapped sequence and more than 90% of the genome (N 90 ) is represented by 622 superscaffolds that are equal to or larger than 0.25 Mb. The anchoring strategy preferentially targeted the larger superscaffolds. At the time of publication 649 superscaffolds equaling 623 Mb (86%) of the assembled genome and 90% of the 39,031 estimated genes were anchored (Potato Genome Sequencing Consortium 2011). Draft PMs for the 12 chromosomes had been constructed but superscaffolds were mostly un-oriented. Since the original publication, continuous efforts have been made to perform further anchoring and orientation of the DM superscaffolds in order to generate the revised and improved genome PMs presented here (version 4.03).

Genetic analysis of the mapping population
The DMDD mapping population was genotyped for AFLP, SSR, SNP, and DArT markers. Twenty two AFLP primer pairs (EcoRI/MseI) amplified 213 detectable fragments. A total of 356 SSR loci were assayed. Of 2304 POPA SNPs and 7680 DArTs interrogated, 2160 and 2174 yielded genotype data, respectively. The compiled set of 4903 markers was screened for presence of polymorphism, data integrity, and concordance between parental and progeny genotypes, as well as meeting the missing data threshold (,20%) and other standard quality control checks. These data filtering and quality measures resulted in considerable reduction in the total number of markers used for linkage mapping to 2597, which comprised 187 AFLPs, 234 SSRs, 367 SNPs, and 1809 DArTs. After excluding co-segregating markers, we used a subset of 1864 uniquely segregating loci for linkage grouping; 1751 unique loci were incorporated into a combined parental linkage map with the 12 expected linkage groups, whereas the remaining 113 remained unmapped. The 12 chromosomal linkage groups span 936.2 cM with an average marker spacing of 0.54 cM per interval. The individual linkage groups ranged in size from 62.9 cM (Chr11) to 101.8 cM (Chr03). A combination of the use of previously mapped SSR markers (Veilleux et al. 1995;Milbourne et al. 1998;Feingold et al. 2005;Tang et al. 2008a;Ghislain et al. 2009) and other available resources such as the RH genetic map (Van Os et al. 2006), the RH WGP map (de Boer et al. 2012) and the tomato-EXPEN 2000 map (Fulton et al. 2002) allowed orientation and assignment of all 12 linkage groups to their respective chromosomes. Table 1 shows the summary statistics of linkage mapping in the DMDD cross.
Departure from Mendelian segregation has been observed frequently in potato crosses. Markers showing segregation distortion were not excluded from the mapping process and most could be mapped to their appropriate linkage groups. The frequency of segregation distortion was highly variable among different chromosomes with the most significant distorted regions observed on chromosomes 1 and 4. Previous potato mapping studies have also shown varying levels of segregation distortion (Gebhardt et al. 1991, Felcher et al. 2012. Figure S1 shows genome-wide distribution of levels of segregation distortion for all STS markers used in DMDD.

Linkage map2based (direct) anchoring
The linkage map of DMDD is predominantly composed of STS markers. The primary map-based anchoring strategy involved locating these sequence-based markers in the DM superscaffolds. SNPs and previously unpublished SSR markers (prefixed with "PM") used in the DMDD linkage map were designed a priori against genome superscaffolds so their unique positions in the relevant superscaffolds were known. The positions of DArT and previously reported SSRs were determined using the bioinformatics alignment and filtering pipeline illustrated in Figure 1.
Co-segregating markers removed during linkage map construction were included in the anchoring process as such genetically redundant markers represent distinct, but physically linked sites in the genome.
The complete set of STS markers was filtered for unique and unambiguous marker-assembly sequence alignments as described. The combined sequence and genetic map coordinates for these unique STS markers were used to assign and order superscaffolds for constructing a framework physical map. The integrated genetic and physical anchoring strategy is shown in Figure 2. Using this strategy, we anchored 1730 (1305 DArTs, 345 SNPs, and 80 SSRs) of the 2292 mapped, including co-segregating, STS markers to their unique positions on the DM superscaffolds. This approach anchored 54.2% (394 Mb) of the DM genome assembly arranged into 334 superscaffolds ( Table 2). The proportion of genetic markers anchored on the genome sequence from each marker-category was 96% (SNPs), 28% (SSRs), and 76% (DArTs). Mapped AFLP fragments were not used in the anchoring process, due to a lack of sequence information. Table S2 contains genomic positions for all the STS markers used in the study. Genetic and physical coordinates for the DMDD mapped markers, including 718 co-segregating markers, are provided in Table S4.

In silico approach2based (indirect) anchoring
The DMDD-based framework physical map was extended by integrating two additional sources of syntenic map data, from potato and tomato, respectively. First, superscaffolds anchored using the RH UHD genetic and physical maps (van Os et al. 2006;de Boer et al. 2012) were added. Second, 2,604 sequence-based markers from the tomato-EXPEN 2000 derived maps, which are closely collinear with potato (Tanksley et al. 1992;Fulton et al. 2002; The Tomato Genome Sequencing Consortium 2012), were used to add superscaffolds. In the case of RH, sequence anchoring was derived from the AFLP-and WGP-based hybrid RH physical map (de Boer et al. 2012) as well as by direct alignment of RH BAC end sequences and fully sequenced RH seed BACs to the DM sequence. In both cases, the (proxy) marker sequences were aligned to the DM assembly using BLAST, adopting stringent matching criteria. The results were processed into reliable genetic anchor points as described previously for the DM markers.
The RH-and tomato-based in silico anchoring strategies independently anchored 470 (527 Mb, 72.5%) and 402 (417 Mb, 57.4%) superscaffolds, respectively (Table 2). Figure 3 shows the superscaffold anchoring summary for both the linkage (DM map) and the two in silico (RH and tomato maps) approaches. The total set of 649 superscaffolds anchored in at least one map was integrated hierarchically, starting with the DMDD-based framework map, placing additional superscaffolds using first the RH and then tomato assignment. The hierarchical 'alignment' of the maps is described below.

Construction of chromosome-scale PMs
Following anchoring, the superscaffolds were ordered into chromosome-scale PMs in a hierarchical process using genetic, sequence and physical map data. The process is broken into two stages.
Stage I: In the first stage the STS markers from the DMDD genetic map were aligned to the DM superscaffolds and used to construct the "backbone" PMs. Additional sequence-linked and sequence-based markers from the RH and tomato genetic maps were subsequently used to add superscaffolds into the DM backbone PMs (Figure 2). Superscaffolds that were anchored in multiple maps were used as reference points to align the genetic positions in the three different maps. Superscaffolds were added into 'gaps' in the backbone PMs where the positions indicated by the RH and tomato markers were in agreement with the positions initially established by the DMDD map data. The known set of chromosomal inversions on chromosomes 5, 6, 9, 10, 11, and 12 between potato and tomato (Tanksley n  a Based on the SSRs mapped in previous studies and further confirmed by using in silico approaches. b Excluding 718 co-segregating markers; when the segregation pattern of two or more markers was identical, only a single marker per set of identical markers was retained to generate the maps; 128 ungrouped markers (including 15 unassigned co-segregating markers) that did not fit any linkage group were also excluded. c Calculated as the map length divided by the number of intervals (mapped markers minus 1, for "total" it is mapped markers minus 12). Generally the different anchoring approaches provided direct support for each other with respect to the relative placement of superscaffolds in the PM. With an optimal alignment/agreement for the superscaffold order among the three different maps used for anchoring, 294 of 374 superscaffolds present in at least one map were found to be in the same order as in the other two maps. In some instances, we observed that ordering of superscaffolds derived using RH and tomato maps was inconsistent with that obtained from the DMDD genetic map. The observed differences could be due to many factors, including technical issues such as mapping or assembly errors or biological properties, such as previously unknown structural differences between the compared genomes. However, given the size and complexity of the potato genome, it is encouraging that the placement of 79% of the superscaffolds was corroborated by the different methods employed.
Although superscaffolds were integrated into genomic blocks at this stage, they were unoriented and, due to the difficulty of aligning genetic maps, largely unordered at the chromosome level. To add, orient and refine the order of superscaffolds into an AGP for Figure 2 Step-wise linkage group assignment and ordering of DM superscaffolds using geneticanchoring information successively from the DM, RH, and tomato genetic maps.
constructing chromosome-scale PMs, a separate process was implemented, as described below.
Stage II: To orient the DM superscaffolds, and to further refine the DMDD linkage map-based PMs, sequence and physical data from a variety of sources were combined as described in the Materials and Methods section and visualized on a standard GBrowse installation ( Figure 4). Custom sequence features were created representing high scoring intersuperscaffold links, allowing the user to "click-and-walk" along the physical evidence from superscaffold to superscaffold in GBrowse. To aid this visualization, the processed RH WGP and tomato alignments, including the aligned sequence markers from the genetic maps used in stage I, were added to GBrowse as additional sequence feature tracks.
Using this integrated visualization tool, we performed three important types of manual improvements to the stage I PMs: (1) scaffolding links were used to provide the relative orientation of superscaffolds, (2) adjacent superscaffolds not previously included in the integrated genetic/physical map were added, and (3) errors in the assembly were identified. These manual improvements were mainly carried out for the euchromatic (gene-rich) regions and for the euchromatin/heterochromatin borders. In addition to orientating the majority of the anchored superscaffolds, the "link-peak" walk strategy combined with manual curation led to the incorporation of an additional 277 previously unanchored superscaffolds into the PMs.
During this process 67 chimeric superscaffolds were identified. Of these, 62, 3, and 2 superscaffolds were revealed to have one, two, and a Only markers mapped in DMDD and uniquely and reliably anchored to DM assembly are included.

Figure 4
Depiction of "Link-peak" walk strategy taking superscaffold PGSC0003DMB000000159 as an example. (A) Custom GBrowse "Linkpeak" intensity track features (shown as red and blue arrows) provided ordered navigation through superscaffolds using the aggregated PEMP. Link peaks to the right (red arrow) indicate "suggested path" downstream of the AGP, whereas those to the left (blue arrow) indicate converse. Reversal of this trend indicates a negative strand for the superscaffold in question. Traversing from one superscaffold to another by taking leads from these 'Link-peak' intensity tracks assisted in manually curating all 12 PMs. (B) Visualization of the underlying PEMP data.
three misassembly locations, respectively, where false sequence joins had occurred. Many of these errors explained incongruities initially observed in the construction of the backbone PMs from the DMDD map (stage I). Chimeric superscaffolds were manually split and allocated to their respective positions in the PMs. For example, the sequence coordinates 1 to 1117982 bp of PGSC0003DMB000000002 were allocated to chromosome 4, whereas those from 1117983 to 6562806 bp were allocated to chromosome 5. These results further illustrate the utility of an integrated genetic and in silico anchoring based approach for refining and correcting genome assembly errors. Included in the refinement process were dot plot alignments of DM chromosome PM sequences to pre-release and finished versions of the tomato genome sequence (The Tomato Genome Sequencing Consortium 2012). These alignments focused on the euchromatic regions and the adjacent heterochromatin border regions, where potato and tomato display homology in their sequences. The dot plot alignments to tomato made useful suggestions on how to place as yet unordered potato superscaffolds and superscaffold blocks, after which nearly always BAC end sequence links were identified in potato that confirmed the suggested orientation. Very occasionally, the potato PM description relied on the tomato alignment for placing potato sequence blocks in their presumed orientation, e.g., from PGSC0003DMB000000729 to PGSC0003DMB000000835 at the top of chromosome 1 and from PGSC0003DMB000000692 to PGSC0003DMB000001163 in the south heterochromatin border on chromosome 8.

Inversions with tomato
The potato-tomato dot plot alignments explained the discrepancies that were found between the potato and tomato genetic maps. In the euchromatic regions and the adjacent heterochromatin border regions we collected the sequence positions of the 19 largest paracentric inversions (with a length of at least 0.3 Mb), which are listed in Table S5 and also indicated in the DM PM figures. Newly identified were, among others, a tandem inversion with minor additional rearrangements on potato chromosome arm 1L, a nested inversion on 2L, and an arm inversion on 8S. Furthermore, the known arm inversions on 9S and 11L were found to be tandem inversions, with the second inversion being located in the heterochromatin border. The chromosomal rearrangements on 2L have also been described by Peters et al. (2012), who presented a scenario involving four structural conversions between potato and tomato. However, our dot plot sequence alignment for this region is less complex and shows a single, smaller inversion inside a larger inversion. This nested inversion model requires only two structural conversion steps and remains compatible with the cytogenetic results of Peters et al. (2012).
No paracentric inversions were identified on chromosome 3. However, on the short arm, the tomato sequence differs from the potato sequence by a 7.0-Mb insertion, which is located at position 2.4 Mb in the DM chromosome 3 PM, and which runs from 1.3 to 8.3 Mb in the tomato SL2.40 assembly. In its center, this tomato insert has 4.2 Mb of sequence that is largely devoid of genes (http://potato. plantbiology.msu.edu/), while the start and end regions align with gene-containing potato sequence segments from region 42.0 to 50.4 Mb on the south arm of chromosome 3. Although these data suggest a translocation of sequences across the centromere, further investigation is needed to exclude sequence assembly errors.

Validation of link peak-based orientation strategy for chromosome 4
The strategy for PM construction and assembly correction was validated on chromosome 4 by targeted sequencing of 82 DM BAC clones that were selected to overlap candidate links as well as 10 of the 15 putative chimeric superscaffolds mapped to this chromosome. Thirty-one BAC clones could be assembled with contigs which spanned multiple superscaffolds and provided full coverage between the BAC end sequence matches to the superscaffolds, both validating the assembly and providing direct evidence for all 10 chimeric breakpoints. Seven of these sequenced BACs allow the inclusion of further superscaffolds that had not previously been assigned to a PM, and one provides evidence for a superscaffold that had been erroneously included.
In addition to the complete assemblies described previously, most other clones could be assembled to a series of contigs which did not span multiple superscaffolds and which have not been included in the BAC pool assembly summary (Table S6). Details of the BAC analysis are given in the Materials and Methods section and a representative example validating a potential break-point in Chromosome 4 is illustrated in Figure 5. A list of putative erroneous superscaffold assembly Figure 5 Assembled BAC sequence for LuSP197F07. Each scaffold assembly is derived from PE sequences of a combined pool of 82 DM BACs (spanning scaffolding gaps on chromosome 4) and single end sequence at greater read depth from one of the six subpools derived from the same BACs. The assemblies show a direct sequence running from PGSC0003DMB000000278 (2 orientation, full length, cyan) through into PGSC0003DMB000000051 (+ orientation, blue) in accordance with the AGP and fully validating the decision to split PGSC00003DMB0000000278 at position 824768 and to split PGSC0003DMB000000051 at position 1859342 as indicated in the AGP file. Regions of good alignment (.98% identity, .1000 bases) are indicated as thick lines. Thin lines indicate no good alignment between the superscaffold and BAC sequences. The BAC end sequences are labeled with their Genbank IDs and are indicated at each end of the plot by black arrows. Breakpoints in the BAC sequences are indicated by orange diagonal lines and annotated with the assigned breakpoints coordinate from the AGP. locations (breakpoints), and the BACs which provide validation for them are given in Table S7.

Demarcating centromeres and pericentromeric boundaries in the PMs
The putative centromere locations for 7 of the 12 potato chromosomes were identified in the PM sequences based on data published by Gong et al. 2012 (Table S8). Six centromere locations were identified from chromatin immunoprecipitated sequences. Of the seven published centromeric satellite repeat sequences (Gong et al. 2012), only the St24 repeat specific for the chromosome 1 centromere identified DM sequences with a high repeat copy number characteristic of centromeric regions. With the other six centromeric repeat sequences, we could not find reliable centromeric targets in the DM assembly because these sequences only identified locations with very few repeat copies, which sometimes occurred on a chromosome other than that expected from their designated centromeres.
Pericentromeric boundaries were deduced by comparing the SNPbased D84 and DRH genetic maps of Felcher et al. (2012) to the current version of PMs. For all chromosomes the typical pattern of distinctly reduced recombination in pericentromeric regions and increased varying recombination rates in euchromatic regions was observed ( Figure 6). These patterns were used as the primary information source to demarcate putative pericentromeric regions in the PMs, and the boundaries of these regions were well supported, and where needed refined, by the RH genetic maps (van Os et al. 2006). Figure 7 and Figure S2 depict the centromere and pericentromeric locations in the PMs. The pachytene chromosome idiograms in these figures are adapted from Potato Genome Sequencing Consortium (2011).

Current status of the reference PMs
The genome anchoring, ordering, and orienting process, as described previously, led to the joining of 951 genome superscaffolds, or nonchimeric segments thereof, into 144 larger, contiguous sequence blocks, and enabled construction of an AGP assembly for the reference DM potato genome. These chromosome-scale PMs, version 4.03, contain 93% (compared with 86%; Potato Genome Sequencing Consortium 2011) of the assembled genome comprising 674 Mb in 951 superscaffolds and include 37,482 (~96%) of the 39,031 predicted genes. A total of 938 superscaffolds (655 Mb or~90% of the assembled genome sequence) are assigned absolute or relative orientation within the PMs, whereas the remaining 13 superscaffolds (19 Mb) are assigned with a random orientation. For 279 Mb of superscaffold sequence blocks from the heterochromatin, the exact chromosome position and absolute orientation could not be determined. These partially unordered regions are marked yellow in the PM figures (Figure 7 and Figure S2). No attempts were made to estimate gap sizes between the superscaffolds, and in the PM sequences all superscaffolds are separated from each other by a fixed gap sequence of 50,000 Ns. The N 90 of the DM potato genome assembly is 0.25 Mb and contains 622 superscaffolds, of which 28 (equalling 17 Mb,~2% of the assembled genome sequence) remain unanchored. The longest anchored superscaffold is 7.1 Mb (PGSC0003DMB000000001; chromosome 1) and the longest unanchored superscaffold (PGSC0003DMB000000064) is 2.2 Mb. The increase in average N 50 from 1.5 Mb to 4.1 Mb in DM version 4.03 (Table 3) further supports the enhanced quality of the constructed PMs. The current version of the PMs/AGP is provided in Table  S9 and includes the list of unanchored (chromosome 0) and chimeric superscaffolds.
For visualizing the differences and improvements in the constructed PMs, we compared dot plots of the current PMs (ver 4.03) to the earlier version 2.1.11 (Figure 8). Superscaffold misplacements were apparent as horizontal or vertical shifts in parts of the alignments in all pairwise comparisons. The overall structural integrity of the constructed PMs is visible from the expected gradual transition from gene rich to gene poor regions which in turn are well complemented by the normal high repeat region density patterns in the pericentromeric locations gradually declining toward the gene rich euchromatic regions ( Figure 6). The PMs along with integrated DMDD and RH genetic maps were visualized using DMAP as described in the Materials and Methods section. Figure 7 shows a representative illustration for chromosome 1 (chromosomes 2212 are shown in Figure S2). Good correspondence between DMDD and RH genetic maps and the PMs was observed.
Although the DMDD map-based strategy was critical in providing the basic anchoring to the DM genome, it had its limitations. Certain superscaffolds lacked sufficient polymorphic STS markers for genomic anchoring and were possibly affected by homozygosity, segregation distortion or other issues ( Figure S1). This mainly occurred in pericentromeric/heterochromatin regions (marked by dashed lines, Figure  7 and Figure S2), which generally displayed a sparse coverage with DMDD markers, possibly due to the customized marker design strategy that precluded the design of markers in highly repetitive, relatively gene poor regions. For example, SNPs were designed against coding regions using RNA-Seq data (Hamilton et al. 2011) and, thus, were mainly localized to gene-rich regions, which occupy a different "genomic space" to the gene-poor, high-repeat content regions ( Figure 6). The DM-based "PM series" SSRs were designed from repeat-masked genome sequence to avoid placement in repetitive DNA. The DArT methodology also uses genome complexity reduction and has been shown to target the low copy fraction of a plant genome through judicious selection of certain restriction enzymes (Jaccoud et al. 2001). Thus, the unavoidable bias toward nonrepetitive sequences in the STS markers employed in the DMDD map resulted in many unanchored superscaffolds. This issue was resolved by using additional resources that we refer to as the in silico anchoring approach. For example, the large block of "orphaned" superscaffolds, not directly connected to the DMDD map, stretching from DMB 394 to DMB 705 (with the exception of DMBs 193,15,59,100,and 200) on chromosome 1 (see Figure 7) was anchored by the evidence derived from the WGP/AFLP-based RH map and the tomato-EXPEN 2000 map and further extended by the "link-peak walk" strategy, illustrating the importance of the multi-layered anchoring approach adopted here.
Potato genomic resources are provided as tracks/features in the GBrowse for the DM genome (hosted at Spud DB site "http://potato. plantbiology.msu.edu/"). One such resource, widely adopted by the potato community, is the Illumina Potato 8303 SNP Infinium array (Felcher et al. 2012) released after our map was constructed. This SNP array was used by Felcher et al. (2012) to construct two genetic maps, both involving DM as the female parent. Although the homozygosity of DM precluded segregation of DM loci in these populations, they showed good congruence for most linkage groups to the prerelease version (a modified ver 2.1.10 latterly referred to as ver 2.1.11) of the DM PMs. Version 4.03 of the PMs provides an improved correspondence with the genetic maps of Felcher et al. (2012) (Figure 6). An updated annotation of the Illumina Potato 8303 SNP Infinium array is provided in Table S3. The DMDD genetic map and associated data files are available at http://solgenomics.net/, and include hyperlinks to the MSU Genome Browser. All of the supplementary data, wherever applicable, are available to download as GFF format files from Spud Figure 6 Enhanced accuracy of the current DM PMs. Panels A and E show anchoring of superscaffolds to the PM versions 4.03 and 2.1.11, respectively. Superscaffolds with known and unknown orientations are depicted in alternating shades of blue and red, respectively. Gaps in between the superscaffolds are marked in gray. Black areas in panel E represent unanchored superscaffolds (version 2.1.11) that were eventually anchored and ordered in PM version 4.03. Panels B and C show gene and repeat region densities, respectively, in 1 MB bins of PM version 4.03. Gene and repeat region densities ranges from 0 to .150 genes/MB and 0 to .900 repeats/MB, respectively. Panels D and F show the correspondence of the genetic maps (D84, green; DRH, black), adapted from Felcher et al. (2012), to PM versions 4.03 and 2.1.11, respectively. Graphs show the genetic (cM) positions plotted against the physical coordinates (Mb) for the SolCAP SNP markers; panels G (D84) and H (DRH) show elaborated examples of good correspondence from chromosome 9.

Figure 7
Illustration of the chromosome 1 PM integrated with the DM and RH genetic maps. STS and AFLP markers anchor sequence locations in the chromosome 1 PM to the DMDD and RH genetic maps, respectively. The AFLP marker positions in the PM were identified through sequence tag alignment of BAC clones from the RH WGP physical map. Superscaffolds comprising the PM are shown as alternating gray and white rectangular blocks. The layout of the PM for each of the genetic maps is shown separately but is identical with superscaffold IDs depicted in the middle. The pachytene idiogram is adapted from the potato reference genome publication (Potato Genome Sequencing Consortium 2011). The putative centromere region and pericentromeric/heterochromatic boundaries are demarcated by asterisk and dashed lines, respectively. Each DMDD marker type is color coded: blue = DArTs, yellow = SNPs, green = SSRs. Blue and magenta lines emerging from the RH genetic map represent AFLP anchors and the intensity of green color corresponds to the AFLP marker density per bin as reported by Van Os et al. (2006). Magenta lines represent AFLP markers with a relatively inaccurate mapping position on the RH genetic map, covering an interval of 5 or more bins. Regions in the central heterochromatin where superscaffold order and orientation are not completely resolved are indicated in yellow. Inversions with the tomato sequence are indicated with red interval bars. DB site "http://potato.plantbiology.msu.edu/". The potato GBrowse including all of the hosted genomic resources/tracks/features have also been updated to the latest version (PM 4.03) of the DM PMs.

Conclusions
The integrated genetic and physical reference map presented here comprising nearly 2500 markers, which are mostly STS, provides a platform for exploiting the potato reference genome. The most obvious and immediate application is the ability to position any sequence-based marker locus to a precise location in the DM genome. This will revolutionize trait analysis, although progress will be dependent on the complexity of the trait concerned, population size, replication and accuracy of phenotypic data and other factors that impinge on map resolution. Once mapped, the genome sequence around the locus can be used to design additional genetic markers for fine-scale mapping, and to identify putative candidate genes using the genome annotation. Such genes can be resequenced from informative plants showing phenotypic variation for the target trait. This ability to move directly from "map to genome to gene" will hasten the identification of genes responsible for traits. However, the automated annotation still includes many genes of "unknown function" and there are likely to be as yet unannotated genes in the genome sequence. Moreover, the DM genome represents only one haplotype in a species known to exhibit abundant sequence diversity.
The conversion of~93% of the assembled genome sequence to well-structured, oriented and annotated PMs has made potato more amenable to modern genomic/genotyping approaches, such as genotyping-by-sequencing (Uitdewilligen et al. 2013). The clear and irreversible shift toward sequence based polymorphism in place of 'fragment based' markers will have the effect of augmenting centimorgan positions with genome sequence co-ordinates, providing a means for verifying the accuracy of mapping studies. The integrated DMDD map complements the published potato genome sequence and adds to a growing number of resources for genetic and genomic analyses.
The integrated map presented here and associated resources will help to alleviate many of the complicating aspects of potato as a genetic system. Potato is the most economically important crop where cultivars are highly heterozygous polyploids that suffer severe inbreeding depression on self-pollination. Such breeding systems make breeding and genetical studies difficult and cultivar development generally requires simultaneous recurrent selection for several traits over many years of evaluation. Introduction of traits that would make such crops more sustainable, e.g., drought and salinity tolerance as well as nutrient use efficiency, will be targeted as we confront global climate change and dwindling natural resources (Levy et al. 2013). Moreover, attempts to convert the cross-pollinated tetraploid breeding system into an F 1 hybrid diploid based scheme are also in progress (Lindhout et al. 2011). The isolation of genes coding for key traits, and characterization of their functional allelic diversity will be greatly facilitated by the resources provided in this study. A recent example is the identification of a gene largely responsible for the adaptation of Andean-derived potato germplasm to the longer day-lengths of temperate latitudes (Kloosterman et al. 2013).
The work presented here has generated a greatly improved ordering of the potato reference genome superscaffolds into chromosomal PMs. The reconfigured PMs and their links with genetic maps provide a major new resource for the research community. They form the basis by which geneticists can identify genes underlying important traits and through which comparative genomics can be further exploited in diversity assessment, phylogenetic inference, and plant breeding.

ACKNOWLEDGMENTS
We thank Andrzej Kilian (Diversity Arrays Technology, Australia) for DArT genotyping of the DMDD mapping population. We acknowledge Peter E. Hedley and Clare Booth (The James Hutton Institute, UK) for help with SNP genotyping. We thank S. B. Divito (Instituto Nacional de Tecnología Agropecuaria, Balcarce, Argentina) for technical assistance. We are also grateful to Luke Ramsay and Peter E. Hedley (The James Hutton Institute, UK) for comments on the manuscript. AFLP and WGP are (registered) trademarks owned by KeyGene N.V. We acknowledge the funding made available by the Potato Genome Sequencing grant, UK [Scottish Government Rural and Environmental Science and Analytical Services Division (RESAS), Department for Environment, Food n a Refers to the status of PMs before execution of the "Link-peak" walk strategy. b Refers to the status of PMs after execution of the "Link-peak" walk strategy.
c Only attempted at stage II. d Total.
e Average. f Chimeric superscaffolds have been included more than once (net number of DMBs anchored = 951).

Figure 8
NUCmer sequence alignment dot plots for the twelve potato chromosomes using current (ver4.03, plotted on x-axis) and previous (ver2.1.11, plotted on y-axis) versions of DM PMs. Sequences aligned in forward and reverse orientations are represented by red and blue lines, respectively. Scaffold misplacements are shown as horizontal or vertical shifts in parts of the aligned blocks.