Integration of the Draft Sequence and Physical Map as a Framework for Genomic Research in Soybean (Glycine max (L.) Merr.) and Wild Soybean (Glycine soja Sieb. and Zucc.)

Soybean is a model for the legume research community because of its importance as a crop, densely populated genetic maps, and the availability of a genome sequence. Even though a whole-genome shotgun sequence and bacterial artificial chromosome (BAC) libraries are available, a high-resolution, chromosome-based physical map linked to the sequence assemblies is still needed for whole-genome alignments and to facilitate map-based gene cloning. Three independent G. max BAC libraries combined with genetic and gene-based markers were used to construct a minimum tiling path (MTP) of BAC clones. A total of 107,214 clones were assembled into 1355 FPC (FingerPrinted Contigs) contigs, incorporating 4628 markers and aligned to the G. max reference genome sequence using BAC end-sequence information. Four different MTPs were made for G. max that covered from 92.6% to 95.0% of the soybean draft genome sequence (gmax1.01). Because our purpose was to pick the most reliable and complete MTP, and not the MTP with the minimal number of clones, the FPC map and draft sequence were integrated and clones with unpaired BES were added to build a high-quality physical map with the fewest gaps possible (http://soybase.org). A physical map was also constructed for the undomesticated ancestor (G. soja) of soybean to explore genome variation between G. max and G. soja. 66,028 G. soja clones were assembled into 1053 FPC contigs covering approximately 547 Mbp of the G. max genome sequence. These physical maps for G. max and its undomesticated ancestor, G. soja, will serve as a framework for ordering sequence fragments, comparative genomics, cloning genes, and evolutionary analyses of legume genomes.

and segmental duplications that cannot be spanned by the short sequence reads (Lewin et al. 2009).
Clone-based maps have been integral to several genome sequencing projects, contributing to gene cloning, the understanding of genome structure, and evolutionary studies. McPherson et al. illustrated the benefit of using the clone-based physical map in the assembly of the human genome sequence (McPherson et al. 2001). A physical map also contributed to the sequencing of the Drosophila melanogaster genome (Hoskins 2000), and a combination strategy of physical mapping and sequencing was applied to the mouse genome (Bouck et al. 2000;Pennisi 2000). To support the increasing interest in map-based gene cloning of important genes, the physical map of Arabidopsis thaliana was constructed, resulting in deeper understanding of genome structure and evolution (Mozo et al. 1999). Rice genome sequencing data were integrated with a physical map, and this integrated high-resolution physical map facilitated genome sequencing through a minimal tiling path of BAC clones (Chen et al. 2002). To build a foundation to sequence the maize genome, physical and genetic maps of maize were developed and anchored to each other, resulting in an useful tool for evolutionary studies of maize (Cone et al. 2002;Wei et al. 2007;Wei et al. 2009).
For soybean, physical maps were constructed using BAC libraries from cv. Forrest and cv. Faribault (Wu et al. 2004a,b). However, the community selected the cultivar Williams82 for a reference genome sequence. A high-quality physical map was needed as a foundation to improve the usefulness of the whole genome sequence for the research community. An initial physical map for Williams 82 was derived from two BAC libraries made with different restriction enzymes (Pampanwar et al. 2005;Soderlund et al. 2000;Warren 2006). This map consisted of 97,272 fingerprinted BAC clones comprising 1893 contigs and approximately 30,000 singletons. The physical map needed to be integrated with the genome sequence and oriented with the genetic map to identify genes underlying quantitative trait loci, which is important for the genetic improvement of soybean and to understand the molecular and genetic basis of traits (Jackson et al. 2006). To improve the genetic anchoring of physical map of G. max, 3290 microsatellites (simple sequence repeat [SSR]) markers were identified from BAC end sequences (BES) of clones comprising the initial physical map and 265 of these SSR were genetically mapped (Shoemaker et al. 2008).
The genomes of G. max and G. soja have been sequenced using whole-genome shotgun sequencing, G. max with traditional Sanger sequencing, and G. soja with next-generation sequencing. In both instances, a physical map can be used to improve the genome sequence by spanning gaps and correcting alignments. Wild soybean, G. soja, is a promising source of genes/alleles that were lost during n domestication bottlenecks (Hyten et al. 2006). Thus, the physical map of G. soja will be useful to clone potentially valuable genes, to improve the quality of the G. soja genome sequence, and as a foundation for comparative evolutionary studies. For the G. max physical map, a minimum tiling path (MTP) can be picked using BESs aligned to the genome sequence. Traditionally, the main purpose of a MTP has been to efficiently select clones to be sequenced; in other words, to minimize the number of clones to be sequenced by selecting clones that are adjacent and overlap minimally. In the case of G. max, in which the whole-genome shotgun data are available, the primary purpose of the MTP is to have a physical map anchored to the genome sequence, thereby providing a framework for genomic research. A reliable MTP covering nearly the whole genome complements a genome shotgun sequence in that it can be used to correct misalignments and to span gaps, which is important for finishing regions and cloning genes. For G. soja, the physical map provides an anchored, clone-based resource to shuttle between the two genomes, domesticated and undomesticated.

Source BAC libraries
The DNA source for soybean BAC libraries was from the cultivar Williams 82 that has been chosen as the standard genotype by the soybean community for genomic studies (Stacey et al. 2004). Three different restriction enzymes HindIII, BstyI, and EcoRI, were used to construct the three libraries, GM_WBa, GM_WBb, and GM_WBc, respectively ( Table 1). The DNA for G. soja BAC library, GSS_Ba, was from a single plant of accession PI468916, and HindIII was used to construct the library ( Table 1).

Source of sequences
Assembly of shotgun sequenced fragments in soybeans presents substantial challenges because of the duplicated nature of the genome (Shoemaker et al. 1996), many repeat sequences, and common domains of several gene families. Although the shotgun sequencing data (gmax 1.01) has several fold coverage of the entire genome, 377 gaps remain (Schmutz et al. 2010). We integrated 950,068,807 bp of sequence length from the 20 pseudomolecules with the physical map ( Table 2).

Source of MTP
Because the gmax 1.01 soybean assembly did not filter out clones with unusually long or short inserts, we limited BAC lengths to a range of 75 kb to 225 kb when MTPs were picked from two different clone pools; one pool contained only BAC clones, which were used to construct the FingerPrinted Contigs (FPC) map (clone pool A), and the other contained all the BAC clones from the three BAC libraries (clone pool B). Two kinds of MTPs were picked from each clone pool by using Dijkstra's shortest path algorithm (Dijkstra 1983). One MTP was picked from only the BAC clones with paired BESs and the other from BAC clones with both paired and unpaired BESs in order to try and extend coverage into sequence gaps ( Figure 2).

Spanning gaps in the FPC map
To span the gaps in the preliminary FPC map having 1893 contigs, the map was integrated with a preliminary 4x sequence assembly from the Joint Genome Institute and the Stanford Human Genome Center. The average length of contigs was 157,040 bp, and the maximum size was 20,109,437 bp (Batzoglou et al. 2002;Jaffe et al. 2003). The integration was performed using the BSS and MTP modules of FPC as described in Nelson and Soderlund (2009). The 148 spanned gaps (contig merges) were automatically identified and performed by FPC (Table 3).
There are many gaps represented as a series of Ns in the 8x soybean sequence (gmax1.01). A total of 1000 Ns indicate gaps between scaffolds that were not spanned using the Arachne assembler, 100 Ns indicate gaps without length information, and a specific number of Ns indicate gaps of known size ( Figure 2). We assumed that a BAC clone would span at least part of a gap when one BES aligned near the edge of a contig abutting the gap and the clone pointed into the sequence gap. Some of the larger gaps with thousands of Ns were spanned by BAC clones with paired and/or unpaired BES by blast searching against the physical map already integrated with the 8x draft sequence data. To increase the coverage of the MTP picked from the clones building the FPC map, the physical location of the gaps on the FPC map were checked and the clones with unpaired BESs corresponding to the location were added to the MTP.

BAC libraries
Three Glycine max cv. Williams 82 BAC libraries, GM_WBa, GM_WBb, and GM_WBc (http://genome.arizona.edu), were made with three different restriction enzymes, HindIII, BstyI, and EcoRI, respectively, to reduce the likelihood of missing parts of the genome attributable to cloning bias. All three libraries were used to construct n the G. max physical map. A BAC library was constructed using HindIII for Glycine soja PI468916, called GSS_Ba. The average insert size of GM_WBa, GM_WBb, GM_WBc, and GSS_Ba were 150, 150, 131, and 150 kb and represent 5.4, 12, 10.9, and 12x coverage of each genome, respectively. Subsets of each library were fingerprinted for construction of the FPC maps (Table 1).
FPC maps for G. max and G. soja Fingerprinted clones were clustered into contigs on the basis of their probability of coincidence score using the FPC software package (Soderlund et al. 1997(Soderlund et al. , 2000 contigs, and 26,968 and 15,219 clones remained as singletons (BACs that did not order into a contig), respectively (Table 4). Of the contigs, 1355 (78%) of G. max's and 1053 (37%) of G. soja's were ordered and oriented to 20 soybean chromosomes (Schmutz et al. 2010) using the alignment function of FPC (Nelson and Soderlund 2009 Figure 1). For the G. max alignment, unanchored sequence scaffolds were included in gmax 1.01, whereas for G. soja, only anchored scaffolds were used. In terms of the consensus FPC map, 607,788 and 426,033 cb units (Consensus Bands) were included in the aligned contigs for G. max and G. soja, respectively (93% of 648,007 cb units and 52% of 815,128; Table 5).
Genetic marker data for G. max For a physical map to be useful in the assembly of a whole-genome sequence, it must be anchored to the genetic map (Jackson et al. 2006). A genetically anchored physical map is helpful not only for gene cloning but for a better understanding of genome structure that might confound a whole genome sequencing strategy (Shoemaker et al. 2008). Genetic markers and gene-based sequences from G. max were used to screen the BAC libraries (results available at http://www. soymap.org) to integrate the genetic and physical maps. The soybean genome sequence was then combined with the physical map using BES (Schmutz et al. 2010) so that the FPC contigs could be further integrated with the sequence and genetic maps. In this study, 4628 genetic markers consisting of 3952 SSR markers and 676 RFLP markers were anchored to the G. max physical map. Of these markers, 1725 were multiple-hit markers (MHM), indicating that the markers were anchored more than two BAC clones, 1181 MHM were linked to more than two contigs, 503 MHM were anchored to multiple clones on one contig, and 41 MHM were anchored to multiple singletons. The average number of contigs hit by the 3952 SSR markers was 1.5, and the average number of contigs hit by 676 RFLP markers was 1.6. Of 3952 SSR markers, 417 hit 0 contigs, 2601 hit 1 contig, 301 hit 2 contigs, and 633 markers hit more than 2 contigs. Of the 676 RFLP markers, 98 hit 0 contigs, 331 hit 1 contig, 145 hit 2 contigs, and 102 hit more than 2 contigs (Table 6). There were many MHM primarily as the result of the short sequences used to screen the BAC libraries and the duplicated soybean genome; however, these data are useful for confirmation of clone order and contig integrity and alignment to the sequence map.
Minimum tiling path (MTP) for G. max Four paradigms have been used to pick minimal tiling paths from FPC fingerprint maps. The first is a map-based approach. Fingerprints of clone pairs that appear to have minimal overlap are analyzed in the FPC gel image display (Coulson et al. 1986). The second is a BESbased approach in which a seed clone is selected and sequenced. This sequence is used to query a BES database to find a minimally overlapping clone; the process is then repeated iteratively (Venter et al. 1996). The third is a hybrid of the first two in which the seed clone selecting and extending process is the same as mentioned previously but the overlap is verified using a map-based approach to reduce the risk of false-positive overlaps (Marra et al. 1999). The fourth approach makes use of both BES and existing genomic sequence by using BESto-sequence alignments to estimate BAC overlaps more accurately than is possible from fingerprint overlaps alone. Functions to implement this approach are built into FPC (Nelson and Soderlund 2009).
In the case of soybean, a genome sequence data (gmax 1.01) is already available. We integrated the sequence with the FPC map to build BAC-based pseudomolecules representing the 20 soybean chromosomes (http://soybase.org). Therefore, our MTP does not need to be "minimal" in the sense of budget constraints for BAC sequencing, and we instead selected BAC clones with the greatest reliability while attempting to minimize overlap between adjacent BACs (Figure 1). Two types of MTPs were picked from two different clone pools: (A) using only the clones contained in the FPC map; and (B) using all the clones from all three BAC libraries that had BESs, which may have been excluded from the FPC map because of fingerprinting errors (hereafter referred to as clone pools A and B, respectively). In the first approach, proximity in FPC provides an additional confirmation of overlapping MTP clones; however, a number of clones that have BES are not contained in the FPC map because of fingerprinting failures.
FPC provides an approximation of where clones should be relative to one another in a contig as there may be error in the band calling of individual clones or in the determination of clone overlap. Therefore, for the clone-ordering process, clones may not end up in the FPC map although BESs can be used to order clones relative to the genome sequence. Thus, we used not only the FPC clones but also the clones not in FPC but having BESs to improve the accuracy of the BACbased maps. The MTP with only FPC clones consists of 1422 GM_WBa, 3887 GM_WBb, and 2035 GM_WBc BAC clones containing 914 gaps and an average of 21.9 kbp overlap between n  (Table 7). To attempt to span gaps in the sequence scaffolds, clones with unpaired BES were added to MTPs. BACs with unpaired BES were anchored to MTP only when they aligned near the edge of a contig pointing toward the gap (Figure 2). In the MTP composed of clones only in the FPC map, 146 gaps were spanned by clones with unpaired BESs and the average overlapping region was elongated by an average of 1.5 kbp. In the MTP built with all three BAC libraries, 160 gaps were covered by the clones with unpaired BESs and the BAC overlaps were extended by an average of 1.4 kbp (Tables 3 and 7).
Alignment of G. soja BESs to G. max genome sequence G. soja's BES were aligned to G. max's whole-genome sequence (gmax1.01) to detect structural difference between G. max and G. soja.
Of 180,099 total BESs, 88,950 clones have paired end sequences, and 2199 clones have sequence for one end only (Table 8). Alignments of these BESs to the gmax1.01 genome resulted in 2675 of the 88,905 clones having only one end aligned to the reference genome. A majority of the clones, 67,047, had BESs that could be aligned to the same chromosome; however, 19,143 clones had BESs that aligned to different chromosomes, indicative of potential rearrangements ( Figure 3A). By examining the distance and orientation of paired BESs, we were able to look at intrachromosomal rearrangements. BES pairs when aligned to the genome should be inverted relative to each other (sequencing from either end of the cloning vector) and we expected the distance between the ends to be within 75 to 225 kbp of each other ( Figure 3B). Of the 67,047 clones where paired BESs aligned to same chromosomes, 89.3% (59,899) were within a range of 75 kbp to 225 kbp, 2.9% (1965) were greater than 225 kbp, and 5.0% (3352) less than 75 kbp apart (supporting information, Figure S1). BAC clones where paired BESs aligned more than 1.5 Mbp apart were excluded as potential artifacts. Of 3796 clones, 1965 were included within 225-kbp to 1.5-Mbp range. A majority of clones fell within the expected distance of an average BAC library insert distribution although there were many clones that had potential insertions/deletions.
In terms of orientation of BESs where both BESs were located on same chromosome, 63,888 clones had the expected orientation (BESs pointing toward each other; Figure 3C). A total of 1184 clones had BESs pointing in the opposite direction, and another 1975 clones had BESs pointing in the same direction, indicative of potential inversions (Table 8).

DISCUSSION
The MTP with the fewest gaps and the most coverage Over G. max genome sequence To increase the coverage of the physical map but maintain reliability, three approaches were considered. First, the preliminary FPC map was integrated with whole-genome draft sequence, meaning that the Figure 2 Representation of integration of the G. max draft sequence and the physical maps of G. max and G. soja. By integrating the draft sequence and the physical maps, gaps in the sequence could be spanned using clones from the physical maps based on BES and gaps in physical map can be spanned by the sequence map. By adding clones with unpaired BES, gaps existing in both the sequence and the physical maps were filled. The yellow bold lines indicate FPC contigs from both physical maps. The black bold line (Chr) represents a sequence scaffold from gmax1.01, and blue fragments represent shotgun sequences that are part of a sequence scaffold. Black and red lines represent BAC clones and green boxes represent BESs. Red lines indicate BAC clones from the MTP. Purple lines indicate the clones with unpaired BESs. Purple dotted line represents a gap that can be partially filled or spanned by adding clones with unpaired BESs.
n draft sequence was aligned to the FPC contigs via BES alignments. A number of FPC contigs were merged based with this approach, and 148 gaps in the FPC map were closed (Table 3). This was done using the preliminary 4x sequence assembly from the Joint Genome Institute-Stanford Human Genome Center, using the Arachne assembler (Batzoglou et al. 2002;Jaffe et al. 2003); later assemblies did not yield additional FPC merges. Second, to increase coverage of the sequence map, clones with unpaired BES were added to the draft sequence and to the MTP (Figure 2). The 8x draft sequence (gmax1.01) that consists of 20 scaffolds covering 950,068,807 bp of sequence length has 377 gaps indicated with 1000 Ns that are not spanned by only paired BES information (Table 3). Nearly one-third, 126 of 377 sequence gaps (33.4%), were spanned by BAC clones from the clone pool A (only the clones contained in the FPC map) with or without paired BESs and an additional 26 gaps (152 of 377, 40.3%) by clones from clone pool B (all fingerprinted clones from the three BAC libraries) with or without paired BES. The MTP picked from all the fingerprinted clones (clone pool B) with paired BESs had 835 gaps of which 160 were covered by adding 244 clones with unpaired BES resulting in additional coverage of as much as 5,947,374 bp. In the case of the MTP picked from only the FPC clones (clone pool A), 6,457,374 bp was covered from clones with unpaired BESs.
Finally, four different MTPs were picked from two different BAC clone pools to maximize coverage and minimize gaps: (1) (Table 7). Comparing MTPs 3 and 4 to 1 and 2, 80 sequence gaps were spanned, and the average length of overlap was similar. Because only 60% of the three BAC libraries (134,182/ 223,640) were used to construct the FPC map, there were more options with the larger pools to select clones that had more sequence coverage and less overlap with adjacent clones. Thus, when all clones were used, the number of clones used to build the MTPs decreased and the coverage length increased. When only clones with paired BES were used, it increased by 14,561,053 bp (from MTP1 to MTP3), and when both paired and unpaired BES were utilized, it increased by 14,050,753 bp (from MTP2 to MTP4).
Comparing MTPs 2 and 4 to 1 and 3, in terms of BESs, 140 gaps were spanned, and the average length of overlap was increased by 1.4 kbp. Once an MTP was picked using clones with paired BES, clones with unpaired BES were used only where we were unable to place clones with paired BESs. Therefore, it was reasonable that both the total numbers of clones used to build the MTPs and the average lengths of overlap increased. The sum of gaps covered by the clones with unpaired BES in both pools was 306, which was 1.8 times more than the sum of gaps spanned when MTPs were picked in the larger pool with all the three BAC libraries.
We conclude that the MTP selected using all the three BAC libraries containing clones with paired and unpaired BES is the best in that it had fewer gaps and the greatest coverage of the sequence map. In instances in which users need to know the relative locations of clones, this can be inferred through the FPC map constructed using clones with both paired and unpaired BESs. This high-resolution chromosome-anchored physical map will serve as an important tool for (1) improving the genome sequence by spanning gaps (in progress); (2) resolving assembly errors caused by repetitive sequences, large gene families and segmental duplications; (3) map-based cloning; and (4) cloning sequences that are too large or repetitive for polymerase chain reaction2based cloning (http://soybase.org).
The physical map of G. soja parallel to G. max genome sequence FPC-based physical maps were originally made to assist in clone-byclone sequencing by identifying minimal tiling paths; indeed, the maize FPC map was used for this purpose as recently as 2009 (Schnable et al. 2009). In the case of whole-genome shotgun sequencing, physical maps may be used for closing sequence gaps, confirmation of the sequence assembly, and to provide an anchored, clone-based resource for further research. With the transition to "next-generation" sequencing technologies, BAC-based maps can be even more crucial for ordering sequence contigs/scaffolds and confirming assemblies (Mardis 2008;Shendure and Ji 2008). Wild soybean, G. soja, genome was sequenced using the Illumina Genome Analyzer resulting in 48.8 Gbp of sequence, 52-fold sequence coverage of the genome. The short reads (35 or 76 bp) were mapped to gmax1.01 reference for assembly (Kim et al. 2010). Although it covered 43-fold of the reference genome, structural differences between two genomes were difficult to analyze because of the short read lengths and short distances between paired reads (Findley et al. 2010;Mahama et al. 1999;Yang et al. 2008).
Putative chromosomal structural rearrangements between G. soja and G. max could be detected through the alignment of BESs from G. soja against the G. max reference sequence (gmax 1.01; Table 8). BAC clones in which paired BESs aligned to different chromosomes indicate potential translocations; however, this interpretation is complicated by recent polyploidy events that occurred in the genus glycine. Insertions and deletions could be predicted from clones where paired BESs aligned too far (.225 kb) or too close (,75 kb) from each other on a chromosome. Inversions were predicted from paired BESs that pointed in either the opposition or same direction, as opposed to the expected orientation of toward each other (Figure 3). The average insert size of paired BESs between 75 kbp and 225 kbp was 146 kbp, consistent with the average insert size of GSS_Ba G. soja library (150 kbp; Table 1). The average insert size of paired BESs greater than 225 kbp was 445 kbp and less than 75 kbp was 37 kbp ( Figure S1). This is an underestimate because small insertions or deletions would be missed because of the variability in BAC insert sizes. However, we were able to calculate a rough estimate of how much of the genome might be in flux between the two species (Kim et al. 2007). Considering insertions and deletion only, we estimate that at least 998 kbp is flux between G. soja and the domesticated G. max. The estimated sizes of insertions and deletions were 300 kbp and 110 kbp, respectively, and deletions were 71% more frequent than insertions. A few hotspots for insertions, deletions, and inversions were detected on the G. max chromosomes ( Figure S2).
The importance of wild soybean (G. soja) as genetic resource for potentially valuable genes for introgression into soybean cannot be overstated. This was the reasoning for the sequencing of G. soja accession IT182932 as well another 17 other accessions of wild soybean (to 5x sequence coverage) (Kim et al. 2010;Lam et al. 2010). The sequence similarity between G. max and G. soja is 98%; however, structural differences are not captured in this statistic. Reciprocal translocations, segmental duplications, and insertions/deletions complicate the ability to map G. soja using G. max as a reference and short read WGS does not currently capture this information. Thus, physical maps remain useful for investigating and describing structural evolution that has occurred between these two genomes and to allow researchers to effectively shuttle between the genomes to capture useful genetic information for crop improvement and basic genetics.

ACKNOWLEDGMENTS
This work was supported in part by the United Soybean Board (Project 7268), the United States Department of Agriculture-Agricultural Research Service, and the National Science Foundation (DBI-0501877 and 082225).