Updated Genome Assembly and Annotation for Metrosideros polymorpha, an Emerging Model Tree Species of Ecological Divergence

Accurate feature annotation as well as assembly contiguity are important requisites of a modern genome assembly. They allow large-scale comparison of genomes across and within species and identification of polymorphisms, leading evolutionary and functional studies. We report an updated genome resource for Metrosideros polymorpha, the most dominant tree species in the Hawaiian native forests and a unique example of rapid and remarkable ecological diversification of woody species. Ninety-one percent of the bases in the sequence assembly (304 Mb) were organized into 11 pseudo-molecules, which would represent the chromosome structure of the species assuming the synteny to a close relative Eucalyptus. Our complementary approach using manual annotation and automated pipelines identified 11.30% of the assembly to be transposable elements, in contrast to 4.1% in previous automated annotation. By increasing transcript and protein sequence data, we predicted 27,620 gene models with high concordance from the supplied evidence. We believe that this assembly, improved for contiguity, and annotation will be valuable for future evolutionary studies of M. polymorpha and closely related species, facilitating the isolation of specific genes and the investigation of genome-wide polymorphisms associated with ecological divergence.

RepeatMasker (Smit et al. 2013) identified 4.1% of the bases to be associated with repeated sequences and transposable elements (TEs), based on the homology with Arabidopsis thaliana repeat sequences. AUGUSTUS (Stanke et al. 2004) predicted 39,305 gene models based on a Hidden Markov model trained with RNA-seq data from one M. polymorpha plant. Although this genome resource was useful to obtain genomic insights of the ecological diversification in the species (Izuno et al. , 2017, there was still room for improvement in the genome resource. Slightly more number of gene models (39,305) were identified compared to other tree genomes (approximately 33,000 genes on average; Neale et al. 2017), indicating that some gene models could be fragmented. The annotated portion of transposable elements was relatively low (4.1%), considering other tree genomes with similar genome size (300-350 Mb) contain transposable elements in 20-30% of the whole genomes (He et al. 2013;Verde et al. 2013;Wu et al. 2014). This suggests that M. polymorpha probably contained additional TE/repeat regions that escaped detection.
Here, we report the updated version of the M. polymorpha genome sequences (version 2.0 hereon). It was obtained by assigning the ver. 1.0 assembly (36,375 scaffolds) to 11 pseudo-chromosomes by means of the presumed collinearity with the closely related Eucalyptus grandis.
To improve the quality of gene prediction, we used an increased number of evidence data, such as RNA-Seq and protein sequences data, to train gene prediction programs. We manually annotated repeat regions in addition to using automatic pipelines to gather repeat sequences that were not identified in the ver. 1.0 annotation. The resulting ver. 2.0 annotation contained more complete gene models that were well supported by supplied evidence data and showed that an increased, but still small, fraction of the genomes is composed by TEs and repeated sequences.

Pseudo-molecule assembly
Metrosideros polymorpha pseudo-molecules were constructed based on the Eucalyptus grandis genome (ver. 2.0; Bartholomé et al. 2014), which is composed of 11 pseudo-chromosomes and 4,932 scaffolds. We assumed chromosome structure be mostly conserved between the two species because the basic chromosome number for Myrtaceae is x = 11 (Atchison 1947) and M. polymorpha also has 11 chromosomes (2n = 22; Carr 1978). We identified the positions of 32,152 E. grandis coding DNA sequences (CDSs) on the 36,376 M. polymorpha scaffolds (ver. 1.0) by NCBI BLASTN (ver. 2.2.31+) with the threshold cutoff of 1.0E-3. Metrosideros polymorpha scaffolds with more than 10 CDSs were used for anchoring to E. grandis pseudo-chromosomes. For all CDSs on these scaffolds, it was first determined to which E. grandis chromosome they map. Then, we removed CDSs that map to different E. grandis chromosomes than the majority of CDS on the scaffold or to very distant regions of the same E. grandis chromosome. This step removed approximately 12% of the CDS (a low value due to good collinearity between E. grandis and M. polymorpha). To determine a single anchor point for each scaffold, the average position of the central five CDSs on the Figure 1 Workflow for re-annotation of the Metrosideros polymorpha genome using the MAKER2 pipeline.
n■ Table 1 Summary of Metrosideros polymorpha pseudo-molecules. Sequences were assigned to 11 chromosomes based on the collinearity with Eucalyptus genes, assuming a complete synteny between the two species  sequences with DNA homology were used to identify non-autonomous deletion derivatives (i.e., insertions that lack parts or all of the coding regions). Multiple sequences from individual families were used to generate a consensus sequence, which was added to the reference TE library. We further de novo identified DNA transposons by searching terminal inverted repeats (TIRs). Candidate TIR elements were then used in BLASTN searches against the whole genome. Those that occurred in multiple copies were considered novel DNA transposon families. For TE annotation, the identified reference TE families were mapped back to the genome using an in-house TE annotation pipeline.

Genome annotation
We used the MAKER genome annotation pipeline (ver. 2.31.8; Holt and Yandell 2011) to identify gene models on the M. polymorpha genome ver. 2.0 ( Figure 1). MAKER was initially run to make crude gene models based on transcript and protein evidence.  Campbell et al. (2014) and Bowman et al. (2017). MAKER was run again to conduct SNAP-and AUGUSTUS-based gene prediction with the trained HMMs. In the second run, we allowed predicted genes with more than 50 amino acids, and`keep_preds`and`always_completeẁ ere set to 1.
To compare annotation quality between the current and previous genome annotations, we added MAKER's quality-control metrics to the ver. 1.0 annotation. Using MAKER, we calculated AED scores for 41,874 mRNAs in the ver. 1.0 annotation against the evidence data used for the current annotation (obtained with the 19 RNA-seq data and protein sequences of E. grandis and A. thaliana).   Figure 2; DDBJ accession BCNH02000001-BCNH02000011), which constitute 91.1% of the entire M. polymorpha genome assembly. This result indicated .90% of the ver. 1.0 assembly represented non-redundant regions of the genome, against our previous speculation that the relatively high heterozygosity could split the assembly. Long read sequencing technologies could align the remaining 10% of the scaffolds, which were short (an average of 571 bp in length) and repeat-rich (data not shown), and clarify how the complexity of genome affects de novo assembly. The current assembly (pseudo-molecules) highlighted highly conserved gene content between the two species. After diverging from a common ancestor in the Paleocene (66-55 million years ago), Eucalyptus diversified in Oligocene (38-24 million years ago) on the Australia continent whereas Metrosideros spread throughout the Pacific islands in the Miocene (23.3-5.2 million years ago) (Thornhill et al. 2015;Wright et al. 2000). The independent evolution of the two species after split could have led to the accumulation of TEs in Eucalyptus, explaining the different genome size between E. grandis (640 Mb; Myburg et al. 2014) and M. polymorpha (330 Mb).

TE annotation
Our manual annotation identified 8.03% of the M. polymorpha genome (ver. 2.0) to be comprised of TEs (Table 2). Most of them were derived from Gypsy (2.46%) and Copia (2.21%) long terminal repeat retrotransposons (LTR-RTs) (Figure 3), of which 19 and 24 families were identified, respectively. DNA transposons constituted 2.43% of the genome (Table 2). We identified between one and 4 families of Mutator, CACTA, hAT and Helitron based on the homology search approach and 9 families based on the features of TIRs (Table 2). Different TE families were enriched in different regions on pseudo-molecules ( Figures S1-S4), suggesting different proliferation strategies among TE families. RepeatMasker annotated 2.88% of the genome to be TEs (Table 2) and RepeatRunner identified 1,469 additional TEs spanning 0.28% of the genome. Gene annotation with MAKER further found 147 genes (spanning 0.20% of the genome) bearing TEs. With only 0.09% of the bases being identified by two or more of the annotations above, 11.30% of the assembly was characterized as TE related.
The three annotation approaches, i.e., manual annotation, RepeatMasker and RepeatRunner, complementary identified TEs in Figure 3 Contributions of the 20 most abundant transposable element (TE) families to the whole genome. Fifteen families could be assigned to four different superfamilies (see inset), the remaining did not contain coding sequences (e.g., transposase), which would have allowed their classification into known superfamilies. the M. polymorpha genome. Our manual annotation identified DNA transposons occupying 2.43% of the M. polymorpha genome, whereas RepeatMasker suggested much lower levels (0.37%; Table 2). While our manual annotation exhaustively identified the distributions of frequent TE superfamilies (.1% in TE space), RepeatMasker could be effective to identify rare TEs, which were not handled in our manual annotation because of the insufficient number of sequences to generate representative elements. RepeatRunner identified completely different TE features from those by RepeatMasker or our manual annotation except for one TE feature. These findings indicate that TE annotation in a novel genome, for which no useful libraries of close relatives are available, can profit from complementary approaches, such as manual identification of TE families and automated annotation using software (e.g., RepeatMasker).
In some cases, the uneven distribution of TE superfamilies along chromosomes was diagnostic to infer the location of the centromeres. In most small plant genomes, LTR-RTs are enriched in centromeric and paracentromeric regions, while non-LTR RTs and DNA transposons (more often associated with genes) are found mostly in distal chromosomal regions (Bennetzen and Wang 2014). Thus, we analyzed the abundance of DNA transposons and LTR-RTs in the M. polymorpha pseudomolecules. We found that LTR-RTs are indeed enriched in distinct regions ( Figures S1-S4). In particular, the Gypsy family RLG_V has very narrow distributions (i.e., it is found only in very narrow windows of 1-3 Mb in size while it is practically absent from all other chromosomal regions; Figures S1-S4). This distribution is reminiscent of that of RLG_Cereba in wheat (Wicker et al. 2018) and CRG retroelement in cotton (Luo et al. 2012), which are found almost exclusively in centromeric regions. Interestingly, some of the pseudomolecules (1, 7, 8 and 10) contain two regions that are enriched in RLG_V elements and other LTR-RTs ( Figures S1-S4). Aware of the caveat that M. polymorpha pseudomolecules were built based on collinearity with E. grandis chromosomes, the presence of multiple regions enriched in potential centromeric LTR-RTs may highlight orientation issues or suggest that the ver. 1.0 assembly included redundant scaffolds due to heterozygosity. Alternatively, one of the two regions may be a non-functional centromere (either being recently inactivated or gaining function). Additional sequencing with long reads or DNA physical/optical mapping techniques will be necessary to build more reliable pseudomolecules.
Compared to other tree genomes sequenced (reviewed in Neale et al. 2017), the M. polymorpha genome is small and contains only a small fraction of TEs. Interestingly, other plants that are adapted to extreme conditions such as mangrove trees (Lyu et al. 2018), a carnivorous plant (Ibarra-Laclette et al. 2013) or the Antarctic midge (Kelley et al. 2014) were reported to have small genomes with low TE content. Thus, the small genome of M. polymorpha could also be the result of convergent evolution due to similar selective constraints that act on M. polymorpha plants when they populate their extreme and diverse environments.

Improved gene annotation
The final MAKER run, in which SNAP and AUGUSTUS were run with custom-trained HMMs, predicted 27,620 gene models putatively coding 40,206 different transcript isoforms (Table 3; File S1). While the assembly fraction occupied by coding regions was almost the same between ver. 1.0 and ver. 2.0, the number of gene models decreased and intron sizes increased in ver. 2.0 (Table 3), suggesting fragmented gene models in ver. 1.0 were concatenated. A sharp increase in mean gene size (3.37 Kb vs. 4.94 Kb in ver. 1.0 and ver. 2.0, respectively) accompanied to a negligible increase in total size of the coding regions (132 and 136 Mb, respectively) validate this observation. This may be due to genes that were split in two ver. 1.0 scaffolds or to the increased supplies of evidence data.
The concordance of the input evidence into the current gene annotation was improved: 66.5% and 88.4% transcript isoforms showed AED , 0.5 and 20.9% and 3.4% isoforms showed AED = 1 in ver. 1.0 and ver. 2.0, respectively (Figure 4). Higher proportion of isoforms in ver. 2.0 was supported by the Pfam protein domains and SwissProt proteins in comparison with ver. 1.0, possibly due to the increased length of isoforms in ver. 2.0 ( Table 3). The 40,206 transcript isoforms represented 90% of the core transcripts from the 1,440 single-copy conserved orthologs in plants (Table 3). Although this was comparable to the completeness of the 43,894 isoforms in the 11 E. grandis chromosomes, which covered 92.1% of the core n■  genes, increased evidence data from other plant tissues besides leaf buds may predict more comprehensive gene models. Overall, with increased effort on manual annotation of TEs and the increased number of evidence data, we considerably improved the quality of the Metrosideros gene models. This will enable more sound and complete evolutionary studies on functional genes relevant to the environmental adaptation of woody species.