Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data

We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3′ UTRs (up to 15–18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.


SUPPORTING INFORMATION
. FlyBase gene model and transcript comments Table S2. FlyBase annotation IDs and the changes that occur to them as a result of annotation updates Table S3. Improved UTR annotations in FlyBase annotation set R6.03 Table S4. Overlap of incorporated RNA-Seq exon junctions with annotated CDS, UTR and non-coding RNA Table S5. Incorporation of modENCODE embryonic TSS regions into gene annotations Table S6. Improvement of 3'UTR annotations File S1. Gene model annotation correspondence between FlyBase R5.24 and R6.03 File S2. Improved UTR annotation of R5.24 protein coding transcripts in R6.03 File S3. Incorporation of RNA-Seq exon junction evidence into gene model annotations File S4. Overlap of modENCODE TSS regions to R5.24 and R6.03 transcripts File S5. Comparison of R5.24 and R6.03 3'UTR annotations. File S6. Small polypeptides File S7. Sex-specific transcripts File S8. Genes with known disruptive mutations in the reference genome assembly    Table S4 Overlap of incorporated RNA-Seq exon junctions with annotated CDS, UTR and non-coding RNA. RNA-Seq exon junctions incorporated into FlyBase gene model annotations (i.e., matching an annotated intron) were assessed for their overlap to CDS, UTRs and non-coding RNA (lncRNA or pseudogene transcripts). Overlap was calculated separately for junctions that were "previously incorporated" in R5.24 (and therefore incorporated independently of RNA-Seq evidence), and for junctions that were "newly incorporated", based primarily on RNA-Seq data, subsequent to R5.24 (and still incorporated in R6.03).
Percent of exon junctions overlapping*: Class of incorporated exon junction: 5'UTR CDS 3'UTR ncRNA Newly incorporated exon junctions (n = 9,033) 37.3 46.2 4.4 12.1 Previously incorporated exon junctions (n = 46,319) 11.6 87.2 0.7 0.5 * Exon junctions that bridge a UTR and CDS, or that overlap different elements in different transcripts, were excluded from this tabulation. Also excluded from this tabulation were 254 exon junctions previously incorporated in R5.24 but subsequently withdrawn by R6.03. Results here are for 9,033 of 9,264 newly incorporated exon junctions, and 46,319 of 48,297 junctions previously incorporated in R5.24 and still incorporated R6.03. 3.0 TSS overlaps transcript, but not transcript 5' end 30.0 6.9* TSS overlaps no transcripts 2.9 0.1 * The 595 cases in which a TSS region overlapped a transcript, but not its 5' end, are currently being re-assessed with additional, independent TSS datasets. File S1. Gene model annotation correspondence between FlyBase R5.24 and R6.03. For each gene model, the annotation ID, FlyBase gene ID, gene symbol and gene class are shown in columns C-F for R5.24 and in columns H-K for R6.03, respectively. Annotations in R5.24 and R6.03 that have some relationship are listed in the same row, and the relationship is described in columns A and B. Column A classifies annotations in R5.24 and R6.03 as being "common" to both sets, related by gene merge, split or reclassification, or as being unique to R5.24 or R6.03. Additional descriptions of the relationship are provided in column B. Gene models encoding mRNA, ncRNA and pseudogene are shown in the first tab ("mRNA-ncRNA-pseudogene"), genes representing "small RNA"-encoding genes on the second tab. Note that although gene models will retain the same annotation ID in the absence of major structural reorganization or reclassification, there can be extensive changes in the number of transcript isoforms and the extent of the UTR annotations. Also note that only the net change after 26 annotation updates is shown; major changes to gene models may have taken place in several steps, and/or involved additional gene models created some time between R5.24 and R6.03 (these intermediates steps are not shown).
File S2. Improved UTR annotation of R5.24 protein coding transcripts in R6.03. Transcripts for 13,104 protein coding gene models common to both R5.24 and R6.03 are listed; this set excludes transcripts of 9 gene models for which R5.24-R6.03 equivalence exists, but annotation IDs have changed (see File S1 for list and explanation). In column A, transcripts are indicated to be "common" to both R5.24 and R6.03, or specific to R5.24 or R6.03. The transcript annotation ID and FlyBase transcript ID are shown in columns B and C. mRNA, 5'UTR and 3'UTR length for R5.24 and R6.03 are shown in columns D-F and H-J. Columns G and K classify the UTR status of each transcript in R5.24 and R6.03, as having "no UTRs", "no 3'UTR", "no 5'UTR" or "both UTRs". The change (nucleotides) in mRNA, 5'UTR and 3'UTR size from R5.24 to R6.03 is calculated in columns L-N.

File S3. Incorporation of RNA-Seq exon junction evidence into gene model annotations.
All 71,514 RNA-Seq exon junctions obtained as evidence by FlyBase are listed in Column A by their FlyBase ID (FBsf#). The dataset(s) of origin for each exon junction is/are shown in Column B: Daines et al., 2011 (BCM), Graveley et al., 2011 (modENCODE), and unpublished lower confidence modENCODE junctions (modENCODE_Extra). The incorporation of each junction into a gene model ("Y" for yes, "N" for no) from annotation sets R5.24 and R6.03 is shown in Columns C and D, respectively. The overlap of each junction with 5' UTR, CDS and 3' UTR elements in R6.03 gene models is shown in columns E to G, respectively ("Y" for yes, "N" for no). Note that some junctions join a UTR to a CDS right at the UTR/CDS boundary; in these cases, overlap to both the CDS and UTR is scored as "Y", with an explanatory comment in Column I. Also note that a given junction can overlap different elements in different transcripts (for example, it may be found in the CDS of one transcript, and in the 5' UTR of another transcript). As such, the number of distinct ways that a junction overlaps transcript elements (i.e., overlap patterns) is indicated in Column H, with explanatory comments in Column I.

File S4. Overlap of modENCODE TSS regions to R5.24 and R6.03 transcripts.
For each high-confidence ("validated") modENCODE embryonic TSS region, its overlap to annotated transcripts in R5.24 and R6.03 was determined. Four kinds of overlap were scored; edge match -TSS 90% point matches the transcript 5' end; edge overlap -TSS spans annotated transcript 5' end (but 90% point does not match the transcript 5' end); other overlap -TSS overlaps the transcript but does not span the transcript 5' end; no overlap (to any transcripts). A single TSS region could overlap different transcripts in different ways: in these cases, only one type of overlap was reported according to the following ranking: edge match > edge overlap > other overlap. FlyBase TSS IDs are listed in Column A, overlap type to R5.24 and R6.03 transcripts in Columns B and C, respectively. Overlapping R5.24 and R6.03 transcripts in Columns D and E, respectively.

File S5. Comparison of R5.24 and R6.03 3'UTR annotations.
The unique set of annotated 3'UTRs is listed for R5.24 and R6.03 (on separate spreadsheet tabs). Each annotated 3'UTR was uniquely defined by its location (Column A). The 3'UTR's size and the location of its 3'end are listed in Columns B and C. The number of transcripts containing the indicated 3'UTR, and IDs for these transcripts, are listed in Columns D and E. 3'UTR ends that are supported by a polyadenylated cDNA, as determined by the associated transcript model comments, are indicated in Column F. Columns G and H list the annotation and FlyBase gene IDs for the gene associated with each distinct 3'UTR, with Column I indicating if the gene contains di/polycistronic transcripts. Column J indicates if the gene model is present in both R5.24 and R6.03 (as judged by an unchanged gene annotation ID).

File S6. Small polypeptides.
Predicted polypeptides of 50 residues or less from the R6.03 annotation set are listed.
File S7. Sex-specific transcripts. As described in the text, the FlyBase RNA-Seq Search tool was used to identify 129 genes with female-specific expression and 2,414 genes with male-specific expression within the R6.03 annotation set. These genes are listed in separate tabs within the spreadsheet, along with additional information on gonadal expression (ovary or testis) and early embryo expression (for female-specific genes only).
File S8. Genes with known disruptive mutations in the reference genome assembly. R6.03 gene model annotations with disruptive mutations in the genome assembly are listed. These disruptive mutations are typically specific to sequences derived from the "iso-1" reference strain. Gene IDs, annotation IDs and gene symbols are listed in Columns A-C respectively. The type of disruptive mutation is described in Column D. Columns E and F described any similarity of the "disrupted" gene to other protein coding genes in the R6.03 annotation set. Additional comments, associated alleles and transposon insertions are listed in Columns G-I, respectively.