Annotation of Contig21

Annotation of Contig21 Sarah Swiezy Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio434W 27 April 2015 Swiezy 2 Abstract: Contig21, a 40 kb region of the fourth (“dot”) chromosome of Drosophila elegans (D. elegans), containing three GENSCAN predictions and two high-quality BLASTX alignments to annotated Drosophila melanogaster (D. melanogaster) genes, was finished and annotated. Contig21 was analyzed for the presence of genes, transcription start sites, repeats, pseudogenes, and non-coding RNAs using NCBI BLAST, FlyBase, and ClustalW2 in addition to the UCSC Genome Browser, Gene Record Finder, Gene Model Checker, and other programs maintained by the Genomics Education Partnership (GEP). The final annotation included conservation of two genes annotated in D. melanogaster, fd102C and CG11148 (Figure 1), as well as several repetitious elements, a putative pseudogene, and a non-coding RNA. CG11148 contains a GYF domain, which is highly conserved among Drosophila species and a variety of mammals as well; this domain is likely involved in recognition of and binding to proline-rich sequences of proteins. The high level of conservation of CG11148 between D. elegans and other Drosophila species also allowed for annotation of the transcription start site to within 15 base pairs. Swiezy 3 Figure 1. Final annotation map of Contig21. Gene annotations in D. elegans shown in blue; BLASTX results aligning D. elegans to the D. melanogaster genome shown in red; gene predictions based on ab initio gene finders shown in gold (GENSCAN) and green (N-SCAN); repetitious elements shown beneath the heading “Repeating Elements by RepeatMasker.” Introduction: Recent innovations in sequencing technologies have made the goal of comparative genomics a promising new reality. Drosophila melanogaster (D. melanogaster), as one of the best studied model organisms, along with the other Drosophila species sequenced by the Drosophila 12 Genomes Consortium, are ideally suited to intensive study by geneticists and evolutionary biologists. With this end in mind, the Genomics Education Partnership based at the Biology Department and The Genome Institute at Washington University in St. Louis, has helped engage university student researchers across the United States in a project to finish and annotate portions of these sequenced Drosophila draft genomes to the high quality required for detailed analyses. The Washington University 2015 Bio4342 students have finished and Swiezy 4 annotated sections of the Drosophila elegans (D. elegans) fourth chromosome. This particular chromosome is unique in Drosophila in that it is composed primarily of heterochromatin, but has a number of functionally important genes that are transcribed at a high level. The present analysis will open the door to future questions regarding the role of transposable elements and repeats in heterochromatin formation, the degree of conservation between species in the Drosophila genus, and the structure of genes and their regulatory regions in a heterochromatin domain. Initial GENSCAN predictions in Contig21: The initial output from GENSCAN showed three predicted features in Contig21. Prediction 1 showed two exons, prediction 2 showed three exons, and prediction 3 showed eight exons (though the last exon of this feature fell outside of the region shown) (Figure 2). Given that prediction 1 had a small number of exons and that prediction 2 did not have an easily identifiable ortholog in Drosophila melanogaster (i.e. there was no BLASTX alignment overlapping with prediction 2), prediction 3 was used to anchor the annotation in this region. Swiezy 5 Figure 2. GENSCAN output for Contig21. Black box denotes prediction 1, green box denotes prediction 2, red box denotes prediction 3. UCSC Genome Browser View of Contig21 Features: To better understand the homology of this region with the D. melanogaster genome, the BLASTX track on the UCSC Genome Browser was compared to the GENSCAN predictions (Figure 3). Prediction 1: As can be seen from this picture of the region, there is a high quality (red) BLASTX alignment of prediction 1 to fd102C-PA in D. melanogaster. Prediction 1 is not supported by a high level of RNA-Seq data or by TopHat splice junctions. (There are two splice junctions; however, these do not correspond to the GENSCAN-proposed intron/exon boundary in this prediction.) A second BLASTX match (brown) overlaps with prediction 1; however, this gene is also inconsistent with the RNA-Seq and TopHat splice junction data in this area. Swiezy 6 Prediction 2: Prediction 2 does not show a BLASTX alignment to a gene in D. melanogaster and has no RNA-Seq data or TopHat splice junction data to support the presence of this GENSCAN prediction as a gene in Drosophila elegans. Prediction 3: All eight exons found in prediction 3 show strong similarity to the exons in the homologous D. melanogaster gene, and the general exon structure of this gene appears to be well-supported by RNA-Seq and TopHat splice junction data (this will become clearer in later figures at higher resolution), suggesting their presence in the D. elegans genome. There are four isoforms for the presumed ortholg of prediction 3 in D. melanogaster (Figure 3). Figure 3. View of Contig21 via UCSC Genome Browser. Blue box marks prediction 1, green box marks prediction 2, pink box marks prediction 3, and black box marks the two TopHat splice junctions that do not correspond to the intron/exon boundary of GENSCAN prediction 1. Annotation of Prediction 3: In order to establish an ortholog in D. melanogaster, a BLASTp search, using as the query the amino acid sequence of D. elegans prediction 3, and as the subject, the annotated D. melanogaster protein database maintained by FlyBase, was carried out. The top four matches Swiezy 7 aligned over the entire length of the F, D, H, and G isoforms of CG11148 with high similarity; all four of these alignments had an E-value of zero, with the next best alignments having E- values >20 (Figure 4). Thus, prediction 3 in D. elegans is an ortholog of CG11148 in D. melanogaster. Figure 4. BLASTp alignment of prediction 3, establishing orthology to all four isoforms of D. melanogaster CG11148. Query: amino acid sequence of prediction 3 in D. elegans; subject: annotated D. melanogaster proteins database. Annotation of Prediction 3, Exon 4: To begin, the D. melanogaster gene CG11148 was entered into the Gene Record Finder maintained by the Genomics Education Partnership, using GFF3 files from FlyBase. This database provided a list of coding exons previously annotated for this gene in D. melanogaster with their corresponding nucleotide and amino acid sequences, the strand on which each is transcribed, and to which isoform each belongs (Figure 5). The longest exon (found in each of the isoforms), exon 4, had a size of 872 amino acids; this sequence was used as the query for a BLASTX search (using the nucleotide sequence of Contig21 as query), which showed that this exon is transcribed in frame -3 (Figure 6). Using the UCSC browser, the region at the 5’ end of exon 4 was expanded to view the nucleotide sequence and all AG (canonical 3’ splice acceptor sequence) and GT (canonical 5’ splice donor sequence) pairs were highlighted (Figure 7). Swiezy 8 Figure 5. Output from Gene Record Finder for D. melanogaster gene CG11148. Top panel: table of coding exons present in each isoform; bottom panel: table of coding exons showing coordinates, strand, phase, and size in amino acids. Exon 2 and exon 3 are alternative second exons; therefore, throughout, exons will be referred to as exons 1 to 8 (in both isoforms), corresponding to their numerical order in the gene prediction, rather than by their FlyBase ID. Figure 6. BLASTX alignment placing exon 4 in frame -3; alignment is truncated after subject amino acid 115. Subject: nucleotide sequence of Contig21; query: amino acid sequence of exon 4 from Gene Record Finder. Swiezy 9 Figure 7. Expanded view of 5’ end of exon 4. Blue box denotes splice acceptor sequence; black box shows that this is a phase 2 acceptor in frame -3. (Note that CG11148 is transcribed on the reverse strand, and therefore, the 5’ end is on the right of this figure.) Looking in frame -3, the 3’ AG splice acceptor sequence corresponds to bases 35,373- 35,732. The first base of the exon is immediately downstream of this pair, at base 35,371, a start that is supported by both N-SCAN and GENSCAN gene predictors, TopHat splice junction sites, the start of deep RNA-Seq coverage, and the predicted amino acid sequence. This is a phase 2 splice acceptor (Figure 7). The first base of this exon is not supported by the BLASTX alignment between the translated nucleotide sequence of Contig21 and the amino acid sequence of D. melanogaster; however, BLASTX (which is looking for conservation) often extends sequence alignments beyond the exon boundaries, and therefore, this result should not be used as evidence against starting exon 4 at base 35,371. As further support for the first base of this exon, the phase of the splice donor of this intron was evaluated. First, a BLASTX alignment showed that exon 3 is also transcribed in frame -3 (Figure 8). Swiezy 10 Figure 8. BLASTX alignment, showing exon 3 is transcribed in frame -3. Query: amino acid sequence of D. melanogaster exon 3 from Gene Record Finder; subject: nucleotide sequence of Contig21; note: computational adjustment turned off in this search. The GT pair from base 38,833-38,832 corresponds to the 3’ end of the N-SCAN and GENSCAN predictions and the end of the RNA-Seq coverage, and therefore, was chosen as the splice donor site. BLASTX likely overextended the alignment. This is a phase 1 donor, which, when added to the phase 2 acceptor, equals 3 (or, a whole codon) (Figure 9).

Annotation of Contig21

Identification and Characterization of TPRKB Dependency in TP53 Deficient Cancers

BRE Facilitates Skeletal Muscle Regeneration by Promoting Satellite Cell Motility and Differentiation Lihai Xiao and Kenneth Ka Ho Lee*

A Computational Model for Classification of BRCA2 Variants Using Mouse Embryonic Stem Cell-Based Functional Assays

BRCC3 Acts As a Prognostic Marker in Nasopharyngeal

Novel BRCA1 Mutations and More Frequent Intron-20 Alteration Found Among 236 Women from Western Poland

BRE Plays an Essential Role in Preventing Replicative and DNA

The Atypical Cyclin-Like Protein Spy1 Overrides P53-Mediated Tumour

Tumour Suppressor OTUD3 Induces Growth Inhibition and Apoptosis by Directly Deubiquitinating and Stabilizing P53 in Invasive Breast Carcinoma

An Integrated Data Analysis Approach to Characterize Genes Highly Expressed in Hepatocellular Carcinoma

Thesis Reference

Breast Cancer

Anti-BRE Monoclonal Antibody, Clone FQS22969 (DCABH-6454) This Product Is for Research Use Only and Is Not Intended for Diagnostic Use