Annotation of Contig21
Sarah Swiezy Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio434W 27 April 2015 Swiezy 2
Abstract:
Contig21, a 40 kb region of the fourth (“dot”) chromosome of Drosophila elegans (D.
elegans), containing three GENSCAN predictions and two high-quality BLASTX alignments to
annotated Drosophila melanogaster (D. melanogaster) genes, was finished and annotated.
Contig21 was analyzed for the presence of genes, transcription start sites, repeats, pseudogenes,
and non-coding RNAs using NCBI BLAST, FlyBase, and ClustalW2 in addition to the UCSC
Genome Browser, Gene Record Finder, Gene Model Checker, and other programs maintained by
the Genomics Education Partnership (GEP). The final annotation included conservation of two
genes annotated in D. melanogaster, fd102C and CG11148 (Figure 1), as well as several
repetitious elements, a putative pseudogene, and a non-coding RNA. CG11148 contains a GYF
domain, which is highly conserved among Drosophila species and a variety of mammals as well;
this domain is likely involved in recognition of and binding to proline-rich sequences of proteins.
The high level of conservation of CG11148 between D. elegans and other Drosophila species
also allowed for annotation of the transcription start site to within 15 base pairs. Swiezy 3
Figure 1. Final annotation map of Contig21. Gene annotations in D. elegans shown in blue; BLASTX results aligning D. elegans to the D. melanogaster genome shown in red; gene predictions based on ab initio gene finders shown in gold (GENSCAN) and green (N-SCAN); repetitious elements shown beneath the heading “Repeating Elements by RepeatMasker.”
Introduction:
Recent innovations in sequencing technologies have made the goal of comparative
genomics a promising new reality. Drosophila melanogaster (D. melanogaster), as one of the
best studied model organisms, along with the other Drosophila species sequenced by the
Drosophila 12 Genomes Consortium, are ideally suited to intensive study by geneticists and
evolutionary biologists. With this end in mind, the Genomics Education Partnership based at the
Biology Department and The Genome Institute at Washington University in St. Louis, has
helped engage university student researchers across the United States in a project to finish and
annotate portions of these sequenced Drosophila draft genomes to the high quality required for
detailed analyses. The Washington University 2015 Bio4342 students have finished and Swiezy 4
annotated sections of the Drosophila elegans (D. elegans) fourth chromosome. This particular
chromosome is unique in Drosophila in that it is composed primarily of heterochromatin, but has
a number of functionally important genes that are transcribed at a high level. The present
analysis will open the door to future questions regarding the role of transposable elements and
repeats in heterochromatin formation, the degree of conservation between species in the
Drosophila genus, and the structure of genes and their regulatory regions in a heterochromatin
domain.
Initial GENSCAN predictions in Contig21:
The initial output from GENSCAN showed three predicted features in Contig21.
Prediction 1 showed two exons, prediction 2 showed three exons, and prediction 3 showed eight
exons (though the last exon of this feature fell outside of the region shown) (Figure 2). Given
that prediction 1 had a small number of exons and that prediction 2 did not have an easily
identifiable ortholog in Drosophila melanogaster (i.e. there was no BLASTX alignment
overlapping with prediction 2), prediction 3 was used to anchor the annotation in this region. Swiezy 5
Figure 2. GENSCAN output for Contig21. Black box denotes prediction 1, green box denotes prediction 2, red box denotes prediction 3.
UCSC Genome Browser View of Contig21 Features:
To better understand the homology of this region with the D. melanogaster genome, the
BLASTX track on the UCSC Genome Browser was compared to the GENSCAN predictions
(Figure 3).
Prediction 1: As can be seen from this picture of the region, there is a high quality (red)
BLASTX alignment of prediction 1 to fd102C-PA in D. melanogaster. Prediction 1 is not
supported by a high level of RNA-Seq data or by TopHat splice junctions. (There are two splice
junctions; however, these do not correspond to the GENSCAN-proposed intron/exon boundary in
this prediction.) A second BLASTX match (brown) overlaps with prediction 1; however, this
gene is also inconsistent with the RNA-Seq and TopHat splice junction data in this area. Swiezy 6
Prediction 2: Prediction 2 does not show a BLASTX alignment to a gene in D.
melanogaster and has no RNA-Seq data or TopHat splice junction data to support the presence
of this GENSCAN prediction as a gene in Drosophila elegans.
Prediction 3: All eight exons found in prediction 3 show strong similarity to the exons in
the homologous D. melanogaster gene, and the general exon structure of this gene appears to be
well-supported by RNA-Seq and TopHat splice junction data (this will become clearer in later
figures at higher resolution), suggesting their presence in the D. elegans genome. There are four
isoforms for the presumed ortholg of prediction 3 in D. melanogaster (Figure 3).
Figure 3. View of Contig21 via UCSC Genome Browser. Blue box marks prediction 1, green box marks prediction 2, pink box marks prediction 3, and black box marks the two TopHat splice junctions that do not correspond to the intron/exon boundary of GENSCAN prediction 1.
Annotation of Prediction 3:
In order to establish an ortholog in D. melanogaster, a BLASTp search, using as the
query the amino acid sequence of D. elegans prediction 3, and as the subject, the annotated D.
melanogaster protein database maintained by FlyBase, was carried out. The top four matches Swiezy 7
aligned over the entire length of the F, D, H, and G isoforms of CG11148 with high similarity;
all four of these alignments had an E-value of zero, with the next best alignments having E-
values >20 (Figure 4). Thus, prediction 3 in D. elegans is an ortholog of CG11148 in D.
melanogaster.
Figure 4. BLASTp alignment of prediction 3, establishing orthology to all four isoforms of D. melanogaster CG11148. Query: amino acid sequence of prediction 3 in D. elegans; subject: annotated D. melanogaster proteins database.
Annotation of Prediction 3, Exon 4:
To begin, the D. melanogaster gene CG11148 was entered into the Gene Record Finder
maintained by the Genomics Education Partnership, using GFF3 files from FlyBase. This
database provided a list of coding exons previously annotated for this gene in D. melanogaster
with their corresponding nucleotide and amino acid sequences, the strand on which each is
transcribed, and to which isoform each belongs (Figure 5). The longest exon (found in each of
the isoforms), exon 4, had a size of 872 amino acids; this sequence was used as the query for a
BLASTX search (using the nucleotide sequence of Contig21 as query), which showed that this
exon is transcribed in frame -3 (Figure 6). Using the UCSC browser, the region at the 5’ end of
exon 4 was expanded to view the nucleotide sequence and all AG (canonical 3’ splice acceptor
sequence) and GT (canonical 5’ splice donor sequence) pairs were highlighted (Figure 7). Swiezy 8
Figure 5. Output from Gene Record Finder for D. melanogaster gene CG11148. Top panel: table of coding exons present in each isoform; bottom panel: table of coding exons showing coordinates, strand, phase, and size in amino acids. Exon 2 and exon 3 are alternative second exons; therefore, throughout, exons will be referred to as exons 1 to 8 (in both isoforms), corresponding to their numerical order in the gene prediction, rather than by their FlyBase ID.
Figure 6. BLASTX alignment placing exon 4 in frame -3; alignment is truncated after subject amino acid 115. Subject: nucleotide sequence of Contig21; query: amino acid sequence of exon 4 from Gene Record Finder. Swiezy 9
Figure 7. Expanded view of 5’ end of exon 4. Blue box denotes splice acceptor sequence; black box shows that this is a phase 2 acceptor in frame -3. (Note that CG11148 is transcribed on the reverse strand, and therefore, the 5’ end is on the right of this figure.)
Looking in frame -3, the 3’ AG splice acceptor sequence corresponds to bases 35,373-
35,732. The first base of the exon is immediately downstream of this pair, at base 35,371, a start
that is supported by both N-SCAN and GENSCAN gene predictors, TopHat splice junction sites,
the start of deep RNA-Seq coverage, and the predicted amino acid sequence. This is a phase 2
splice acceptor (Figure 7). The first base of this exon is not supported by the BLASTX alignment
between the translated nucleotide sequence of Contig21 and the amino acid sequence of D.
melanogaster; however, BLASTX (which is looking for conservation) often extends sequence
alignments beyond the exon boundaries, and therefore, this result should not be used as evidence
against starting exon 4 at base 35,371.
As further support for the first base of this exon, the phase of the splice donor of this
intron was evaluated. First, a BLASTX alignment showed that exon 3 is also transcribed in
frame -3 (Figure 8). Swiezy 10
Figure 8. BLASTX alignment, showing exon 3 is transcribed in frame -3. Query: amino acid sequence of D. melanogaster exon 3 from Gene Record Finder; subject: nucleotide sequence of Contig21; note: computational adjustment turned off in this search.
The GT pair from base 38,833-38,832 corresponds to the 3’ end of the N-SCAN and
GENSCAN predictions and the end of the RNA-Seq coverage, and therefore, was chosen as the
splice donor site. BLASTX likely overextended the alignment. This is a phase 1 donor, which,
when added to the phase 2 acceptor, equals 3 (or, a whole codon) (Figure 9).
Figure 9. Expanded view of the 3’ end of exon 3. Blue box denotes splice donor sequence; black box shows that this is a phase 1 donor in freme -3. Swiezy 11
Next, the 3’ end of exon 4 was scanned for GT splice donors. The GT pair from base
32,716-32,715 corresponds to the 3’ end of the N-SCAN and GENSCAN predictions and the end
of the RNA-Seq coverage, and therefore, was chosen as the splice donor site. This is a phase 1
donor in frame -3. The last base of exon 4 is therefore base 32,717 (Figure 10).
To further support position 32,717 as the last base of exon 4, the phase of the splice
acceptor of this intron was evaluated. A BLASTX search showed that exon 5 is also transcribed
in frame -3 (Figure 11). The end of exon 5 was expanded and scanned for AG splice acceptors.
The AG pair from base 30,759-30,758 corresponds to the end of the N-SCAN and GENSCAN
predictions and a sharp increase in RNA-Seq coverage; therefore, this was chosen as the splice
acceptor site, which is a phase 2 acceptor in frame -3. This acceptor, when combined with the
phase 1 splice donor of exon 4, adds to 3 nucleotides, lending further evidence to base 32,717 as
the last base of exon 4 (Figure 12). The annotation of exon 4 is summarized in Table 1 below.
Figure 10. Expanded view of end of exon 4. Blue box denotes splice donor sequence; black box shows that this is a phase 1 donor in frame -3. Swiezy 12
Figure 11. BLASTX alignment, showing exon 5 is transcribed in frame -3. Query: amino acid sequence of D. melanogaster exon 5 from Gene Record Finder; subject: nucleotide sequence of Contig21.
Figure 12. Expanded view of 5’ end of exon 5. Blue box denotes splice acceptor sequence; black box shows that this is a phase 2 acceptor in frame -3. Swiezy 13
Exon First Base Last Base Length Phase of 3’ Phase of 5’ Frame (in base Splice Splice pairs) Acceptor Donor 3 ------Phase 1 -3 4 35,371 32,717 2655 Phase 2 Phase 1 -3 5 ------Phase 2 -- -3 Table 1. Summary of exon 4 annotation.
Annotation of Prediction 3 Isoforms, Alternative Second Exons:
A similar procedure as used above was used to establish the frame (exon-by-exon
BLASTX) (Table 2), first base, and last base of exon 2-DF (PD/PF isoform) and exon 2-GH
(PG/PH isoform), as well as the GT/GC splice donor sites and the AG splice acceptor sites.
Exon 2-GH:
The end of exon 1 has a GT splice donor that is phase 0 in frame -1, which is compatible
with the phase 0 AG splice acceptor at the beginning of exon 2-GH in frame -2. The end of exon
2-GH has a non-canonical GC splice donor that is phase 0 in frame -2, which is compatible with
the phase 0 AG splice acceptor at the beginning of exon 3 in frame -3 (Figure 13a). The non-
canonical splice donor is not conserved between other Drosophila species (Figure 13b). Swiezy 14
Figure 13a. Top right panel: end of exon 1; top left panel: beginning of exon 2-GH; bottom right panel: end of exon 2-GH; bottom left panel: beginning of exon 3. In all panels, blue box shows phase 0 donor or acceptor. Arrows denote reading frame for each panel.
Figure 13b. Blue box shows sequence of the splice donor of exon 2-GH; this sequence is non-canonical, GC, in D. elegans, but canonical, GT, in the six other Drosophila species shown.
Exon 2-DF:
As stated in the “Exon 2-GH” section above, the 3’ end of exon 1 has a GT splice donor
that is phase 0 in frame -1; this is compatible with the phase 0 AG splice acceptor at the
beginning of exon 2-DF in frame -3. Thus, the annotation of the 3’ end of exon 1 is the same Swiezy 15
between all isoforms. There is TopHat splice junction data (JUNC000000711) to support the
truncation of exon 2-GH by 21 base pairs, which creates exon 2-DF in the PD and PF isoforms;
however, JUNC00000711 is only seen in the Mixed Embryo track and is only supported by four
reads (Figure 14), whereas JUNC00000712 is supported by 121 reads, JUNC00000209 by 43
reads, and JUNC00000292 by 37 reads. However, given the high level of conservation to the
ortholog of this gene in D. melanogaster, exon 2-DF was annotated separately from exon 2-GH
in order to maintain all four isoforms found in D. melanogaster in D. elegans (Figure 15).
The annotation of the end of exon 2-DF was the same as that of the end of exon 2-GH,
and thus, the annotation of the beginning of exon 3 was also the same for all four isoforms
(Figure 13).
Figure 14. TopHat splice junction information for JUNC00000711. Top panel: black arrows represent the position of JUNC00000711; bottom panel: red box indicates that depth of coverage for this junction, which is only four reads. Swiezy 16
Figure 15. Right panel: end of exon 1, phase 0 splice donor denoted by blue box; left panel: beginning of exon 2-DF, phase 0 splice acceptor denoted by blue box. End of exon 2-DF and beginning of exon 3 have identical annotation to exon 2-GH (Figure 13).
The difference between exon 2-GH and exon 2-DF is the only difference between the
PD/PF and PG/PH forms. The PD/PF isoforms are identical in their protein sequences, but differ
in the annotation of their 5’ UTRs, which was not a part of the present analysis. Similarly, the
PG/PH isoforms are identical in their protein sequences, but differ in the annotation of both their
3’ and 5’ UTRs (Figure 16). Swiezy 17
H D F G
Figure 16. Top panel: 3’ UTR has an extra intron in isoform PH relative to the other three isoforms. Middle panel: the G, F, and D isoforms share the same 3’ UTR. Bottom panel: the H and D isoforms share the same 5’ UTR while the F and G isoforms share a different 5’ UTR.
Annotation of Prediction 3, Exon 8:
A similar procedure as used above was used to establish the frame (Table 2), first and last
base, and the splice donor and splice acceptor sites for exon 8. The exon-by-exon BLASTX
result for this exon revealed that the query sequence (translated nucleotide sequence of
Contig21) was missing 11 amino acids from the 5’ end relative to the subject (D. melanogaster
exon 8 from Gene Record Finder) (Figure 17). At the 5’ end of the alignment to exon 8, there is
high RNA-Seq coverage, N-SCAN and GENSCAN gene model predictions, and TopHat splice
junctions, as well as a consistent splice acceptor phase that supports base 24,878 as the first base
of exon 8 (Figure 18). In addition, using base 24,878 as the starting base would restore the
“missing” 11 amino acids in the BLASTX alignment. Swiezy 18
Figure 17. Exon-by-exon BLASTX alignment, showing 11 missing amino acids from the 5’ end of exon 8. Query: amino acid sequence of exon 8 from Gene Record Finder; subject: nucleotide sequence of Contig21.
Figure 18. Expanded view of beginning of exon 8. Blue box denotes splice acceptor sequence; black box shows that this is a phase 1 acceptor in frame -3.
The start and stop codons chosen for CG11148 are 39,739-39,737 and 24,755-24,753
respectively (Figures 19 and 20). Swiezy 19
Figure 19. Start codon for CG11148.
Figure 20. Stop codon for CG11148.
Annotation of Prediction 3, Remaining Exons:
Annotation of the exons not discussed above (exons 1, 5, 6, and 7) was straight-forward;
in general these exons had deep RNA-Seq coverage and TopHat splice junctions that were
consistent with both N-SCAN and GENSCAN predictions. The final annotations of all exons and
isoforms are summarized in Tables 2, 3, and 4. Swiezy 20
Exon 1 2-DF 2-GH 3 4 5 6 7 8
Frame -1 -2 -3 -3 -3 -3 -2 -3 -3
Table 2. Frame of each coding exon established by exon-by-exon BLASTX search against the nt/nr database maintained by NCBI.
Annotation Summary:
Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 39,739-39,407 333 -- Phase 0 -1 2-GH 39,348-39,175 174 Phase 0 Phase 0 -2 3 39,113-38,834 280 Phase 0 Phase 1 -3 4 35,371-32,717 2655 Phase 2 Phase 1 -3 5 30,757-29,789 969 Phase 2 Phase 1 -3 6 29,719-29,645 75 Phase 2 Phase 1 -3 7 25,085-24,931 155 Phase 2 Phase 0 -2 8 24,878-24,756 123 Phase 0 -- -3 Stop Codon 24,755-24,753 3 -- -- -3 Table 3. Final annotation summary for PG/PH isoforms.
Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 39,739-39,407 333 -- Phase 0 -1 2-DF 39,348-39,175 174 Phase 0 Phase 0 -2 3 39,113-38,834 280 Phase 0 Phase 1 -3 4 35,371-32,717 2655 Phase 2 Phase 1 -3 5 30,757-29,789 969 Phase 2 Phase 1 -3 6 29,719-29,645 75 Phase 2 Phase 1 -3 7 25,085-24,931 155 Phase 2 Phase 0 -2 8 24,878-24,756 123 Phase 0 -- -3 Stop Codon 24,755-24,753 3 -- -- -3 Table 4. Final annotation summary for PD/PF isoforms.
Gene Model Checker Results:
Both models were confirmed, showing high sequence similarity to the D. melanogaster
models (Figure 21). Swiezy 21
Figure 21. Gene Model Checker dot plots showing conservation with D. melanogaster gene models for PG/PH annotation (left) and PD/PF annotation (right).
Annotation of Prediction 1:
Because there were two overlapping BLASTX alignments via the UCSC Genome
Browser, it was first necessary to choose which of these was the more likely candidate for a gene
in D. elegans (Figure 22). Looking at the high-scoring segment pairs (HSPs) for each BLASTX
alignment showed higher percent identities and higher E-values for the fd102C-PA alignment
over the fd19B-PA alignment (Figure 23). Alternatively, this can also be seen by the colors of
the BLASTX alignments, red (fd102C-PA) being a better quality alignment than brown (fd19B-
PA). Lastly, fd102C is found on the fourth chromosome in D. melanogaster whereas fd19B is
found on the X chromosome in D. melanogaster. Swiezy 22
Figure 22. UCSC Genome Browser shows overlapping BLASTX results in the region of prediction 1.
Figure 23. HSP summaries for BLASTX alignments shown in Figure 22. Black boxes denote the E-values for each HSP; red boxes denote the Percent identities for each HSP. The HSPs for fd102C-PA have higher E- values and higher percent identities than HSPs for fd198-PA; therefore, fd102C-PA is the better alignment to the D. elegans sequence.
A BLASTp search, using as query, the amino acid sequence of D. elegans, prediction 1,
and as subject, the annotated D. melanogaster proteins database maintained by FlyBase, was
used to corroborate this putative ortholog in D. melanogaster. This search revealed a number of
protein matches with a high E-value, but the score (=696.041) for fd102C-PA was much higher
than the scores for the other matches (≤116.316). In addition, the visual representation of these
matches showed that only the alignment to fd102C-PA was consistently high quality over the
entire length of the query sequence (Figure 24). The high E-values of the other matches are
likely due to the presence of the highly conserved forkhead domain, found in the matches
beginning with “fd.” Swiezy 23
Figure 24. BLASTp alignment using as query, the amino acid sequence of D. elegans prediction 1, and as subject, the annotated D. melanogaster proteins database. Top panel: best alignment to fd102C-PA; bottom panel: conservation to the highly conserved forkhead domain is likely responsible for the high E-values given to the other alignments.
Both GENSCAN and N-SCAN predicted two coding exons for this gene. However, via
Gene Record Finder, fd102C-PA has only one translated exon; the second exon contains the first
portion of the 5’ UTR in D. melanogaster (Figure 25). Additional support for only one coding
exon comes from the fact that there are no stop codons in the frame of this gene’s transcription in
D. elegans (Figure 26).
Figure 25. Left panel: fd102C-PA has one translated (coding) exon in D. melanogaster; right panel: fd102C-PA has two transcribed exons in D. melanogaster. Swiezy 24
Figure 26. UCSC Genome Browser view of nucleotide sequence of fd102C-PA region in D. elegans. Transcription of this feature is in frame -1 (red arrow); no red stop codons are seen within this frame, extending throughout the length of the prediction.
Interestingly, the FlyBase Genome Browser showed a 700 base pseudogene annotated
within the intron separating the two pieces of the 5’ UTR of fd102C-PA in D. melanogaster
(Figure 27). A BLASTn search was then used to discover whether this pseudogene was also
conserved in the D. elegans sequence. The alignment shown below shared 71% identity to the D.
melanogaster sequence, but extended over only 54% of the D. melanogaster sequence (Figure
28). However, this search does not rule out the possibility that a pseudogene (or the remnant of a
pseudogene) is present in the D. elegans genome. Because pseudogenes are not translated into
functional proteins there is a lack of negative selection pressure to maintain their sequence;
hence, evolution within these features occurs rapidly. Given that the common ancestor for D.
melanogaster and D. elegans lived about 10 million years ago, we would expect the sequence of
a pseudogene (if present in the ancestor) to differ significantly between these two species. Based
on the BLASTn alignment, we can conclude that a putative ortholog to CR33797 extends from
base 6,005 to base 6,705 (Figure 28). Swiezy 25
Figure 27. FlyBase Genome Browser showing single coding exon (orange) and two additional transcribed exons (gray). Pseudogene CR33797-RB appears within the intron separating the pieces of the 5’ UTR in the D. melanogaster assembly (red boxes).
Figure 28. Results of BLASTn search using query, nucleotide sequence of CR33797-RB in D. melanogaster and subject, nucleotide sequence of contig21. The BLAST parameters were optimized for “somewhat similar sequences” and the filtering for “low complexity regions” was turned off. Top left panel: dot plot shows that the identity between these sequences is low and that a little more than half of the pseudogene sequence (on the x axis) is found in the sequence of contig21 (on the y axis); top right panel: the top alignment shows that the first 502 nucleotides and the last 104 nucleotides of the D. melanogaster pseudogene are not conserved in contig21; bottom panel: 71% sequence homology exists between the sequences of CR33797-RB and contig21, 54% of CR33797-RB is present in contig21.
RNA-Seq coverage in Mixed Embryos is high throughout the prediction 1. In addition,
there is only a single small repeat within the region of this feature; this repeat lines up exactly
with and is likely responsible for the novel intron proposed by both GENSCAN and N-SCAN, Swiezy 26
given that these gene predictors both run using the masked genomic sequence (Figure 29). Both
high RNA-Seq coverage and the lack of other repeats, combined with the BLASTX homology to
the D. melanogaster sequence, corroborates the presence of a gene in this region. A BLASTX
search revealed that the single exon is in Frame -1 and begins at the methionine at base 3,166
and ends at base 1,304 immediately before the stop codon from base 1,303 to base 1,301 (Figure
30a). While this is the first coding exon, it is the second transcribed exon. The canonical AG
splice acceptor site is found upstream of the methionine, where the RNA-Seq also ends. The first
base of this exon that is transcribed (part of the 5’ UTR) is at base 3,232 (Figure 30b).
Figure 29. UCSC Genome Browser view of fd102C-PA; black box shows repeat falling within the novel intron proposed by the GENSCAN and N-SCAN gene models. Red arrow shows the high level of RNA-Seq coverage throughout the region of this feature. Swiezy 27
Figure 30a. UCSC Genome Browser View of 3’ end of fd102C-PA, stop codon boxed in blue.
Figure 30b. UCSC Genome Browser View of 5’ end of fd102C-PA, beginning at methionine boxed in blue.
Annotation Summary:
Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 3,166-1,304 1863 -- -- -1 Stop Codon 1,303-1,301 3 -- -- -1 Table 5. Final annotation summary for fd102C-PA. Swiezy 28
Gene Model Checker Results:
The model proposed above for Gene Model Checker confirmed the proposed model for the
ortholog of fd102C-PA in D. elegans. The dot plot generated shows homology between the D.
melanogaster and D. elegans annotations throughout, albeit with several gaps (Figure 31).
Figure 31. Gene Model Checker output for fd102C-PA annotation. Left panel: dot plot showing global sequence homology between D. melanogaster gene (x axis) versus D. elegans annotation (y axis); right panel: sequence similarity between D. melanogaster and D. elegans annotations displayed at amino acid resolution.
Annotation of Prediction 2:
This prediction was unique in that only SNAP and GENSCAN prediction suggested that
there may be a feature present at this position. There was no BLASTX alignment between the D.
elegans and D. melanogaster sequences, a low level of RNA-Seq data that does not match up
with specific exons (more likely to be noise), and a large number of repeats present throughout
the region. There was one TopHat junction that appeared to support the middle exon, and there
was some conservation of this region across Drosophila species; however, the conservation
evidence did not support the middle exon (Figure 32). Swiezy 29
Figure 32. UCSC Genome Browser view of Prediction 2. Blue box indicates the TopHat junction that supports the middle exon; the pink boxes indicate conservation among other Drosophila species. Note that the TopHat junction and conservation track support different exons in this feature.
Next, a BLASTp search was completed, using as query, the predicted protein sequence of
this feature, and as subject, the annotated D. melanogaster protein database maintained by
FlyBase. Using the default “expect value” of 10 returned no significant hits; changing the expect
value to 100 returned a number of very poor alignments with poor scores and poor E-values
(Figure 33). However, a BLASTn search, using as query, the nucleotide sequence of this feature,
and as subject, the D. melanogaster nucleotide sequence database maintained by FlyBase,
returned an alignment to CR45123-RA (Figure 34). Checking the synteny on the FlyBase genome
browser confirmed the presence of this non-coding RNA in D. melanogaster between CG11148
and fd102C-PA (Figure 35). Swiezy 30
Figure 33. BLASTp search using query, amino acid sequence of prediction 2, and subject, annotated D. melanogaster protein database. Top panel: summary of alignments showing poor E-values for all alignments; middle panel: visual representation of summary table in top panel; bottom panel: side-by-side amino acid alignment for best BLAST hit. Swiezy 31
Figure 34. BLASTn search using query, nucleotide sequence of prediction 2, and subject, the D. melanogaster nucleotide sequence database. Top panel: summary of alignments, red box shows E-value for top hit is almost 43 orders of magnitude larger than the second best hit; middle panel: visual representation of summary table in top panel; bottom panel: side-by-side nucleotide alignment for best BLAST hit.
Figure 35. Genome browser view of the D. melanogaster homologous region to contig21. Red box denotes non- coding RNA prediction.
A BLASTn search confirms the presence of CR45123 in contig21 (Figure 36).
Depending on whether this non-coding RNA is transcribed or not, it may be the source of the Swiezy 32
RNA-Seq and TopHat junctions in the region of this feature and may also be the reason that
GENSCAN originally made a prediction in this region.
Figure 36. Top panel: BLASTn search using query, nucleotide sequence of CR45123 in D. melanogaster, and subject, nucleotide sequence of contig21; bottom panel: dot plot showing homology throughout the sequences (query on x axis, subject on y axis).
After synthesizing all of the above information, we can conclude that GENSCAN
prediction 2 is a consequence of the presence of an ortholog to CR45123 in D. elegans. Given
the above BLAST alignment, the CR45123 ortholog in D. elegans extends from base 18971 to
base 19465.
Annotation of Mary’s Gene:
In addition to the genes annotated in contig21, I was also given one gene from contig30
to annotate (Figure 37); for an analysis of contig30 outside of this feature, see Mary
Richardson’s annotation report. Swiezy 33
There are two alignments to D. melanogaster genes in the region of this prediction;
however, as discussed previously, the color of the BLASTX alignment indicates its quality, with
the red being a better alignment than the brown. Using a BLASTp search, I was able to confirm
that Eph is the correct D. melanogaster ortholog to this GENSCAN prediction (Figure 38).
Figure 37. GENSCAN prediction contig30.1 (prediction 1). Swiezy 34
Figure 38. BLASTp search using query, amino acid sequence of prediction 1, and subject, annotated D. melanogaster proteins database maintained by FlyBase. Top panel: summary of BLAST alignments (not all hits shown); bottom panel: visual representation of best BLAST alignments (not all hits shown).
Gene Record Finder revealed that the annotation of Eph in D. melanogaster has six
unique isoforms (with two isoforms sharing the same coding sequence) and 18 unique coding
exons (Figure 39). Swiezy 35
Figure 39. Gene Record Finder information about Eph, showing six unique isoforms and 18 unique coding exons.
Annotation of Isoforms PA and PE:
The first exon of isoform PA has no corresponding GENSCAN prediction in contig30;
however, looking at the upstream project shows a canonical splice acceptor site, RNA-Seq data
and TopHat junctions, as well as a methionine to support the beginning of this exon. The next
nine exons of the PA and PE isoforms showed routine annotations with exon boundaries
supported by RNA-Seq data, TopHat junctions and in-phase splice donor and acceptor pairs. The
eleventh exon (15_2213_0) showed two different alignments via the BLASTX track. Base
12,711, corresponding to the BLASTX alignment to the PD isoform, was chosen as the final base Swiezy 36
of exon 11 for isoforms PC, PE, PA, PD, and PB in agreement with the RNA-Seq and TopHat
junction data.
Figure 40. Black line shows that the BLASTX alignment to the D isoform corresponds to the RNA-Seq and TopHat junction data; this base 12,711 was chosen as the last base of exon 11 in the PC, PE, PA, PD, and PB isoforms. Swiezy 37
Annotation Summary:
Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 23-275 253 -- 1 2 2 346-425 80 2 0 3 3 2123-2303 181 0 1 2 4 3247-3732 486 2 1 3 5 5342-6229 888 2 1 1 6 6291-6467 177 2 1 2 7 11201-11329 129 2 1 1 8 11400-11495 96 2 1 2 9 11566-11936 371 2 0 3 10 12169-12315 147 0 0 1 11 12382-12711 330 0 0 1 12 12770-12871 102 0 -- 2 Stop Codon 12872-12874 3 -- -- 2 Table 6. Annotation summary for Eph-PA and Eph-PE.
Gene Model Checker Results:
The gene model summarized in Table 6 was confirmed for both isoform PA and PE
(Figure 41). However, in both dot plots generated from the Gene Model Checker, the annotation
seems to be lacking support for the first coding exon. A side-by-side alignment of the amino acid
sequence of the first exon shows that there is significant homology between the D. melanogaster
and D. elegans sequences in this region, although it is not as highly conserved as the second,
third, or fourth exon (Figure 42). Swiezy 38
Figure 41. Dot plot showing homology between Eph-PA (left) and Eph-PE (right) in D. melanogaster to contig30 ortholog in D. elegans. Note the appearance of a lack of homology in the first exon on both dot plots (black brackets).
Figure 42. Side-by-side amino acid alignment of Eph-PE (D. melanogaster), and Submitted_Seq (D. elegans ortholog); note significant similarity in the first exon (black box).
Annotation of Isoform PB:
Via the model annotated in D. melanogaster, there should be 13 exons in the PB isoform.
Looking at the BLASTX alignment and the gene predictors on the UCSC Genome Browser
shows only 12 exons. Gene Record Finder reveals that the first exon should have only three Swiezy 39
amino acids and should start after the start, but be found within, the first exon in the PA and PE
isoforms (Figure 43). This leaves only two viable options for the beginning of the exon where a
methionine is present to begin transcription and which will not truncate the second exon of this
isoform (Figure 44); however, neither of these conserves the MTK sequence and only one has a
suitable splice donor sequence (Figure 45). Taken together, these results suggest that we cannot
make a model in D. elegans that is orthologous to the PB isoform in D. melanogaster.
Figure 43. Top table: first three exons of Eph-PB in D. melanogaster; bottom table: first three exons of Eph- PE in D. melanogaster; inset: amino acid sequence of exon 1 of Eph-PB.
Figure 44. Start codons that are possible for isoform PB boxed in blue; start codons that are impossible in isoform PB due to truncation of the exon 2 boxed in black; exon 2 boxed in green. Swiezy 40
Figure 45. Each of the possible start codons boxed in Figure 44; only the first has a suitable GT splice donor site nearby.
Annotation of Isoform PC:
The annotation of this isoform was straightforward; the only difference between Eph-PC
and Eph-PA/PE is the introduction of a small exon, exon 7 (Figure 46). There were TopHat
junctions as well as an N-SCAN prediction and in-phase splice donors and acceptors to support
the presence of this exon.
Figure 46. Short exon found in Eph-PC, Eph-PD, and Eph-PF along with the corresponding prediction of this exon by N-SCAN and the supporting TopHat junctions. Swiezy 41
Annotation Summary:
Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 23-275 253 -- 1 2 2 346-425 80 2 0 3 3 2123-2303 181 0 1 2 4 3247-3732 486 2 1 3 5 5342-6229 888 2 1 1 6 6291-6467 177 2 1 2 7 7975-8022 48 2 1 3 8 11201-11329 129 2 1 1 9 11400-11495 96 2 1 2 10 11566-11936 371 2 0 3 11 12169-12315 147 0 0 1 12 12382-12711 330 0 0 1 13 12770-12871 102 0 -- 2 Stop Codon 12872-12874 3 -- -- 2 Table 7. Annotation summary for Eph-PC.
Gene Model Checker Results:
The gene model summarized in Table 6 was confirmed for Eph-PC (Figure 47). Note that
as in Eph-PA and Eph-PE, the first exon appears to be poorly conserved on the dot plot
comparing the sequence similarity between the D. melanogaster and D. elegans models; see
Figure 42 for the side-by-side amino acid comparison. Swiezy 42
Figure 47. Gene Model Checker generated dot plot showing sequence homology between D. melanogaster Eph-PC gene model (x axis) and orthologous model in D. elegans (y axis). Note what appears to be complete lack of homology in the model for exon 1 (black brackets).
Annotation of Isoform PD:
All but one of the exons present in isoform PD were annotated in either isoform PA, PE, or
PC above. The BLASTX alignment suggests that the last exon in Eph-PD appears more than 4
kb downstream of the end of the prior exon. However, this is inconsistent with the D.
melanogaster model seen on FlyBase and with Gene Record Finder, which both predict only 12
coding exons. In addition, there are no TopHat junctions, RNA-Seq, or predictions by
GENSCAN or N-SCAN to support this exon (Figure 48). Swiezy 43
Figure 48. Top left panel: UCSC Genome Browser view of last exon proposed by BLASTX alignment with D. melanogaster; top right panel: Flybase genome browser view of Eph-PD gene model in D. melanogaster; bottom panel: expanded view of top right panel.
Without using the final exon proposed by BLASTX, and instead ending at the same place
as the PA, PC, and PE isoforms would not allow this gene model to have a stop codon. Hence,
the last exon was extended until the next available stop codon in frame 1, from base 12775 to
12777 (Figure 49). However, Gene Record Finder showed that in D. melanogaster, the last base
of Eph-PD exon 12 is upstream of the first base of Eph-PE exon 12; this is inconsistent with the
stop codon chosen, but the stop codon from base 12775 to base 12777 represents the best option
for conserving the PD isoform in D. elegans (Figure 50). Swiezy 44
Figure 49. Stop codon chosen for Eph-PD final exon, exon 12 (black box).
Figure 50. Left panel: gene model for Eph-PE in D. melanogaster, showing exon 12, boxed in black, begins at base 620,254; right panel: gene model for Eph-PD in D. melanogaster, exon 12, boxed in black, ends at 620,234 (before the start of Eph-PE exon 12). Swiezy 45
Annotation Summary:
Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 23-275 253 -- 1 2 2 346-425 80 2 0 3 3 2123-2303 181 0 1 2 4 3247-3732 486 2 1 3 5 5342-6229 888 2 1 1 6 6291-6467 177 2 1 2 7 7975-8022 48 2 1 3 8 11201-11329 129 2 1 1 9 11400-11495 96 2 1 2 10 11566-11936 371 2 0 3 11 12169-12315 147 0 0 1 12 12382-12774 330 0 -- 1 Stop Codon 12775-12777 3 -- -- 2 Table 8. Annotation summary for Eph-PD.
Gene Model Checker Results:
The model proposed above for an ortholog to Eph-PD in D. elegans passed the guidelines
set by Gene Model Checker. The apparent lack of homology between exon 1 in the gene models
has previously been discussed and accounted for (Figure 42). The lack of homology seen at the
very end of the last exon in this model is likely due to the choice of stop codon, which according
to Gene Record Finder should not have extended so far, but for which there was no better
alternative (Figure 51). Swiezy 46
Figure 51. Gene model checker output. Left panel: dot plot (x axis, Eph-PD model in D. melanogaster; y axis, D. elegans orthologous model) showing lack of homology at end of exon 12 (black brackets); right panel: side- by-side amino acid alignment, showing lack of sequence homology in final exon (top line, Eph-PD model in D. melanogaster; bottom line, D. elegans orthologous model).
Annotation of Isoform PF:
This isoform uses the same first exon as isoform Eph-PB; we previously decided that
there is no viable methionine to make this a reasonable model for the first exon. Hence, in the
annotation of Eph-PF, we decided that the first exon would start at the upstream methionine
consistent with the start position of Eph-PA, Eph-PC, Eph-PD, and Eph-PE.
This isoform also uses different final exons than any of the other exons. The D.
melanogaster gene model shows that the final exon of Eph-PF is only one base long and only
codes for a portion of the stop codon; this base is found at the first base of the final exons in the
other isoforms (Figure 52). Swiezy 47
Figure 52. Gene model for Eph-PF via FlyBase Genome Browser; red boxes indicated the transcript (top panel, red box) and coding (middle panel, red box) lengths of the PF isoform. The final exon in this isoform is one base long and codes for a portion of the stop codon (bottom panel, red box).
A suitable splice donor was found in the region of the end of the BLASTX alignment of
contig30 to D. melanogaster. Immediately prior to the GT donor sequence, a TG sequence is
present. This is a phase 2 donor in frame 1 (Figure 53). In frame 3 we find a suitable splice
acceptor that is phase 1. Immediately after the AG splice sequence, an A is present. This A
completes the stop codon (TGA) begun in the previous exon and is consistent with the start of an
exon predicted by N-SCAN (Figure 54). Swiezy 48
Figure 53. Boxed in blue is the TG sequence at the end of exon 14 immediately upstream of the GT splice donor sequence (phase 2 in frame 1).
Figure 54. Boxed in blue is the A at the beginning of exon 15 immediately downstream of the AG splice acceptor sequence (phase 1 in frame 3); this A completes the stop codon sequence TGA.
While this gene model can be made, it is primarily made for the purpose of conserving
the isoforms found in D. melanogaster. Note that there was no significant drop in RNA-Seq or
TopHat junctions to support the end of exon 14 at this position. Given that the RNA-Seq
coverage is >600 reads deep in the Mixed Embryos track, it seems unlikely that if this isoform Swiezy 49
does exist in D. elegans and is transcribed at levels consistent with the other isoforms, there
would be no TopHat junctions.
Annotation Summary:
Exon Coordinates Length Phase of Phase of Frame (bp) 3’ Splice 5’ Splice Acceptor Donor 1 23-275 253 -- 1 2 2 346-425 80 2 0 3 3 2123-2303 181 0 1 2 4 3247-3732 486 2 1 3 5 5342-6229 888 2 1 1 6 6291-6467 177 2 1 2 7 7975-8022 48 2 1 3 8 11201-11329 129 2 1 1 9 11400-11495 96 2 1 2 10 11566-11936 371 2 0 3 11 12169-12315 147 0 0 1 12 12382-12639 258 0 -- 1 Stop Codon 12640-12641 and 12770 3 -- -- 1àà3 Table 9. Annotation summary for Eph-PF.
Gene Model Checker Results:
Gene Model Checker does not allow the stop codon to be split up this way; hence, this
portion of the model failed. All other exons in the model proposed above passed the criteria set
by the program (Figure 55). The apparent lack of homology in the first exon has been previously
discussed and accounted for (Figure 42). Swiezy 50
Figure 55. Dot plot showing homology between Eph-PF gene model in D. melanogaster (x axis) and orthologous model in D. elegans (y axis).
Overview Eph:
I annotated five isoforms of Eph in D. elegans, Eph-PA, Eph-PC, Eph-PD, Eph-PE, and
Eph-PF (Figure 56); Eph-PB was not annotated as an isoform because changing the annotation
of the first exon would make the model exactly consistent with Eph-PA (both in terms of
transcript sequence and coding sequence). Swiezy 51
Figure 56. Final gene model for Eph ortholog in D. elegans, including isoforms PA, PC, PD, PE, and PF.
Repeat Elements (contig21):
The total repeat content for contig21 returned by RepeatMasker was 31.72% (Figure 57).
There were eight repeats over 500 bp in length, which were limited to the regions between genes
or within long introns of the genes (Table 10 and Figure 58). Two independent BLASTn
searches using as the subject, Wolbachia (taxid:953) and Wolbachiae (taxid:952), and as the
query, nucleotide sequence of contig21, returned no significant hits, suggesting that the 40
unclassified repeats (19.88%) are not remnants of Wolbachia DNA. We would expect the repeat
density to be this high given that contig21 is a sequence taken from the D. elegans dot
chromosome. Swiezy 52
Figure 57. RepeatMasker output for contig21.
Repeat Family Coordinates Length (bp) rnd-5_family-237 4473-5632 1159 rnd-5_family2544 3768-4472 704 rnd-3_family113 142-799 657 rnd-3_family113 6459-7104 645 rnd-3_family113 17977-18580 603 rnd-1_family259 31194-31723 529 rnd-5_family815 23275-23802 527 rnd-5_family237 76-588 512 Table 10. Summary of repeats >500 bp in length.
Figure 58. UCSC Browser view of repeat elements >500 bp in length. Swiezy 53
Synteny:
Comparing my annotations with the orthologous region on the dot chromosome in D.
melanogaster revealed the same order of genes, all transcribed on the negative strand (Figure
59). I also included the putative pseudogene and ncRNA found in this region in D. melanogaster
in my model, and these are uploaded into the final track below.
Figure 59. Synteny between D. elegans (top panel) and D. melanogaster (bottom panel) annotations.
Transcription Start Site Annotation, CG11148:
I chose to evaluate the transcription start site (TSS) for CG11148 since this gene was
well-conserved between the two species and had a high amount of RNA-Seq data throughout. To
begin, the annotation of the TSS in D. melanogaster was evaluated (Figure 60). Given that there
are only two positions annotated as a TSS by Celniker, this was classified as a “peaked”
promoter. In the D. melanogaster model, the TSS is annotated ten bases upstream of the Swiezy 54
beginning of the annotation of the 5’ UTR; the 5’ UTR begins 759 bases upstream of the
beginning of the first coding exon. By extrapolation to D. elegans, because the first coding exon
begins at base 39,739, adding 769 bases to this means that the TSS should fall at about 40,508.
Figure 60. Annotation of transcription start site for CG11148 in D. melanogaster. Left panel: characterization of peaked promoter; right panel: expanded view of promoter, black box defines the region of sequence used for the BLASTn search below.
We would expect to see conserved core promoter motifs close to base 40,508. Within this
region, two BRE motifs and one DPE motif offer the best support for the presence of a promoter
(Figure 61). Swiezy 55
Figure 61. Conserved core promoter motifs in the region of the promoter in D. elegans.
Next, a BLASTn search was conducted, using as subject, the nucleotide sequence of the
promoter region in the D. melanogaster annotation (defined in Figure 60), and as query, the
nucleotide sequence of contig21. This returned an alignment covering the estimated TSS in D.
elegans; however, the alignment was missing the first 15 bases of the query sequence (Figure
62).
Figure 62. BLASTn alignment using query, nucleotide sequence of contig21, and subject, nucleotide sequence of CG11148 promoter in D. melanogaster. Swiezy 56
In order to discover the reason for the 15 base deletion in the D. elegans sequence relative
to the D. melanogaster sequence, the conservation track on the D. melanogaster browser was
evaluated. We can see that the D. melanogaster sequence in the region of the 5’ UTR and TSS is
conserved to the highest degree in D. yakuba and D. erecta, with the sequences of other
Drosophila species showing increasing divergence. The sequence of D. elegans in this region
shows more conservation to the D. biarmipes sequence than to the D. melanogaster sequence, so
we are not surprised by the missing bases in the BLASTn alignment above (Figure 63).
Figure 63. Conservation in the region of the CG11148 promoter. D. melanogaster conservation to sequences boxed in red; conservation between D. elegans and D. biarmipes sequences shown in blue. Swiezy 57
Given the similarity the between D. biarmipes and D. elegans species in this region, the
D. biarmipes RNA PolII and RNA-Seq data was evaluated for a more complete description of
this region. The RNA PolII data showed two peaks associated with the promoter, which suggests
that there are two different isoforms for this UTR; these two peaks are consistent with the two
different sets of TopHat splice junctions (Figure 64). To evaluate the homology between the D.
biarmipes sequence and the D. elegans sequence further, a BLASTn search was conducted using
query, the nucleotide sequence of contig21, and subject, the nucleotide sequence of the RNA-Seq
corresponding to the promoter region in D. biarmipes, defined in Figure 64, from base 835,662
to base 836,062 (Figure 65). Swiezy 58
Figure 64. UCSC Genome Browser view of CG11148 promoter; note two peaks seen in RNA PolII data, suggesting two isoforms for this UTR may be present; this is consistent with the two different sets of TopHat splice junctions seen in this region as well. The black brackets define the promoter region of CG11148 in D. biarmipes—DNA from this region was used as the subject in the BLASTn search in Figure 65.
Figure 65. BLASTn alignment using query, the nucleotide sequence of contig21, and subject, the nucleotide sequence of the CG11148 promoter in D. biarmipes (i.e. base 1 of the subject is base 835,662 in the D. biarmipes whole genome assembly as of April 2013). Swiezy 59
Because the farthest upstream base of the subject of the BLASTn in Figure 65 aligned to
base 40552 of the query (in the region where we would expect the TSS to be in D. elegans),
because this result is roughly in line with the result in D. melanogaster, which placed the TSS at
40542 (40527+15 missing bases), and because there are core promoter motifs within 50 bp of
either of these bases, we estimate that the TSS for CG11148 in D. elegans will be between base
40540 and base 40555 (of contig21).
Evolution of CG11148 Orthologs:
ClustalW2 can be a powerful tool to evaluate the evolution of a sequence over a variety of
different species. This analysis attempted to characterize the conserved GYF domain in
CG11148. The GYF domain may have a “critical importance in tripeptide ligand binding” [1].
This domain has been found in a number of eukaryotic proteins and “could also be involved in
proline-rich sequence recognition” [1]. Also, note that “there is limited homology within the C-
terminal 20-30 amino acids of various GYF domains, supporting the idea that this part of the
domain is structurally, but not functionally important” [1]. In D. melanogaster the GYF domain
is 49 amino acids long and extends from amino acid 566 to amino acid 614 of CG11148 [2]; the
sequence of this conserved domain is shown below (Figure 66). Based on the ClustalW2
alignment below, we can see that this domain is highly conserved throughout a number of
Drosophila species (Figure 67) as well as throughout a number of other animal and insect
species as well (Figure 68). Swiezy 60
Figure 66. Sequence of the conserved GYF domain in D. melanogaster.
Figure 67. ClustalW2 alignment of eight Drosophila species; this analysis showed varying conservation of the amino acid sequence throughout CG11148. The conserved GYF domain boxed in black showed exceptional similarity between all of the fly species. Swiezy 61
Figure 68. ClustalW2 alignment of eight animal and insect species; this analysis showed strong conservation, but cluster into two groups of species, the insects (Drosophila, mosquitos, and honey bees) and other animal species. The conservation of the GYF domain, however, was seen for all species.
Discussion:
The fourth chromosome of D. elegans investigated here shows a higher density of repeats
but a similar density of genes than expected in a region of euchromatin on one of the other
Drosophila chromosomes. Although the fourth chromosome is known to be composed mainly of
heterochromatin, it is interesting to note that the genes that are on this chromosome are still
transcribed and translated at normal levels. Evidence of a high level of gene expression was
apparent in both CG11148 and Eph, which had RNA-Seq coverage of 600-800 reads throughout
the genes. Further analyses of the transcription start sites of genes on the fourth chromosome, Swiezy 62
like that presented above, may give insights into the mechanism for expression of genes located
in a region of heterochromatin.
Contig21 included both a non-coding RNA gene and a putative pseudogene. A
pseudogene is an especially rare feature for Drosophila and may represent a retrotransposition
event. This may be especially impressive given the heterochromatic nature of the DNA on the
fourth chromosome; in order for a retrotransposition event to occur, the gene would first need to
be expressed, requiring the transcription machinery to be able to access the DNA, and then need
to find a way to reintegrate into DNA that is tightly incorporated into nucleosomes. Future
characterization of the gene from which this pseudogene originated in D. melanogaster, if from a
fourth chromosome gene, could be important in determining which regions of the fourth
chromosome are less rigidly packaged.
Appendix:
GFF, pep, and FASTA files submitted to Wilson Leung.
Acknowledgements:
Thank you to Dr. Elgin, Dr. Shaffer, Wilson Leung, Dr. Bednarski, and Dr. Mardis for
their support throughout this project.
References:
[1] Mitchell, Alex, et al. "The InterPro protein families database: the classification resource after 15 years." Nucleic Acids Research (2014): gku1243. http://nar.oxfordjournals.org/content/early/2014/11/26/nar.gku1243.full.
[2] UniProt Consortium. "UniProt: a hub for protein information." Nucleic Acids Research (2014): gku989. http://nar.oxfordjournals.org/content/early/2014/10/27/nar.gku989.abstract.