<<

Annotation of Contig21

Sarah Swiezy Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio434W 27 April 2015 Swiezy 2

Abstract:

Contig21, a 40 kb region of the fourth (“dot”) of Drosophila elegans (D.

elegans), containing three GENSCAN predictions and two high-quality BLASTX alignments to

annotated Drosophila melanogaster (D. melanogaster) , was finished and annotated.

Contig21 was analyzed for the presence of genes, transcription start sites, repeats, pseudogenes,

and non-coding RNAs using NCBI BLAST, FlyBase, and ClustalW2 in addition to the UCSC

Genome Browser, Record Finder, Gene Model Checker, and other programs maintained by

the Genomics Education Partnership (GEP). The final annotation included conservation of two

genes annotated in D. melanogaster, fd102C and CG11148 (Figure 1), as well as several

repetitious elements, a putative pseudogene, and a non-coding RNA. CG11148 contains a GYF

domain, which is highly conserved among Drosophila species and a variety of mammals as well;

this domain is likely involved in recognition of and binding to proline-rich sequences of .

The high level of conservation of CG11148 between D. elegans and other Drosophila species

also allowed for annotation of the transcription start site to within 15 base pairs. Swiezy 3

Figure 1. Final annotation map of Contig21. Gene annotations in D. elegans shown in blue; BLASTX results aligning D. elegans to the D. melanogaster genome shown in red; gene predictions based on ab initio gene finders shown in gold (GENSCAN) and green (N-SCAN); repetitious elements shown beneath the heading “Repeating Elements by RepeatMasker.”

Introduction:

Recent innovations in sequencing technologies have made the goal of comparative

genomics a promising new reality. Drosophila melanogaster (D. melanogaster), as one of the

best studied model organisms, along with the other Drosophila species sequenced by the

Drosophila 12 Genomes Consortium, are ideally suited to intensive study by geneticists and

evolutionary biologists. With this end in mind, the Genomics Education Partnership based at the

Biology Department and The Genome Institute at Washington University in St. Louis, has

helped engage university student researchers across the United States in a project to finish and

annotate portions of these sequenced Drosophila draft genomes to the high quality required for

detailed analyses. The Washington University 2015 Bio4342 students have finished and Swiezy 4

annotated sections of the Drosophila elegans (D. elegans) fourth chromosome. This particular

chromosome is unique in Drosophila in that it is composed primarily of heterochromatin, but has

a number of functionally important genes that are transcribed at a high level. The present

analysis will open the door to future questions regarding the role of transposable elements and

repeats in heterochromatin formation, the degree of conservation between species in the

Drosophila genus, and the structure of genes and their regulatory regions in a heterochromatin

domain.

Initial GENSCAN predictions in Contig21:

The initial output from GENSCAN showed three predicted features in Contig21.

Prediction 1 showed two exons, prediction 2 showed three exons, and prediction 3 showed eight

exons (though the last exon of this feature fell outside of the region shown) (Figure 2). Given

that prediction 1 had a small number of exons and that prediction 2 did not have an easily

identifiable ortholog in Drosophila melanogaster (i.e. there was no BLASTX alignment

overlapping with prediction 2), prediction 3 was used to anchor the annotation in this region. Swiezy 5

Figure 2. GENSCAN output for Contig21. Black box denotes prediction 1, green box denotes prediction 2, red box denotes prediction 3.

UCSC Genome Browser View of Contig21 Features:

To better understand the homology of this region with the D. melanogaster genome, the

BLASTX track on the UCSC Genome Browser was compared to the GENSCAN predictions

(Figure 3).

Prediction 1: As can be seen from this picture of the region, there is a high quality (red)

BLASTX alignment of prediction 1 to fd102C-PA in D. melanogaster. Prediction 1 is not

supported by a high level of RNA-Seq data or by TopHat splice junctions. (There are two splice

junctions; however, these do not correspond to the GENSCAN-proposed intron/exon boundary in

this prediction.) A second BLASTX match (brown) overlaps with prediction 1; however, this

gene is also inconsistent with the RNA-Seq and TopHat splice junction data in this area. Swiezy 6

Prediction 2: Prediction 2 does not show a BLASTX alignment to a gene in D.

melanogaster and has no RNA-Seq data or TopHat splice junction data to support the presence

of this GENSCAN prediction as a gene in Drosophila elegans.

Prediction 3: All eight exons found in prediction 3 show strong similarity to the exons in

the homologous D. melanogaster gene, and the general exon structure of this gene appears to be

well-supported by RNA-Seq and TopHat splice junction data (this will become clearer in later

figures at higher resolution), suggesting their presence in the D. elegans genome. There are four

isoforms for the presumed ortholg of prediction 3 in D. melanogaster (Figure 3).

Figure 3. View of Contig21 via UCSC Genome Browser. Blue box marks prediction 1, green box marks prediction 2, pink box marks prediction 3, and black box marks the two TopHat splice junctions that do not correspond to the intron/exon boundary of GENSCAN prediction 1.

Annotation of Prediction 3:

In order to establish an ortholog in D. melanogaster, a BLASTp search, using as the

query the amino acid sequence of D. elegans prediction 3, and as the subject, the annotated D.

melanogaster database maintained by FlyBase, was carried out. The top four matches Swiezy 7

aligned over the entire length of the F, D, H, and G isoforms of CG11148 with high similarity;

all four of these alignments had an E-value of zero, with the next best alignments having E-

values >20 (Figure 4). Thus, prediction 3 in D. elegans is an ortholog of CG11148 in D.

melanogaster.

Figure 4. BLASTp alignment of prediction 3, establishing orthology to all four isoforms of D. melanogaster CG11148. Query: amino acid sequence of prediction 3 in D. elegans; subject: annotated D. melanogaster proteins database.

Annotation of Prediction 3, Exon 4:

To begin, the D. melanogaster gene CG11148 was entered into the Gene Record Finder

maintained by the Genomics Education Partnership, using GFF3 files from FlyBase. This

database provided a list of coding exons previously annotated for this gene in D. melanogaster

with their corresponding nucleotide and amino acid sequences, the strand on which each is

transcribed, and to which isoform each belongs (Figure 5). The longest exon (found in each of

the isoforms), exon 4, had a size of 872 amino acids; this sequence was used as the query for a

BLASTX search (using the nucleotide sequence of Contig21 as query), which showed that this

exon is transcribed in frame -3 (Figure 6). Using the UCSC browser, the region at the 5’ end of

exon 4 was expanded to view the nucleotide sequence and all AG (canonical 3’ splice acceptor

sequence) and GT (canonical 5’ splice donor sequence) pairs were highlighted (Figure 7). Swiezy 8

Figure 5. Output from Gene Record Finder for D. melanogaster gene CG11148. Top panel: table of coding exons present in each isoform; bottom panel: table of coding exons showing coordinates, strand, phase, and size in amino acids. Exon 2 and exon 3 are alternative second exons; therefore, throughout, exons will be referred to as exons 1 to 8 (in both isoforms), corresponding to their numerical order in the gene prediction, rather than by their FlyBase ID.

Figure 6. BLASTX alignment placing exon 4 in frame -3; alignment is truncated after subject amino acid 115. Subject: nucleotide sequence of Contig21; query: amino acid sequence of exon 4 from Gene Record Finder. Swiezy 9

Figure 7. Expanded view of 5’ end of exon 4. Blue box denotes splice acceptor sequence; black box shows that this is a phase 2 acceptor in frame -3. (Note that CG11148 is transcribed on the reverse strand, and therefore, the 5’ end is on the right of this figure.)

Looking in frame -3, the 3’ AG splice acceptor sequence corresponds to bases 35,373-

35,732. The first base of the exon is immediately downstream of this pair, at base 35,371, a start

that is supported by both N-SCAN and GENSCAN gene predictors, TopHat splice junction sites,

the start of deep RNA-Seq coverage, and the predicted amino acid sequence. This is a phase 2

splice acceptor (Figure 7). The first base of this exon is not supported by the BLASTX alignment

between the translated nucleotide sequence of Contig21 and the amino acid sequence of D.

melanogaster; however, BLASTX (which is looking for conservation) often extends sequence

alignments beyond the exon boundaries, and therefore, this result should not be used as evidence

against starting exon 4 at base 35,371.

As further support for the first base of this exon, the phase of the splice donor of this

intron was evaluated. First, a BLASTX alignment showed that exon 3 is also transcribed in

frame -3 (Figure 8). Swiezy 10

Figure 8. BLASTX alignment, showing exon 3 is transcribed in frame -3. Query: amino acid sequence of D. melanogaster exon 3 from Gene Record Finder; subject: nucleotide sequence of Contig21; note: computational adjustment turned off in this search.

The GT pair from base 38,833-38,832 corresponds to the 3’ end of the N-SCAN and

GENSCAN predictions and the end of the RNA-Seq coverage, and therefore, was chosen as the

splice donor site. BLASTX likely overextended the alignment. This is a phase 1 donor, which,

when added to the phase 2 acceptor, equals 3 (or, a whole codon) (Figure 9).

Figure 9. Expanded view of the 3’ end of exon 3. Blue box denotes splice donor sequence; black box shows that this is a phase 1 donor in freme -3. Swiezy 11

Next, the 3’ end of exon 4 was scanned for GT splice donors. The GT pair from base

32,716-32,715 corresponds to the 3’ end of the N-SCAN and GENSCAN predictions and the end

of the RNA-Seq coverage, and therefore, was chosen as the splice donor site. This is a phase 1

donor in frame -3. The last base of exon 4 is therefore base 32,717 (Figure 10).

To further support position 32,717 as the last base of exon 4, the phase of the splice

acceptor of this intron was evaluated. A BLASTX search showed that exon 5 is also transcribed

in frame -3 (Figure 11). The end of exon 5 was expanded and scanned for AG splice acceptors.

The AG pair from base 30,759-30,758 corresponds to the end of the N-SCAN and GENSCAN

predictions and a sharp increase in RNA-Seq coverage; therefore, this was chosen as the splice

acceptor site, which is a phase 2 acceptor in frame -3. This acceptor, when combined with the

phase 1 splice donor of exon 4, adds to 3 nucleotides, lending further evidence to base 32,717 as

the last base of exon 4 (Figure 12). The annotation of exon 4 is summarized in Table 1 below.

Figure 10. Expanded view of end of exon 4. Blue box denotes splice donor sequence; black box shows that this is a phase 1 donor in frame -3. Swiezy 12

Figure 11. BLASTX alignment, showing exon 5 is transcribed in frame -3. Query: amino acid sequence of D. melanogaster exon 5 from Gene Record Finder; subject: nucleotide sequence of Contig21.

Figure 12. Expanded view of 5’ end of exon 5. Blue box denotes splice acceptor sequence; black box shows that this is a phase 2 acceptor in frame -3. Swiezy 13

Exon First Base Last Base Length Phase of 3’ Phase of 5’ Frame (in base Splice Splice pairs) Acceptor Donor 3 ------Phase 1 -3 4 35,371 32,717 2655 Phase 2 Phase 1 -3 5 ------Phase 2 -- -3 Table 1. Summary of exon 4 annotation.

Annotation of Prediction 3 Isoforms, Alternative Second Exons:

A similar procedure as used above was used to establish the frame (exon-by-exon

BLASTX) (Table 2), first base, and last base of exon 2-DF (PD/PF isoform) and exon 2-GH

(PG/PH isoform), as well as the GT/GC splice donor sites and the AG splice acceptor sites.

Exon 2-GH:

The end of exon 1 has a GT splice donor that is phase 0 in frame -1, which is compatible

with the phase 0 AG splice acceptor at the beginning of exon 2-GH in frame -2. The end of exon

2-GH has a non-canonical GC splice donor that is phase 0 in frame -2, which is compatible with

the phase 0 AG splice acceptor at the beginning of exon 3 in frame -3 (Figure 13a). The non-

canonical splice donor is not conserved between other Drosophila species (Figure 13b). Swiezy 14

Figure 13a. Top right panel: end of exon 1; top left panel: beginning of exon 2-GH; bottom right panel: end of exon 2-GH; bottom left panel: beginning of exon 3. In all panels, blue box shows phase 0 donor or acceptor. Arrows denote reading frame for each panel.

Figure 13b. Blue box shows sequence of the splice donor of exon 2-GH; this sequence is non-canonical, GC, in D. elegans, but canonical, GT, in the six other Drosophila species shown.

Exon 2-DF:

As stated in the “Exon 2-GH” section above, the 3’ end of exon 1 has a GT splice donor

that is phase 0 in frame -1; this is compatible with the phase 0 AG splice acceptor at the

beginning of exon 2-DF in frame -3. Thus, the annotation of the 3’ end of exon 1 is the same Swiezy 15

between all isoforms. There is TopHat splice junction data (JUNC000000711) to support the

truncation of exon 2-GH by 21 base pairs, which creates exon 2-DF in the PD and PF isoforms;

however, JUNC00000711 is only seen in the Mixed Embryo track and is only supported by four

reads (Figure 14), whereas JUNC00000712 is supported by 121 reads, JUNC00000209 by 43

reads, and JUNC00000292 by 37 reads. However, given the high level of conservation to the

ortholog of this gene in D. melanogaster, exon 2-DF was annotated separately from exon 2-GH

in order to maintain all four isoforms found in D. melanogaster in D. elegans (Figure 15).

The annotation of the end of exon 2-DF was the same as that of the end of exon 2-GH,

and thus, the annotation of the beginning of exon 3 was also the same for all four isoforms

(Figure 13).

Figure 14. TopHat splice junction information for JUNC00000711. Top panel: black arrows represent the position of JUNC00000711; bottom panel: red box indicates that depth of coverage for this junction, which is only four reads. Swiezy 16

Figure 15. Right panel: end of exon 1, phase 0 splice donor denoted by blue box; left panel: beginning of exon 2-DF, phase 0 splice acceptor denoted by blue box. End of exon 2-DF and beginning of exon 3 have identical annotation to exon 2-GH (Figure 13).

The difference between exon 2-GH and exon 2-DF is the only difference between the

PD/PF and PG/PH forms. The PD/PF isoforms are identical in their protein sequences, but differ

in the annotation of their 5’ UTRs, which was not a part of the present analysis. Similarly, the

PG/PH isoforms are identical in their protein sequences, but differ in the annotation of both their

3’ and 5’ UTRs (Figure 16). Swiezy 17

H D F G

Figure 16. Top panel: 3’ UTR has an extra intron in isoform PH relative to the other three isoforms. Middle panel: the G, F, and D isoforms share the same 3’ UTR. Bottom panel: the H and D isoforms share the same 5’ UTR while the F and G isoforms share a different 5’ UTR.

Annotation of Prediction 3, Exon 8:

A similar procedure as used above was used to establish the frame (Table 2), first and last

base, and the splice donor and splice acceptor sites for exon 8. The exon-by-exon BLASTX

result for this exon revealed that the query sequence (translated nucleotide sequence of

Contig21) was missing 11 amino acids from the 5’ end relative to the subject (D. melanogaster

exon 8 from Gene Record Finder) (Figure 17). At the 5’ end of the alignment to exon 8, there is

high RNA-Seq coverage, N-SCAN and GENSCAN gene model predictions, and TopHat splice

junctions, as well as a consistent splice acceptor phase that supports base 24,878 as the first base

of exon 8 (Figure 18). In addition, using base 24,878 as the starting base would restore the

“missing” 11 amino acids in the BLASTX alignment. Swiezy 18

Figure 17. Exon-by-exon BLASTX alignment, showing 11 missing amino acids from the 5’ end of exon 8. Query: amino acid sequence of exon 8 from Gene Record Finder; subject: nucleotide sequence of Contig21.

Figure 18. Expanded view of beginning of exon 8. Blue box denotes splice acceptor sequence; black box shows that this is a phase 1 acceptor in frame -3.

The start and stop codons chosen for CG11148 are 39,739-39,737 and 24,755-24,753

respectively (Figures 19 and 20). Swiezy 19

Figure 19. Start codon for CG11148.

Figure 20. Stop codon for CG11148.

Annotation of Prediction 3, Remaining Exons:

Annotation of the exons not discussed above (exons 1, 5, 6, and 7) was straight-forward;

in general these exons had deep RNA-Seq coverage and TopHat splice junctions that were

consistent with both N-SCAN and GENSCAN predictions. The final annotations of all exons and

isoforms are summarized in Tables 2, 3, and 4. Swiezy 20

Exon 1 2-DF 2-GH 3 4 5 6 7 8

Frame -1 -2 -3 -3 -3 -3 -2 -3 -3

Table 2. Frame of each coding exon established by exon-by-exon BLASTX search against the nt/nr database maintained by NCBI.

Annotation Summary:

Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 39,739-39,407 333 -- Phase 0 -1 2-GH 39,348-39,175 174 Phase 0 Phase 0 -2 3 39,113-38,834 280 Phase 0 Phase 1 -3 4 35,371-32,717 2655 Phase 2 Phase 1 -3 5 30,757-29,789 969 Phase 2 Phase 1 -3 6 29,719-29,645 75 Phase 2 Phase 1 -3 7 25,085-24,931 155 Phase 2 Phase 0 -2 8 24,878-24,756 123 Phase 0 -- -3 Stop Codon 24,755-24,753 3 -- -- -3 Table 3. Final annotation summary for PG/PH isoforms.

Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 39,739-39,407 333 -- Phase 0 -1 2-DF 39,348-39,175 174 Phase 0 Phase 0 -2 3 39,113-38,834 280 Phase 0 Phase 1 -3 4 35,371-32,717 2655 Phase 2 Phase 1 -3 5 30,757-29,789 969 Phase 2 Phase 1 -3 6 29,719-29,645 75 Phase 2 Phase 1 -3 7 25,085-24,931 155 Phase 2 Phase 0 -2 8 24,878-24,756 123 Phase 0 -- -3 Stop Codon 24,755-24,753 3 -- -- -3 Table 4. Final annotation summary for PD/PF isoforms.

Gene Model Checker Results:

Both models were confirmed, showing high sequence similarity to the D. melanogaster

models (Figure 21). Swiezy 21

Figure 21. Gene Model Checker dot plots showing conservation with D. melanogaster gene models for PG/PH annotation (left) and PD/PF annotation (right).

Annotation of Prediction 1:

Because there were two overlapping BLASTX alignments via the UCSC Genome

Browser, it was first necessary to choose which of these was the more likely candidate for a gene

in D. elegans (Figure 22). Looking at the high-scoring segment pairs (HSPs) for each BLASTX

alignment showed higher percent identities and higher E-values for the fd102C-PA alignment

over the fd19B-PA alignment (Figure 23). Alternatively, this can also be seen by the colors of

the BLASTX alignments, red (fd102C-PA) being a better quality alignment than brown (fd19B-

PA). Lastly, fd102C is found on the fourth chromosome in D. melanogaster whereas fd19B is

found on the X chromosome in D. melanogaster. Swiezy 22

Figure 22. UCSC Genome Browser shows overlapping BLASTX results in the region of prediction 1.

Figure 23. HSP summaries for BLASTX alignments shown in Figure 22. Black boxes denote the E-values for each HSP; red boxes denote the Percent identities for each HSP. The HSPs for fd102C-PA have higher E- values and higher percent identities than HSPs for fd198-PA; therefore, fd102C-PA is the better alignment to the D. elegans sequence.

A BLASTp search, using as query, the amino acid sequence of D. elegans, prediction 1,

and as subject, the annotated D. melanogaster proteins database maintained by FlyBase, was

used to corroborate this putative ortholog in D. melanogaster. This search revealed a number of

protein matches with a high E-value, but the score (=696.041) for fd102C-PA was much higher

than the scores for the other matches (≤116.316). In addition, the visual representation of these

matches showed that only the alignment to fd102C-PA was consistently high quality over the

entire length of the query sequence (Figure 24). The high E-values of the other matches are

likely due to the presence of the highly conserved forkhead domain, found in the matches

beginning with “fd.” Swiezy 23

Figure 24. BLASTp alignment using as query, the amino acid sequence of D. elegans prediction 1, and as subject, the annotated D. melanogaster proteins database. Top panel: best alignment to fd102C-PA; bottom panel: conservation to the highly conserved forkhead domain is likely responsible for the high E-values given to the other alignments.

Both GENSCAN and N-SCAN predicted two coding exons for this gene. However, via

Gene Record Finder, fd102C-PA has only one translated exon; the second exon contains the first

portion of the 5’ UTR in D. melanogaster (Figure 25). Additional support for only one coding

exon comes from the fact that there are no stop codons in the frame of this gene’s transcription in

D. elegans (Figure 26).

Figure 25. Left panel: fd102C-PA has one translated (coding) exon in D. melanogaster; right panel: fd102C-PA has two transcribed exons in D. melanogaster. Swiezy 24

Figure 26. UCSC Genome Browser view of nucleotide sequence of fd102C-PA region in D. elegans. Transcription of this feature is in frame -1 (red arrow); no red stop codons are seen within this frame, extending throughout the length of the prediction.

Interestingly, the FlyBase Genome Browser showed a 700 base pseudogene annotated

within the intron separating the two pieces of the 5’ UTR of fd102C-PA in D. melanogaster

(Figure 27). A BLASTn search was then used to discover whether this pseudogene was also

conserved in the D. elegans sequence. The alignment shown below shared 71% identity to the D.

melanogaster sequence, but extended over only 54% of the D. melanogaster sequence (Figure

28). However, this search does not rule out the possibility that a pseudogene (or the remnant of a

pseudogene) is present in the D. elegans genome. Because pseudogenes are not translated into

functional proteins there is a lack of negative selection pressure to maintain their sequence;

hence, evolution within these features occurs rapidly. Given that the common ancestor for D.

melanogaster and D. elegans lived about 10 million years ago, we would expect the sequence of

a pseudogene (if present in the ancestor) to differ significantly between these two species. Based

on the BLASTn alignment, we can conclude that a putative ortholog to CR33797 extends from

base 6,005 to base 6,705 (Figure 28). Swiezy 25

Figure 27. FlyBase Genome Browser showing single coding exon (orange) and two additional transcribed exons (gray). Pseudogene CR33797-RB appears within the intron separating the pieces of the 5’ UTR in the D. melanogaster assembly (red boxes).

Figure 28. Results of BLASTn search using query, nucleotide sequence of CR33797-RB in D. melanogaster and subject, nucleotide sequence of contig21. The BLAST parameters were optimized for “somewhat similar sequences” and the filtering for “low complexity regions” was turned off. Top left panel: dot plot shows that the identity between these sequences is low and that a little more than half of the pseudogene sequence (on the x axis) is found in the sequence of contig21 (on the y axis); top right panel: the top alignment shows that the first 502 nucleotides and the last 104 nucleotides of the D. melanogaster pseudogene are not conserved in contig21; bottom panel: 71% exists between the sequences of CR33797-RB and contig21, 54% of CR33797-RB is present in contig21.

RNA-Seq coverage in Mixed Embryos is high throughout the prediction 1. In addition,

there is only a single small repeat within the region of this feature; this repeat lines up exactly

with and is likely responsible for the novel intron proposed by both GENSCAN and N-SCAN, Swiezy 26

given that these gene predictors both run using the masked genomic sequence (Figure 29). Both

high RNA-Seq coverage and the lack of other repeats, combined with the BLASTX homology to

the D. melanogaster sequence, corroborates the presence of a gene in this region. A BLASTX

search revealed that the single exon is in Frame -1 and begins at the methionine at base 3,166

and ends at base 1,304 immediately before the stop codon from base 1,303 to base 1,301 (Figure

30a). While this is the first coding exon, it is the second transcribed exon. The canonical AG

splice acceptor site is found upstream of the methionine, where the RNA-Seq also ends. The first

base of this exon that is transcribed (part of the 5’ UTR) is at base 3,232 (Figure 30b).

Figure 29. UCSC Genome Browser view of fd102C-PA; black box shows repeat falling within the novel intron proposed by the GENSCAN and N-SCAN gene models. Red arrow shows the high level of RNA-Seq coverage throughout the region of this feature. Swiezy 27

Figure 30a. UCSC Genome Browser View of 3’ end of fd102C-PA, stop codon boxed in blue.

Figure 30b. UCSC Genome Browser View of 5’ end of fd102C-PA, beginning at methionine boxed in blue.

Annotation Summary:

Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 3,166-1,304 1863 -- -- -1 Stop Codon 1,303-1,301 3 -- -- -1 Table 5. Final annotation summary for fd102C-PA. Swiezy 28

Gene Model Checker Results:

The model proposed above for Gene Model Checker confirmed the proposed model for the

ortholog of fd102C-PA in D. elegans. The dot plot generated shows homology between the D.

melanogaster and D. elegans annotations throughout, albeit with several gaps (Figure 31).

Figure 31. Gene Model Checker output for fd102C-PA annotation. Left panel: dot plot showing global sequence homology between D. melanogaster gene (x axis) versus D. elegans annotation (y axis); right panel: sequence similarity between D. melanogaster and D. elegans annotations displayed at amino acid resolution.

Annotation of Prediction 2:

This prediction was unique in that only SNAP and GENSCAN prediction suggested that

there may be a feature present at this position. There was no BLASTX alignment between the D.

elegans and D. melanogaster sequences, a low level of RNA-Seq data that does not match up

with specific exons (more likely to be noise), and a large number of repeats present throughout

the region. There was one TopHat junction that appeared to support the middle exon, and there

was some conservation of this region across Drosophila species; however, the conservation

evidence did not support the middle exon (Figure 32). Swiezy 29

Figure 32. UCSC Genome Browser view of Prediction 2. Blue box indicates the TopHat junction that supports the middle exon; the pink boxes indicate conservation among other Drosophila species. Note that the TopHat junction and conservation track support different exons in this feature.

Next, a BLASTp search was completed, using as query, the predicted protein sequence of

this feature, and as subject, the annotated D. melanogaster protein database maintained by

FlyBase. Using the default “expect value” of 10 returned no significant hits; changing the expect

value to 100 returned a number of very poor alignments with poor scores and poor E-values

(Figure 33). However, a BLASTn search, using as query, the nucleotide sequence of this feature,

and as subject, the D. melanogaster nucleotide sequence database maintained by FlyBase,

returned an alignment to CR45123-RA (Figure 34). Checking the synteny on the FlyBase genome

browser confirmed the presence of this non-coding RNA in D. melanogaster between CG11148

and fd102C-PA (Figure 35). Swiezy 30

Figure 33. BLASTp search using query, amino acid sequence of prediction 2, and subject, annotated D. melanogaster protein database. Top panel: summary of alignments showing poor E-values for all alignments; middle panel: visual representation of summary table in top panel; bottom panel: side-by-side amino acid alignment for best BLAST hit. Swiezy 31

Figure 34. BLASTn search using query, nucleotide sequence of prediction 2, and subject, the D. melanogaster nucleotide sequence database. Top panel: summary of alignments, red box shows E-value for top hit is almost 43 orders of magnitude larger than the second best hit; middle panel: visual representation of summary table in top panel; bottom panel: side-by-side nucleotide alignment for best BLAST hit.

Figure 35. Genome browser view of the D. melanogaster homologous region to contig21. Red box denotes non- coding RNA prediction.

A BLASTn search confirms the presence of CR45123 in contig21 (Figure 36).

Depending on whether this non-coding RNA is transcribed or not, it may be the source of the Swiezy 32

RNA-Seq and TopHat junctions in the region of this feature and may also be the reason that

GENSCAN originally made a prediction in this region.

Figure 36. Top panel: BLASTn search using query, nucleotide sequence of CR45123 in D. melanogaster, and subject, nucleotide sequence of contig21; bottom panel: dot plot showing homology throughout the sequences (query on x axis, subject on y axis).

After synthesizing all of the above information, we can conclude that GENSCAN

prediction 2 is a consequence of the presence of an ortholog to CR45123 in D. elegans. Given

the above BLAST alignment, the CR45123 ortholog in D. elegans extends from base 18971 to

base 19465.

Annotation of Mary’s Gene:

In addition to the genes annotated in contig21, I was also given one gene from contig30

to annotate (Figure 37); for an analysis of contig30 outside of this feature, see Mary

Richardson’s annotation report. Swiezy 33

There are two alignments to D. melanogaster genes in the region of this prediction;

however, as discussed previously, the color of the BLASTX alignment indicates its quality, with

the red being a better alignment than the brown. Using a BLASTp search, I was able to confirm

that Eph is the correct D. melanogaster ortholog to this GENSCAN prediction (Figure 38).

Figure 37. GENSCAN prediction contig30.1 (prediction 1). Swiezy 34

Figure 38. BLASTp search using query, amino acid sequence of prediction 1, and subject, annotated D. melanogaster proteins database maintained by FlyBase. Top panel: summary of BLAST alignments (not all hits shown); bottom panel: visual representation of best BLAST alignments (not all hits shown).

Gene Record Finder revealed that the annotation of Eph in D. melanogaster has six

unique isoforms (with two isoforms sharing the same coding sequence) and 18 unique coding

exons (Figure 39). Swiezy 35

Figure 39. Gene Record Finder information about Eph, showing six unique isoforms and 18 unique coding exons.

Annotation of Isoforms PA and PE:

The first exon of isoform PA has no corresponding GENSCAN prediction in contig30;

however, looking at the upstream project shows a canonical splice acceptor site, RNA-Seq data

and TopHat junctions, as well as a methionine to support the beginning of this exon. The next

nine exons of the PA and PE isoforms showed routine annotations with exon boundaries

supported by RNA-Seq data, TopHat junctions and in-phase splice donor and acceptor pairs. The

eleventh exon (15_2213_0) showed two different alignments via the BLASTX track. Base

12,711, corresponding to the BLASTX alignment to the PD isoform, was chosen as the final base Swiezy 36

of exon 11 for isoforms PC, PE, PA, PD, and PB in agreement with the RNA-Seq and TopHat

junction data.

Figure 40. Black line shows that the BLASTX alignment to the D isoform corresponds to the RNA-Seq and TopHat junction data; this base 12,711 was chosen as the last base of exon 11 in the PC, PE, PA, PD, and PB isoforms. Swiezy 37

Annotation Summary:

Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 23-275 253 -- 1 2 2 346-425 80 2 0 3 3 2123-2303 181 0 1 2 4 3247-3732 486 2 1 3 5 5342-6229 888 2 1 1 6 6291-6467 177 2 1 2 7 11201-11329 129 2 1 1 8 11400-11495 96 2 1 2 9 11566-11936 371 2 0 3 10 12169-12315 147 0 0 1 11 12382-12711 330 0 0 1 12 12770-12871 102 0 -- 2 Stop Codon 12872-12874 3 -- -- 2 Table 6. Annotation summary for Eph-PA and Eph-PE.

Gene Model Checker Results:

The gene model summarized in Table 6 was confirmed for both isoform PA and PE

(Figure 41). However, in both dot plots generated from the Gene Model Checker, the annotation

seems to be lacking support for the first coding exon. A side-by-side alignment of the amino acid

sequence of the first exon shows that there is significant homology between the D. melanogaster

and D. elegans sequences in this region, although it is not as highly conserved as the second,

third, or fourth exon (Figure 42). Swiezy 38

Figure 41. Dot plot showing homology between Eph-PA (left) and Eph-PE (right) in D. melanogaster to contig30 ortholog in D. elegans. Note the appearance of a lack of homology in the first exon on both dot plots (black brackets).

Figure 42. Side-by-side amino acid alignment of Eph-PE (D. melanogaster), and Submitted_Seq (D. elegans ortholog); note significant similarity in the first exon (black box).

Annotation of Isoform PB:

Via the model annotated in D. melanogaster, there should be 13 exons in the PB isoform.

Looking at the BLASTX alignment and the gene predictors on the UCSC Genome Browser

shows only 12 exons. Gene Record Finder reveals that the first exon should have only three Swiezy 39

amino acids and should start after the start, but be found within, the first exon in the PA and PE

isoforms (Figure 43). This leaves only two viable options for the beginning of the exon where a

methionine is present to begin transcription and which will not truncate the second exon of this

isoform (Figure 44); however, neither of these conserves the MTK sequence and only one has a

suitable splice donor sequence (Figure 45). Taken together, these results suggest that we cannot

make a model in D. elegans that is orthologous to the PB isoform in D. melanogaster.

Figure 43. Top table: first three exons of Eph-PB in D. melanogaster; bottom table: first three exons of Eph- PE in D. melanogaster; inset: amino acid sequence of exon 1 of Eph-PB.

Figure 44. Start codons that are possible for isoform PB boxed in blue; start codons that are impossible in isoform PB due to truncation of the exon 2 boxed in black; exon 2 boxed in green. Swiezy 40

Figure 45. Each of the possible start codons boxed in Figure 44; only the first has a suitable GT splice donor site nearby.

Annotation of Isoform PC:

The annotation of this isoform was straightforward; the only difference between Eph-PC

and Eph-PA/PE is the introduction of a small exon, exon 7 (Figure 46). There were TopHat

junctions as well as an N-SCAN prediction and in-phase splice donors and acceptors to support

the presence of this exon.

Figure 46. Short exon found in Eph-PC, Eph-PD, and Eph-PF along with the corresponding prediction of this exon by N-SCAN and the supporting TopHat junctions. Swiezy 41

Annotation Summary:

Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 23-275 253 -- 1 2 2 346-425 80 2 0 3 3 2123-2303 181 0 1 2 4 3247-3732 486 2 1 3 5 5342-6229 888 2 1 1 6 6291-6467 177 2 1 2 7 7975-8022 48 2 1 3 8 11201-11329 129 2 1 1 9 11400-11495 96 2 1 2 10 11566-11936 371 2 0 3 11 12169-12315 147 0 0 1 12 12382-12711 330 0 0 1 13 12770-12871 102 0 -- 2 Stop Codon 12872-12874 3 -- -- 2 Table 7. Annotation summary for Eph-PC.

Gene Model Checker Results:

The gene model summarized in Table 6 was confirmed for Eph-PC (Figure 47). Note that

as in Eph-PA and Eph-PE, the first exon appears to be poorly conserved on the dot plot

comparing the sequence similarity between the D. melanogaster and D. elegans models; see

Figure 42 for the side-by-side amino acid comparison. Swiezy 42

Figure 47. Gene Model Checker generated dot plot showing sequence homology between D. melanogaster Eph-PC gene model (x axis) and orthologous model in D. elegans (y axis). Note what appears to be complete lack of homology in the model for exon 1 (black brackets).

Annotation of Isoform PD:

All but one of the exons present in isoform PD were annotated in either isoform PA, PE, or

PC above. The BLASTX alignment suggests that the last exon in Eph-PD appears more than 4

kb downstream of the end of the prior exon. However, this is inconsistent with the D.

melanogaster model seen on FlyBase and with Gene Record Finder, which both predict only 12

coding exons. In addition, there are no TopHat junctions, RNA-Seq, or predictions by

GENSCAN or N-SCAN to support this exon (Figure 48). Swiezy 43

Figure 48. Top left panel: UCSC Genome Browser view of last exon proposed by BLASTX alignment with D. melanogaster; top right panel: Flybase genome browser view of Eph-PD gene model in D. melanogaster; bottom panel: expanded view of top right panel.

Without using the final exon proposed by BLASTX, and instead ending at the same place

as the PA, PC, and PE isoforms would not allow this gene model to have a stop codon. Hence,

the last exon was extended until the next available stop codon in frame 1, from base 12775 to

12777 (Figure 49). However, Gene Record Finder showed that in D. melanogaster, the last base

of Eph-PD exon 12 is upstream of the first base of Eph-PE exon 12; this is inconsistent with the

stop codon chosen, but the stop codon from base 12775 to base 12777 represents the best option

for conserving the PD isoform in D. elegans (Figure 50). Swiezy 44

Figure 49. Stop codon chosen for Eph-PD final exon, exon 12 (black box).

Figure 50. Left panel: gene model for Eph-PE in D. melanogaster, showing exon 12, boxed in black, begins at base 620,254; right panel: gene model for Eph-PD in D. melanogaster, exon 12, boxed in black, ends at 620,234 (before the start of Eph-PE exon 12). Swiezy 45

Annotation Summary:

Exon Coordinates Length Phase of 3’ Phase of Frame (bp) Splice 5’ Splice Acceptor Donor 1 23-275 253 -- 1 2 2 346-425 80 2 0 3 3 2123-2303 181 0 1 2 4 3247-3732 486 2 1 3 5 5342-6229 888 2 1 1 6 6291-6467 177 2 1 2 7 7975-8022 48 2 1 3 8 11201-11329 129 2 1 1 9 11400-11495 96 2 1 2 10 11566-11936 371 2 0 3 11 12169-12315 147 0 0 1 12 12382-12774 330 0 -- 1 Stop Codon 12775-12777 3 -- -- 2 Table 8. Annotation summary for Eph-PD.

Gene Model Checker Results:

The model proposed above for an ortholog to Eph-PD in D. elegans passed the guidelines

set by Gene Model Checker. The apparent lack of homology between exon 1 in the gene models

has previously been discussed and accounted for (Figure 42). The lack of homology seen at the

very end of the last exon in this model is likely due to the choice of stop codon, which according

to Gene Record Finder should not have extended so far, but for which there was no better

alternative (Figure 51). Swiezy 46

Figure 51. Gene model checker output. Left panel: dot plot (x axis, Eph-PD model in D. melanogaster; y axis, D. elegans orthologous model) showing lack of homology at end of exon 12 (black brackets); right panel: side- by-side amino acid alignment, showing lack of sequence homology in final exon (top line, Eph-PD model in D. melanogaster; bottom line, D. elegans orthologous model).

Annotation of Isoform PF:

This isoform uses the same first exon as isoform Eph-PB; we previously decided that

there is no viable methionine to make this a reasonable model for the first exon. Hence, in the

annotation of Eph-PF, we decided that the first exon would start at the upstream methionine

consistent with the start position of Eph-PA, Eph-PC, Eph-PD, and Eph-PE.

This isoform also uses different final exons than any of the other exons. The D.

melanogaster gene model shows that the final exon of Eph-PF is only one base long and only

codes for a portion of the stop codon; this base is found at the first base of the final exons in the

other isoforms (Figure 52). Swiezy 47

Figure 52. Gene model for Eph-PF via FlyBase Genome Browser; red boxes indicated the transcript (top panel, red box) and coding (middle panel, red box) lengths of the PF isoform. The final exon in this isoform is one base long and codes for a portion of the stop codon (bottom panel, red box).

A suitable splice donor was found in the region of the end of the BLASTX alignment of

contig30 to D. melanogaster. Immediately prior to the GT donor sequence, a TG sequence is

present. This is a phase 2 donor in frame 1 (Figure 53). In frame 3 we find a suitable splice

acceptor that is phase 1. Immediately after the AG splice sequence, an A is present. This A

completes the stop codon (TGA) begun in the previous exon and is consistent with the start of an

exon predicted by N-SCAN (Figure 54). Swiezy 48

Figure 53. Boxed in blue is the TG sequence at the end of exon 14 immediately upstream of the GT splice donor sequence (phase 2 in frame 1).

Figure 54. Boxed in blue is the A at the beginning of exon 15 immediately downstream of the AG splice acceptor sequence (phase 1 in frame 3); this A completes the stop codon sequence TGA.

While this gene model can be made, it is primarily made for the purpose of conserving

the isoforms found in D. melanogaster. Note that there was no significant drop in RNA-Seq or

TopHat junctions to support the end of exon 14 at this position. Given that the RNA-Seq

coverage is >600 reads deep in the Mixed Embryos track, it seems unlikely that if this isoform Swiezy 49

does exist in D. elegans and is transcribed at levels consistent with the other isoforms, there

would be no TopHat junctions.

Annotation Summary:

Exon Coordinates Length Phase of Phase of Frame (bp) 3’ Splice 5’ Splice Acceptor Donor 1 23-275 253 -- 1 2 2 346-425 80 2 0 3 3 2123-2303 181 0 1 2 4 3247-3732 486 2 1 3 5 5342-6229 888 2 1 1 6 6291-6467 177 2 1 2 7 7975-8022 48 2 1 3 8 11201-11329 129 2 1 1 9 11400-11495 96 2 1 2 10 11566-11936 371 2 0 3 11 12169-12315 147 0 0 1 12 12382-12639 258 0 -- 1 Stop Codon 12640-12641 and 12770 3 -- -- 1àà3 Table 9. Annotation summary for Eph-PF.

Gene Model Checker Results:

Gene Model Checker does not allow the stop codon to be split up this way; hence, this

portion of the model failed. All other exons in the model proposed above passed the criteria set

by the program (Figure 55). The apparent lack of homology in the first exon has been previously

discussed and accounted for (Figure 42). Swiezy 50

Figure 55. Dot plot showing homology between Eph-PF gene model in D. melanogaster (x axis) and orthologous model in D. elegans (y axis).

Overview Eph:

I annotated five isoforms of Eph in D. elegans, Eph-PA, Eph-PC, Eph-PD, Eph-PE, and

Eph-PF (Figure 56); Eph-PB was not annotated as an isoform because changing the annotation

of the first exon would make the model exactly consistent with Eph-PA (both in terms of

transcript sequence and coding sequence). Swiezy 51

Figure 56. Final gene model for Eph ortholog in D. elegans, including isoforms PA, PC, PD, PE, and PF.

Repeat Elements (contig21):

The total repeat content for contig21 returned by RepeatMasker was 31.72% (Figure 57).

There were eight repeats over 500 bp in length, which were limited to the regions between genes

or within long introns of the genes (Table 10 and Figure 58). Two independent BLASTn

searches using as the subject, Wolbachia (taxid:953) and Wolbachiae (taxid:952), and as the

query, nucleotide sequence of contig21, returned no significant hits, suggesting that the 40

unclassified repeats (19.88%) are not remnants of Wolbachia DNA. We would expect the repeat

density to be this high given that contig21 is a sequence taken from the D. elegans dot

chromosome. Swiezy 52

Figure 57. RepeatMasker output for contig21.

Repeat Family Coordinates Length (bp) rnd-5_family-237 4473-5632 1159 rnd-5_family2544 3768-4472 704 rnd-3_family113 142-799 657 rnd-3_family113 6459-7104 645 rnd-3_family113 17977-18580 603 rnd-1_family259 31194-31723 529 rnd-5_family815 23275-23802 527 rnd-5_family237 76-588 512 Table 10. Summary of repeats >500 bp in length.

Figure 58. UCSC Browser view of repeat elements >500 bp in length. Swiezy 53

Synteny:

Comparing my annotations with the orthologous region on the dot chromosome in D.

melanogaster revealed the same order of genes, all transcribed on the negative strand (Figure

59). I also included the putative pseudogene and ncRNA found in this region in D. melanogaster

in my model, and these are uploaded into the final track below.

Figure 59. Synteny between D. elegans (top panel) and D. melanogaster (bottom panel) annotations.

Transcription Start Site Annotation, CG11148:

I chose to evaluate the transcription start site (TSS) for CG11148 since this gene was

well-conserved between the two species and had a high amount of RNA-Seq data throughout. To

begin, the annotation of the TSS in D. melanogaster was evaluated (Figure 60). Given that there

are only two positions annotated as a TSS by Celniker, this was classified as a “peaked”

promoter. In the D. melanogaster model, the TSS is annotated ten bases upstream of the Swiezy 54

beginning of the annotation of the 5’ UTR; the 5’ UTR begins 759 bases upstream of the

beginning of the first coding exon. By extrapolation to D. elegans, because the first coding exon

begins at base 39,739, adding 769 bases to this means that the TSS should fall at about 40,508.

Figure 60. Annotation of transcription start site for CG11148 in D. melanogaster. Left panel: characterization of peaked promoter; right panel: expanded view of promoter, black box defines the region of sequence used for the BLASTn search below.

We would expect to see conserved core promoter motifs close to base 40,508. Within this

region, two BRE motifs and one DPE motif offer the best support for the presence of a promoter

(Figure 61). Swiezy 55

Figure 61. Conserved core promoter motifs in the region of the promoter in D. elegans.

Next, a BLASTn search was conducted, using as subject, the nucleotide sequence of the

promoter region in the D. melanogaster annotation (defined in Figure 60), and as query, the

nucleotide sequence of contig21. This returned an alignment covering the estimated TSS in D.

elegans; however, the alignment was missing the first 15 bases of the query sequence (Figure

62).

Figure 62. BLASTn alignment using query, nucleotide sequence of contig21, and subject, nucleotide sequence of CG11148 promoter in D. melanogaster. Swiezy 56

In order to discover the reason for the 15 base deletion in the D. elegans sequence relative

to the D. melanogaster sequence, the conservation track on the D. melanogaster browser was

evaluated. We can see that the D. melanogaster sequence in the region of the 5’ UTR and TSS is

conserved to the highest degree in D. yakuba and D. erecta, with the sequences of other

Drosophila species showing increasing divergence. The sequence of D. elegans in this region

shows more conservation to the D. biarmipes sequence than to the D. melanogaster sequence, so

we are not surprised by the missing bases in the BLASTn alignment above (Figure 63).

Figure 63. Conservation in the region of the CG11148 promoter. D. melanogaster conservation to sequences boxed in red; conservation between D. elegans and D. biarmipes sequences shown in blue. Swiezy 57

Given the similarity the between D. biarmipes and D. elegans species in this region, the

D. biarmipes RNA PolII and RNA-Seq data was evaluated for a more complete description of

this region. The RNA PolII data showed two peaks associated with the promoter, which suggests

that there are two different isoforms for this UTR; these two peaks are consistent with the two

different sets of TopHat splice junctions (Figure 64). To evaluate the homology between the D.

biarmipes sequence and the D. elegans sequence further, a BLASTn search was conducted using

query, the nucleotide sequence of contig21, and subject, the nucleotide sequence of the RNA-Seq

corresponding to the promoter region in D. biarmipes, defined in Figure 64, from base 835,662

to base 836,062 (Figure 65). Swiezy 58

Figure 64. UCSC Genome Browser view of CG11148 promoter; note two peaks seen in RNA PolII data, suggesting two isoforms for this UTR may be present; this is consistent with the two different sets of TopHat splice junctions seen in this region as well. The black brackets define the promoter region of CG11148 in D. biarmipes—DNA from this region was used as the subject in the BLASTn search in Figure 65.

Figure 65. BLASTn alignment using query, the nucleotide sequence of contig21, and subject, the nucleotide sequence of the CG11148 promoter in D. biarmipes (i.e. base 1 of the subject is base 835,662 in the D. biarmipes whole genome assembly as of April 2013). Swiezy 59

Because the farthest upstream base of the subject of the BLASTn in Figure 65 aligned to

base 40552 of the query (in the region where we would expect the TSS to be in D. elegans),

because this result is roughly in line with the result in D. melanogaster, which placed the TSS at

40542 (40527+15 missing bases), and because there are core promoter motifs within 50 bp of

either of these bases, we estimate that the TSS for CG11148 in D. elegans will be between base

40540 and base 40555 (of contig21).

Evolution of CG11148 Orthologs:

ClustalW2 can be a powerful tool to evaluate the evolution of a sequence over a variety of

different species. This analysis attempted to characterize the conserved GYF domain in

CG11148. The GYF domain may have a “critical importance in tripeptide ligand binding” [1].

This domain has been found in a number of eukaryotic proteins and “could also be involved in

proline-rich sequence recognition” [1]. Also, note that “there is limited homology within the C-

terminal 20-30 amino acids of various GYF domains, supporting the idea that this part of the

domain is structurally, but not functionally important” [1]. In D. melanogaster the GYF domain

is 49 amino acids long and extends from amino acid 566 to amino acid 614 of CG11148 [2]; the

sequence of this conserved domain is shown below (Figure 66). Based on the ClustalW2

alignment below, we can see that this domain is highly conserved throughout a number of

Drosophila species (Figure 67) as well as throughout a number of other animal and insect

species as well (Figure 68). Swiezy 60

Figure 66. Sequence of the conserved GYF domain in D. melanogaster.

Figure 67. ClustalW2 alignment of eight Drosophila species; this analysis showed varying conservation of the amino acid sequence throughout CG11148. The conserved GYF domain boxed in black showed exceptional similarity between all of the fly species. Swiezy 61

Figure 68. ClustalW2 alignment of eight animal and insect species; this analysis showed strong conservation, but cluster into two groups of species, the insects (Drosophila, mosquitos, and honey bees) and other animal species. The conservation of the GYF domain, however, was seen for all species.

Discussion:

The fourth chromosome of D. elegans investigated here shows a higher density of repeats

but a similar density of genes than expected in a region of euchromatin on one of the other

Drosophila . Although the fourth chromosome is known to be composed mainly of

heterochromatin, it is interesting to note that the genes that are on this chromosome are still

transcribed and translated at normal levels. Evidence of a high level of gene expression was

apparent in both CG11148 and Eph, which had RNA-Seq coverage of 600-800 reads throughout

the genes. Further analyses of the transcription start sites of genes on the fourth chromosome, Swiezy 62

like that presented above, may give insights into the mechanism for expression of genes located

in a region of heterochromatin.

Contig21 included both a non-coding RNA gene and a putative pseudogene. A

pseudogene is an especially rare feature for Drosophila and may represent a retrotransposition

event. This may be especially impressive given the heterochromatic nature of the DNA on the

fourth chromosome; in order for a retrotransposition event to occur, the gene would first need to

be expressed, requiring the transcription machinery to be able to access the DNA, and then need

to find a way to reintegrate into DNA that is tightly incorporated into nucleosomes. Future

characterization of the gene from which this pseudogene originated in D. melanogaster, if from a

fourth chromosome gene, could be important in determining which regions of the fourth

chromosome are less rigidly packaged.

Appendix:

GFF, pep, and FASTA files submitted to Wilson Leung.

Acknowledgements:

Thank you to Dr. Elgin, Dr. Shaffer, Wilson Leung, Dr. Bednarski, and Dr. Mardis for

their support throughout this project.

References:

[1] Mitchell, Alex, et al. "The InterPro protein families database: the classification resource after 15 years." Nucleic Acids Research (2014): gku1243. http://nar.oxfordjournals.org/content/early/2014/11/26/nar.gku1243.full.

[2] UniProt Consortium. "UniProt: a hub for protein information." Nucleic Acids Research (2014): gku989. http://nar.oxfordjournals.org/content/early/2014/10/27/nar.gku989.abstract.