<<

A Thesis

entitled

Human and Transcriptome Analysis with Next-Generation

by

Basil Khuder

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the

Master of in Biomedical Sciences Degree in

Bioinformatics, , and

______Dr. Alexi Fedorov, Committee Chair

______Dr. David Kennedy, Committee Member

______Dr. Robert Blumenthal, Committee Member

______Dr. Amanda Bryant-Friedrich, Dean College of Graduate Studies

The University of Toledo

August 2017

Copyright 2017, Basil Khuder

This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of

Human Genome and Transcriptome Analysis with Next-Generation Sequencing

by

Basil Khuder

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Masters of Science in Biomedical Sciences Degree in , Proteomics, Genomics

The University of Toledo

August 2017

Advancements in Next-Generation Sequencing technologies and steep declines in costs have enabled sequencing to occur at astronomical rates. With this technology, researchers have made great strides in progressing our understanding of the .

Additionally, this surplus of data has also opened the doors to many popular , such as such as NCBI’s Sequencing-Read Archive, to help host the enormity of files. The onslaught of data has also, however, created predicaments for scientists, as researchers are still trying to find the most optimal methods for using and processing these data to answer some of their most challenging questions. Here we present two approaches to normalize and analyze high-throughput data, that can help respond to questions about the human genome and transcriptome and demonstrate how constructing methods can produce substantial biological implications. The first approach uses NCBI’s Sequence Read

Archive to analyze unnormalized RNA-Seq data. An was created to manually normalize the data, allowing us to use the SRA to look for plausible expression levels of long, highly-conserved, non-coding sequences within the Fat-Mass and Obesity

Gene, FTO. Bioinformatic software was then used to try to confirm both our preliminary results and predicted patterns of expression. The second approach uses pre-existing genome iii assembly software, namely the Genome Analysis Toolkit by the , to normalize -sequencing data from two individuals afflicted with retinitis pigmentosa, and find variants that might have contributed to the disease. These two approaches showcase how researchers can readily analyze their data, and gain far better insight into the understanding of the human genome and transcriptome.

iv

I dedicate this thesis to my loving parents, Maha and Sadik, for their support throughout this process.

Acknowledgements

I would like to extend my deepest gratitude to my adviser Dr. Alexei Fedorov, for his great mentorship over the past two years. His helpfulness and guidance has allowed explore my passion in the fields of and Bioinformatics and guide me towards a future career. I would also like to greatly thank Dr. David Kennedy and Dr. Robert

Blumenthal, for assisting me through the process, as members of my thesis committee.

Lastly, I would like to thank my lab colleagues Patrick Brennan, Sharmistha Chakraborthy,

Rajib Dutta, Joseph Mainsah, and Yuriy Yatskiv for their friendship and assistance.

v

Table of Contents

Abstract ...... iii

Acknowledgements ...... v

Table of Contents...... vi

List of Tables ...... viii

List of Figures ...... ix

List of Abbreviations ...... x

1 Transciptome Analysis of Ultra-Conserved Elements in Human Fat-Mass and

Obesity (FTO)… ...... …………...1

1.1 Abstract……………………………………………………………………….. . 1

1.2 Introduction ...... 2

1.2.1 FTO GWAS and Regulatory Studies ...... 3

1.2.2 FTO Transcripts ...... 4

1.3 Methods…...... 5

1.3.1 Non-Code Database ...... 5

1.3.2 SRA-to-BLAST ...... 5

1.3.3 RNA-Seq by Expectation Maximization ...... 6

1.4 Results...…...... 7

1.4.1 SRA-to-BLAST Hits...... 7

1.4.2 SRA-to-BLAST Controls...... 8

vi

1.4.3 RSEM ...... 9

1.5 Discussion...…...... 9

1.6 Conclusion...…...... 12

2 Investigations into Exome Sequencing Data Provide Insights into Rare Retinal

Disease Obesity Gene (FTO)… ...... …………...27

2.1 Abstract……………………………………………………………………… . 27

2.2 Introduction…………………………………………………………………...28

2.2.1 RP1L1 ...... 30

2.3 Methods…………………………………………………………………….. ... 30

2.3.1 Genome Alignment ...... 30

2.3.2 Variant Calling ...... 33

2.3.3 Custom Variant Filtration ...... 34

2.4 Results……………………………………………………………………...... 35

2.5 Discussion…………………………………………………………………...... 37

2.6 Conclusion………………………………………………………………….. ... 38

References……………………………………………………………………...... 46

A Experimental Results from SRA-to-BLAST ...... 52

B Commands for Genomic Alignment and Variant Calling ...... 67

vii

List of Tables

1.1 FTO Intron and Exon Coordinates ...... 13

1.2 FTO UCNE Coordinates ...... 15

1.3 FTO Transcripts ...... 18

1.4 RSEM Results ...... 25

viii

List of Figures

1-1 FTO Gene Depiction ...... 16

1-2 FTO Downstream Model ...... 17

1-3 UCNE Expression Workflow ...... 19

1-4 UCNE with Highest Hits...... 20

1-5 Average UCNE Hits ...... 21

1-6 Reads Spots and Average UCNE Hits ...... 22

1-7 Intron #5 Hits and Average UCNE Hits ...... 23

1-8 Intron #5 Hits and UCNE #4 Hits ...... 24

2-1 Workflow for Genome Alignment and Variant Calling ...... 40

2-2 Representation of Reference Indexing ...... 41

2-3 Filtration Steps Conducted ...... 42

2-4 Distribution of SNPs and INDEls across Chromosomes ...... 43

2-5 Location of INDEL in RP1L1 using IGV ...... 44

2-6 RP1L1 -Coding Sequence Affected by Variant ...... 45

ix

List of Abbreviations

BAM ...... Binary Mapped Format BWA ...... Burrows-Wheeler Aligner

FTO ...... Fat-mass and Obesity Gene

GATK ...... Genome Analysis Toolkit GWAS ...... Genome-Wide Association Studies

INDEL...... Insertions and Deletions

RP1L1...... Retinitis Pigmentosa 1 Like 1 Gene RSEM ...... RNA-Seq by Expectation-Maximization

SNP ...... Single Polymorphism SRA ...... SAM ...... Sequence Alignment Mapped Format

TPM ...... Transcripts Per Million TSL ...... Transcript Support Level

UCE ...... Ultra-Conserved Element UCNE ...... Ultra-Conserved Non-Coding Element

x

Chapter 1

Transcriptome Analysis of Ultra-Conserved Elements in Human Fat-Mass and Obesity Gene (FTO)

1.1 Abstract

The Fat-Mass and Obesity Gene, FTO, has been the subject of much scrutiny from the moment the gene was implicated in having effects on metabolic syndromes; notably obesity and type-2 diabetes. Current research on the gene has shed doubt on this previous assertion, and instead postulated that FTO regulates downstream targets that are responsible for these metabolic ailments. FTO contains ten ultra-conserved non-coding elements 200-

300 long sequences of DNA that extremely well-conserved in two or more .

However, it is not quite apparent how these ultra-conserved elements interact in FTO’s functionality. In this study, we analyzed 57 RNA-seq experiments from 16 different human tissues, in the NCBI’s Sequence Read Archive Depository. Using an algorithmic approach to normalizing the raw sequencing data manually, we examined expression levels of these ultra- conserved elements, to see if they are being transcribed, adding to the plausible regulatory functionality, or if they are just bystanders in the complexity that is the FTO gene. This experimentation was followed up by using RNA-Seq by Expectation-Maximization software,

RSEM, to determine if non-protein-coding FTO transcripts, which span the regions of some of the UCNEs, could have expression levels, confirming our preliminary results. Our

1

findings from the SRA-to-BLAST experiments were relatively consistent in comparison to our controls and did not indicate any potential for expression that was either alone or a part of longer transcripts. Our follow-up results using RSEM showed low to below-baseline expression levels for four non-protein coding transcripts. We conclude that using the SRA- to-BLAST database, in conjunction with pre-determined controls, is a useful preliminary investigation tool for expression and that there is a strong possibility of these ultra- conserved elements are being expressed.

1.2 Introduction

The Fat-mass and Obesity-associated gene, also known as FTO or Alpha- ketoglutarate-dependent dioxygenase gene, is an eight-intron gene on Chromosome 16 [1].

The protein that is encoded by FTO is an enzyme capable of mRNA demethylation, specifically changing N6-methyladenine (m6A) to adenine. The protein is a part of the AlkB , known for its activity in DNA repair mechanisms [2]. Table 1.1a shows the distribution of regions for the FTO gene and the lengths of each of the regions as per the

GRch37/hg19 assembly, and Table 1.1b showcases the division of regions by the

GRch38/hg38 assembly. What is intriguing about FTO is the fact that it contains ten highly conserved elements distributed across the gene. These elements have been identified as ultra-conserved elements, UCEs, or, more specifically, ultra-conserved non-coding elements,

UCNEs.

UCEs are segments of DNA that are longer than 200 base pairs and have 100% conservation in orthologous regions for at least humans, mouse, and rat [3]. The Ultra-

Conserved Non-Coding Elements database contains over 4000 non-coding ultra-conserved

2

elements, conserved among eight species [4]. The UCNEs within FTO are contained in introns 1, 7 and 8. The largest UCNE is #3, at 394 base pairs, while the second largest is #7 at 370 base pairs. The full coordinates of each of the ultra-conserved elements within FTO, their location within their gene and the length of the element can be seen in Table 2, with 2a being the GRch37/hg19 coordinates and 2b being the GRch38/hg38 coordinates. Figure 1 showcases a depiction of the FTO gene, the locations of the ultra-conserved elements and how the gene varies between the genome assemblies.

1.2.1 FTO GWAS and Regulatory Studies

FTO has been implicated in many Genome-wide Association Studies, with the earlier studies showcasing variants within FTO having a strong association with obesity and type 2 diabetes [5]. One of the first GWAS studies to have implicated FTO in obesity found that participants who possessed a common variant, rs9939609, were more prone to diabetes and had a change in their BMI [6]. Specifically, 16% of the individuals who were homozygous for the risk allele had an increased weight of 3 kilograms, and a 1.67-fold increase in the chance of them having obesity [6]. A later study showed that the loss of the FTO gene in mice caused postnatal growth retardation, and low levels of adipose tissue and lean body mass [7].

However, more recent studies have shed doubt on the notion that FTO is solely the reason behind the effects on obesity and metabolic disorders and, instead suggest it is FTO’s regulatory ability that allow it to act on nearby targets.

In 2013, Smemo et al. indicated that FTO contains functional long-range enhancer that works on the IRX3 gene to alter body mass. It is the noncoding regions with FTO that are connected with IRX3 and can directly interact with IRX3 promoters to alter its

3

expression [8]. This new revelation began to shift the focus towards further investigation of

FTO’s regulatory abilities. In 2015, Claussnitzer et al. identified a ten kilobase super- enhancer, that contained the variant rs1421085 within a conserved motif. This motif was the binding site for ARID5B, which can act as a transcriptional repressor. The FTO risk allele alters this binding site, resulting in over-expression of IRX3 and IRX5 [9]. Figure 1.2 depicts this binding site. However, even with this information on how FTO is acting upon other gene targets, not much information has been dedicated to the roles, if any, of the UCNEs present within FTO. Some studies have simply referenced the entire intronic regions within

FTO and noted that this region is involved in the regulatory landscape, or they have not been mentioned at all.

1.2.2 FTO Transcripts

Per Ensembl, FTO contains 23 transcripts and they are listed in Table 1.3 [10]. Out of these 23, five of the transcripts are referred to as non-protein coding transcripts. Out of these five non-protein coding transcripts, four include UCNE sequences. Non-protein- coding transcripts fall under four categories under Ensembl: noncoding , long non- coding rna or a [11]. This means, that these four transcripts could potentially have retained introns, that contain UCNE elements, or could be some of the types of functional non-coding RNA. Transcript FTO-021 includes UCNE #10, Transcript FTO-

011 spans UCNEs #3, #4, #5, 6, #7 and #8, Transcript FTO-007 includes UCNEs #9 and

#10 and, lastly, Transcript FTO-009 spans only UCNE #1. Three of these four transcripts are listed as having transcript support level 4, and in the case of Transcript FTO-009, it has transcript support level 5. Transcript support level, or TSL, refers to the method that

4

Ensembl uses to indicate how much scientific support a specific transcript has. The highest

TSL value is 1, with the lowest being a 5. In the instance of these four indicated transcripts, there is very limited support for them [12], at least in the 18 tissues assessed (see Methods,

1.3.2.)

1.3 Methods

1.3.1 Non-Code Database

The Noncode Database provides one of the most complete listings, and annotations for long non-coding [13]. To see if any of our ultra-conserved non-coding elements were present in the database, we submitted each of the UCNE into the Noncode

BLAST database. Sequences were submitted using their grCh37/hg19 and GRCh38/hg38 genomic coordinates. If sequences matched the Noncode database, this would imply that these UCNEs are already known long non-coding RNAs. However, if they did not appear, this would mean that either these UCNEs are not long non-coding RNAs, or that they simply have not been identified as them yet.

1.3.2 SRA-to-BLAST

To conduct the expression analysis, sequencing datasets were taken from NCBI’s

Sequence Read Archive [14] and were used as the basis for the expression analysis. All datasets were total-RNA-seq experiments, which were depleted of ribosomal RNAs. In total,

57 experiments were analyzed for human, compassing the tissues of the pancreas, placenta, adrenal gland, spleen, small intestine, muscle, osteoblast, cervix, adipose, brain, stomach, thymus, heart, thyroid, kidney, prostate, lung, and liver. Exact experimental numbers for

5

each of the experiments can be seen in Appendix A. SRA-to-BLAST was then used to analyze each of the ten ultra-conserved elements. Before doing so, each sequence was submitted to the Repeat Masker online database [15] to be analyzed for any repetitive elements and to ensure no false-positives would be shown in the results of our BLAST search. The search engine was specifically used, which matches the sequences against the human Dfam database [16].

The only UCNE element that showed any repetitive elements was UCNE #4, in which 36BP’s, 14.48% of the total sequence, were deemed to be simple repeats. All the pre- determined BLAST defaults were used, except for the feature “expect for maximum target sequences” which were set at 20,000 and “expect threshold” set at 10−4. We used three different control sets, to compare our UCNE BLAST hits results to. These controls included

FTO mRNA, FTO Intron #5 and the number of read spots per experiment. Read spots is a feature unique to SRA experiments and refers to where the read has originated from, in addition to any meta information as well. So, for example, one spot might contain multiple reads, in addition to the adapter information and sequencing barcodes.

1.3.3 RNA-Seq by Expectation-Maximization

RNA-Seq by Expectation-Maximization was used on a portion of the SRA experiments to see if any of the four non-protein coding FTO-transcripts had significant expression levels. FASTQ files from the SRA database were aligned using the RSEM STAR pipeline [17]. This pipeline, allows for simultaneous alignment of sequencing data, while also generating a final output that contains expression information. The alignment jobs were based upon Ensembl’s GRCh38 DNA primary assembly for homo sapiens. The output files

6

from the RSEM software provided transcript expression levels in both Transcripts Per

Million (TPM) and Reads Per Kilobase per Million. RPKM is a technique that removes both library size-effects and feature lengths, but can only be used for within-sample normalization

[18] TPM normalizes for differences in transcript composition, which allows it to be used as an expression value across libraries and samples [18]. Because of this, TPM was used to assess plausible expression of these FTO transcripts.

1.4 Results

1.4.1 SRA-to-BLAST Hits

All results from the SRA-to-BLAST analysis can be viewed in Appendix A. Out of all the 57 experiments, the highest number of BLAST hits that were received by any UCNE came in experiment SRX1603429 of parietal brain lobe, where UCNE #4 had 177 hits. The second highest UCNE in this same experiment was UCNE #1, with 93 hits and the third highest was UCNE #6 with 75 hits. Also in this experiment, the average number of hits for every UCNE was at 66, our control Intron #5 had 276 hits, and our mRNA control had

4,432 hits. The proportion of average UCNE hit count to mRNA hits was .01489, and the read spots were at 102.8 million. The second highest number of hits for any UCNE came in experiment SRX1603506 where UCNE #4, again, was the highest and had a total of 174 hits. The second largest UCNE in this experiment was UCNE #5 with 89 hits, and the third highest was UCNE #1 with 88 hits. Also in this experiment, our control, Intron #5, had a total of 206 hits, while mRNA was at 14,574 hits. The average UCNE hit count was 70.6, the proportion of average UCNE hit count to mRNA hits was .034272, and the read spots were at 195.8 million. The third experiment UCNE experiment came from experiment

7

SRX1603439, thyroid. Again, UCNE #4 was the highest, and this time it had 134 hits. The second largest came from UCNE #6, which had 74 hits and the third highest came from

UCNE #7, with 45 hits. The number of hits intron #5 had was 206, and the number of hits for mRNA was 14,574. The read spots for this experiment were at 48.5 million. Out of all the experiments, UCNE #4 had the highest number of BLAST hits in 29 of them, while

UCNE #7 had the second largest in ten experiments.

Figure 1.4 showcases each of the UCNEs and the number of times that each one had the highest number of hits in an SRA experiment. Experiments that had notably low hit counts included experiment SRX1830410 for brain and experiment SRX1830412 for heart.

There was no experiment for total RNA-seq libraries that had zero hits for every single

UCNE. We also can see this variability between tissues through Figure 1.5, which shows a box and whisker plot of average UCNE against the various tissues that were tested.

Variability was the highest in tissues where multiple experiments were seen.

1.4.2 SRA-to-BLAST Controls

We then began to compare our SRA-to-BLAST hits for the UCNEs to the hits that were attained by our controls; notably, Intron #5 hits, and the number of read spots per experiment. Figure 1.6 showcases a scatter plot that has read spots on the x-axis, the independent variable and average UCNE BLAST hits on the y-axis, as the dependent variable. We conducted a coefficient of determination for the two values and retrieved an r- squared value of .11 or 11%. Figure 1.7 showcases another scatter plot, but this time we have average Intron #5 BLAST hits, that have been normalized for length on the x-axis, with average UCNE BLAST hits, also normalized for length, on the y-axis. The coefficient

8

of determination for these two values was .33 or 33%. Lastly, we also plotted the normalized

UCNE #hits, against the normalized Intron #5 hits in Figure 1.8, and conducted a t-test, looking to see if there was significance between the means of these two groups. Our results indicated a p-value of 5*10^-8 for the difference between the means of the two groups.

1.4.3 RSEM

Four total RNA-seq SRA experiments were used in conjunction with the RNA-Seq by Expectation-Maximization software. The four SRA experiments included SRX1603425,

SRX1603427, SRX1603429, and SRX1603432. The full results from the RSEM experiments can be seen in Table 1.4. For SRX1603425, all four of the non-protein coding transcripts had 0 Transcripts Per Millions, TPM. In SRX1603427, three of the transcripts had 0 TPM, while ENST00000472835.1 had a TPM of .04, which correlates to very low to below- baseline expression. For SRX1603429, two of the FTO transcripts had 0 TPM, while

ENST00000570395.1 had 0.02 TPM, and ENST00000472835.1 had 0.01 TPM, which again correlates to very low to below-baseline expression. Finally, for SRX1603432, all four of the transcripts had TPM levels of 0.

1.5 Discussion

Our SRA-to-BLAST results indicated two things; a significant number of UCNE hits were seen in the tissues of lung and brain, and UCNE #4 was shown to have the highest number of hits in the most experiments. Figure 1.4 demonstrates how lung and brain had the highest amounts of expression, while Figure 1.5 shows how UCNE # 4 had the highest number of hits in 29 of the 57 experiments. This means that in 51% of our SRA

9

experiments, from multiple researchers, and different library construction protocols, UCNE

#4 had the highest number of SRA-to-BLAST hits. If we look at how many times a UCNE had either the first or second largest amounts of hits, we see UCNE #4 was still the top of this category, being the first or second highest 68% of the time. Having UCNE #4 at the top of our experiments, even with the multiple confounding variables present within significantly different SRA experiments tells us that a non-random phenomenon is occurring. When we began to compare our results to our first control read spots, we surprisingly had an extremely low correlation with average UCNE hits. Specifically, the correlation was .11, or 11%. This may be because read spots are extremely variable, unlike traditional read-depth values. They also differ considerably between experiments, depending on the amount of meta-information, and the number of reads per every individual spot, designated by the researcher who submitted the data.

We then compared our results to our other control, Intron #5. When we plotted normalized Intron #5 against the normalized average UCNE hits from every experiment, we noted that there was a higher correlation here, than we saw with read spots, at .33 or 33%, but still not an entirely significant one. This tells us that there is a slight positive correlation between an increase in the normalized Intron #5, and increased normalized average UCNE hits. When we then only looked at normalized UCNE #4 hits, and Intron #5 hits, we began to see a significant difference in the means of these two groups, as shown in Figure 1.8. For normalized Intron #5, we see a mean of 1 for the number of hits, whereas we see a mean of

15 for the number of hits for normalized UCNE #4. The conducted t-test tells us that this is a significant difference. What’s important here is that we were looking to see whether our control would be increasing, proportional to the UCNE hits. If this was the case, it would

10

make it harder for us to conclude that some type of expression is going on. However, because we are able to see a significant difference between the two, it indicates that it’s likely there is some kind of expression going on with our ultra-conserved elements, especially with

UCNE #4.

Because our SRA-to-BLAST model is not a conventional approach to looking at expression, there is not much literature on how to use this information to assess our findings. For example, although we have seen many hits across multiple experiments, no article or paper describes the minimum number of hits, in comparison to controls, which are needed to state whether a specific sequence has expression or not. Furthemore, another issue with dealing with intronic sequences is because they are inside pre-mRNA tranascript, traces of them should be present in total RNA-seq libraries. Thus, we had to be extremely careful in correlating hits with expression, as these, again, could be traces of the pre-mRNA transcripts.

Our follow-up RSEM results show very little to no expression of any of the FTO non-protein coding transcripts in four SRA experiments. This could be because of the very small sample of RNA-Seq experiments that were used, which was only four, or because these transcripts are not being expressed. The issue with using RSEM arises from the fact that it was used in conjunction with an existing database, in this case, Ensembl’s transcript database, to base its expression output on. Thus, we were limited to only looking at poorly- supported transcripts, that potentially could have some type of intron-retainment, encompassing UCNEs inside of them. Because this analysis did not indicate any expression, and due to a smaller a sample size, we are unable to come to any indication that is occurring in these four non-protein coding transcripts.

11

1.6 Conclusion

Our results showcase the usefulness of utilizing the Sequence-Read Archive and the

SRA-to-BLAST system as a preliminary investigation tool in any transcriptome or genome analysis. Furthermore, the non-randomness we exhibited through UCNE #4, and the significant mean differences between it and our control, tell us that there is a potential for these UCNEs to have expression. Much of the evidence points to UCNE #4, as it had the highest mean difference between the controls. Our subsequent RSEM experiments only show low to below-baseline expression levels through four experiments. However, more experimentation is needed to be able to state whether we can confidently state whether these transcripts are expressed.

12

Table 1.1: A.) Coordinates for each of the exonic and intronic regions within FTO per the GRch37/hg19 reference assembly. In total, FTO is 410,504 base pairs long [10]. B.) Coordinates for each of the regions within FTO per the GRch38/hg38 reference assembly [10]. Table 1.1 A.)

Table 1.1 B.)

13

14

Table 1.2: A.) Coordinates for each of the UCNEs based upon the GRch37/hg19 assembly. B.) Coordinates for each of the UCNEs based upon the GRch38/hg38 reference assembly [10]. Table 1.2 A.)

Table 1.2 B.)

15

Figure 1-1: Depiction of the FTO gene per the GRCh37/hg19 reference assembly. As can be seen, FTO consists of nine exons and eight introns. The first intron is the largest, and it also contains the most amount of ultra-conserved elements. What also can be seen is the ten ultra-conserved elements and how they span the entire FTO gene. Intron one contains the first UCNE, intron seven contains two UCNEs, and then Intron eight holds seven UCNEs.

16

Figure 1-2: Illustration of one of the possible models on how FTO interacts with downstream targets. Here we can see that FTO contains a , which binds the ARID5B. In the reference allele, ARID5B can be fully functional, and interact with the IRX3 enhancer, repressing transcription However, when there is a risk allele within the sequence motif, transcription factor binding is terminated and expression occurs. Image adapted from Figure 1 of the article by Herman et. al, 2015 titled Making Biological Sense of GWAS Data: Lessons from the FTO Locus. [19]

17

Table 1.3: There are 23 transcripts located inside FTO, with four being non-protein coding transcripts that are highlighted in yellow. The base pairs that are listed for these four transcripts are if they follow traditional splicing patterns. However, due to the low Transcript Support Level that is associated with them, their structure and number of base-pairs might ultimately be different.

18

Figure 1-3: Depiction of the workflow taken for the expression analysis. Controls were used to assess expression levels of both mRNA and expression levels of FTO intron #5. Afterward, the RSEM software was used to compare the results from the SRA-to-BLAST to the levels given by RSEM.

19

Figure 1-4: Pie chart showing the times a specific UCNE had the number of BLAST hits in an SRA experiment. UCNE #4 was the highest in 29 experiments, UCNE #7 was the highest in ten experiments, and UCNE #8 was the highest in eight of the experiments.

20

Figure 1-5: Box and whisker chart plotting the average UCNE hits per various tissues studied. As shown, some of the highest variability lay most within brain, lung, stomach and thyroid.

21

Figure 1-6: Scatter plot with read sports per million on the x-axis being plotted against average UCNE BLAST hits on the y-axis. The R-squared value for the two variables is .11 or 11%, indicating a very low correlation between the two.

22

Figure 1-7: Scatter plot with average Intron #5 hits per experiment, normalized for length on the x-axis and average UCNE hits per experiment, also normalized for length, on the y- axis. As shown, as the number of average hits for Intron #5 increased, so did the number of UCNE hits.

23

Figure 1-8: Box and whisker plot, plotting normalized Intron #5 hits versus normalized UCE #4 hits. We can see a substantial difference in the means of these two groups. Specifically, we can see the mean of Intron #5 hits were at 1, whereas the mean for UCE #4 was at 15. A t-test, to test the significance of the two groups’ means, shows a p-value of 5X10^-8.

24

Table 1.4: A.) For SRA experiment SRX1603425, we see that there was no expression for any of the four FTO transcripts. B.) For SRA experiment SRX1603427, we see that the only transcript that had expression was transcript ENST00000472835.1 with .04 TPM. .) For SRA experiment SRX1603429, we see that transcript ENST00000570395.1 had .02 TPM, ENST00000472835.1 had .01 TPM, while the other transcripts had no expression. D.) For SRA experiment SRX1603432. we see that none of the transcripts had any expression.

Table 1.4 A.) SRX1603425 TPM ENST00000570395.1 0 ENST00000472835.1 0 ENST00000635892.1 0 ENST00000636091.1 0

Table 1.4 B.) SRX1603427 TPM ENST00000570395.1 0 ENST00000472835.1 0.04 ENST00000635892.1 0 ENST00000636091.1 0

Table 1.4 C.) SRX1603429 TPM ENST00000570395.1 0.02 ENST00000472835.1 0.01 ENST00000635892.1 0 ENST00000636091.1 0

Table 1.4 D.) SRX1603432 TPM ENST00000570395.1 0 ENST00000472835.1 0

25

ENST00000635892.1 0 ENST00000636091.1 0

26

Chapter 2

Investigations into Exome Sequencing Data Provide Insights into Rare Retinal Disease

2.1 Abstract

Genome alignment provides a clear way to normalize data based on a reference genome.

Once normalized, we can then see how the data differ from the consensus, and conduct further analysis to process the results. One such analysis is variant-calling; a method of pinpointing every variant that is present and producing a file that lists their location and frequency. We demonstrate the usefulness of this approach by analyzing three exome- sequencing datasets, two of which are from individuals suffering from a rare retinal disorder that has caused vision loss, and one from the two individuals’ mother. This heritable retinal disease has been blanketed by the term retinitis pigmentosa which describes a wide range of retinal diseases, with no known cause of the disease having been established. By using genomic alignment and variant calling on the exome-sequencing data, we were able to look for candidate variants that could have caused this disease. We used the Genome Analysis

Toolkit, Picard Tools, SamTools, SNPEff, and SNPSift to produce an initial file, or VCF file, containing 924,444 variants. Through variant filtration procedures, we narrowed the list of candidate variants down to 1530 single nucleotide polymorphisms and 715 insertions and deletions. After manually analyzing each of these variants, we were

27

able to find a deletion within a retinal-related gene, RP1L1, that could be one of the causes of the disease. Our methodology and analysis show how genome alignment, variant calling, and variant filtration can provide great insight in understanding diseases within the human genome.

2.2 Introduction

Sequencing analysis can provide a wide-range of biologically-relevant information to help answer complex scientific questions about the human genome. This becomes even more evident when being faced with issues that can help individuals with rare diseases. One such rare disease is retinitis pigmentosa, a blanket term to describe a broad range of retinal diseases that cause blindness [20]. Because of the rarity and the wide-range of diseases that could have caused the condition, it is incredibly difficult to genetic bases in affected individuals. However, when sequencing information is available from the disease-affected individuals, it allows advanced techniques to be employed in hopes of providing some answers to this genomic puzzle. One of those methods is genome alignment and variant calling.

Genome alignment is a method to align sequencing data against a reference genome, allowing researchers to find locations where their sequencing data differs from the consensus reference. The GRCh37/hg19 human reference genome was released in February 2009 by the Reference Genome Consortium and is a highly accurate, contiguous reference assembly

[21]. A newer reference assembly was published in March of 2014 called GRCh38/hg38 that adds to the accuracy of the previous reference genome [21]. There are many different software packages available for genome alignment, one being the Burrows-Wheeler Aligner.

28

The Burrows-Wheeler Aligner, or BWA maps low-divergent sequences against reference . It contains three : BWA-MEM, BWA-backtrack, and BWA-SW. BWA-

MEM and BWA-SW are used for longer sequences, that stem from 70 to 100 base pairs.

BWA-backtrack is suitable for Illumina reads that range up to 100 base pairs [22]. The software and all the listed algorithms are based on the Burrows-Wheeler Transform, a method that is commonly used to compress and index files [23]. Once the files are aligned, variant-calling can be run afterward to find all variants within the sequencing files that differ from the reference genome. A popular variant-calling software is the Genome Analysis

Toolkit.

The Genome Analysis Toolkit, or GATK, is software developed by the Broad

Institute and allows for a wide variety of variant discovery and genotyping tools [24]. When the software is run on aligned data, it will produce a variant call format file, or VCF file, that contains all the variants. However, usually because of low stringency on the variant-calling software, variant filtration is needed afterward to remove false-positives. GATK has built-in filtration tools that can remove a lot of the falsities within the data. Other software, such as

SNPEff, can be used to create more custom filtration as well [25]. Once all the filtration has been completed, we are now left with candidate variants that can be used for more thorough analysis, to see exactly how they have affected the genome we are interested in. In this study, we have received exome-sequencing data in the FASTQ format from three individuals; one of which is a mother, and the other two her daughters afflicted with retinitis pigmentosa. To try to gain a better picture of what could be the cause of their diseases, genome alignment and variant calling, like mentioned above, were conducted on their sequencing files to produce a VCF file. After rounds of variant-filtration, candidate SNPS

29

and INDELS were identified and analyzed more to try to see which ones could be potential candidate SNPS.

2.2.1 RP1L1

RP1L1, or Retinitis Pigmentosa 1 Like 1, is a gene that encodes for a retinal-specific protein that plays important roles for photosensitivity, and for the photoreceptors [26]. The gene sits on chromosome 8, is composed of three introns and is 105,838 basepairs long [27].

Additionally, there are two known transcripts for this gene; one protein coding named

RP1L1-001, and the other non-protein coding named RP1L1-002. Transcript RP1L1-001 is

7973 base pairs long, and the protein that it codes for is made up of 2400 amino acids.

Likewise, the non-protein coding transcript RP1L1-002 is 1536 base pairs long. Both transcripts have TSL levels of 1 and 2, respectively, denoting very high support for their existence [27]. enclosed in the RP1L1 gene, encoding for , have been shown to cause severe retinal issues, including retinal dystrophy [26]. RP1L1 is a polymorphic paralog of the RP1 gene, which also encodes for proteins specific to retinal functions [28].

2.3 Methods

2.3.1 Genome Alignment

Raw Exome-Sequencing FASTQ files were provided for each of the three individuals. Whole Exome Sequencing was carried out via the Illumina Hiseq platform using the Agilent SureSelect Human All Exon V5 Kit. This was done on three individuals from a family; P1-being the healthy mother, and P2 and P3 being the two affected daughters. The

30

provided FASTQ files were 100 base-pair end sequencing, with one file being paired-end one, R1, and the other file being paired-end two, R2. To be able to map out possible single nucleotide polymorphisms, SNPs, and insertions or deletions, INDEL, that may be the cause of retinitis pigmentosa; we ran our files through the Broad Institutes Best Practice

Guide for exome-sequencing file [29].

The Broad Institutes Best Practice Guide involved taking our raw reads, mapping them to a reference genome, producing variants based on this aligned file and then filtering our variants to reduce false positives. The programs that were involved in this process included SAMTools[30], Picard Tools[31], Burrows-Wheeler Aligner[22],

GenomeAnalysisToolKit[24], SNPSift[32], and SNPeff [23]. SamTools is the aligner software that was used, that also contains many post-processing tools, such as reference indexing, variant calling, and alignment viewer [33]. In our workflow, SamTools was utilized only to create the reference indexing. Picard Tools is a Java-based program used to manipulate high-throughput sequencing data [31]. It was used throughout the process in multiple steps. The Burrows-Wheeler Alignment Tool is software that was used alongside

SAMTools in the reference indexing step, and used to align our reference genome to the raw sequencing reads. The Genome Analysis Toolkit, or GATK, is another Java-based program created by the Broad Institute that encompasses a wide variety of genome tools [34]. One of the original uses for GATK in this workflow was to produce the variant format file, VCF, and to run variant filtration on our data. Additionally, later we also used GATK to separate out any insertions or deletions, INDELs, from our filtered VCF file to be able to conduct the subsequent analysis. An overview of the entire process can be seen in Figure 3, while the entire process with all the command lines can be viewed in Appendix B.

31

The first phase of the genome aligned process is referred to as reference indexing, and it involves downloading a reference genome. Reference indexing was an essential step as it allowed for Burrow’s Wheeler Aligner to easily align segments of the reference genome to our sequencing files. The reference genome that was used was UCSC’s GRch37/hg19 reference, hg19., while BWA and the BWTSW algorithm are used to index the reference genome initially. A depiction of how BWA breaks up the reference genome for indexing can be in Figure 2.2.

SamTools and the Fadix algorithms were then used to produce an indexing file that allowed for enabling of random access to various portions of the reference genomes [33].

Upon utilizing SamTools to generate the index file, a total of six files were produced, and this included Hg19.fasta.amb, Hg19.fasta.ann, Hg19.fasta.bwt, Hg19.fasta.fai, Hg19.fasta.pac and Hg19.fasta.sa. Once reference indexing was completed, the next step was to produce an unmapped Binary Alignment Mapped File, uBAM [35]. The purpose of producing an unmapped file was since traditional mapped BAM files do not contain the proper meta information. Thus, in this step, we converted to a uBAM file to put in necessary information about our sequencing file, before actually converting it to a mapped BAM that occurred at a later step. Picard Tools was then used in the process of converting to a uBam using Picard’s

FastqToSam tool.

The next step in the workflow involved marking the Illumina Adapters. This is because the adapters that were used in the initial sequencing were still present in the files, and thus, they had to be removed to prevent false-positives from appearing in later steps.

This was completed using Picard Tools’ and the MarkIlluminaAdapter feature.

32

After marking of the adapters, the files were now ready to be converted back into the

FASTQ format so reference alignment can ensue. The conversion back was completed using

Picard Tools SamtoFastq option. As aforementioned, the Burrows-Wheeler aligner was used to conduct the reference alignment, specifically the BWA-MEM algorithm, which is the algorithm recommended by the Broad Institute [29]. Successful alignment produces a

Sequence Alignment Map Format file, SAM file. Due to the file size of SAM files, they are not the preferred file format for sequence analysis. Instead, these files are converted to a binary aligned mapped file, BAM, which allows for easier handling and analysis. This conversion from SAM to BAM is done using Picard Tools’ the MergeBamAlignment.

2.3.2 Variant Calling

At this point, variant-calling was conducted on the BAM files to produce a variant call format file, that would list all variants present in each of the individual’s exome. This includes all single nucleotide polymorphisms and insertions and deletions. To complete this process, the Broad Institute’s Genome Analysis Toolkit, GATK, was employed. The first step was to use the GATK’s HaplotypeCaller option, using the default Discovery Mode. The

Discovery mode determines the most likely alleles that are present within the data [36].

Because exome-sequencing data was used, HaploypeCaller was running on specific exome- coordinates, to make certain variants were not called on any low-quality intronic reads. This exome-coordinated file was downloaded by Illumina Agilent’s Sure Select, based upon the hg19 coordinates.

After variant-filtration had been conducted, a VCF file with the raw variants was now presented. However, due to the nature of HaplotypeCaller, the VCF file contained

33

some SNPS that were false-positives, and a filtration step was then conducted using the

GATK’s hard-filtering procedure. Although the hard-filtering method is not the preferred filtration method provided by the Broad Institute’s Best Practice Guide, it was needed due to the limited cohort of that were provided within the three sequencing files [37]. Upon hard-filtering, two separate VCF files were produced; one containing only single-nucleotide polymorphisms and the other one only including insertions and deletions.

2.3.3 Custom Variant Filtration

Unlike the variant filtration that was done at the end methods 2.3.2, to remove variants that did not meet specified statistical standards, the custom variant filtration was conducted to remove variants that did not match the 1000 Genomes Phase 1 and Phase 3 datasets. This was done to drastically reduce the list of candidate variants, to allow for manual analysis. However, it should be noted that variants that this does not mean that variants that were filtered out based upon custom parameters may not have contributed to the retinitis pigmentosa at play. This custom filtration was needed to make the large datasets easier to handle and only left those variants that have not been identified. We refer to these variants as rare. To conduct this custom filtration, SNPsift and SNPEff were used to remove variants from our files that were present with 1000 Genomes Phase 1 [38] and Phase 3 [39].

Lastly, a custom PERL program was created for the last round of filtration to ensure only variants that were possessed within both affected daughters were remaining, and variants that could have arisen due to compound heterozygosity.

2.4 Results

34

After starting with a total of 924,444 variants, our filtration procedures retained a total of 1,530 candidate SNPS and 715 candidate INDELS. Figure 2.3 presents all the variants that were filtered out during each round of variant filtration, to achieve the final list of candidate variants. As shown, many of the variants were removed when we filtered based upon exon coordinates. This is because the GATK software calls variants on the entire sequencing data set, including low-quality regions that may contain intronic reads. The only way to properly remove these variants is to run the filtration set based upon specified exon coordinates, which pertain to the reference genome that is being used. Another step that also filtered out a good number of variants was the second filtration process where we used hard- filters. This step removed a good portion of SNPs that had the highest probability of being false-positives, based upon statistical tests set forth by the GATK.

Figure 2.4a and Figure 2.4b show the chromosomal allocation of variants across each round of the filtration process. After each round of this filtration, we can see on the circular plot how the distribution dramatically decreases up until we are left with our remaining candidate variants. As expected, chromosomes that are larger in size ended up having a higher distribution of variants during the process than smaller in size chromosomes. We can also see the comparison of the two plots, and how the distribution between the two end up being the closest after the very last filtration step.

From the remaining list of variants, manual analysis was conducted on the INDELs file, looking for variants that were related to retinitis pigmentosa and which affected a large portion of the gene, and the gene’s subsequent protein coding. We narrowed down this list to one possible deletion, located inside the gene RP1L1. We were able to narrow down to this single candidate because this was one of the few insertions or deletions in

35

associated with retinitis pigmentosa. Also, 23 amino acids were removed through this variant. This INDEL thus became our lead candidate.

Figure 2.5 shows the INDEL within RP1L1 in the Integrative Genomic browser.

The RP1L1 deletion is 69 base pairs long and, per the GRCh37/hg19 reference genome, is located on chromosome 8 at position 10,465,965. The reference allele for the deletion is

“TCCTTCTGCCTCTGGGGCCTCTACATCTTCTGACTCTGGCTGGGCCTCCCCTT

CAGCCTCCTGGGCATCC,” while the alternative allele is T. Traditionally, the IGV browser will mark deletions with a “D”, however, due to the great length of this variant,

IGV simply puts two arrows around the alternative allele that is now present.

Figure 2.6 depicts the entire RP1L1 gene, and how the affected deletion disrupts the coding regions of the protein that is coded by the RP1L1 gene. As shown in the figure, we can see how substantial a deletion can be in the grand scheme of the RP1L1 gene, and how many amino acids are subsequently removed because of this. In yellow, are the highlighted amino acids that are removed in individuals affected with this deletion. Furthermore, Figure

2.6 shows us how close these deletions are to the cap-terminus of the protein.

2.5 Discussion

Our filtration process led us to ultimately identifying a candidate deletion, present in the RP1L1 gene. Upon researching this deletion in the literature, there does not appear to be any related articles that reference it in particular. However, it should be noted that our filtration process was not fool-proof. Even though we filtered out many of the SNPs and

INDELs based upon whether they met our filtration criteria, that does not mean these filtered variants do not have any implications for the disease. This holds especially true when

36

we consider the 1000 Genomes filtering process. Variants were filtered out if they matched known variants with the 1000 Genomes Data Phase 1 and Phase 3, and leave the unknown or rare variants intact. This type of filtration was conducted due to time constraints, and because of the sheer high number of variants, we were working with. We very well may have filtered out a gene that had relatedness or could be the cause of the daughter’s disease in the process, however, it was needed due to the nature of this project and to ensure we could quickly get to the INDELs that have higher probabilities of causing the disease.

Furthermore, time constraints also limited us to manually going through the list of candidate

SNPs, and being able to identify those that are of the high probability of being linked to the disease. Thus, there could also be SNPs that may have dire implications that were also not looked at it.

The next step that needs to be taken to further analyze the RP1L1 deletion would be to verify its existence in the three individuals. Although there is a significant probability that this exists, any computational software will have room for error, and thus, it must first be proven that it absolutely exists through some type of subsequent analysis, such as targeted sequencing.

To further elaborate on the need for more confirmation, consider Figure 2.4a and

2.4b once more. There is a small distribution of variants within the Y chromosome region.

Biologically, this does not make sense, since we are dealing with two daughters. However, due to recommendations by the Broad Institute, our variant-calling and filtration procedures were not at the strictest setting, since this can also cause issues with too few variants being called, so some false-positives will remain in the data. Again, this goes back to doing some additional analysis to ensure that lead candidate variants do exist in the studied individuals.

37

2.6 Conclusion

Sequence alignment and variant calling are two valuable bioinformatic tools that can help identify variants within sequencing files. By using these technologies, and our custom filtration methods, we have identified a candidate deletion on chromosome 8 and within the gene RP1L1, that could have contributed retinitis pigmentosa that two daughters have been afflicted with. We have come to this conclusion, since this deletion removes a substantial part of the protein encoded by the gene RP1L1, and because of its high relation to retinitis pigmentosa and other retinal diseases. However, although we believe this to be an ideal candidate variant, other important variants may exist that have also contributed to the retinal disease these daughters have been afflicted with. More research should be conducted on these sequencing files to look for these other candidate variants, that may have been filtered out or not considered via our methodology.

38

Figure 2-1: Workflow of the genome alignment and variant calling steps. The first nine steps are based on the Broad Institute’s Best Practice Guide for Exome-Sequencing Data. The last two steps were custom steps, used to produce VCF files with only rare SNPs or INDELs.

39

Figure 2-2: Representation of the reference indexing process using the Burrows-Wheeler Aligner. The purpose of reference indexing is to index the reference genome (in this case the reference genome was GRCh37/hg19) to allow BWA to quickly can align segments of the reference to our raw exome sequencing files. Reference indexing utilizes eight files total, one being the original reference genome marked as hg19.fasta, and the other seven being files that are generated through the indexing process.

40

Figure 2-3: Demonstration of the filtration steps taken to produce analysis-ready VCF files. The starting number of variants created by the GATK for each of the three individuals tallied 924,444 variants. In step #2, we have run variant calling over exon coordinates to remove non exonic variants. In step #3, we have used hard filters, a technique that uses statistical set points, and ensure only variants that meet a threshold remain. In steps #4 and #5, we filtered based upon the 1000 Human Phase 1 Data so only variants without SNP IDs are kept. In the last step, we filter using custom PERL scripts, so only the variants that meet a genotype criteria are kept. 41

a.)

b.)

Figure 2-4: Depiction of the SNP and INDEL filtration process across chromosomes. Each track of the circular plot showcases the INDELs and the SNPs at the step of the filtration process. Figure 2.4a showcases the SNPs, while Figure 2.5b showcases the INDELs.

42

Figure 2-5: The location of the INDEL within RP1L1, on chromosome 8 using the Integrative Genomics Viewer [35]. The reference allele is TCCTTCTGCCTCTGGGGCCTCTACATCTTCTGACTCTGGCTGGGCCTCCCCTTC AGCCTCCTGGGCATCC, while the alternative allele is T. Also shown is the Integrative Genome Browser’s viewpoint of where this INDEL lies in retrospect to the gene.

43

Figure 2-6: Protein coding sequence of the RP1L1 gene. Highlighted in yellow are the affected amino acids due to the deletion.

44

References

[1] "FTO FTO, alpha-ketoglutarate dependent dioxygenase [Homo sapiens (human)] - Gene -

NCBI." National Center for Biotechnology Information. U.S. National Library of Medicine,

n.d. Web. 22 May 2017.

[2] Fedeles, Bogdan I., et al. "The AlkB family of Fe (II)/α-ketoglutarate-dependent dioxygenases:

repairing nucleic acid alkylation damage and beyond." Journal of Biological Chemistry 290.34

(2015): 20734-20742.

[3] Bejerano, Gill, et al. "Ultraconserved elements in the human genome." Science 304.5675 (2004):

1321-1325.

[4] Dimitrieva, Slavica, and Philipp Bucher. "UCNEbase—a database of ultraconserved non-coding

elements and genomic regulatory blocks." Nucleic acids research (2012): gks1092.

[5] Hertel, Jens K., et al. "FTO, type 2 diabetes, and weight gain throughout adult life." Diabetes

60.5 (2011): 1637-1644.

[6] Frayling, Timothy M., et al. "A common variant in the FTO gene is associated with body mass

index and predisposes to childhood and adult obesity." Science 316.5826 (2007): 889-894.

45

[7] Fischer, Julia, et al. "Inactivation of the Fto gene protects from obesity." Nature 458.7240 (2009):

894-898.

[8] Smemo, Scott, et al. "Obesity-associated variants within FTO form long-range functional

connections with IRX3." Nature 507.7492 (2014): 371-375.

[9] Laber, Samantha, and Roger D. Cox. "Commentary: FTO obesity variant circuitry and adipocyte

browning in humans." Frontiers in genetics 6 (2015).

[10] "Gene: FTO ." Summary - Homo sapiens - Ensembl genome browser 88. N.p., n.d. Web. 22

May 2017.

[11] "Vega gene and transcript types." Vega Genome Browser. Ensembl, n.d. Web. 12 May 2017.

[12] "TSL (Transcript Support Level)." Help - Glossary - Homo sapiens - Ensembl genome browser

88. N.p., n.d. Web. 22 May 2017.

[13] Zhao, Yi, et al. "NONCODE 2016: an informative and valuable data source of long non-coding

RNAs." Nucleic acids research 44.D1 (2016): D203-D208.

[14] "Home - SRA - NCBI." National Center for Biotechnology Information. U.S. National Library

of Medicine, n.d. Web. 10 May 2017.

46

[15] Tarailo‐Graovac, Maja, and Nansheng Chen. "Using RepeatMasker to identify repetitive

elements in genomic sequences." Current Protocols in Bioinformatics (2009): 4-10.

[16] Hubley, Robert, et al. "The Dfam database of repetitive DNA families." Nucleic acids

research 44.D1 (2016): D81-D89.

[17] ENCODE-DCC. "ENCODE-DCC/long-rna-seq-pipeline." GitHub. N.p., 04 Mar. 2017. Web.

22 May 2017.

[18] Conesa, Ana, et al. "A survey of best practices for RNA-seq data analysis." Genome

biology 17.1 (2016): 13.

[19] Herman, Mark A., and Evan D. Rosen. "Making biological sense of GWAS data: lessons from

the FTO locus." metabolism 22.4 (2015): 538-539.

[20] "Retinal dystrophies." Kent Association for the Blind. N.p., n.d. Web. 2 May 2017.

[21] "Human Genome Browser - hg19 assembly." UCSC Genome Browser Gateway. University of

California Santa Cruz, n.d. Web.

[22] "Burrows-Wheeler Aligner." Burrows-Wheeler Aligner. N.p., n.d. Web. 2 May 2017.

[23] Sirén, Jouni. "Burrows-Wheeler transform for terabases." Data Compression Conference

(DCC), 2016. IEEE, 2016.

47

[24] "GATK Home Page." Genome Analysis Toolkit. Broad Institute, n.d. Web.

[25] SnpEff. N.p., n.d. Web. 2 May 2017.

[26] "RP1L1 RP1 like 1 [Homo sapiens (human)] - Gene - NCBI." National Center for

Biotechnology Information. U.S. National Library of Medicine, n.d. Web. 1 May 2017.

[27] "Gene: RP1L1." Summary - Homo sapiens - Ensembl genome browser 88. N.p., n.d. Web. 13

May 2017.

[28] Bowne, Sara J., et al. "Characterization of RP1L1, a highly polymorphic paralog of the retinitis

pigmentosa 1 (RP1) gene." Molecular vision 9 (2003): 129.

[29] Data Science & Data Engineering @ Broad Institute. "GATK Best Practices Recommended

workflows for variant discovery analysis with GATK." GATK | Best Practices. N.p., n.d.

Web. 1 May 2017

[30] "SAMtools." SAMtools. N.p., n.d. Web. 22 May 2017.

[31] "Picard Tools." Picard Tools - By Broad Institute. N.p., n.d. Web. 2 May 2017.

[32] SnpSift. N.p., n.d. Web. 22 May 2017.

48

[33] Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G.,

Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence

alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID:

19505943]

[34] McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing

next-generation DNA sequencing data." 20.9 (2010): 1297-1303.

[35] "BAM." BAM | Integrative Genomics Viewer. Broad Institute, n.d. Web. 08 May 2017.

[36] Van der Auwera, Geraldine. "How to Call variants with HaplotypeCaller." Broad Institute

GATK Forum. Broad Institute, Feb. 2016. Web. 08 May 2017.

[37] "(howto) Apply hard filters to a call set." GATK-Forum. N.p., n.d. Web. 10 May 2017.

[38] "IGSR: The International Genome Sample Resource." Phase 1 | 1000 Genomes. N.p., n.d.

Web. 5 May 2017.

[39] "IGSR: The International Genome Sample Resource." Phase 3 | 1000 Genomes. N.p., n.d.

Web. 22 May 2017.

49

Appendix A

Experimental Results from SRA-to-BLAST

All experimental results from SRA-to-BLAST can be seen below. Listed are the specific tissue of the SRA experiment, the SRX number, the number of BLAST hits per UCNE, and

BLAST hits from our controls.

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

Appendix B

Linux Commands for Genomic Alignment and Variant Calling

These are all the Linux commands that were used to produce the VCF files from the raw

FASTQ files for the exome-sequencing project. Because three sequencing data files were used, all examples are based upon sequencing of individual P1, P1_R1.fastq and

P1_R2.fastq. The Linux commands can be downloaded in a text file, via the following link: https://ydsoa.org/bioinformatics/.

Command Line 1:

java -jar picard.jar FastqToSam \

FASTQ=P1_R1.fastq \

FASTQ2=P1_R2.fastq \

OUTPUT=P1_fastqtosam.bam \

READ_GROUP_NAME=C7H7:5 \

SAMPLE_NAME=P1 \

LIBRARY_NAME=Agilent_SureSelect \

65

PLATFORM=Illumina \

RUN_DATE=2017-01-20T00:30:30+00:00

Command Line 2:

java -jar picard.jar MarkIlluminaAdapters \

I=P1_fastqtosam.bam\

O=P1_fastqtosam_markilluminaadapters.bam\

M=P1_snippet_markilluminaadapters_metrics.txt

Command Line 3:

java -jar picard.jarSamToFastq\

I=P1_fastqtosam_markilluminaadapters.bam \

FASTQ=P1_reverted.fq \

CLIPPING_ATTRIBUTE=XT \

CLIPPING_ACTION=2 \

INTERLEAVE=true \

NON_PF=true \

Command Line 4:

bwa mem -M -t 30 -p hg19.fasta \

P1_reverted.fastq > P1_Hg19.

Command Line 5:

66

java -jar picard.jar MergeBamAlignment \

R=hg19.fasta \

UNMAPPED_BAM=P1_fastqtosam.bam \

ALIGNED_BAM=P1_Hg19.sam \

O=P1_final_hg19.bam \

CREATE_INDEX=true \

ADD_MATE_CIGAR=true \

Command Line 6:

CLIP_ADAPTERS=false \

CLIP_OVERLAPPING_READS=true \

SECONDARY_ALIGNMENTS=true \

MAX_INSERTIONS_OR_DELETIONS=-1 \

PRIMARY_ALIGNMENT_STRATEGY=MostDistant\

ATTRIBUTES_TO_RETAIN=XS

Command Line 7:

java -jar GenomeAnalysisTK.jar \

-T HaplotypeCaller \

-R hg19.fasta \

-I P1_final_hg19.bam \

67

-L coordinates.bed \

--genotyping_mode DISCOVERY \

-stand_call_conf 30 \

-o raw_variants.vcf

Command Line 8:

java -jar GenomeAnalysisTK.jar \

-T SelectVariants \

-R hg19.fasta \

-V raw_variants.vcf \

-selectType SNP \

-o raw_snps.vcf

Command Line 9:

java -jar GenomeAnalysisTK.jar \

-T VariantFiltration \

-R hg19.fasta \

-V raw_snps.vcf \

--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 ||

ReadPosRankSum < -8.0" \

--filterName "P1_snp_filter" \

-o filtered_snps.vcf

68

Command Line 10:

java -jar GenomeAnalysisTK.jar \

-T SelectVariants \

-R hg19.fasta \

-V raw__variants.vcf \

-selectType INDEL \

-o raw_indels.vcf

Command Line 11:

java -jar GenomeAnalysisTK.jar \

-T VariantFiltration \

-R hg19.fasta \

-V raw_indels.vcf \

--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \

--filterName "P1_indel_filter" \

-o filtered_indels.vcf

69