ALLELE SPECIFIC EXPRESSION IN VARIOUS TISSUES OF GALLUS

GALLUS DOMESTICUS

by

M. Joseph Tomlinson IV

A thesis submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Master of Science in Bioinformatics and Computational Biology

Fall 2018

© 2018 M. Joseph Tomlinson IV All Rights Reserved

ALLELE SPECIFIC EXPRESSION IN VARIOUS TISSUES OF GALLUS

GALLUS DOMESTICUS

by

M. Joseph Tomlinson IV

Approved: ______Behnam Abasht, Ph.D. Professor in charge of thesis on behalf of the Advisory Committee

Approved: ______Limin Kung, Jr., Ph.D. Chair of the Department of Animal and Food Sciences

Approved: ______Mark W. Rieger, Ph.D. Dean of the College of Agriculture and Natural Resources

Approved: ______Douglas J. Doren, Ph.D. Interim Vice Provost for the Office of Graduate and Professional Education

ACKNOWLEDGMENTS

I would like to thank my graduate advisor Dr. Behnam Abasht. Without his guidance and support this project would not have been possible. I would also like to thank him for giving me the opportunity to gain extensive experience in the bioinformatics field that greatly improved my skillsets.

I would also like to thank my thesis committee members Dr. Shawn Polson, Dr. Jing Qiu and Dr. Randall Wisser for their advice and guidance on this project, which was essential in the overall success of this project.

I also express my thanks to my classmates, labmates and friends at UD, who helped encourage me throughout the journey (Michael Papah, Daniel Chazi Capelo, Felix Francis, Juniper Lake, Emma Fare, Steve Chiou, Matt Saponaro, Terence Mhora and many more names).

I would like to thank my family. My mom and step-dad who have helped support and encourage me over the years in my academic studies. My lovely wife, who whole heartedly encouraged me when I went back to school at UD and who also decided to join me in my academic endeavors and get a degree at University of Delaware too.

Finally, I would love to dedicate this thesis to my daughter- Anna Bai Tomlinson!

iii

TABLE OF CONTENTS

LIST OF TABLES ...... vi LIST OF FIGURES ...... ix ABSTRACT ...... xii

Chapter

1 IMPORTANCE OF ALLELE SPECIFIC EXPRESSION ...... 1

1.1 Prefix for Project ...... 1 1.2 Importance of ASE ...... 2 1.3 Types of ASE ...... 4 1.4 Imprinting and ASE ...... 5 1.5 Brief History of ASE ...... 5 1.6 ASE in Chickens ...... 6

REFERENCES ...... 10

2 ALLELE SPECIFIC EXPRESSION ANALYSIS IN CHICKENS ...... 13

2.1 Introduction ...... 13 2.2 Materials and Methods ...... 16

2.2.1 Sample Collection and Quality Control ...... 16 2.2.2 Sequence Alignment and Variant Calling ...... 17 2.2.3 Analyzing Unmappable Reads ...... 20 2.2.4 600K Genotyping Data ...... 21 2.2.5 Validating RNA-Seq Analysis ...... 21 2.2.6 VCF ASE Detection Tool (VADT) ...... 22

2.2.6.1 VADT - Filtering of Data ...... 22 2.2.6.2 VADT – Detection of Reference Allele Bias ...... 23 2.2.6.3 VADT – Binomial Testing ...... 23 2.2.6.4 VADT – Statistical Analysis of Binomial Results ... 24

2.2.6.4.1 Sample Level Analysis ...... 24 2.2.6.4.2 Meta-Analysis ...... 25

2.2.6.5 VADT – Settings Utilized in Our Study of ASE...... 25

2.2.7 Investigation Functional Significance ...... 25 2.2.8 Validation of VADT’s Robustness...... 26

iv

2.3 Results ...... 26

2.3.1 Mapping and Initial Variant Results ...... 26 2.3.2 Mapping Issue Between Tissue ...... 27 2.3.3 Validation of Variant Calling Pipeline ...... 35 2.3.4 VADT Analysis and Results ...... 38 2.3.5 Functional Significance and Pathway Enrichment ...... 40 2.3.6 Identification of Tissue Specific Pathways with Strong ASE Signals ...... 44 2.3.7 Identification of Robust ASE Found in All Three Tissues ...... 46 2.3.8 VADT Robustness ...... 49

2.4 Discussion ...... 52

REFERENCES ...... 56

3 FUTURE STUDIES OF ASE ...... 62

3.1 Introduction ...... 62 3.2 Future Developments of ASE Detection Analysis Pipeline ...... 63

3.2.1 Improving VADT’s ASE Detection Model...... 63 3.2.2 Strandness of Read Counts ...... 64

3.3 Future Biological Directions ...... 64 3.4 Conclusion ...... 68

REFERENCES ...... 69

Appendix

A DERIVATION OF BINOMIAL TEST ...... 71 B FALSE DISCOVERY RATE ...... 74 C FISHER’S METHOD FOR COMBINED PROBABILITY ...... 78 D BREAKDOWN OF INFORMATIVE VARIANT COUNTS ...... 79 E EXPANDED VIEW OF DAVID RESULTS ...... 80 F TOP ASE CANDIDATE GENES ...... 86

v

LIST OF TABLES

Table 2.1: Summary statistics of STAR alignment (1st Pass) for all the samples separated by project that were used to create the initial VCF used for masking. The wooden breast project samples had a diversity of input lengths (*) and represent the average between the samples...... 27

Table 2.2: Overall summary statistics from the unmappable reads for R1 and R2 (sample 47337) for all three tissues. The “Average Trimmed Length” is the length that FastqBLAST trimmed the sequence and the “Average Hit Seq Length” is the length BLAST returned for the match...... 29

Table 2.3: Top 5 FastqBLAST count results for R1 and R2 of unmappable reads for the three tissues (breast muscle, abdominal and liver) for sample 47337...... 30

Table 2.4: Summary table of the 1st Pass of STAR alignment for the three tissues with chimeric setting turned on...... 32

Table 2.5: Comparison of variant calls between 600K Genotyping panel and RNA-seq variants. The initial total variant counts for both the panel and RNA-seq are based on variant calls after filtering and represent all high-quality variants that can be compared between the datasets...... 36

Table 2.6: Top genes identified using VADT’s significant results and Ensembl’s VEP tool where genes show 100% ASE in all three tissues after normalization. ’s overall biological function summary based on GeneCards [55]...... 48

Table 2.7: Variants previously verified using Sanger sequencing and corresponding VADT results. VADT was able to identify all variants as statistically significant in its original tissue designation. Coordinates were lifted over from Gallus gallus 4.0 to Gallus gallus 5.0...... 51

Table 3.1: Variants previously verified using Sanger sequencing and corresponding VADT results from both chicken datasets tested using VADT. Variants were prior reported by Zhuo et al [8] as showing ASE and were verified using Sanger Sequencing in the in the tissue designated on the left-hand side of the table. ASE statistical significance was determined by VADT. Coordinates were lifted over from Gallus gallus 4.0 to Gallus gallus 5.0 for comparison purposes using the UCSC genome browser [9] ...... 67

vi

Table B.1: Example of duplicate p-values and their corresponding corrected p- values (q-values) and overall significance based on <0.05 cutoff...... 75

Table D.1: Summary statistics of informative variants read counts per tissue...... 79

Table D.2: Binning of all informative variant read counts broken down by tissue. Bins refer to the total counts for a variant and reflect the number of variants with those counts...... 79

Table E.1: Breast Muscle tissue results from DAVID, 1319 Ensembl IDs submitted and 817 Ensembl IDs matched for analysis. Only annotation cluster 1 showed statistical significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed...... 80 Table E.2: Breast Muscle/Liver tissue results from DAVID, 430 Ensembl IDs submitted and 281 Ensembl IDs matched for analysis. Only annotation cluster 1 showed statistical significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table...... 81 Table E.3: Breast Muscle/Abdominal Fat tissue results from DAVID, 1320 Ensembl IDs submitted and 833 Ensembl IDs matched for enrichment. Only annotation cluster 1 showed significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table...... 81 Table E.4: All three tissue results from DAVID, 1715 Ensembl IDs submitted and 1036 Ensembl IDs matched for enrichment. A total of five clusters showed significance (FDR p-value <0.1 and Enrichment Score >1.3). and only significant terms listed in table……...... 82 Table E.5: Liver tissue results from DAVID, 1057 Ensembl IDs submitted and 703 Ensembl IDs matched for enrichment. Only 2 annotations clusters visualized, however a total of 7 clusters had significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table. ……...... ……...... 83 Table E.6: Breast muscle tissue results from DAVID, 543 Ensembl IDs submitted and 272 Ensembl IDs matched for enrichment. Only annotation cluster 1 showed significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table...... 84 Table E.7: Abdominal fat tissue results from DAVID, 536 Ensembl IDs submitted and 243 Ensembl IDs matched for enrichment. Only annotation cluster 1 shown significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table...... 85

vii

Table E.8: Liver tissue results from DAVID, 484 Ensembl IDs submitted and 260 Ensembl IDs matched for enrichment. Only annotation cluster 1 showed significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table...... 85 Table F.1: Top ASE Genes Identified using VADT, VEP and normalizing variants called. Gene variants were normalized using the following (informative variants/sig. variants * 100). Genes were considered “top” based on the were identified in all three tissues (breast muscle, abdominal fat, and liver) and the average normalization value between all three tissues was >=80%...... 86

viii

LIST OF FIGURES

Figure 1.1: Three types of allele specific expression found in an organism: 1) monoallelic expression where only one allele is expressed, but both alleles are present, 2) allele specific bias expression, one allele is over expressed when compared to the second allele and 3) allele specific isoform expression, one allele is expressed because exons are missing from the other isoform expressed. Arrows correspond to transcription and the larger the arrow, the greater the expression of that allele. (Image adapted from [16]) ...... 4

Figure 1.2: Timeline of ASE research into chickens. Solid line boxes represent chicken ASE research and dashed line boxes represent key ASE research developments for anchoring [6, 7, 15, 19, 20, 24-28]...... 7

Figure 2.1: Samples used in this ASE study. All samples were used to create a reference VCF for masking of genome, but only feed efficiency samples were utilized in ASE analysis. *Samples utilized in study by Zhuo et al (2015) [30]. **Samples utilized in study by Zhou et al (2015) [29]. ***Samples utilized in study by Mutryn et al (2015) [31]...... 17

Figure 2.2: RNA-seq variant calling pipeline. The pipeline consists of three parts: part 1. the initial calling of variants using an unmasked genome, part 2. variants were used to create a global g.vcf file that was used to create a masked genome and part 3. repeating the variant calling pipeline but aligning with the masked genome to help remove reference allele bias [24, 25, 33-38]...... 20

Figure 2.3: All feed efficiency samples from the various tissues plotted by their % chimeric reads versus % unmappable reads. Chimeric reads and unmappable reads were reported by STAR aligner...... 33

Figure 2.4: Zoomed in view of Figure 2.3 that shows only breast muscle samples plotted by % chimeric reads versus % unmappable reads with corresponding linear regression equation and R2...... 34

Figure 2.5: Matrix comparisons of genotype calls for 600K versus RNA-seq for feed efficiency samples. On the y-axis is the 600K genotyping dataset and on the x-axis the RNA-Seq dataset. Diagonal lines (highlighted in yellow) represent self to self...... 37

ix

Figure 2.6: VCF ASE Detection Tool (VADT) pipeline for all three tissues analyzed (liver, abdominal fat and breast muscle). The pipeline consists of three major steps (blue outlined boxes): filtering using GATK (user defined filters), quality control filters and the actual statistical analysis for ASE variants...... 39

Figure 2.7: Venn diagram comparing overlapping variants among the three tissues. On the left are variants considered informative and on the right are variants found to be statistically significant for ASE. On average the informative variants overlap by 32.5% and the significant variants only overlap by 3.7%...... 40

Figure 2.8: Ensembl’s VEP summary statistics for significant variants for all three tissues. Downstream variants specifically refer to variants found within 5,000 bases of the end of the gene. Downstream variants most likely represent issues with the current genome annotation found in VEP and the average distance for downstream variants was found to be 1695.79 nt for breast muscle, 1775.12 nt for abdominal fat, and 1670.81 nt for liver...... 42

Figure 2.9: Venn diagram comparing overlapping genes among the three tissues. On left are genes identified from the informative variants using VEP and on right are genes identified from significant variants using VEP...... 43

Figure 2.10: Pathway enrichment of statistically significant ASE genes. Genes were identified using Ensembl’s VEP and overlapping groups were broken apart. Ensembl gene IDs were uploaded into DAVID for pathway enrichment. The pathway enrichment results (rectangular boxes) do not represent all the possible gene clusters identified and have been simplified for brevity and visualization purposes (see Appendix E). The cutoff for significance was an FDR p-value <0.1 and Enrichment Score >1.3...... 44

Figure 2.11: Comparison of top normalized genes and corresponding pathway enrichment. Identified genes variant counts were normalized (significant variants/informative variants) and genes with scores >=50% were compared between tissues. Only tissue specific genes were uploaded to DAVID for pathway enrichment due to small sample size. A full expansion of the identified clusters with enrichment terms can be seen in Appendix E. The cutoff for significance was an FDR p- value <0.1 and Enrichment Score >1.3...... 45

x

Figure 2.12: Network analysis of genes enriched for ASE in all three tissues. Genes that showed an average combined ASE score of 50% after normalization in all three tissues were submitted to STRING (n = 208) to examine interactions. The largest networked identified in the analysis was ribosomal genes (zoomed in image). There were a total 176 nodes, 374 edges with an average node degree 4.25 with a PPI enrichment p-value of <1.0e-16 (significant interactions identified)... 47

Figure 2.13: Venn diagram comparing VADT results versus the results from Zhuo et al (2017) in brain and liver tissue. VADT_Sig refers to variants considered significant by VADT and VADT_Informative refers to variants considered informative by the program. Overall VADT was able to capture the majority of previously published ASE variants ..... 50

Figure 3.1: Comparison of variants considered statistically significant by VADT from the study by Zhuo et al [8] and our ASE study of the BCD cross. Venn diagram was produced using online bioinformatics tool [10]. ... 66

Figure A.1: Example Python code to perform modified binomial test using SciPy...... 73

Figure B.1: Comparing variants considered significant by two different FDR methods when analyzing breast muscle tissue using VADT. The less stringent FDR assigned the most significant q-value to all duplicate p-values and the more stringent FDR assigned the least significant q-value to duplicate p-values...... 76

Figure B.2: Example Python code to perform FDR that outputs adjusted p-value (q-values). The results are reported as “key:value” dictionary where original value is the key and the q-value is the value. All original values highlighted yellow are considered significant because their q- values are <0.05...... 77

Figure C.1: Example Python code to perform combine pvalues using SciPy. The first value reported is the chi-square value and the second value is actual p-value...... 78

xi

ABSTRACT

Allele specific expression (ASE) is the process where one allele in a heterozygous individual is expressed at a higher level in comparison to the other when equal expression is expected. ASE is of great interest because it helps in the identification of cis-regulatory mutations or epigenetic modifications that influence gene expression. In this study, the effect of these cis-regulatory elements is investigated by examining single nucleotide polymorphisms (SNPs) from RNA-seq. Identified SNPs may then be utilized in future animal breeding programs. In our study, we performed

RNA-sequencing on 100 samples collected from various populations of chickens. We then followed Genome Analysis Toolkit’s (GATK) “Best Practices for Variant Calling on RNA-seq” using recommended settings. We aligned the sequence reads to the chicken reference genome sequence (Gallus gallus 5.0) from Ensembl. Based on the first round of alignment of STAR, the average number of reads was 36,196,012 with an average mapping rate of 85%. We identified a total 3,147,284 variants (SNPs and

Indels) and then used these variants to mask the reference genome for initial alignment and re-ran the pipeline. The final variants from a large sub-set of the prior samples (n=

68), which consisted of samples from different tissues (breast muscle, abdominal fat, liver), but from the same population were examined for ASE. ASE analysis was performed using the custom analysis software called VCF ASE Detection Tool

(VADT). It should be mentioned that VCF refers to a variant call file. VADT automatically filters the variant data based on user parameters, detects ASE using a binomial test and automatically performs statistical correction of the results. On average

xii

~174,000 SNPs in each tissue passed our filtering criteria and were considered informative, of which ~24,000 (~14%) showing ASE. The overlap of ASE SNPs among the 3 tissues was only 3.7%, with ~83% of variants showing tissue specificity. Now if variants are mapped to genes and the genes are compared between the tissues, this overlap increases to 20.1%. Overall it was found that ASE genes show enrichment for tissue specificity but were also found to show enrichment for pathways involved with translation and ribosomes. When the top ASE genes that were found in all three tissues were examined, there was a clear enrichment for KEGG pathways involved in ribosomes and metabolism.

xiii

Chapter 1

IMPORTANCE OF ALLELE SPECIFIC EXPRESSION

1.1 Prefix for Project

Poultry, which includes chickens, turkey ducks and other fowls, represents over

32% of meat intake in the USA [1]. The broiler chicken industry in the U.S. alone is worth over $65-95 billion [2] with the Delmarva (Delaware, Maryland & Virginia) peninsula comprising over $3.4 billion of this market [3], an astounding output based on its overall size. Despite the importance of chickens in our economy and food systems and the immense amount of breeding they have undergone, there are many biological mechanisms affecting the bird’s health and financial value that are still poorly understood. The focus of this thesis is on one of these biological mechanisms, specifically allele specific expression.

Allele specific expression (ASE) is the process where one allele in a heterozygous individual is expressed at a higher level in comparison to the other. ASE can also be referred to as allelic imbalance, differential allelic expression and allelic bias, but for simplicity it will hereafter only be referred to as ASE [4]. For a typical heterozygous , it is expected that both alleles are expressed in a 1:1 ratio; however, this is not always the case and some alleles are expressed at higher levels. There are many biological causes of ASE, but it is generally believed that the mechanism is cis-acting

1 because of the localized influence on allele expression and the expression of both alleles

[5, 6].

The main goal of this project was investigating ASE among various tissues of chickens, however as the project was progressing it was discovered that no bioinformatic tool exists that can easily detect ASE from a VCF file. As a result, the project evolved to possess two main goals: 1) create a robust analysis tool to detect ASE and 2) investigate the overall functional significance of ASE in various tissues of chickens.

1.2 Importance of ASE

Allele specific expression has seen increased interest among researchers because it offers a new functional mechanism to improve our understanding of the genome.

Other tools used to investigate the genome, such as genome wide association studies

(GWAS) and expression quantitative trait loci (eQTL) studies have proven extremely helpful in elucidating disease risks, but both have their limitations, which will be discussed [7]. Genome wide association studies have helped identify risk factors for diseases, but many important variants detected by GWAS are located in intronic regions, gene deserts and in regions of large linkage disequilibrium (LD) [8, 9]. Fine mapping genomic regions in strong LD to find true causative variants is time consuming and often incapable of identifying the true causative variant in an LD block [10]. Also, many diseases are associated with a large number of loci, each contributing a small amount of risk [8]. With many of these complex diseases, all the risk factors that ultimately cause

2 the disease are still being elucidated and may encompass a “omingenic” model where many gene pathways are implicated [11, 12].

In eQTL studies a specific single nucleotide polymorphism (SNP) or genetic marker is investigated to see if it is associated with a measurable change of expression of a specific gene. It was hypothesized that cis-regulatory mechanisms would be identified in many of these studies. However, many of these eQTL studies have not only identified proximal “cis” regulatory mechanisms, but have also found many trans acting mechanisms (incorrectly classified at cis), complicating the analysis of the overall mechanism [13]. In these instances, cis refers to a mechanism that acts locally in the same genomic region as the gene of interest and trans refers to a distant mechanism that affects the gene of interest [14]. Expression quantitative trait loci studies also have difficulty with mapping resolution (variant is in a large LD block) and detecting small effects [15]. As useful as eQTL studies have been they cannot answer all of the questions about the underpinnings of gene expression.

Allele specific expression analysis is a relatively new tool in this ongoing investigation of casual variants in the genome. The major benefit to ASE analysis when compared to GWAS and eQTL studies is its ability to identify alleles associated with a measurable change in expression using a few individuals, or even one single individual.

Understanding ASE is also very important economically because ASE findings can be directly implemented in agricultural breeding programs (selection based on ASE variants), because the molecular mode of action is better understood when compared to

3

GWAS [7]. However, at the same time, the functional implications of ASE need to be further investigated to better understand influence in an organism.

1.3 Types of ASE

There are three types of ASE that can occur at any given heterozygous locus

(Figure 1.1). The first type is monoallelic expression of an allele, where only one allele is expressed, and the other allele is completely silenced. The second type is allele- specific bias, where one allele is expressed at a higher level than the other allele. The third type of ASE is allele specific isoforms where the alleles of a heterozygous locus are associated with different isoforms of a gene. Allele specific expression may not be consistent across cells and can also differ based on the tissue development stage of an organism [16].

Figure 1.1: Three types of allele specific expression found in an organism: 1) monoallelic expression where only one allele is expressed, but both alleles are present, 2) allele specific bias expression, one allele is over expressed when compared to the second allele and 3) allele specific isoform expression, one allele is expressed because exons are missing from the other isoform expressed. Arrows correspond to transcription and the larger the arrow, the greater the expression of that allele. (Image adapted from [16])

4

1.4 Imprinting and ASE

Imprinting is the process where an allele’s expression in an offspring is silenced or otherwise altered due to epigenetic modification inherited from a specific parent [17].

Specifically, imprinting constitutes a subset of ASE that merits some brief discussion.

Genetic imprinting is associated with some diseases, such as Prader-Willi and

Angelman syndrome, which is caused by deletion of a gene cluster at 15q11-q13, resulting in imprinting of the UBE3A gene. Which parent the mutation comes from determines the type of disease the child develops. If the deletion comes from the mother the child has Angelman syndrome and if the deletion comes from the father the child had Prader-Willi syndrome [18]. Imprinting lacks relevance in this thesis because the study did not include parental information. It is possible that imprinting does not even occur in chickens with a recent study by Zhuo et al (2017) suggesting that it does not

[19].

1.5 Brief History of ASE

Allele specific expression studies are becoming increasingly common in scientific research due to the limitations of GWAS and eQTL studies, where identifying a quantifiable trait that is pin-pointed to an exact location in the genome is impossible.

This explosion of literature can be seen in a simple google scholar search (May 2018) where over 1,100,100 results are returned for a simple search of “allele specific expression” and “”. ASE was first discovered in humans back in 2002 using

Centre d’Etude du Polymorphisme Humain (CEPH) samples [20], the same year that

5

ASE was also discovered in mice [21]. Allele specific expression was then found in the plant world in 2004 in maize (Zea mays) hybrids by Guo et al. [22]. Although ASE was identified in various genes, no global study had been performed that interrogated the entire genome until 2008, when ASE was reported to occur globally in a high throughput screen by Serre et al (2008) [23]. At that time, prior knowledge about a genetic region was required to design probes and micro-arrays to investigate ASE, which severely limited overall detection capabilities. In 2009 techniques were developed to identify

ASE using RNA-seq data [24], which allowed for greater flexibility because no background knowledge was needed about a genomic region and it allowed for interrogation of the entire transcriptome. Since then RNA-seq has become the gold standard for detecting ASE and with many studies still using the techniques pioneered by Degner et al for analyzing and detecting ASE variants.

1.6 ASE in Chickens

The first ASE study in chickens was reported in 2006, only 4 years after the first

ASE study was reported in humans [20, 25]. Since then, ASE research in chickens has progressed steadily over the years. A full timeline of all the ASE research in chickens with key developments in the field (non-poultry based research) can be seen in Figure

1.2.

6

Figure 1.2: Timeline of ASE research into chickens. Solid line boxes represent chicken ASE research and dashed line boxes represent key ASE research developments for anchoring [6, 7, 15, 19, 20, 24-28].

ASE was first reported in chickens in 2006 specifically in the hsp70 gene (heat shock protein) in chickens under heat stress [25]. Then in 2011, ASE was found to occur in a cis-regulatory element of Sonic hedgehog (Shh) associated with limb development in Silkie chickens [26]. After these groundbreaking studies, the field shifted to focus more on ASE associated with disease and has continued down this route for the last five years.

At around the same time that the Dunn et al study was being performed with

Silkie chickens in 2011, two different studies were being performed investigating

Marek’s disease in chicken. Marek’s disease virus (MDV) is an oncogenic herpesvirus that largely affects chickens and can cause huge financial losses for the poultry industry

7 when outbreaks occur. The disease is generally controlled by vaccines, but the virulence of the disease has increased in the last 25 years and the virus is constantly evolving to escape vaccine development [7, 29]. Both studies found evidence of ASE in chickens when challenged with Marek’s disease [6, 15, 27]. Additional ASE studies into Marek’s disease were performed with broiler and layer chickens in 2013 and successfully identified thousands of ASE SNPs [28]. Ultimately all of these studies cumulated in a proof of concept by Cheng et al who found that ASE SNPs in chickens were extremely good markers for selecting for disease resistance to Marek’s disease. In this study, they were able to reduce the incidence of Marek’s disease by 22% in progeny based on selection of ASE SNPs that were also in eQTL with disease resistance. [7]. This study by Cheng et al was the first of its kind in poultry breeding programs and demonstrates the applicability of directly using ASE SNPs for animal selection.

In 2017 an ASE study was performed on chickens in un-challenged situation (no disease) by Zhuo et al. This study specifically looked at the brain and liver tissue of embryos and identified 5,197 ASE SNPs in the in the brain (18.3% of genes) and 4,638

ASE SNPs in the liver (17.3% of genes) [19]. However no further functional investigation of these SNPs occurred, and possible mechanisms of action were not explained. It should be noted that the final VCF file from this study was later used for validation of the VCF analysis tool described in chapter 2.

Overall, there are many unexplored paths in research relating to ASE in chickens. Only one study has focused on the global behavior of ASE in chickens but was limited to two tissues and did not follow up on the significance of these ASE

8

SNPs [19]. There is the potential to discover more ASE SNPs in many different tissues of chickens. Identifying these SNPs is important because they could have a direct impact on chicken breeding programs when selecting for traits like feed efficiency, meat yield or disease resistance. Gaining a better understanding of how

ASE affects various tissues could be extremely important in understanding the mechanism of ASE globally in an organism.

9

REFERENCES

1. Daniel, C.R., et al., Trends in meat consumption in the USA. Public Health Nutr, 2011. 14(4): p. 575-83.

2. Council, N.C. Broiler Chicken Industry Key Facts 2018. 2018 5-16-2018]; Available from: http://www.nationalchickencouncil.org/about-the- industry/statistics/broiler-chicken-industry-key-facts/.

3. Delmarva Poultry Industry, I. DPI Facts and Figures. 2018 [cited 2018 5-09- 2018]; Available from: http://www.dpichicken.org/facts/facts-figures.cfm.

4. Spillane, C. and P.C. McKeown, Plant Epigenetics and Epigenomics. 2014, New York: Springer Science and Business Media.

5. Marsh, S., Pyrosequencing Protocols. Detection of allelic imbalance in gene expression using pyrosequencing, ed. H. Wang and S.C. Elbein. 2007, Totowa, NJ: Humana Press.

6. Meydan, H., et al., Allele-specific expression analysis reveals CD79B has a cis-acting regulatory element that responds to Marek's disease virus infection in chickens. Poult Sci, 2011. 90(6): p. 1206-11.

7. Cheng, H.H., et al., Fine mapping of QTL and genomic prediction using allele- specific expression SNPs demonstrates that the complex trait of genetic resistance to Marek's disease is predominantly determined by transcriptional regulation. BMC Genomics, 2015. 16: p. 816.

8. Maurano, M.T., et al., Systematic localization of common disease-associated variation in regulatory DNA. Science, 2012. 337(6099): p. 1190-5.

9. Bartonicek, N., et al., Intergenic disease-associated regions are abundant in novel transcripts. Genome Biol, 2017. 18(1): p. 241.

10. TomlinsonIV, M.J., et al., Fine mapping and functional studies of risk variants for type 1 diabetes at 16p13.13. Diabetes, 2014. 63: p. 4360- 4368.

10

11. Boyle, E.A., Y.I. Li, and J.K. Pritchard, An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell, 2017. 169(7): p. 1177-1186.

12. Robinson, M.R., N.R. Wray, and P.M. Visscher, Explaining additional genetic variation in complex traits. Trends Genet, 2014. 30(4): p. 124-32.

13. Hasin-Brumshtein, Y., et al., Allele-specific expression and eQTl analysis in mouse adipose tissue. BMC Genomics, 2014. 15(471): p. 1-13.

14. Rockman, M.V. and L. Kruglyak, Genetics of global gene expression. Nat Rev Genet, 2006. 7(11): p. 862-72.

15. Maceachern, S., et al., Genome-Wide Identification and Quantification of cis- and trans-Regulated Genes Responding to Marek's Disease Virus Infection via Analysis of Allele-Specific Expression. Front Genet, 2011. 2: p. 113.

16. Gregg, C., Known unknowns for allele-specific expression and genomic imprinting effects. F1000Prime Rep, 2014. 6: p. 75.

17. Pfeifer, K., Review Article: Mechanisms of Genomic Imprinting. Am. J. Hum. Genet., 2000. 67: p. 777-787.

18. Strachan, T. and A. Read, Molecular Genetics. 2011: Garland Science.

19. Zhuo, Z., S.J. Lamont, and B. Abasht, RNA-Seq Analyses Identify Frequent Allele Specific Expression and No Evidence of Genomic Imprinting in Specific Embryonic Tissues of Chicken. Sci Rep, 2017. 7(1): p. 11944.

20. Yan, H., et al., Allelic variation in human gene expression. Science, 2002. 297(5584): p. 1143.

21. Cowles, C.R., et al., Detection of regulatory variation in mouse genes. Nat Genet, 2002. 32(3): p. 432-7.

11

22. Guo, M., et al., Allelic variation of gene expression in maize hybrids. Plant Cell, 2004. 16(7): p. 1707-16.

23. Serre, D., et al., Differential allelic expression in the : a robust approach to identify genetic and epigentic cis-acting mechanism regulating gene expression. PLoS Genet, 2008. 4(2): p. 1-16.

24. Degner, J.F., et al., Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics, 2009. 25(24): p. 3207- 12.

25. Zhen, F.S., et al., Tissue and allelic-specific expression of hsp70 gene in chickens: basal and heat-stress-induced mRNA level quantified with real-time reverse transcriptase polymerase chain reaction. Br Poult Sci, 2006. 47(4): p. 449-55.

26. Dunn, I.C., et al., The chicken polydactyly (Po) locus causes allelic imbalance and ectopic expression of Shh during limb development. Dev Dyn, 2011. 240(5): p. 1163-72.

27. Maceachern, S., et al., Genome-wide identification of allele-specific expression (ASE) in response to Marek's disease virus infection using next generation sequencing. BMC Proc, 2011. 5 Suppl 4: p. S14.

28. Perumbakkam, S., et al., Comparison and contrast of genes and biological pathways responding to Marek's disease virus infection using allele-specific expression and differential expression in broiler and layer chickens. BMC Genomics, 2013. 14(64): p. 1-10.

29. Osterrieder, N., et al., Marek's disease virus: from miasma to model. Nat Rev Microbiol, 2006. 4(4): p. 283-94.

12

Chapter 2

ALLELE SPECIFIC EXPRESSION ANALYSIS IN CHICKENS

2.1 Introduction

Allele specific expression (ASE) is the process where one allele in a heterozygous individual is expressed at a higher level in comparison to the other when equal expression is expected. The phenomenon is of great interest because it helps in the identification of cis-regulatory mutations or epigenetic modifications that influence gene expression. ASE is extremely useful because it helps overcome some of the limitations of GWAS and eQTL studies. GWAS studies, while important for identifying loci associated with disease, have limitations because many hits are found in large LD blocks, introns, intergenic regions and gene deserts and have unknown effects [1, 2].

Also, it is speculated that GWAS studies miss many variants, which require much larger statistical power to identify [3, 4]. Expression quantitative trait loci have been helpful in identifying many possible mechanisms of action for variants but it is speculated many sites maybe incorrectly classified as cis or trans [5]. Currently, there is an initiative to perform large scale eQTL studies (GTEex Consortium) and integrate the results with

GWAS and ASE data to try and better understand genetic risk factors for diseases [6].

Research in chickens is extremely important because the market for broilers in the U.S.A is estimated to be worth $65-$95 billion and helps create over 1.6 million jobs

13

[7]. There have been numerous studies into ASE and chickens with some studies looking at specific genes and others at overall response to disease [8-15]. One of the latest study by Cheng et al. (2015) found that ASE SNPs can be used as markers to select for resistance to Marek’s disease in chickens [8]. Another study looked at ASE in brain and livers of chicken embryos and found ASE affected ~17.8% of genes [14], however, ASE has not been characterized in other economically important tissues like breast muscle.

Currently, the most straight forward analysis to detect ASE is using the binomial test, which was first implemented by Degner et al (2009) for the first RNA-Seq ASE analysis [16]. Depending on the type of analysis being performed there are various software packages to analyze ASE, a comprehensive summary of the currently available packages can be found in a review by Gu and Wang (2015) [17]. However, there is currently no software package to perform ASE detection on a per sample basis from a standard VCF file. Bioinformatics pipelines for ASE detection require customized file inputs, which in itself is not trivial task to create and often require additional information like haplotype blocks, which is not available for many non-model organisms [18-20].

Detection of ASE can be summarized in four major stages: sequencing of RNA, alignment of the sequencing reads, variant calling and finally detection of ASE variants.

During the first stage, the RNA is isolated from the tissue of interest using an mRNA isolation kit and the mRNA is converted to DNA using a reverse transcriptase enzyme and then sent for sequencing. In the second stage the sequenced reads are aligned to the reference genome and various quality controls are performed on these aligned reads, so only high-quality reads are identified. Next, the variants are called from the aligned

14 reads. Variants are specifically identified by examining the genome for structural differences in the sequence like SNPs and indels. Finally, the variants are examined for

ASE using a statistical analysis program that looks at biallelic variants and examines the base counts for those alleles.

During ASE analysis, it is important to account for reference allele bias, a statistical phenomenon where the reference allele is favored during alignment over the alternative allele, which ultimately affects the base counts for a variant [16]. Bias tends to occur in areas with clusters of alternative alleles [21]. In a study of the HLA gene by

Brandt et al (2015), it was found that 18.6% of variant calls for the gene were incorrect and ~25% of frequencies were incorrect due to reference allele bias [21]. Various methods of correcting reference allele bias have been suggested such as creating a new style of reference genome to capture all variant information [22], creating phased genotypes for alignment [22], correcting for reference allele frequencies in a sample

[23] and trying to mask the genome with an alternative allele [16]. However, building on the model proposed by Degner et al (2009) [16], which utilized an alternative allele due to technological limitations which only permitted A, T, C, G, it is now possible to put N’s (designation unknown base) in the genome using a tool like BEDTools [24].

This creates a new reference genome that has all variation masked hence the naming

“masked genome”. This technique was utilized by Zhuo et al (2017) [14] using the

RNA-seq STAR, where alleles aligned at these N positions were neither penalized nor rewarded in the alignment process which ultimately improves the final base counts for a variant [25].

15

This study investigates ASE in three different tissue types liver, fat and breast muscle from multiple chicken lines. Variants were detected with a two-step process; first sequences were aligned using the tools specified in the Genome Analysis Toolkit’s

(GATK) best practices [26-28]. The genome was masked using these results and then sequences were re-aligned to the masked genome to produce final variant calls in the form a variant call file (VCF). ASE variants were then detected using a custom bioinformatics tool called “Variant ASE Detection Tool” (VADT), after which the functional significance of ASE variants was explored.

2.2 Materials and Methods

2.2.1 Sample Collection and Quality Control

Samples used in this study came from two different projects, feed efficiency and

Wooden Breast Disease, of which work on some of these samples has been prior published [29-31]. A full breakdown of all the samples utilized in this study can be seen in Figure 2.1. All of the feed efficiency samples came from a cross of three of lines called BCD and were collected from the liver, abdominal fat and breast muscle of the same birds. The wooden breast samples came from two lines of chickens called B and

C and also from the line of the BCD cross. These samples were collected from breast muscle and liver. All of the feed efficiency samples were prepared using Illumina’s

TruSeq Stranded mRNA kit and were sequenced at 2 x 75 cycles on an Illumina HiSeq

2000 Sequencer. The wooden breast project samples were prepared either using either

16 a TruSeq Stranded mRNA kit or a TruSeq mRNA kit and were sequenced with either 2 x 75, 2 x 101, 2 x 102 on Illumina 2000 or Illumina 2500 Sequencer. All fastq files were reviewed for sequence quality using FastQC [32].

Feed Efficiency Project BCD Line Liver (n = 23) * BCD Line Abdominal Fat (n = 22) ** BCD Line Breast Muscle (n = 23)

Wooden Breast Project B and BCD Line Breast Muscle (n = 13) *** C Line Breast Muscle (n = 11) C Line Liver (n = 8)

Figure 2.1: Samples used in this ASE study. All samples were used to create a reference VCF for masking of genome, but only feed efficiency samples were utilized in ASE analysis. *Samples utilized in study by Zhuo et al (2015) [30]. **Samples utilized in study by Zhou et al (2015) [29]. ***Samples utilized in study by Mutryn et al (2015) [31].

2.2.2 Sequence Alignment and Variant Calling

All sequencing reads that passed FastQC were analyzed using GATK’s Best

Practices for Calling Variants in RNA-seq [26-28, 33]. This consisted of aligning sequencing reads using STAR (version 2.5.2b) with the two-pass setting [25, 34] to the

Gallus gallus 5.0 genome downloaded from Ensembl [35, 36]. The index file initially used for STAR was created using gtf file 86 from Ensembl [36, 37]. After alignment duplicate reads were marked using Picard (version 2.8.1) [38]. Next, the data was run

17 through using GATK’s (version 3.7), which includes “Split’N’Trim,” “Base

Recalibration” and variant calling using HaplotypeCaller (using the g.vcf setting). The g.vcfs from all the samples listed in Figure 2.1 were merged together to create a global vcf file for masking. The global vcf file was then used to mask the reference genome using Bedtools [24]. All the samples were re-aligned to the masked genome and re- processed using all the prior mentioned steps to call variants. A full summary of this entire process can be seen in Figure 2.2. The final g.vcf for the feed efficiency samples was created by merging samples by tissue and finally removing variants with “Fisher

Strand > 30.0”, “Quality Depth < 2.0” and “Depth < 100” using GATK (program flags variants as failing). Samples from the wooden breast project were excluded from the remainder of the analysis to minimize confounding factors, such as methods used for cDNA library preparation, sequencing and lines of chickens which could introduce unknown bias.

18

Part 1. Initial Calling of Variants Based on GATK

Part 2. Merge All VCF Files and Mask Genome

Part 3. Call Variants using Masked Genome

19

Figure 2.2: RNA-seq variant calling pipeline. The pipeline consists of three parts: part 1. the initial calling of variants using an unmasked genome, part 2. variants were used to create a global g.vcf file that was used to create a masked genome and part 3. repeating the variant calling pipeline but aligning with the masked genome to help remove reference allele bias [24, 25, 33-38].

2.2.3 Analyzing Unmappable Reads

Unexpected alignment rates in some of the samples warranted further investigation of potential alignment issues To characterize unmappable reads a high throughput systematic manner, a custom python program was developed in the lab called FastqBLAST, which remotely submits sequences to NCBI’s BLASTs functions from a personal computer using the Biopython package [39]. More specifically the program takes in a fastq file and randomly selects a sample of sequences based on a user defined parameter. The program then trims end bases (<20 base quality score) until passing bases are identified, if no passing bases are identified the entire sequence is trimmed. The program then BLASTs the filtered sequences and retrieves the top hit from the BLAST results. Additional gene information is then retrieved using NCBI’s

EFetch function and the results are merged. Finally, FastqBLAST tallies all the results and prints out summary reports of the BLAST results and EFetch results. Specifically, for this experiment we adjusted FastqBLAST’s parameters to closely match NCBI’s

Megablast parameters and also verified that the results from both “blasters” were comparable using a test dataset (results not shown). We used FastqBLAST to blast

20

~1000 fastq sequences from each of our unmappable reads files from the first STAR alignment and limited the BLAST organism to Gallus gallus.

2.2.4 600K Genotyping Data

All the samples were genotyped with the ThermoFisher Axiom Chicken

Genotyping Array [40]. The raw genotyping data (cel files) was analyzed using Axiom

Analysis Suite Software (version 3.0.1 64 bit) with the Gallus gallus 5.0 genome

(downloaded from Axiom server) following the software’s Best Practices Workflow using recommended settings for agricultural animals [41, 42]. The final results were exported, including a raw VCF of all the genotype calls and a txt file of all variants with

>= 97% call rate. The txt file was utilized to filter low quality variants from the raw

VCF.

2.2.5 Validating RNA-Seq Analysis

To validate the RNA-Seq pipeline, 600K genotyping data was compared to the

RNA-Seq merged g.vcf files using a custom python script that performs filtering and matching of two vcf files. The 600K genotyping data was filtered to remove all variants with <97% call rates and all non-overlapping samples. The RNA-Seq data was filtered using the following criteria: all variants that failed previously mentioned GATK filters, variants within 75 base pairs of INDELs, variants with < 20 quality score and samples with < 20 read counts. Final filtered datasets were then compared for matching variants and overlapping variants were analyzed for overall concordance among samples.

21

2.2.6 VCF ASE Detection Tool (VADT)

To identify ASE variants, a custom program was written called VADT (VCF

ASE Detection Tool), the complete code for the program can currently be found at https://github.com/mjtiv/VADT (currently hidden in private mode). VADT takes in a raw VCF file, filters the data and then performs various statistical tests for detection of allele specific expression (ASE) to identify highly confident occurrences of ASE.

VADT was written in python3.6 and utilizes the following modules to run: SciPy [43], numpy [44], future, os.path, sys, time and copy [45]. A full explanation of the tool process can be found in the README.md at the previously mentioned github page, but the major steps will be briefly discussed below.

2.2.6.1 VADT - Filtering of Data

VADT performs various filtering steps on a VCF file before testing for ASE.

The first filter is removing all variants that have been flagged as failing (FILTER column of VCF) in the last step of GATK’s Best Practices. Specifically, the program looks for a "PASS" flag in the filter column to include that variant in further analysis.

VADT removes all variants within a certain distance of an indel defined by the user.

VADT also removes all variants that are not biallelic and variants with less than the minimum quality score (default setting <20).

VADT also performs sample filtering of the data. This consists of removing all samples that are homozygous, have low read count (user defined setting), allele count

22 is <1% of the total counts or just "no data" is reported for that sample. Variants can fail for any one of these filters or can fail for a combination of these filters being implemented.

Now if a variant passes all the prior mentioned filters that were applied globally or on a per-sample basis it can be classified as “informative.” All informative variants were then utilized in various statistical analysis that will be further discussed.

2.2.6.2 VADT – Detection of Reference Allele Bias

A major concern with detection of ASE variants is reference allele bias, where the explicit definition of a reference allele by use of a reference genome causes mapping bias that favors the reference allele. So, VADT tests all informative variants that pass the prior filters for reference allele bias by examining reference and alternative allele counts. The program reports the total bias per variant and reports the final global bias value for the entire dataset. It is possible to use this final bias value to re-run the entire dataset in an attempt to remove some portion of that bias.

2.2.6.3 VADT – Binomial Testing

VADT performs a binomial test on all informative samples in the dataset using

SciPy's binomial test module [43]. A complete breakdown of this statistical test

(derivation) can be found in Appendix A, along with the corresponding python code to run the analysis. The program utilizes the raw read counts from the VCF file, found in

23 the “genotype fields” of a VCF under the AD (unfiltered allele depth) sub-field for each sample.

2.2.6.4 VADT – Statistical Analysis of Binomial Results

VADT performs two different types of statistical analysis of the data (varying models) to identify significant ASE results after the initial binomial test: a per sample analysis of all informative variants and a meta-analysis across each variant considering all informative samples. Both tests give similar results but address the question of identifying statistically significant ASE variants in slightly different ways. It is up to the user to choose which model is best for their dataset. An explanation of how each test is performed is explained below.

2.2.6.4.1 Sample Level Analysis

After performing the initial binomial test, VADT next performs a false discovery rate correction (FDR) [46] on a per-sample basis for all informative variants. The full breakdown of how the FDR is calculated and the code used in the program can be found in Appendix B. Any sample with a significant FDR corrected p-value based on the cutoff defined by the user is considered significant and so any variant with at least one significant sample is now considered significant too. The final significant results are then tallied and reported to the user in various summary report files.

24

2.2.6.4.2 Meta-Analysis

After performing the initial binomial test, VADT next performs a meta-analysis across all tested samples per variant using Fisher's Method to combine p-values [47].

An explanation of Fisher’s Method and the corresponding python code can be found in

Appendix C. After producing p-values for all informative variants, an FDR correction is performed, and significant variants identified based on the corrected cutoff p-value defined by the user. The final results are then tallied for this model and reported to the user in various summary report output files.

2.2.6.5 VADT – Settings Utilized in Our Study of ASE

This ASE study utilized an indel filter distance of 75 base pairs (read length for feed efficiency dataset), a quality score minimum of 20, and a read count threshold of

20 for each sample. The cutoff p-values for significance for the meta-analysis was 0.05 and the cutoff for the sample analysis 0.05.

2.2.7 Investigation Functional Significance

To validate the functional significance of the variants identified in the study, the variants were analyzed using Ensembl’s Variant Effect Predictor (VEP) tool [48]. The results were downloaded as a txt file and custom python scripts were written to parse the results. The final gene list was then uploaded to DAVID for functional annotation

[49]. The cutoff p-value for significance for pathway enrichment from DAVID was an

FDR p-value of 0.1. When investigating the functional significance, the overlap of

25 variants or genes among tissues was visualized using two online bioinformatics tool for creating Venn diagrams [50, 51].

2.2.8 Validation of VADT’s Robustness

The raw VCF from Zhuo et al (2017) was used to validate the overall robustness of VADT [14]. The VCF was analyzed using VADT, using similar settings as in our prior analysis. The meta-analysis results were compared to previously published results which included results of variants validated by Sanger Sequencing. In the study by Zhuo et al, ASE variants were analyzed and grouped according to parental linage (Fayoumi vs. Leghorn) whereas in our analysis we treated the data as one giant pool. All the variant coordinates in Zhuo et al (2017) were lifted over from Gallus gallus 4.0 to

Gallus gallus 5.0 when necessary using UCSC genome browser lift-over tool [52].

2.3 Results

2.3.1 Mapping and Initial Variant Results

We sequenced a total of 100 samples, which had an average input of

36,196,012.79 reads and average uniquely mapping rate of 85.94%. The feed efficiency samples (n = 68) had on average input of 33,2221,292.6 reads with an average uniquely mapping rate of 87.24%, whereas the wooden breast samples (n =

32) had on an average input of 42,517,293.19 reads with a uniquely mapping rate of

83.17%. A summary of the mapping results can be found in Table 2.1. All these

26 samples produced a total of 3,147,284 variants which were used to mask the reference genome for the second round of alignment.

Table 2.1: Summary statistics of STAR alignment (1st Pass) for all the samples separated by project that were used to create the initial VCF used for masking. The wooden breast project samples had a diversity of input lengths (*) and represent the average between the samples.

Feed Efficiency Project Avg. Input Input Avg. Uniquely % Uniquely Reads Length Mapped Reads Mapped Reads Breast Muscle (n=23) 34212079.35 150 27477522.91 80.11% Abdominal Fat (n=22) 32096465.36 150 29137588.18 90.78% Liver (n=23) 33306427.57 150 30305271.57 90.98% All Samples (n=68) 33221292.6 150 28971047.25 87.24%

Wooden Breast Project Avg. Input Input Avg. Uniquely % Uniquely Reads Length Mapped Reads Mapped Reads Breast Muscle (n=24) 51070107.71 174.42* 42340701.46 82.94% Liver (n=8) 16858849.63 202 13961441.13 83.85% All Samples (n=32) 42517293.19 181.31* 35245886.38 83.17%

2.3.2 Mapping Issue Between Tissue

As seen in Table 2.1 the feed efficiency liver and abdominal fat samples had mapping rates ~10% higher than breast muscle samples. All feed efficiency samples were produced using the same library preparation procedures and sequenced with the

27 same read length and sequencing machines, suggesting that such variation should not be due to the preparation process. The wooden breast samples show much greater diversity in preparation process (input length, library prep etc.) and variation is expected. Due to this added complexity, the wooden breast samples were excluded from further analysis for ASE discovery.

This mapping issue discrepancy could potentially be due to variation in read quality scores, which were originally inspected using FastQC using the pass/fail categorization and visual inspection of the program’s output report. To calculate a more quantitative approach, a custom python program was written to quantify all the FastQC results and reports back various base quality scores that could have potentially biased the quality. However, after analysis of all the samples, no issues were found with base quality. For the feed efficiency samples, the average base quality was 34.79 for breast muscle, 34.15 for liver and 36.00 for abdominal fat.

To further investigate this issue the STAR alignment results were assessed, and it was found that breast muscle samples had significantly higher percentage of reads considered “too short”, 15.39% compared to 4.50% for abdominal fat and 5.33% for liver. STAR describes reads as “too short,” if 2/3 of the read cannot be aligned correctly

[25, 34]. A sample was randomly chosen from the feed efficiency project and its unmappable reads files for both R1 and R2 in all three tissues were blasted in a high throughput manner using FastqBLAST against the chicken genome. It should be noted the sample’s muscle SAM files were ~3 times the size of the other tissues’ SAM files for unmapped reads. The overall statistics of the 1000 blasted sequences for R1 and R2

28 can be seen in Table 2.2. There were many more muscle reads that returned results from

BLAST compared to the other tissues for both the R1 and R2. The exact genes being identified with the BLAST search were quantified and the top five hits (based on counts of hits) can be seen in Table 2.3. It was found that muscle samples were returning muscle related genes in both the forward and reverse reads whereas abdominal and liver were reporting back similar ribosomal and mitochondrial genes.

Table 2.2: Overall summary statistics from the unmappable reads for R1 and R2 (sample 47337) for all three tissues. The “Average Trimmed Length” is the length that FastqBLAST trimmed the sequence and the “Average Hit Seq Length” is the length BLAST returned for the match.

FastqBLAST Results Sequences Avg. Trimmed Blast Avg. Hit in File BLASTed Length Hits Seq. Length Muscle R1 5,735,211 1000 70.54 736 70.19 Muscle R2 5,735,211 1000 57.58 608 69.19 Abdominal Fat R1 1,882,124 1000 70.72 544 68.64 Abdominal Fat R2 1,882,124 1000 60.99 415 66.14 Liver R1 1,821,565 1000 70.48 734 70.84 Liver R2 1,821,565 1000 43.75 330 66.76

29

Table 2.3: Top 5 FastqBLAST count results for R1 and R2 of unmappable reads for the three tissues (breast muscle, abdominal and liver) for sample 47337

Top Muscle Genes from Unmappable Reads (R1) Accesssion ID Description Counts NM_205119.1 Gallus gallus enolase 3 (beta, muscle) (ENO3), mRNA >gi 51 Gallus gallus light chain, phosphorylatable, fast skeletal muscle NM_001198744.1 (MYLPF), mRNA >gi 46 Gallus gallus ATPase sarcoplasmic/endoplasmic reticulum Ca2+ NM_205519.1 transporting 1 (ATP2A1), mRNA >gi 46 NM_205507.1 Gallus gallus creatine kinase, M-type (CKM), mRNA >gi 41 Gallus gallus isolate ACAD15101_Palawan_Philippines mitochondrion, KY039437.1 complete genome 33 Total 217

Top Muscle Genes from Unmappable Reads (R2) Accesssion ID Description Counts Gallus gallus ATPase sarcoplasmic/endoplasmic reticulum Ca2+ NM_205519.1 transporting 1 (ATP2A1), mRNA >gi 44 NM_205119.1 Gallus gallus enolase 3 (beta, muscle) (ENO3), mRNA >gi 39 NM_205507.1 Gallus gallus creatine kinase, M-type (CKM), mRNA >gi 39 Gallus gallus myosin light chain, phosphorylatable, fast skeletal muscle NM_001198744.1 (MYLPF), mRNA >gi 37 PREDICTED: Gallus gallus creatine kinase M-type-like (LOC107051134), XM_025151021.1 partial mRNA 23 Total 182

Top Abdominal Fat Genes from Unmappable Reads (R1) Accesssion ID Description Counts XR_003078040.1 PREDICTED: Gallus gallus 28S ribosomal RNA (LOC112533599), rRNA 31 XR_003078044.1 PREDICTED: Gallus gallus 18S ribosomal RNA (LOC112533603), rRNA 23 BX931917.1 Gallus gallus finished cDNA, clone ChEST790c21 12 PREDICTED: Gallus gallus avian tenascin X (TNX), transcript variant X18, XM_025155812.1 mRNA 9 XR_003078043.1 PREDICTED: Gallus gallus 18S ribosomal RNA (LOC112533602), rRNA 8 Total 83

30

Top Abdominal Fat Genes from Unmappable Reads (R2) Accesssion ID Description Counts XR_003078040.1 PREDICTED: Gallus gallus 28S ribosomal RNA (LOC112533599), rRNA 38 XR_003078044.1 PREDICTED: Gallus gallus 18S ribosomal RNA (LOC112533603), rRNA 16 BX931917.1 Gallus gallus finished cDNA, clone ChEST790c21 13 PREDICTED: Gallus gallus avian tenascin X (TNX), transcript variant X18, XM_025155812.1 mRNA 7 Gallus gallus isolate ACAD15101_Palawan_Philippines mitochondrion, KY039437.1 complete genome 7 Total 81

Top Liver Genes from Unmappable Reads (R1) Accesssion ID Description Counts XR_003078040.1 PREDICTED: Gallus gallus 28S ribosomal RNA (LOC112533599), rRNA 74 XM_025149533.1 PREDICTED: Gallus gallus albumin (ALB), transcript variant X5, mRNA 45 Gallus gallus isolate ACAD15101_Palawan_Philippines mitochondrion, KY039437.1 complete genome 35 PREDICTED: Gallus gallus cytochrome P450 2G1-like (LOC112530469), XM_025144502.1 mRNA 34 U16848.1 GGU16848 Gallus gallus complement C3 precursor mRNA, complete cds 30 Total 218

Top Liver Genes from Unmappable Reads (R2) Accesssion ID Description Counts XR_003078040.1 PREDICTED: Gallus gallus 28S ribosomal RNA (LOC112533599), rRNA 60 PREDICTED: Gallus gallus cytochrome P450 2G1-like (LOC112530469), XM_025144502.1 mRNA 30 XR_003078044.1 PREDICTED: Gallus gallus 18S ribosomal RNA (LOC112533603), rRNA 24 U16848.1 GGU16848 Gallus gallus complement C3 precursor mRNA, complete cds 20 XM_025149533.1 PREDICTED: Gallus gallus albumin (ALB), transcript variant X5, mRNA 8 Total 142

31

The results presented in Table 2.2 and 2.3 indicate that the mapping issue with muscle samples is not caused by the inability to independently map R1 to R2 to the chicken genome. To further validate this finding, all samples were re-aligned using

STAR, but using 1st-Pass and with the chimeric flag tagged turned on (default setting is off). Chimeric referring to reads where various parts of the same read can map to multiple locations. By turning the chimeric flag on STAR reports back the chimeric read statistics in the final alignment report. The results of this alignment can be seen in Table

2.4.

Table 2.4: Summary table of the 1st Pass of STAR alignment for the three tissues with chimeric setting turned on.

Avg. Uniquely Avg. Unmappable Avg. Chimera Tissue Mapping Reads Reads Reads Breast Muscle (n=23) 80.11% 15.39% 2.73% Abdominal Fat (n=22) 90.78% 4.50% 0.61% Liver (n=23) 90.98% 5.33% 0.44%

The muscle samples have 4-6 times more chimeric reads than the other tissues, suggesting that the current sequence length of reads or overall quality of muscle related genes in the current genome assembly still needs improvement. Longer reads may help to better align the sequences to the reference genome or the current genome has incorrectly assembled genes due to . The specific relationship of how these chimeric reads affects mapping can be seen in Figures 2.3-2.4. In Figure 2.3

32 all samples were plotted, and a clear grouping of properly mapping samples can be seen, whereas the muscle samples spread out in a linear fashion. The linear relationship was found to be moderately correlated (R2 = 0.5202) as seen in Figure 2.4. Overall, a very little change in chimeric reads results in a large increase in unmappable reads.

4.50%

4.00%

3.50%

3.00%

2.50% Breast Muscle 2.00% Liver

% Chimeric Chimeric % Reads Abdominal Fat 1.50%

1.00%

0.50%

0.00% 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% % Unmappable Reads

Figure 2.3: All feed efficiency samples from the various tissues plotted by their % chimeric reads versus % unmappable reads. Chimeric reads and unmappable reads were reported by STAR aligner.

33

4.50%

4.00%

3.50%

3.00%

2.50%

y = 0.1594x + 0.0027 2.00% R² = 0.5202

Chimeric % Reads 1.50%

1.00%

0.50%

0.00% 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% % Unmappable Reads

Figure 2.4: Zoomed in view of Figure 2.3 that shows only breast muscle samples plotted by % chimeric reads versus % unmappable reads with corresponding linear regression equation and R2.

In summary, it appears the unmappable muscle samples appear to BLAST correctly to muscle related genes, but due to the current chicken genome assembly

(Gallus gallus 5.0) are being treated as chimeric reads and so some caution needs to be taken when interpreting the final muscle results.

34

2.3.3 Validation of Variant Calling Pipeline

To validate the RNA-Seq pipeline the variant calls were compared to separate genotyping call data produced with the same samples (Table 2.5). The same filtering criteria utilized in the later VADT ASE analysis were used excluding the homozygous allele filtering. On average the panels overlapped by ~13,997 variants and the genotype calls per-sample were concordant by 99.46%. The discordant reads were largely driven by a new allele appearing that was not the reference or alternative allele, but no strong conclusions can be made at a frequency of about 0.25%. To verify that overall concordance was not driven by random chance or an unknown systematic error, samples were shuffled and compared to each other in a concordance matrix (Figure 2.5). Samples compared to each other had concordance rates ~0.5, whereas self by self was ~0.9, ruling out the possibility of random chance driving the high concordance rates. Overall, it appears the RNA-Seq pipeline is working properly and overall calls can be trusted using the current filtering criteria.

35

Table 2.5: Comparison of variant calls between 600K Genotyping panel and RNA- seq variants. The initial total variant counts for both the panel and RNA-seq are based on variant calls after filtering and represent all high-quality variants that can be compared between the datasets.

Muscle Data Liver Abdominal Fat

Number of Samples 23 23 22 Total Variants in Genotyping Panel 572,645 572,645 572,645 Total Variants in RNA-Seq Dataset 193,245 199,330 268,744

Total Matching (Ref and Alt) 12,769 12,764 16,459

Matching Stats

Total Matches 168,885 187,463 254,846 Total Non-matches 997 962 1,323

Concordance (%) 99.41 99.49 99.48

Non-Matching Stats Homozygous Ref Allele 299 362 401

Homozygous Alt Allele 131 104 152 Discordant Genotype (new allele) 567 496 770

36

A. Concordance Matrix Comparison of Breast Muscle Samples

B. Concordance Matrix Comparison of Abdominal Fat Samples

C. Concordance Matrix Comparison of Liver Samples

Figure 2.5: Matrix comparisons of genotype calls for 600K versus RNA-seq for feed efficiency samples. On the y-axis is the 600K genotyping dataset and on the x-axis the

RNA-Seq dataset. Diagonal lines (highlighted in yellow) represent self to self.

37

2.3.4 VADT Analysis and Results

Between 1.5 and 2 million variants were identified in the three tissues. After filtering based on strand bias, quality depth and coverage, ~76% of variants were removed in each of the three tissues. The rest of the variants were then filtered based on various quality controls (GATK fail flag, closeness to INDELs, homozygosity, not biallelic, low counts etc.), resulting in the removal of an additional ~55% of the remaining variants found in each tissue. After filtering, the remaining variants included

148,860 in breast muscle, 217,628 in abdominal fat and 155,875 in liver with an average overlap of 53.47% variants among the three tissues and were considered “informative”.

A full breakdown of summary statistics for the informative variants (avg. size, bins etc.) can be found in Appendix D.

We also checked for reference allele bias to determine if masking the genome helped remove possible issues with the data. This was done by examining the reference allele ratio for informative variants for each tissue against the unmasked informative variants. It was found that breast muscle went from 51.77% to 50.17%, abdominal fat from 51.93% to 50.19% informative variants and liver from 51.91% to 50.17%.

Masking of the genome helped remove potential bias in the dataset as evident by the reduction of bias closer to the expected ratio 50.0%, which should help to strengthen our statistical power at detecting ASE with higher confidence.

Informative variants were then tested for ASE using the binomial test, followed by a meta-analysis with FDR correction. In the end, there were 21,916 variants in liver,

28,190 variants in abdominal fat and 20,775 variants in breast muscle that were

38 identified as showing ASE. Only 2,196 (3.7%) of these variants overlapped. Also, only about 13.65% of informative variants showed ASE in each tissue. A full breakdown of this pipeline can be seen in Figure 2.6 and Venn diagrams comparing informative versus significant variants can be seen in Figure 2.7. The overall prevalence of ASE was for

13.89% for breast muscle, 14.92% for abdominal fat and 15.55% for liver.

Figure 2.6: VCF ASE Detection Tool (VADT) pipeline for all three tissues analyzed (liver, abdominal fat and breast muscle). The pipeline consists of three major steps (blue outlined boxes): filtering using GATK (user defined filters), quality control filters and the actual statistical analysis for ASE variants.

39

Figure 2.7: Venn diagram comparing overlapping variants among the three tissues. On the left are variants considered informative and on the right are variants found to be statistically significant for ASE. On average the informative variants overlap by 32.5% and the significant variants only overlap by 3.7%.

2.3.5 Functional Significance and Pathway Enrichment

All significant variants were investigated for function using Ensembl’s online

VEP tool. VEP returns a variety of results, such as location of the variants, variant protein consequence, gene it is predicted to have an influence on. In general, there appears to be enrichment of variants in the 3’ UTR region and downstream region of genes if these two groupings are combined (largest proportion of consequences), but this may be an artifact of how mRNA is prepared and isolated. A full breakdown of the variant locations and overall consequence can be seen in Figure 2.8. No significant

40 difference was identified between tissues and no detrimental coding consequences found. Informative variants consequences were also examined (data not shown) but no significant changes were detected between informative and significant variants.

The results were then further parsed by Ensembl gene ID and overlapping genes between tissues examined as seen in Figure 2.9, which also compares genes identified from informative variants. In the informative variants group; breast muscle captured

10,577 genes, abdominal fat capture 11,878 genes and liver captured 10,277 genes, with a high overlap of 66.9%. When only significant variants were examined, breast muscle captured 4,784 genes, abdominal fat captured 5,709 genes and liver captured 4,095, with an overlap of 20.1%. There was also significant enrichment of tissue specific genes.

41

Figure 2.8: Ensembl’s VEP summary statistics for significant variants for all three tissues. Downstream variants specifically refer to variants found within 5,000 bases of the end of the gene. Downstream variants most likely represent issues with the current genome annotation found in VEP and the average distance for downstream variants was found to be 1695.79 nt for breast muscle, 1775.12 nt for abdominal fat, and 1670.81 nt for liver.

42

Figure 2.9: Venn diagram comparing overlapping genes among the three tissues. On left are genes identified from the informative variants using VEP and on right are genes identified from significant variants using VEP.

The genes from the significant variants identified in each group were submitted to DAVID for pathway enrichment. It was found that ASE usually involves tissue specific pathways but is also found in common functional pathways that are found in all three tissues (Figure 2.10 and Appendix E). The pathways found in common for all three tissues are related to ribosomes and translation. Some groupings of genes did not show statistical significance which could be due to limitations of DAVID enrichment software, because many Ensembl gene IDs were not identified by DAVID. The total hits found in DAVID for each grouping are noted in Figure 2.10 for reference and a more extensive expansion of the DAVID results is in Appendix E.

43

Figure 2.10: Pathway enrichment of statistically significant ASE genes. Genes were identified using Ensembl’s VEP and overlapping groups were broken apart. Ensembl gene IDs were uploaded into DAVID for pathway enrichment. The pathway enrichment results (rectangular boxes) do not represent all the possible gene clusters identified and have been simplified for brevity and visualization purposes (see Appendix E). The cutoff for significance was an FDR p-value <0.1 and Enrichment Score >1.3.

2.3.6 Identification of Tissue Specific Pathways with Strong ASE Signals

To further isolate genes and tissue specific pathways involved in ASE, genes were normalized (sig variants/informative variants). The gene list was then filtered for only genes with >=50% ASE signal and only showed specificity for that tissue. The final list of genes for each tissue was uploaded to DAVID, where on average 258.33 genes were searchable and results seen in Figure 2.11 and Appendix E. All three lists

44 showed a more focused tissue-specific enrichment when compared to Figure 2.10. In this analysis, abdominal fat does show some significant enrichment, but significantly less than the other tissues. It should be recognized the overlapping genes were searchable in DAVID, but the focus in this study was to try and identify strong tissue specific ASE genes and pathways.

Figure 2.11: Comparison of top normalized genes and corresponding pathway enrichment. Identified genes variant counts were normalized (significant variants/informative variants) and genes with scores >=50% were compared between tissues. Only tissue specific genes were uploaded to DAVID for pathway enrichment due to small sample size. A full expansion of the identified clusters with enrichment terms can be seen in Appendix E. The cutoff for significance was an FDR p-value <0.1 and Enrichment Score >1.3.

45

2.3.7 Identification of Robust ASE Genes Found in All Three Tissues

Building on the prior technique utilized in the prior section, we instead focused on looking for common genes found in all three tissues that showed strong ASE signal.

All three tissues ASE genes scores were normalized (sig variants/informtive variants) and then the normalized scores were averaged between the three tissues and the final gene lists filtered at various cutoffs based on the type of analysis being performed. As seen in Figure 2.10, 1,715 genes were originally found in common between all three tissues and using a cutoff of >=50% normalization value a total 344 genes were identified, which consisted of 208 genes with identifiable gene symbols. These 208 genes were uploaded to STRING [53] to examine protein to protein interactions (Figure

2.12). In this figure it is clearly shown that ribosomal genes form the largest most complex network of interactions among all the genes. The top three KEGG pathways

[54] identified by STRING were Ribosomes (count = 18, FDR = 8.62e-15), metabolic pathways (count = 32 FDR = 3.33e-07) and oxidative phosphorylation (count = 11, FDR

=8.93e-07). This relationship with ribosomes was prior shown in Figure 2.10, but with less stringent cutoffs. If a more stringent cutoff of >=80% is applied, only 78 genes are identifiable, out of which 23 show 100% ASE (Appendix F. Table 1.). Now in this list of 23 genes, 10 genes had identifiable gene symbols that could be referenced for biological function in GeneCards [55]. These 10 genes with corresponding function can be seen in Table 2.6. In the list 3 out of 10 genes are ribosomal coding genes (RPS29,

RPL35A and MRPL43) and 3 out of 10 genes are directly involved in metabolism

(UBL5, RARES2 and MT-ND2).

46

Figure 2.12: Network analysis of genes enriched for ASE in all three tissues. Genes that showed an average combined ASE score of 50% after normalization in all three tissues were submitted to STRING (n = 208) to examine protein interactions. The largest networked identified in the analysis was ribosomal genes (zoomed in image). There were a total 176 nodes, 374 edges with an average node degree 4.25 with a PPI enrichment p-value of <1.0e-16 (significant interactions identified).

47

Table 2.6: Top genes identified using VADT’s significant results and Ensembl’s VEP tool where genes show 100% ASE in all three tissues after normalization. Gene’s overall biological function summary based on GeneCards [55].

Gene Ensembl Gene ID Gene Name Biological Function Symbol Cell surface receptor involved in signal Immunoglobulin Lambda transduction for proliferation and ENSGALG00000021139 IGLL1 Like Polypeptide 1 differentiation of ProB cell to preB cell.

Forms part of complex that transfers Oligosaccharyltransferase oligosaccharides to polypeptide chains. ENSGALG00000029837 OST4 Complex Subunit 4, Non- Catalytic

Codes a protein that functions like ubiquitin, but instead bind to a protein and ENSGALG00000035138 UBL5 Ubiquitin Like 5 interfere with its function and appears to show association with metabolism. Codes a protein that forms part of the 60S subunit of ribosomes. ENSGALG00000007611 RPL35A Ribosomal Protein L35A

Codes for a subunit of protein that forms Ubiquinol-Cytochrome C part of the inner mitochondrial membrane ENSGALG00000008066 UQCR10 Reductase, Complex III Subunit X

Codes for a protein that forms part of Mitochondrial Ribosomal subunit of mitochondrial ribosomes ENSGALG00000034352 MRPL43 Protein L43

Codes for protein that forms part of the 40S subunit of ribosomes ENSGALG00000012229 RPS29 Ribosomal Protein S29

Receptor which helps transmits signaling Retinoic Acid Receptor for the regulation of various biological ENSGALG00000042642 RARRES2 Responder 2 functions like adipogenesis, metabolism and inflammation Forms a sub-unit that is part of the dynactin protein complex. Dynactin are ENSGALG00000021365 DCTN3 Dynactin Subunit 3 involved in a variety of functions associated with . Mitochondrially Encoded Forms an important subunit in NADH: Ubiquinone mitochondria that is involved NADH ENSGALG00000043768 MT-ND2 Oxidoreductase Core dehydrogenase, which is extremely Subunit 2 important in metabolism

48

2.3.8 VADT Robustness

The robustness of VADT was tested by using a dataset from Zhuo et al (2017)

[14]. In that study ASE was detected using parental lineage information in the grouping of ASE variants and performed with two separate meta-analyses based on the parental information. The analysis described herein pooled all the data together and excluded parental information. Only variants with an FDR <0.05 from the meta-analysis were compared. Overall VADT captured 57.34% of significant variants reported in the brain and 65.61% of variants reported in the liver. If variants that are classified as informative are included VADT could capture 94.91% of variants reported in the brain and 94.80% of variants reported in the liver. The comparison of results from the two methods can be seen in Figure 2.13. The Zhuo et al (2017) study also verified fourteen variants found to be significant for ASE using Sanger sequencing in tissue samples from brain and liver. Specifically, they looked at seven variants in brain and seven variants in liver, all of these variants were picked up perfectly by VADT with 100% match rate based on the tissue of origin designated by Zhuo et al (2017) (Table 2.7).

49

Figure 2.13: Venn diagram comparing VADT results versus the results from Zhuo et al (2017) in brain and liver tissue. VADT_Sig refers to variants considered significant by VADT and VADT_Informative refers to variants considered informative by the program. Overall VADT was able to capture the majority of previously published ASE variants.

50

Table 2.7: Variants previously verified using Sanger sequencing and corresponding VADT results. VADT was able to identify all variants as statistically significant in its original tissue designation. Coordinates were lifted over from Gallus gallus 4.0 to Gallus gallus 5.0.

ASE Statistical Variant Information Significance (VADT) Original Variant Brain Liver Tissue (Gallus Ref. (Zhuo (Zhuo Designation Gene gallus 5.0) Allele 2017) 2017) AKR1D1 1:56735606 C yes yes MPZL1 1:90891019 G yes yes BRP44 1:90873705 C yes yes Brain RFT1 12:1200202 T yes yes ROGD1 14:14415138 G yes yes PECAM1 18:6837730 A yes no FAM110B 2:111605284 G yes no KCNK1 3:38009219 T yes yes SOD2 3:44930274 T yes yes MANSC 1:71975277 G no yes Liver CLN5 1:153672059 G no yes PPM1K 4:46422860 T yes yes 27334 Z:54543069 G no yes FETUB 9:15627917 C yes yes

51

2.4 Discussion

VADT is a streamlined ASE detection pipeline that is robust, accurate and straight forward to run. The pipeline follows GATK’s Best Practices to produce an initial VCF, which can be fed directly into VADT for ASE detection. The output from VADT can easily be uploaded to VEP and the final output parsed with simple coding scripts, so results can easily be used for other downstream analyses like pathway to investigate

ASE’s biological influence. Understanding the biological effect of ASE is key to gaining a better understanding of cis-regulatory elements in the genome.

Our analysis pipeline for variant calling was found to be highly accurate as seen by the strong correlation with our DNA genotyping calls. However, it should be recognized that a large number of variants were excluded due low overlap with the 600K chicken genotyping panel. This limit in overlap between RNA-seq and the genotyping panel most likely is due the limitations of genotyping panels in general that are sold for a specific organism. Genotyping panels have a limited number of variants that must be able capture a variety of genetic information across different tissues and different genetic backgrounds, which means they can miss a lot of information. But, in their defense genotyping panels are cheap to use and very accurate. The high overall concordance in variants that do match, gives us higher confidence in our variant calling for ASE detection.

The biggest issue we encountered with our entire analysis pipeline was limitations with the current reference 2016 genome for chickens. It appears Gallus gallus 5.0 still needs to be improved based on the issues with alignment in breast muscle. Most likely

52 these issues can only be overcome with longer reads in all aspects of the investigation, including the genome and transcriptome assembly along with RNA-seq. This would allow more precise mapping of reads that maybe problematic to assemble due to homologous repeat regions.

VADT is totally self-contained analysis program that requires minimal effort from a user to run, the only requirement is an HPC environment due to the size of and complexity of VCF files. The entire program is self-contained in one python script for simplicity. VADT is the first ASE analysis programs that does not require any modifications of user input files and runs the entire detection analysis in one sweep with statistical corrections built in, allowing a user to focus more on the biological significance of their results. VADT was tested and validated on a different dataset and proven to be robust and accurate. Also, VADT allow a user to analyze multiple tissues in a high throughput manner that would not have been possible before because when the analysis pipeline is broken up into various steps that need to run in parallel, it is highly prone to error.

By the end of this study it was found that ASE can be both tissue specific and not.

ASE tissue specificity in chickens was previously reported by Zhuo et al [14] and in other organisms [56-58]. However, the exact pathway enrichment of ASE variants for chickens had not been performed. When ASE genes were uploaded to DAVID it clearly became evident that that tissue specific pathways were implicated with ASE, this type of behavior was seen in the study by Pinter et al with mice [57]. It should be recognized statistical power of identifying pathway was lost because many chicken genes are still

53 being annotated and had no identifiable biological function, so DAVID had to exclude them from analysis. This may explain why abdominal fat did not show any enrichment, which could also had been due to the diversity and quantity of genes identified, which could weaken any statistical enrichment.

This study was the first to show in chickens that non-tissue specific genes are implicated in ASE, specifically among ribosomes and translational machinery. This type of mechanism had only been previously reported in hybrid catfish in 2016, which are being bred for key economically important traits in aquaculture [59]. Whereas, just this year ASE affecting translational machinery was identified in mice and humans by

Park et al (2018), where they also showed ASE containing mRNA was being actively translated and built into ribosomes [60]. The identification of ASE among translational machinery was only made possible due to the comparison among all three tissues for overlapping variants/genes, which was facilitated by using VADT for streamlining.

Now the specific cause of ribosomal ASE enrichment will be further explored.

Chickens over the last century have been under great selective pressure to grow bigger, faster and with greater feed efficiency. When a strain of broilers from 1957 was compared to a strain of broilers from 2005 there was a drastic improvement in all economically important traits like for example increases in bird size, especially in breast muscle and faster growth [61]. Now these selective pressures on chicken has mostly likely caused the enrichment of key biologically important genes and pathways that are involved in metabolism and protein production that favor growth. In our ASE study, especially when comparing all three tissues, we see this type of enrichment in our

54 results. The exact mechanism of how these changes are causing improvements is totally speculative at this point. For example, ribosomes with certain ASE SNPs could favor the production of certain types of that encourage increased cell growth because the ASE ribosome makes a protein that is harder to flag for degradation and allows for longer expression. At the same time, we are assuming these traits from ASE ribosomes are beneficial, they could potentially be harmful hence why ASE variants have a low prevalence in a population. For example, if a ribosome with ASE SNPs allows for a slightly detrimental amino acid change it could manifest itself as a disease. Like with the disease of wooden breast, which manifests itself overtime in fast growing broiler birds as hardening of the muscles [62]. ASE variants in ribosomes could allow for proteins to be produced that cumulatively overtime causes muscular disease. However, both these situations are totally hypothetical and further data from an eQTL study is needed to show that ASE SNPs in ribosomes are associated with any measurable change, they could be totally benign mutations that have no measurable influence.

Overall in our study we have created a new analysis pipeline that simplifies and standardizes ASE detection which will help researchers focus more on the biological significance of ASE. Enabling a much greater understanding of how ASE affects various types of organisms. We have also showed the impact of ASE variants affecting chickens, especially their potential role in translational machinery. Combining the knowledge of tissue specific ASE variants with non-tissue specific variants has the potential to truly revolutionize the agricultural field.

55

REFERENCES

1. Maurano, M.T., et al., Systematic localization of common disease-associated variation in regulatory DNA. Science, 2012. 337(6099): p. 1190-5.

2. Bartonicek, N., et al., Intergenic disease-associated regions are abundant in novel transcripts. Genome Biol, 2017. 18(1): p. 241.

3. Visscher, P.M., et al., 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet, 2017. 101(1): p. 5-22.

4. Robinson, M.R., N.R. Wray, and P.M. Visscher, Explaining additional genetic variation in complex traits. Trends Genet, 2014. 30(4): p. 124-32.

5. Hasin-Brumshtein, Y., et al., Allele-specific expression and eQTl analysis in mouse adipose tissue. BMC Genomics, 2014. 15(471): p. 1-13.

6. Consortium, G., The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science, 2015. 348(6): p. 648-660.

7. Council, N.C. Broiler Chicken Industry Key Facts 2018. 2018 5-16-2018]; Available from: http://www.nationalchickencouncil.org/about-the- industry/statistics/broiler-chicken-industry-key-facts/.

8. Cheng, H.H., et al., Fine mapping of QTL and genomic prediction using allele- specific expression SNPs demonstrates that the complex trait of genetic resistance to Marek's disease is predominantly determined by transcriptional regulation. BMC Genomics, 2015. 16: p. 816.

9. Dunn, I.C., et al., The chicken polydactyly (Po) locus causes allelic imbalance and ectopic expression of Shh during limb development. Dev Dyn, 2011. 240(5): p. 1163-72.

10. Maceachern, S., et al., Genome-wide identification of allele-specific expression (ASE) in response to Marek's disease virus infection using next generation sequencing. BMC Proc, 2011. 5 Suppl 4: p. S14.

56

11. Meydan, H., et al., Allele-specific expression analysis reveals CD79B has a cis-acting regulatory element that responds to Marek's disease virus infection in chickens. Poult Sci, 2011. 90(6): p. 1206-11.

12. Perumbakkam, S., et al., Comparison and contrast of genes and biological pathways responding to Marek's disease virus infection using allele-specific expression and differential expression in broiler and layer chickens. BMC Genomics, 2013. 14(64): p. 1-10.

13. Zhen, F.S., et al., Tissue and allelic-specific expression of hsp70 gene in chickens: basal and heat-stress-induced mRNA level quantified with real-time reverse transcriptase polymerase chain reaction. Br Poult Sci, 2006. 47(4): p. 449-55.

14. Zhuo, Z., S.J. Lamont, and B. Abasht, RNA-Seq Analyses Identify Frequent Allele Specific Expression and No Evidence of Genomic Imprinting in Specific Embryonic Tissues of Chicken. Sci Rep, 2017. 7(1): p. 11944.

15. Maceachern, S., et al., Genome-Wide Identification and Quantification of cis- and trans-Regulated Genes Responding to Marek's Disease Virus Infection via Analysis of Allele-Specific Expression. Front Genet, 2011. 2: p. 113.

16. Degner, J.F., et al., Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics, 2009. 25(24): p. 3207- 12.

17. Gu, F. and X. Wang, Analysis of allele specific expression -- a survey. Tsinghua Science and Technology, 2015. 20(5): p. 513-529.

18. Mayba, O., et al., MBASED: allele specific expression detection in cancer tissues and cell lines. Genome Biol, 2014. 15(405): p. 1-21.

19. Pirinen, M., et al., Assessing allele-specific expression across multiple tissues from RNA-seq read data. Bioinformatics, 2015. 31(15): p. 2497-504.

20. Edsgard, D., et al., GeneiASE: Detection of condition-dependent and static allele-specific expression from RNA-seq data without haplotype information. Sci Rep, 2016. 6: p. 21134.

57

21. Brandt, D.Y.C., et al., Mapping bias overestimates reference allele frequencies at the HLA gene in teh 1000 genomes project phase I data. G3 Genes, Genomes, Genetics, 2015. 5.

22. Turro, E., et al., Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol, 2011. 12(R13): p. 1-15.

23. Montgomery, S.B., et al., Transcriptome genetics using second generation sequencing in a Caucasian population. Nature, 2010. 464(7289): p. 773-7.

24. Quinlan, A.R. and I.M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 2010. 26(6): p. 841-2.

25. Dobin, A., et al., STAR: Ultrafast Universal RNA-Seq Aligner. Bioinformatics, 2012. 29(1): p. 15-21.

26. DePristo, M.A., et al., A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 2011. 43(5): p. 491-8.

27. McKenna, A., et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 2010. 20(9): p. 1297-303.

28. Van der Auwera, G.A., et al., From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics, 2013. 43: p. 11 10 1-33.

29. Zhou, N., W.R. Lee, and B. Abasht, Messenger RNA sequencing and pathway analysis provide novel insights into the biological basis of chickens' feed efficiency. BMC Genomics, 2015. 16: p. 195.

30. Zhuo, Z., et al., RNA-Seq Analysis of Abdominal Fat Reveals Differences between Modern Commercial Broiler Chickens with High and Low Feed Efficiencies. PLoS One, 2015. 10(8): p. e0135810.

58

31. Mutryn, M.F., et al., Characterization of a novel chicken muscle disorder through differential gene expression and pathway analysis using RNA- sequencing. BMC Genomics, 2015. 16: p. 399.

32. Andrews, S. FastQC: a quality control tool for high throughput sequence data. 2010 2016-08-03; Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

33. GATK. Calling Variants in RNA-seq. 2014 2015-12-07 [cited 2016 09-29]; Available from: https://software.broadinstitute.org/gatk/guide/article?id=3891.

34. Dobin, A., STAR Manual 2.5.1a, C.S. Harbors, Editor. 2016: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf.

35. ENSEMBL, Genome assembly: Gallus_gallus-5.0, ENSEMBL, Editor. 2016: www.ensembl.org.

36. Zerbino, D.R., et al., Ensembl 2018. Nucleic Acids Res, 2018. 46(D1): p. D754-D761.

37. ENSEMBL, Gallus gallus 5.0 86 GTF. 2016.

38. Broad. Picard. 2016; Available from: https://broadinstitute.github.io/picard/.

39. Cock, P.J., et al., Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 2009. 25(11): p. 1422- 3.

40. Kranis, A., et al., Development of a high density 600K SNP genotyping array for chicken. BMC Genomics, 2013. 14: p. 59.

41. Biosystems, A., Axiom Analysis Suite 3.0 (User Guide) Revision 4. 2017, ThermoFisher Scientific.

42. Biosytems, A., Axiom Genotyping Solution- Data Analysis Guide. 2017, ThermoFisher Scientific.

59

43. Jones, E., et al. SciPy (the library). 2001- 12-03-2017]; Available from: http://www.scipy.org/.

44. Van Der Walt, S., S.C. Colbert, and G. Varoquaux, The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 2011. 13(2): p. 22-30.

45. Foundation, P.S., Python Language Reference (version 3.6).

46. Benjamini, Y. and Y. Hochberg, Controlling the false discovery rate: a pratical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 1995. 57(1): p. 289-300.

47. Fisher, R., Statistical Methods for Research Workers (Thirteenth Edition- Revised). 1958, New York: Hafner Publishing Company Inc.

48. McLaren, W., et al., The Ensembl Variant Effect Predictor. Genome Biol, 2016. 17(1): p. 122.

49. Huang da, W., B.T. Sherman, and R.A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc, 2009. 4(1): p. 44-57.

50. Ghent, B.I. Calculate and draw custom Venn diagrams. Bioinformatics & Evolutionary Genomics 2017-2-12]; Available from: http://bioinformatics.psb.ugent.be/webtools/Venn/.

51. Oliverso, J.C. Venny. An interactive tool for comparing lists with Venn's diagrams. 2007-2015; Available from: http://bioinfogp.cnb.csic.es/tools/venny/index.html.

52. Kent, W.J., et al., The human genome browser at UCSC. Genome Res, 2002. 12(6): p. 996-1006.

60

53. Szklarczyk, D., et al., STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res, 2015. 43(Database issue): p. D447-52.

54. Kanehisa, M. and S. Goto, KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 2000. 28(1): p. 27-30.

55. Stelzer, G., et al., The GeneCards suite: from gene data mining to disease genome sequence analyses. Curr Protoc Bioinformatics, 2016. 54(1): p. 1.30.1- 1.30.33.

56. Zhang, K., et al., Digital RNA allelotyping reveals tissue-specific and allele- specific gene expression in human. Nat Methods, 2009. 6(8): p. 613-8.

57. Pinter, S.F., et al., Allelic Imbalance Is a Prevalent and Tissue-Specific Feature of the Mouse Transcriptome. Genetics, 2015. 200(2): p. 537-49.

58. Chamberlain, A.J., et al., Extensive variation between tissues in allele specific expression in an outbred mammal. BMC Genomics, 2015. 16: p. 993.

59. Chen, A., et al., Ribosomal protein genes are highly enriched among genes with allele-specific expression in the interspecific F1 hybrid catfish. Mol Genet Genomics, 2016. 291(3): p. 1083-93.

60. Park, M.M., et al., Variant ribosomal RNA alleles are conserved and exhibit tissue-specific expression. Sci Adv, 2018. 4(2): p. 1-13.

61. Zuidhof, M.J., et al., Growth, efficiency, and yield of commercial broilers from 1957, 1978, and 2005. Poult Sci, 2014. 93(12): p. 2970-82.

62. Sihvo, H.K., K. Immonen, and E. Puolanne, Myodegeneration with fibrosis and regeneration in the pectoralis major muscle of broilers. Vet Pathol, 2014. 51(3): p. 619-23.

61

Chapter 3

FUTURE STUDIES OF ASE

3.1 Introduction

Overall ASE helps form the missing link between GWAS and eQTL studies and gives researchers a new tool to investigate the genome. The major limitations in studying ASE is the lack of reliable and easy-to-implement bioinformatics tools for detecting it. The standalone ASE detection tool VADT, described in this thesis addresses this issue. However, with any good bioinformatics program there are always improvements to be made. In the future, VADT might benefit from an improved model or parsing techniques to identify novel biological features. Additionally, after using

VADT to identify ASE in chickens, we appear to only scratch the surface of the true biological significance of ASE There are still many more questions that can be asked about ASE specifically regarding its biological impact. This chapter is broken up into two major parts, the first part focuses on further development of VADT and the other part the future biological investigations of ASE to better understand its mechanisms of action in an organism.

62

3.2 Future Developments of ASE Detection Analysis Pipeline

Currently there are many different directions the ASE detection pipeline can go with possible improvements to the tool itself or the addition new features to better detect

ASE. Here we will discuss two possible avenues of improvements that encompass improving the model and also identifying hidden biological features of ASE through changes to the analysis pipeline.

3.2.1 Improving VADT’s ASE Detection Model

The statistical method used to detect ASE in VADT is a simple binomial model, which, like every attempt to model a biological system, represents an over simplification of the actual regulatory mechanism. Also, with a binomial test the probability value is fixed for all tests. The default setting in VADT is 0.5 and can be adjusted if a user desires. The binomial tests does not consider the directionality of the ASE change, which can vary samples. For example, allele “A” maybe higher expressed in one sample, and in another sample “A” maybe lower expressed causing the binomial test to flag both situations as ASE, which then combines their p-values using a Fischer Meta-Analysis

[1]. A better way to incorporate this directionality combining samples is to use a beta- binomial test instead of the binomial test. A beta-binomial allows the probability of success to change (beta-function) and takes directionality into account for all samples when calculating significance [2]. It should be mentioned that VADT does provide summary reports with information about whether the reference or alternative allele

63 shows ASE, so a user can try to investigate this phenomenon, but it does not provide further insights.

3.2.2 Strandness of Read Counts

During ASE detection, the strandness (forward/reverse) of reads was not taken into account. This means we are missing information about ASE that could be driven by different genes that are transcribed on opposite strands. It is possible to turn on a stranded feature in HaplotypeCaller [3-6] to gain information about strands for variant counts, however, this means VADT would need to be modified to take advantage of this information. By incorporating this feature, it may be possible to tease apart the unique strandness of ASE in tissues assuming it occurs at a high enough frequency. In mammals and mice it was previously found that around 10% of genes were overlapping [7]. This means it may be possible to detect 1,000-2,000 variants that are strand specific in our dataset, if the entire analysis was re-run with this setting turned on. This insight would help tease apart more precise information about ASE and which genes it is influencing.

3.3 Future Biological Directions

ASE variants have the potential to extremely important for breeding purposes, but many questions still exist. It is still unknown how common are these variants among certain populations and do these same variants always show ASE or is there another unknown biological factor driving ASE variation. To try and answer these questions

64 some the VADT liver results from Zhuo et al [8] were lifted over using UCSC Genome

Browser [9]. The statistically significant results in both studies were examined for overall overlap (Figure 3.1) and it was found only that 29.55% of variants from the study by Zhuo et al overlapped with our current study. This limited overlap was further verified when variants validated with the sanger sequencing by Zhuo were compared with all three tissues. Statistically significant results from the BCD cross and overlapping variants was found to be on average 14% (Table 3.1). These results are not entirely unexpected because the two lines are genetically very distinct, however, these finding raise key questions about how ASE functions in different populations. Clearly, there are ASE variants that can be described as “strong” that overlap between studies, but there could be other ASE variants that show “moderate” or “low” prevalence among different tissues or different genetic backgrounds. Overall this comparison between the two different studies has the potential for some very exciting work into how ASE functions in two different populations of birds from very different genetic backgrounds.

65

Figure 3.1: Comparison of variants considered statistically significant by VADT from the study by Zhuo et al [8] and our ASE study of the BCD cross. Venn diagram was produced using online bioinformatics tool [10].

66

Table 3.1: Variants previously verified using Sanger sequencing and corresponding VADT results from both chicken datasets tested using VADT. Variants were prior reported by Zhuo et al [8] as showing ASE and were verified using Sanger Sequencing in the in the tissue designated on the left-hand side of the table. ASE statistical significance was determined by VADT. Coordinates were lifted over from Gallus gallus 4.0 to Gallus gallus 5.0 for comparison purposes using the UCSC genome browser [9].

VADT Results Sig. Variant Original Brain Liver BCD BCD Tissue Ref. (Zhuo (Zhuo Breast Ab. BCD Designation Gene Variant Allele 2017) 2017) Muscle Fat Liver AKR1D1 1:56735606 C yes yes no no no MPZL1 1:90891019 G yes yes no no no BRP44 1:90873705 C yes yes no no no Brain RFT1 12:1200202 T yes yes yes no no ROGD1 14:14415138 G yes yes no no no PECAM1 18:6837730 A yes no yes yes yes FAM110B 2:111605284 G yes no no no no KCNK1 3:38009219 T yes yes no no no SOD2 3:44930274 T yes yes no no no MANSC 1:71975277 G no yes no yes yes Liver CLN5 1:153672059 G no yes no no yes PPM1K 4:46422860 T yes yes no no no 27334 Z:54543069 G no yes no no no FETUB 9:15627917 C yes yes no no no

67

3.4 Conclusion

In the end of this study we have created a new analysis pipeline that standardized

ASE detection, which as demonstrated in this future direction chapter, will allow researchers to focus more on the biological impact of ASE in their organism. We have also demonstrated that ASE variants in chickens are influencing translational machinery. Combining this knowledge of tissue specific ASE with non-tissue specific

ASE has the potential of drastically influencing the field of agricultural. There are many directions this research can be pushed both from the computational side and from the biological side, but it is clear that understanding ASE variants is key to helping bridge the gap between GWAS studies and eQTL studies and explaining part of the molecular mechanism involved in gene expression. As previously demonstrated by Cheng et al

[11], ASE variants when utilized with prior genetic information, have the potential to revolutionize future breeding programs. Clearly, further research that provides a better understanding of ASE is warranted.

68

REFERENCES

1. Fisher, R., Statistical Methods for Research Workers (Thirteenth Edition- Revised). 1958, New York: Hafner Publishing Company Inc.

2. Harries, J.M. and G.L. Smith, The two-factor triangle test. J. Fd Technol, 1982. 17: p. 153-162.

3. McKenna, A., et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 2010. 20(9): p. 1297-303.

4. DePristo, M.A., et al., A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 2011. 43(5): p. 491-8.

5. GATK. Calling Variants in RNA-seq. 2014 2015-12-07 [cited 2016 09-29]; Available from: https://software.broadinstitute.org/gatk/guide/article?id=3891.

6. Van der Auwera, G.A., et al., From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics, 2013. 43: p. 11 10 1-33.

7. Sanna, C.R., W.H. Li, and L. Zhang, Overlapping genes in the human and mouse genomes. BMC Genomics, 2008. 9: p. 169.

8. Zhuo, Z., S.J. Lamont, and B. Abasht, RNA-Seq Analyses Identify Frequent Allele Specific Expression and No Evidence of Genomic Imprinting in Specific Embryonic Tissues of Chicken. Sci Rep, 2017. 7(1): p. 11944.

9. Kent, W.J., et al., The human genome browser at UCSC. Genome Res, 2002. 12(6): p. 996-1006.

10. Ghent, B.I. Calculate and draw custom Venn diagrams. Bioinformatics & Evolutionary Genomics 2017-2-12]; Available from: http://bioinformatics.psb.ugent.be/webtools/Venn/.

69

11. Cheng, H.H., et al., Fine mapping of QTL and genomic prediction using allele- specific expression SNPs demonstrates that the complex trait of genetic resistance to Marek's disease is predominantly determined by transcriptional regulation. BMC Genomics, 2015. 16: p. 816.

70

APPENDIX A

DERIVATION OF BINOMIAL TEST

The binomial test is a statistical test where there are two possible outcomes

(success and failure). The binomial test is originally based on the binomial distribution equation (Equation A.1) [1], where n=number of trials, k = number of successes, p = probability of success and q = (1-p) but this equation can become very tedious and cumbersome to use with very large numbers.

푛! Probability Mass Function of Pr(푋 = 푘) = 푝푘푞푛−푘 (A.1) Binomial Distribution 푘! (푛 − 푘)!

To simplify the analysis a normal distribution equation can be utilized to perform the analysis when there is a large sample size, np > 5 and nq > 5 [2]. The exact derivations of how a normal distribution is modified for a binomial test can be seen in Equations

A.2-A.4.

First implement standardization of a normal variables for a normal distribution function, where a and b are the boundary points and µ = mean and ϭ = standard deviation [1].

푎 − 휇 푏 − 휇 Pr(푎 < 푋 < 푏) = Pr⁡( < 푍⁡ < ) (A.2) 휎 휎

71

Now if a normal approximation is applied to the binomial distribution, this means a=b and the area is approximated by (a-1/2) to (a+1/2), which re-arranges Equation A.2 to

Equation A.3 .

푎 − 휇 ± 0.5 푍 = Pr⁡( ) (A.3) 휎

Now utilzing that the µ = np and (휎^2) = npq of binomial distribution and applying it to normal distribution, creating the final equation utilized in the binomial test (Equation

A.4) [2]. In this final equation k = number of success, n = total number of tests, p = probability and q = (1-p).

푘 − 푛푝 ± 0.5 푍 = Pr⁡( ) (A.4) √푛푝푞

Once a Z value is determined, the p-value can be determined from a Z-table and in this case, is multiplied by 2 to correct for the two-tailed test. Overall the binomial test can be seen a normal approximation applied to the binomial distribution.

This binomial test can easily be implemented in Python using the SciPy stats module as seen in Figure A.1 [3, 4].

72

Code #!/usr/bin/env python3.6 from scipy import stats

x = 121 #Number of Success Trials n = 200 #Total trials performed p = 0.5 #probability

value_test = stats.binom_test(x, n, p) print (value_test)

Output

(0.00363494795253)

Figure A.1: Example Python code to perform modified binomial test using SciPy.

73

APPENDIX B

FALSE DISCOVERY RATE

The false discovery rate (FDR) is statistical way to control the type I error (false positive) of data when performing multiple tests. It should be mentioned that the FDR method can also be referred to as the Benjamini-Hochberg method, but for simplicity purposes was solely referred to as the FDR method. It is a preferred means of correcting for type I error (false positive) when compared to the Bonferroni approach which can be too stringent. The original FDR procedure created by Benjamini and Hochberg can be seen in equation B.1 [5]. The first step in the procedure is sorting all p-values from lowest to highest and assigning ranks to each p-value. So, in Equation B.1 i is the rank of the p-value, m is the total number of p-values and q is 0.05.

푖 푝_푣푎푙푢푒 ≤ 푞 (B.1) 푚

So, if the p-value being compared is less than or equal this corrected value on the right of the in-equality it is considered significant. To simplify the equation, it possible to re- arrange values, so that an adjusted p-value (aka q-value) is created where if it is <0.05 it is considered significant (Equation B.2). This value is much more straightforward and easier to understand then an inequality.

푝_푣푎푙푢푒 ∗ 푚 푞_푣푎푙푢푒 = ⁡ (B.2) 푖

Now the FDR method is currently not built into any known python modules for python3.6, so a custom script was developed to perform this analysis. However, a

74 dilemma occurs with duplicates p-values and how a corrected p-value, now referred to as q-value, should to be assigned to either number as demonstrated in Table B.1.

Table B.1: Example of duplicate p-values and their corresponding corrected p- values (q-values) and overall significance based on <0.05 cutoff.

Order P-value Q-value Significance 1 0.04 0.280 No 2 0.04 0.140 No 3 0.04 0.093 No 4 0.04 0.070 No 5 0.04 0.056 No 6 0.04 0.047 Yes 7 0.04 0.040 Yes

There are four approaches that can occur when assigning corrected values; assign the more stringent q-value calculated to duplicates, assign the lowest more significant q-value to all duplicates, randomly assign q-values to the duplicates from q- values, assign q-values based on original order found in the dataset. All four methods have pros and cons and deciphering which method will slightly influence overall end results and should be considered when interpreting the results. Specifically, in our code, we chose the more liberal approach and assigned the lowest q-value to all duplicates within a group. To ensure this did not bias our data, we also ran one of our datasets with the more stringent FDR approach and compared the results between the two datasets

(Figure B.1), where it was found no difference occurred in what variants were considered significant. This was slightly expected with such large number of variants

75 tested, but if our code is implemented in a small dataset, caution must be taken when interpreting q-values as demonstrated in Table B.1, where two q-values are significant, and the rest are not and deciphering which value is correct is difficult. The final code implemented in VADT can be seen in Figure B.2, which implements a python dictionary for speed and efficiency for looking up values.

Figure B.1: Comparing variants considered significant by two different FDR methods when analyzing breast muscle tissue using VADT. The less stringent FDR assigned the most significant q-value to all duplicate p-values and the more stringent FDR assigned the least significant q-value to duplicate p-values.

76

Code #!/usr/bin/env python3.6 #Python Code to Create an Adjusted p-value (Q-value0

#Original Example values from Benjamini-Hochberg 1995 Paper p_values=[0.0001, 0.0004, 0.0019, 0.0095, 0.0201, 0.0278, 0.0298, 0.0344, 0.0459, 0.3240, 0.4262, 0.5719, 0.6528, 0.7590, 1.000]

def fdr_correction(values): p_value_dict={} adj_p_values=[] #sort list (smallest p-value to highest value) p_values.sort(); #create i value for equation i=1 for x in range(0, len(p_values)): #append adjust FDR p-value adj_p_values.append(p_values[x] * (len(p_values)) / i) i+=1 #increment i value #create a dictionary of values (key-original value: value-adjust value) p_value_dict.update({p_values[x]:adj_p_values[x]}) return(p_value_dict)

def main(): p_value_dict=fdr_correction(p_values) print (p_value_dict) main() Output 0.0001: 0.0015, 0.0004: 0.003, 0.0019: 0.0095, 0.0095: 0.035625, 0.0201: 0.0603, 0.0278: 0.069499, 0.0298: 0.06385, 0.0344: 0.0645, 0.0459: 0.0765, 0.324: 0.48600, 0.4262: 0.581182, 0.5719: 0.714875, 0.6528: 0.75323, 0.759: 0.81321, 1.0: 1.0

Figure B.2: Example Python code to perform FDR that outputs adjusted p-value (q- values). The results are reported as “key:value” dictionary where original value is the key and the q-value is the value. All original values highlighted yellow are considered significant because their q-values are <0.05.

77

APPENDIX C

FISHER’S METHOD FOR COMBINED PROBABILITY

Fisher’s method is a meta-analysis across multiple samples to test for overall statistical significance when each individual test is combined and statistical examination taken as whole [6]. The test follows a chi-square distribution with k degrees of freedom.

The overall equation can be seen in Equation C.1 where k = number of tests, and p refers to each p-value from the test [7].

푘 2 푋 = 2 ∑(−log푒⁡(푝푖)) (C.1) 푖=1

The resulting value is looked up in a chi-square distribution table, where the degrees of freedom = 2*k. Fisher’s method can easily be implemented in Python using the SciPy stats module as seen in Figure C.1.

Code

#!/usr/bin/env python3.6 from scipy.stats import combine_pvalues

#Original example values from Fisher (1985) values=[0.145, 0.263, 0.087]

print (combine_pvalues(values))

Output

(11.4169, 0.076)

Figure C.1: Example Python code to perform combine pvalues using SciPy. The first value reported is the chi-square value and the second value is actual p-value.

78

APPENDIX D

BREAKDOWN OF INFORMATIVE VARIANT COUNTS Informative variant counts were analyzed for overall summary statistics and broken down by tissue and can be seen in the following two tables, Table D.1 consists of summary stats about the read counts whereas Table D.2 breaks down the binning of read counts. It was found the majority of variants have read counts between 20-200 counts.

Table D.1: Summary statistics of informative variants read counts per tissue.

Breast Muscle Ab. Fat Liver Total Variants Analyzed 148,860 217,628 155,875 Total Read Counts 70,385,651 112,093,032 83,352,055 Average Read Count 472.83 515.07 534.74 Range of Read Counts 20-73134 20-68765 20-82966

Table D.2: Binning of all informative variant read counts broken down by tissue. Bins refer to the total counts for a variant and reflect the number of variants with those counts

Breast Muscle Abdominal Fat Liver Bins Counts Percent Counts Percent Counts Percent Bin 20-200 87135 58.53% 111401 51.19% 88204 56.59% Bin 200-400 24369 16.37% 40583 18.65% 27131 17.41% Bin 400-600 12322 8.28% 21632 9.94% 13427 8.61% Bin 600-800 6760 4.54% 12287 5.65% 7197 4.62% Bin 800-1000 4232 2.84% 7671 3.52% 4235 2.72% Bin 1000-1200 2644 1.78% 5017 2.31% 2836 1.82% Bin 1200-1400 1900 1.28% 3502 1.61% 2100 1.35% Bin >1400 9498 6.38% 15535 7.14% 10745 6.89%

79

APPENDIX E

EXPANDED VIEW OF DAVID RESULTS

Significant ASE variants identified using VADT were uploaded to Ensembl’s VEP [8] and results were parsed using a custom script to identify all unique Ensembl gene IDs, which were separated into various categories based on how they overlapped in Figure

2.8 and results uploaded to DAVID [9, 10]. The results of the analysis can be seen in

Figure 2.10, and an expanded view of the pathway enrichment results from DAVID can be seen in Tables E.1-E.5.

Table E.1: Breast Muscle tissue results from DAVID, 1319 Ensembl IDs submitted and 817 Ensembl IDs matched for analysis. Only annotation cluster 1 showed statistical significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed.

Annotation Cluster 1 Enrichment Score: 2.89 Count P_Value Benjamini UP_KEYWORDS Zinc-finger 42 5.00E-04 5.50E-02 UP_KEYWORDS Zinc 56 9.00E-04 6.60E-02 Annotation Cluster 3 Enrichment Score: 2.3 Count P_Value Benjamini Cul3-RING ubiquitin ligase GOTERM_CC_DIRECT 12 4.50E-04 4.80E-02 complex Annotation Cluster 5 Enrichment Score: 1.75 Count P_Value Benjamini ubiquitin-dependent protein GOTERM_BP_DIRECT 17 5.00E-05 8.30E-02 catabolic process

80

Table E.2: Breast Muscle/Liver tissue results from DAVID, 430 Ensembl IDs submitted and 281 Ensembl IDs matched for analysis. Only annotation cluster 1 showed statistical significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table.

Annotation Cluster 1 Enrichment Score: 3.15 Count P_Value Benjamini KEGG_PATHWAY Biosynthesis of antibiotics 15 3.50E-04 3.50E-02 KEGG_PATHWAY Biosynthesis of amino acids 8 9.60E-04 4.80E-02 KEGG_PATHWAY Carbon metabolism 10 1.00E-03 3.50E-02

Table E.3: Breast Muscle/Abdominal Fat tissue results from DAVID, 1320 Ensembl IDs submitted and 833 Ensembl IDs matched for enrichment. Only annotation cluster 1 showed significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table.

Annotation Cluster 1 Enrichment Score: 2.85 Count P_Value Benjamini INTERPRO Collagen triple helix repeat 14 2.30E-05 2.90E-02 UP_KEYWORDS Collagen 11 1.10E-04 2.50E-02 GOTERM_CC_DIRECT collagen trimer 10 2.30E-04 4.80E-02

81

Table E.4: All three tissue results from DAVID, 1715 Ensembl IDs submitted and 1036 Ensembl IDs matched for enrichment. A total of five clusters showed significance (FDR p-value <0.1 and Enrichment Score >1.3). and only significant terms listed in table.

Annotation Cluster 1 Enrichment Score: 8.27 Count P_Value Benjamini GOTERM_CC_DIRECT cytosolic large ribosomal subunit 21 2.50E-10 3.20E-08 structural constituent of GOTERM_MF_DIRECT 37 4.60E-10 3.30E-07 ribosome KEGG_PATHWAY Ribosome 37 9.60E-10 1.40E-07 UP_KEYWORDS Ribosomal protein 24 2.40E-09 6.90E-07 UP_KEYWORDS Ribonucleoprotein 25 5.10E-08 7.40E-06 GOTERM_BP_DIRECT translation 29 1.80E-06 2.00E-03 Annotation Cluster 2 Enrichment Score: 2.97 Count P_Value Benjamini Glutathione S-transferase, C- INTERPRO 12 3.50E-06 3.00E-03 terminal-like Glutathione S-transferase, C- INTERPRO 8 7.90E-05 4.40E-02 terminal GOTERM_BP_DIRECT glutathione metabolic process 9 8.80E-05 3.80E-02 GOTERM_MF_DIRECT glutathione transferase activity 8 2.90E-04 6.70E-02 KEGG_PATHWAY Glutathione metabolism 11 4.40E-03 5.20E-02 Metabolism of xenobiotics by KEGG_PATHWAY 9 1.20E-02 9.30E-02 cytochrome P450 Annotation Cluster 3 Enrichment Score: 2.65 Count P_Value Benjamini KEGG_PATHWAY Fatty acid degradation 16 2.60E-07 1.90E-05 Valine, leucine and isoleucine KEGG_PATHWAY 14 3.60E-04 8.80E-03 degradation KEGG_PATHWAY beta-Alanine metabolism 10 1.10E-03 2.30E-02 KEGG_PATHWAY Pyruvate metabolism 11 1.80E-03 2.90E-02 KEGG_PATHWAY Arginine and proline metabolism 12 2.40E-03 3.50E-02 Annotation Cluster 4 Enrichment Score: 1.93 Count P_Value Benjamini KEGG_PATHWAY Glycolysis / Gluconeogenesis 14 1.70E-03 3.00E-02 Annotation Cluster 7 Enrichment Score: 1.54 Count P_Value Benjamini KEGG_PATHWAY Butanoate metabolism 8 9.20E-03 8.00E-02 KEGG_PATHWAY Propanoate metabolism 8 1.20E-02 9.50E-02

82

Table E.5: Liver tissue results from DAVID, 1057 Ensembl IDs submitted and 703 Ensembl IDs matched for enrichment. Only 2 annotations clusters visualized, however a total of 7 clusters had significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table.

Annotation Cluster 1 Enrichment Score: 4.25 Count P_Value Benjamini GOTERM_BP_DIRECT fibrinolysis 8 1.40E-07 2.40E-04 SMART Tryp_SPc 14 1.60E-07 3.90E-05 Peptidase S1A, chymotrypsin- INTERPRO 14 2.80E-07 3.30E-04 type UP_KEYWORDS Hemostasis 7 5.60E-07 1.40E-04 UP_KEYWORDS Blood coagulation 7 5.60E-07 1.40E-04 INTERPRO Peptidase S1 14 1.10E-06 6.60E-04 Trypsin-like cysteine/serine INTERPRO 14 2.30E-06 9.20E-04 peptidase domain UP_KEYWORDS Serine protease 13 4.20E-06 5.40E-04 serine-type endopeptidase GOTERM_MF_DIRECT 16 1.70E-05 9.40E-03 activity Peptidase S1, trypsin family, INTERPRO 10 2.00E-04 2.40E-02 active site UP_KEYWORDS Protease 20 3.90E-03 9.50E-02 Annotation Cluster 2 Enrichment Score: 3.75 Count P_Value Benjamini INTERPRO Cytochrome P450 11 2.80E-06 8.30E-04 GOTERM_MF_DIRECT iron ion binding 18 2.20E-05 6.20E-03 Secondary metabolites biosynthesis, COG_ONTOLOGY 14 2.70E-05 7.30E-04 transport, and catabolism Secondary metabolites biosynthesis, COG_ONTOLOGY 14 2.70E-05 7.30E-04 transport, and catabolism Secondary metabolites biosynthesis, COG_ONTOLOGY 14 2.70E-05 7.30E-04 transport, and catabolism UP_KEYWORDS Iron 18 1.60E-04 1.30E-02 GOTERM_MF_DIRECT heme binding 15 2.00E-04 3.60E-02 UP_KEYWORDS Monooxygenase 9 8.60E-04 4.30E-02 UP_KEYWORDS Heme 11 9.80E-04 4.10E-02 UP_KEYWORDS Oxidoreductase 23 1.10E-03 3.80E-02

83

Informative and Significant ASE variants identified using VADT were uploaded to Ensembl’s VEP [8] and results were parsed using a custom script to tally all variant counts per gene. Gene counts were then normalized (statistically sig. counts / informative counts) for each gene. The genes with >=50% normalization value were compared between tissues and genes only found in one tissue were uploaded to DAVID [9, 10]. The results of the analysis can be seen in Figure 2.11, and an expanded view of the pathway enrichment results from DAVID can be seen in Tables E.6-E.8.

Table E.6: Breast muscle tissue results from DAVID, 543 Ensembl IDs submitted and 272 Ensembl IDs matched for enrichment. Only annotation cluster 1 showed significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table.

Annotation Cluster 1 Enrichment Score: 3.22 Count P_Value Benjamini KEGG_PATHWAY Biosynthesis of antibiotics 18 1.10E-05 1.10E-03 UP_KEYWORDS Glycolysis 6 2.30E-05 2.10E-03 GOTERM_BP_DIRECT glycolytic process 6 5.30E-05 4.00E-02 GOTERM_BP_DIRECT gluconeogenesis 6 1.70E-04 6.20E-02 KEGG_PATHWAY Carbon metabolism 10 1.40E-03 7.30E-02

84

Table E.7: Abdominal fat tissue results from DAVID, 536 Ensembl IDs submitted and 243 Ensembl IDs matched for enrichment. Only annotation cluster 1 shown significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table.

Annotation Cluster 1 Enrichment Score: 3.25 Count P_Value Benjamini SMART FBG 5 3.30E-04 3.30E-02 Fibrinogen, alpha/beta/gamma INTERPRO 5 5.60E-04 8.20E-02 chain, C-terminal globular domain

Table E.8: Liver tissue results from DAVID, 484 Ensembl IDs submitted and 260 Ensembl IDs matched for enrichment. Only annotation cluster 1 showed significance (FDR p-value <0.1 and Enrichment Score >1.3) and only significant terms listed in table.

Annotation Cluster 1 Enrichment Score: 2.14 Count P_Value Benjamini UP_KEYWORDS Oxidoreductase 14 1.60E-04 2.70E-02 UP_KEYWORDS Iron 10 4.80E-04 4.10E-02 Secondary metabolites COG_ONTOLOGY biosynthesis, transport, and 6 2.30E-03 3.00E-02 catabolism

85

APPENDIX F

TOP ASE CANDIDATE GENES

Table F.1: Top ASE Genes Identified using VADT, VEP and normalizing variants called. Gene variants were normalized using the following (informative variants/sig. variants * 100). Genes were considered “top” based on the were identified in all three tissues (breast muscle, abdominal fat, and liver) and the average normalization value between all three tissues was >=80%.

No. Ensembl Gene ID Gene Symbol Total Hits Avg. Normalized 1 ENSGALG00000038687 - 3 100.00 2 ENSGALG00000021139 IGLL1 3 100.00 3 ENSGALG00000042449 - 3 100.00 4 ENSGALG00000039383 - 3 100.00 5 ENSGALG00000041885 - 3 100.00 6 ENSGALG00000029837 OST4 3 100.00 7 ENSGALG00000029729 - 3 100.00 8 ENSGALG00000035138 UBL5 3 100.00 9 ENSGALG00000043742 - 3 100.00 10 ENSGALG00000032272 - 3 100.00 11 ENSGALG00000007611 RPL35A 3 100.00 12 ENSGALG00000008066 UQCR10 3 100.00 13 ENSGALG00000037876 - 3 100.00 14 ENSGALG00000039867 - 3 100.00 15 ENSGALG00000034352 MRPL43 3 100.00 16 ENSGALG00000041562 - 3 100.00 17 ENSGALG00000034151 - 3 100.00 18 ENSGALG00000012229 RPS29 3 100.00 19 ENSGALG00000042642 RARRES2 3 100.00 20 ENSGALG00000012299 - 3 100.00 21 ENSGALG00000046114 - 3 100.00 22 ENSGALG00000021365 DCTN3 3 100.00 23 ENSGALG00000043768 ND2 3 100.00 24 ENSGALG00000005490 RPS2 3 94.44 25 ENSGALG00000041793 RF00004 3 94.44 26 ENSGALG00000014585 - 3 94.44

86

No. Ensembl Gene ID Gene Symbol Total Hits Avg. Normalized 27 ENSGALG00000040260 TUBA1C 3 93.33 28 ENSGALG00000041380 BF2 3 92.70 29 ENSGALG00000040576 - 3 92.59 30 ENSGALG00000022871 HHLA2 3 91.85 31 ENSGALG00000037783 - 3 91.67 32 ENSGALG00000042863 SMIM4 3 91.67 33 ENSGALG00000043565 - 3 91.67 34 ENSGALG00000010811 LGMN 3 90.91 35 ENSGALG00000041597 - 3 90.48 36 ENSGALG00000040628 - 3 90.48 37 ENSGALG00000031403 - 3 88.89 38 ENSGALG00000030109 - 3 88.89 39 ENSGALG00000036158 - 3 88.89 40 ENSGALG00000001529 THYN1 3 88.89 41 ENSGALG00000032757 - 3 88.89 42 ENSGALG00000036173 - 3 88.89 43 ENSGALG00000012488 TST 3 87.78 44 ENSGALG00000033672 - 3 87.63 45 ENSGALG00000004028 - 3 87.50 46 ENSGALG00000039881 - 3 87.42 47 ENSGALG00000038410 - 3 87.18 48 ENSGALG00000034721 - 3 86.90 49 ENSGALG00000033302 - 3 85.93 50 ENSGALG00000044239 - 3 84.85 51 ENSGALG00000033932 BF1 3 84.52 52 ENSGALG00000037405 - 3 83.33 53 ENSGALG00000039548 - 3 83.33 54 ENSGALG00000031701 - 3 83.33 55 ENSGALG00000036214 PARK7 3 83.33 56 ENSGALG00000027483 GLRX 3 83.33 57 ENSGALG00000030940 BLB2 3 83.33 58 ENSGALG00000031865 - 3 83.33 59 ENSGALG00000045034 - 3 83.33 60 ENSGALG00000046185 - 3 83.33 61 ENSGALG00000005849 PPDPF 3 83.33 62 ENSGALG00000028520 CST3 3 83.33 63 ENSGALG00000032142 COX1 3 83.33

87

No. Ensembl Gene ID Gene Symbol Total Hits Avg. Normalized 64 ENSGALG00000036145 - 3 83.33 65 ENSGALG00000001330 ATP5MC1 3 83.33 66 ENSGALG00000004521 GPX3 3 82.58 67 ENSGALG00000004769 PSAP 3 82.38 68 ENSGALG00000001718 CCNG1 3 82.25 69 ENSGALG00000030189 - 3 82.14 70 ENSGALG00000026152 GBP 3 81.93 71 ENSGALG00000000293 - 3 81.68 72 ENSGALG00000032653 - 3 81.24 73 ENSGALG00000012119 MARCO 3 81.15 74 ENSGALG00000037953 TUBA1A 3 80.64 75 ENSGALG00000029339 - 3 80.56 76 ENSGALG00000033116 CLC2DL3 3 80.44 77 ENSGALG00000040068 - 3 80.11 78 ENSGALG00000007403 PEBP1 3 80.00

88

APPENDIX REFERENCES

1. Rosner, B., Fundamentals of Biostatistics (8th Edition). 2016, Boston, MA: Cengage.

2. Berenson, M.L., D.M. Levine, and T.C. Krehbiel, Basic Business Statistics (Tenth Edition). 2006, Upper Saddle River, New Jersey: Pearson Prentice Hall. 897.

3. Foundation, P.S., Python Language Reference (version 3.6).

4. Jones, E., et al. SciPy (the library). 2001- 12-03-2017]; Available from: http://www.scipy.org/.

5. Benjamini, Y. and Y. Hochberg, Controlling the false discovery rate: a pratical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 1995. 57(1): p. 289-300.

6. Fisher, R., Statistical Methods for Research Workers (Thirteenth Edition- Revised). 1958, New York: Hafner Publishing Company Inc.

7. Brown, M.B., A Method for Combining Non-Independent One-Sided Test of Significance. Biometrics, 1975. 31: p. 987-992.

8. McLaren, W., et al., The Ensembl Variant Effect Predictor. Genome Biol, 2016. 17(1): p. 122.

9. Huang da, W., B.T. Sherman, and R.A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc, 2009. 4(1): p. 44-57.

10. Huang da, W., B.T. Sherman, and R.A. Lempicki, Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res, 2009. 37(1): p. 1-13.

89