Integrated Assembly and Annotation of Fathead Minnow

Genome Towards Prediction of Environmental Exposures

A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in the Department of Biomedical Engineering by

John Martinson M.S. University of Cincinnati March 2020

Committee Chair: Jaroslaw Meller, Ph.D.

Abstract

The fathead minnow (FHM, Pimephales promelas) is a species of temperate freshwater fish with a geographic range that extends throughout much of North America that is widely used as a model organism for aquatic toxicity testing. Our team at the Environmental Protection Agency produced a new

FHM assembly which served as the foundation for accomplishing the aims in this project. Because of the importance of the underlying assembly to being able to achieve the aims, the generation of the new assembly is presented in this dissertation, though it was not a specific aim.

The first aim of this research project was to annotate the protein coding genes in a new FHM genome. A comprehensive set (26,150) of gene models that can facilitate the analysis of RNA-seq expression profiles derived from exposures of P. promelas subjects to chemicals and other stressors was produced.

The second aim of the project was to demonstrate the application and utility of the new gene models by using RNA-seq data generated in controlled exposure experiments to identify differentially expressed genes (DEGs) as markers of exposure. FHM were exposed to two chemicals with different modes of toxicity, the pyrethroid pesticide bifenthrin and copper. The new gene models were used to quantify mRNA expressions levels and statistical and machine learning techniques were applied to develop lists of DEGs in treated and untreated samples. The third aim of the study was to develop predictors of exposure from the data, using machine and statistical learning methods to combine the obtained markers into exposure signatures and optimize the predictive power of the resulting exposure classifiers. As part of these experiments, five different classifiers were evaluated using a cross-validation framework. Classifiers were able to distinguish treated samples from controls and were then applied to samples treated with the other chemical to evaluate how the classifiers performed when faced with an exposure scenario different from the one for which they were trained.

ii

Assessment of the genome and gene models in terms of both BUSCO coverage and RNA-seq mapping rates show that the new assembly and gene models represent not only a significant improvement over the previously published FHM assembly and gene annotations, but also that they compare very favorably with the highly studied and closely related zebrafish (Danio rerio). Given the mature state of the zebrafish genome the FHM results presented here represent a significant success story. Further validation of the success was provided by the successful use of the new gene models in the bifenthrin/copper exposure study. For each of the two toxicants studied, successful classifiers of exposure were able to be developed from a variety of approaches based on mapping RNA-seq data to the new gene models. Functional analysis of the differentially expressed genes (DEGs) leveraged by the classifiers indicated toxicant specific responses at the gene level appeared to drive the ability to correctly classify samples. Glm elastic net (“glmnet”) and random forest showed the most promise of being able to avoid false positive classifications in the cross-chemical testing.

iii

iv

Acknowledgements

I would like to thank my academic advisor Dr. Jaroslaw Meller for the guidance and support he has provided to me as I’ve been on this long and winding path, that is certainly the one less taken. Jarek saw something in me that made him decide to encourage me to follow through on the academic pursuits I chose to take up as a forty-something year old and has stuck with me, and gone to bat for me, for a long time. Thank you Jarek. You’ve been a great mentor.

I’d also like to thank my former EPA colleague Dr. Mitchell Kostich. Mitch taught me so much while he was around at EPA that I can’t even begin to describe the full benefits I derived from working with him.

Whether (or not) I wanted to know about ancient religions, oppressed Muslims in China, early Pink

Floyd, biology, computing or statistics, Mitch was my go-to guy. We miss him a lot at the EPA, and I would not be at this point without Mitch having led our efforts to sequence the Fathead minnow genome. Mitch was another great mentor and teacher.

There are a bunch of other people at the EPA and UC who I would like to thank for their support and counsel over the years. Given the duration of my journey, there have been a lot. I will start with the other members of my dissertation committee, Dr. Daria Narmoneva, Dr. Marepalli Rao, and my current

EPA supervisor, Dr. Adam Biales. Drs. Narmoneva and Rao were great additions to my committee who helped “keep me honest” by asking good questions and providing useful insights and good advice that helped me move forward. The support Adam supplied to me, particularly as I came down the “home stretch,” was great and influenced my being able to finish up as much as any factor. Dr. Greg Toth at

EPA, who ceded the protein coding gene annotation responsibility to me, and in so many other ways helped me along the way, also deserves special thanks. Weichun Huang’s tireless efforts to generate the best assembly possible also must be recognized. His work set the basis for my success. Barbara Carter at the UC CEAS Graduate Office has also been incredibly helpful and supportive over the years and

v

deserves special thanks too. Many others helped and/or supported me along the way in some way or other. In no real order, others I’d like to thank include Pete Kauffman, Lora Johnson, Dr. David Bencic,

Bob Flick, Dr. Rong-Lin Wang, Denise Gordon, Mary Jean See, Dr. David Lattier, Dr. Florence Fulk, Dr.

Mark Bagley, Dr. Eric Waits, Dr. Roy Martin, Janine Fetke, Dr. Erik Pilgrim, Sara Okum, Dr. John Darling,

Chris Bourk, Margie Vazquez and Dr. Jing-Huei Lee. There are just too many others to thank by name. So to all of you, thanks.

Almost last, and certainly not least, I’d like to dedicate this dissertation to my dearly departed parents, who I know would be proud of me, and to my siblings and extended family including my in-laws Nancy and Tom. I also want to say to my children, Alex, Ellen and Lily, who I love immensely, “Guess what kids?

I finally did it! Whatever you might choose to do, try to do it sooner! But also, don’t give up!”

Finally, I’d like to thank my wife Beth for all her patience, understanding and love. Putting up with me isn’t always easy, especially when I’m stressed out about overdue academic pursuits and don’t have time to do other things. Thanks for everything Beth. Love you then, now and always.

vi

Table of Contents Abstract ...... ii

Acknowledgements ...... v

Chapter 1. Introduction and Aims ...... 1 1.1. Motivation ...... 1

1.2. Background ...... 2

1.3. Specific Aims ...... 5

Specific Aim #1. Develop a comprehensive set of annotated protein-coding gene models for Pimephales promelas using the improved FHM genome assembly ...... 6 Specific Aim #2. Identify differentially expressed transcripts from the set of new gene models developed as aim 1 as markers of exposure in an RNA-seq exposure study...... 7 Specific Aim #3. Develop and assess statistical and machine learning predictors of exposure from the RNA expression profiles developed as aim 2...... 7 Chapter 2. Generation of a New FHM Assembly ...... 8 2.1 Introduction and Background ...... 8

2.1a. DNA sequencing ...... 9 2.1b. Genome assembly ...... 15 2.2. Materials and methods ...... 18

2.2a. DNA sequencing ...... 18 2.2b. Assembly and scaffolding...... 20 2.2c. BUSCO and RNA read mapping ...... 22 2.3. Results...... 23

2.4. Discussion ...... 26

Chapter 3. Annotation of protein coding genes in the FHM genome ...... 28 3.1. Introduction and Background ...... 28

3.1a. and annotation ...... 31 3.1b. Annotation pipelines ...... 35 3.1c. Masking ...... 36 3.2. Materials and Methods ...... 38

3.2a. Generation of input RNA reads ...... 39 3.2b. Assembly of RNA reads ...... 40

vii

3.2c. Preliminary annotation runs ...... 41 3.2d. Maker and PASA/EVidenceModeler (EVM) annotation runs ...... 43 3.2e. Assessment of models sets, selection of preferred set and additional processing ...... 50 3.2f. Final model filtering ...... 55 3.3. Results...... 59

3.4. Discussion ...... 67

Chapter 4 - Using the FHM gene models in an RNA-Seq expression experiment to develop predictors of exposure...... 70 4.1. Introduction and Background ...... 70

4.2. Methods ...... 72

4.2a. Summary ...... 72 4.2b. Exposure organisms ...... 73 4.2c. Test chemicals and exposure water ...... 73 4.2d. Exposures ...... 74 4.2e. RNA isolation and preparation of sequencing libraries ...... 74 4.2f. RNA Sequencing ...... 75 4.2g. Mapping ...... 75 4.2h. Feature ranking ...... 76 4.2i. Classifier tuning, training, and testing ...... 77 4.2j. Functional analysis/biomarker discovery ...... 78 4.3. Results...... 79

4.3a. Sequencing, Mapping, and Quality Control evaluation ...... 79 4.3b. Feature ranking (Differentially expressed gene lists, DEG lists) ...... 83 4.3c. Classifier performance ...... 83 4.3d. Final classifier testing and functional analysis ...... 86 4.4. Discussion ...... 89

4.4a. Sequencing mapping and QC ...... 89 4.4b. Feature ranking (DEG lists)...... 89 4.4c. Classifier performance ...... 90 4.4d. Final classifier testing and functional analysis ...... 92 4.4e. Limitations and future studies ...... 103 4.5 Conclusions ...... 104

viii

Bibliography ...... 105 Supplementary Materials...... 113 S.1. Specific steps/commands used in Maker and PASA/EVM annotation pipelines production runs 113

S.1a. rCorrector (version 1.0.3.1) ...... 114 S.1b. TrimGalore (version 0.6.1) ...... 114 S.1c. STAR (version 2.6.0a) ...... 115 S.1d. Trinity (version 2.5.0) ...... 116 S.1e. Maker2 (version 2.31.9) ...... 117 S.1f. PASA (version 2.3.3) ...... 118 S.1g. TransDecoder (version 5.5.0) ...... 119 S.1h. EVidenceModeler (EVM, version 1.1.1) ...... 121 S.1i. Final PASA processing ...... 126 S.2. Preliminary Maker runs and notes on gene predictor training ...... 128

S.2a. Preliminary MAKER run using the “dbg2olc” assembly ...... 128 S.2b. Gene predictor training ...... 130 S.2c. Preliminary MAKER runs using the “c15dcphs.fa” assembly ...... 141 S.2d. Maker zebrafish run for qualitative comparison to Maker FHM process ...... 143 S.3. Additional Tables and Figures referred to in the text ...... 148

S.4. List of Publications ...... 155

ix

Chapter 1. Introduction and Aims 1.1. Motivation

Water quality is a concern world-wide in both the context of human health and ecological research.

Agencies have been established throughout the world to establish water quality criteria and monitor water quality. Anthropogenic effects on water quality are often studied and used to establish policies, laws and regulations designed to ameliorate or at least manage negative effects. The potential toxic effects of chemicals which might enter the aquatic environment is one aspect of water quality.

Numerous ways of measuring water quality and toxic effects of waterborne chemicals have been developed, such as chemical analyses of water and sediment, physical habitat measures, aquatic toxicology assays, etc. In the United States some of these assays are used as regulatory tools, employed by state environmental agencies responsible for implementing Federal statutes related to water quality and the toxicity of new and existing chemicals such as pesticides, which must be registered as safe for use. Several toxicity assays involving P. promelas are employed in a variety of these regulatory contexts

(Ankley and Villeneuve 2006).

The FHM is among the most commonly used model organisms in aquatic toxicology, but due to the lack of a fully sequenced and well-annotated genome, studies aimed at characterizing the mechanisms of action of toxicants are limited. Further, the lack of a genome has limited the use of current sequence- based technologies. Beyond the broader goal of protecting water resources and life that relies on them, and identifying the potential hazards of chemical contaminants, there are several specific and significant connections of the work presented here to current work at the Environmental Protection Agency’s

Office of Research and Development (ORD, where I am employed). Our Fathead minnow (FHM) project supports specific tasks under high level research programs within ORD. For the Chemical Safety and

1

Sustainability Program (CSS) a specific task we are contributing to is one designed to, “Improve ecological methods and models for predicting exposure, accumulation and effects of PFAS and other methodologically challenging compounds”. Under ORD’s Safe and Sustainable Water Resources

Program (SSWR) we are part of specific efforts aimed at the “Development and evaluation of in vitro and in vivo bioassays for use in determining human health ambient water criteria for similarly acting groups of chemicals of high priority” and developing “Effects-based Methods for Assessing Chemical

Contaminants in Wastewater and Reclaimed Water”. All the support we supply to these research efforts will involve the use of FHM as the primary test organism, and simultaneous measurement of relative abundance of RNAs transcribed from tens of thousands of individual genes to detect changes in physiology due to chemical exposure. We required a fully sequenced genome to do that effectively.

1.2. Background

Fish have been used to assess the toxicity of aquatic environments since the 1800’s. Over time there was a shift away from using the larger/older fish employed in early assays to using early developmental stage subjects and/or smaller species with relatively short life cycles. There was also a shift from mostly lethality-based assays to longer term sub-lethal assays to gain better understanding of potential chronic effects related to toxic exposures. Over time species such as the fathead minnow, zebrafish (Danio rerio), and Japanese medaka (Oryzias latipes) became the most commonly used model species in aquatic toxicology assays. These species were favored because they have relatively short lifespans and do well in cultured environments.

In North America P. promelas has essentially served as the “lab rat” in the field of aquatic toxicology for decades, being used extensively for both regulatory testing and research. A recent search of Google

Scholar using the term “Pimephales promelas” and a date range of 1970-2017 produced “about 23,100 results.” The fathead is a member of the Cyprinidae family and is widely distributed in North American

2

freshwater environments. It is tolerant of a wide range of basic water quality characteristics such as alkalinity/hardness, pH, temperature, and turbidity. This combination of characteristics along with their short life span and amenity to culture are among the reasons the species has played such a prominent role in ecotoxicology. Standard methods involving the use of the fathead support a number of regulatory activities. The Short-term (48 or 96 hour) lethality test is one test used during pesticide registration and is used in the development of Water Quality Criteria (WQC) and in screening some chemicals under EPA’s Toxic Substances Control Act (TSCA). The EPA also will use data from a 30-day early life-stage assay as inputs in the pesticide and WQC processes. A shorter 7 day early-life stage test is used by EPA in the whole-effluent monitoring program that occurs under the authority of the National

Pollution Discharge Elimination System (NPDES).

To develop deeper understanding of the effects of toxic chemicals at very fundamental levels we need to push ecotoxicology forward into the realm of molecular biology. P. promelas has been a clear choice as a target in which molecular understanding and tools should be developed given the amount of historical toxicology data that were generated from studies using it as a model organism, and there has been a push in that direction over the last decade (reviewed in Villeneuve and Ankley, 2006). By measuring the transcription levels of genes in response to different stimuli (e.g., a stressor) one can gain an understanding of exactly how an organism is responding. The study of co-expressed genes has often lead to a deeper understanding of the systemic response, providing clues to protein function and helping to unravel complex molecular pathways. The sequenced genome of an organism provides the most fundamental level of understanding about it and provides a valuable basis for the exploration of biological processes as genes get “turned on and off” at various developmental stages and in response to various stimuli, resulting in the production or non-production of different gene products (primarily proteins). While there are ways to measure gene expression without a sequenced genome, to get a view

3

of an organism’s total response to a stimuli (e.g., stressor) at a molecular level, one needs to have the entire genome available.

In order to gain experimental access to the molecular underpinnings of the FHM response, a team of researchers from the U.S. EPA and Dupont performed shotgun genome sequencing of P. promelas and published a draft genome assembly for the organism in January 2016 (Burns, Cogburn et al. 2016). A set of annotated gene models based on the draft assembly were subsequently published (Saari, Schroeder et al. 2017). The annotations attempt to map exactly where the genes are located on the genome, including the exact location of their exons and introns, and attempts to identify those genes by comparing their homology to the genes/transcripts/proteins of another model organism with a well- studied genome, Danio rerio, the zebrafish (Howe, Clark et al. 2013). The annotations provide a functional context to the genes along with their locations in the genome.

While these two publications represented important advances in the field, due to the highly fragmented assembly, they suffered from significant shortcomings that severely limited their usefulness and comprehensiveness, and likely resulted in inaccuracies. To achieve a better genome assembly with concordantly more complete and better-annotated gene models, our group at the Environmental

Protection Agency in Cincinnati did additional P. promelas DNA sequencing runs, including some using newer technologies (PacBio) to produce longer DNA sequences (additional details provided later). The combination of “short-read” sequence data (both new and the sequences used to produce the published assembly) and “long-read” sequence data, combined with the use of additional techniques

(“Hi-C” and optical mapping, also described later), was used to produce a much more contiguous P. promelas assembly than the one previously published. The new assembly (reviewed in Chapter 2) provided the foundation for the work that represents the specific aims of this dissertation.

4

The Improved gene models based on this assembly will greatly enhance studies focused on the molecular level of response. They will allow researchers to fully leverage the power of RNA-seq expression profiles and increase the likelihood of finding the key genes involved in response to perturbations, such as being exposed to a toxic chemical or being subjected to some other type of environmental stress and will aid in the functional interpretation of the global gene expression response. Further, they will provide a means of applying sequence-based technologies to important issues, such as chemical groupings, which are needed for chemical read across. Improved functional annotations will also increase the probability of being able to understand the biological underpinnings

(e.g., molecular pathways) involved in the organism’s response, providing a better source for hypothesis generation for future research.

1.3. Specific Aims

The first aim of this study was to develop a comprehensive annotation of the protein coding genes in the

FHM genome. The annotation process resulted in a set of gene models that will facilitate the analysis of

“RNA-seq” profiles derived from exposures of P. promelas subjects to various chemicals, mixtures, effluents and other environmental samples. The second aim of this project was to demonstrate the application and utility of the new gene models (and the new genome from which the gene models were derived) by using RNA-seq data generated in a controlled exposure experiment to identify differentially expressed genes as markers of exposure. The new gene models were used to quantify mRNA expressions levels and statistical techniques were used to identify signatures of exposure based on the expression levels of select genes in treated and untreated samples. The third aim was to develop indicators of exposure from RNA-seq data using machine and statistical learning methods to combine such obtained markers into exposure signatures and optimize predictive power of the resulting

5

exposure classifiers. Multiple learning methods were evaluated as classifiers using cross-validation framework.

The work described in Chapters 2 and 3 of this dissertation will be presented in an upcoming publication

(in preparation) from our team at EPA, where the new genome assembly and its associated annotations will be unveiled. That publication should be released by Summer, 2020. To be clear, though this author very significantly participated in the team effort to produce the new assembly described in Chapter 2, and though his name will appear as first author on the planned genome publication, the author’s “real” thesis work was what is outlined in the specific aims below and described in detail in Chapters 3 and 4.

The work in Chapter 4 will be released as a separate publication, and is currently under review and will be uploaded soon to bioRxiv.org (“BioArchive”) prior to publication. Several other publications directly related to the new gene model annotations are also planned for release in 2020 and are listed at the end of the dissertation in the “Publications” section.

Specific Aim #1. Develop a comprehensive set of annotated protein-coding gene models for Pimephales promelas using the improved FHM genome assembly

Using the improved FHM genome assembly our team produced as a framework, a comprehensive and accurate set of annotated P. promelas protein-coding gene models was developed. Two gene annotation pipelines were employed and each produced gene models. The two model sets were evaluated and one was chosen as the preferred model set with which to continue. Additional processing of the chosen model set was done to improve the mapping rate of RNA reads to the models. A final filtering step based on RNA mapping rates and homology to reference proteins was applied to the models. Approximately 10,700 models that did not meet established filtering criteria were removed leaving 26,150 gene models as a final output. The new assembly and gene models have been uploaded

6

to NCBI/GenBank (Accession WIOS00000000) and will become available to the public once the genome publication is released (expected to be no later than Summer, 2020).

Specific Aim #2. Identify differentially expressed transcripts from the set of new gene models developed as aim 1 as markers of exposure in an RNA-seq exposure study

P. promelas were exposed to two toxicants, copper and the pyrethroid pesticide bifenthrin. Following exposure of FHM larvae to the chemicals, mRNA sequencing libraries were prepared from total RNA extract of exposed and non-exposed fish. The RNA libraries were sequenced on a second-generation sequencing platform, and gene expression data was tabulated by mapping and counting the resulting

RNA reads to the gene models developed in Aim 1. A custom pipeline, employing the tools EdgeR and limma was used to derive normalized estimates of gene expression to identify differentially expressed genes (DEGs) as markers of exposure.

Specific Aim #3. Develop and assess statistical and machine learning predictors of exposure from the RNA expression profiles developed as aim 2.

Employing the DEGs discovered in Aim 2, five statistical and machine learning techniques were used to develop predictors of exposure status for each of the two toxicants to which the fish were exposed.

Classifiers were tuned and trained in a cross-validation framework, where exposure signatures were derived from RNA-seq data using statistical methods combined with machine learning approaches for feature selection and aggregation to combine the DEGs into robust predictive signatures. “Cross- chemical” testing of classifiers and functional analysis based on the DEGs for each toxicant were used to evaluate whether classification was driven by toxin specific responses.

7

Chapter 2. Generation of a New FHM Assembly

2.1 Introduction and Background

As previously stated, our team’s goal was to generate a more contiguous and complete FHM assembly and more comprehensive gene models than the previously published version (hereafter v1) to better harness the promise of RNA-seq expression assays. P. promelas has a haploid chromosome number of

25, but the v1 assembly is composed of 73,057 scaffolds and contigs with half the total assembled bases on contigs less than 7,468 bases long (i.e., contig N50=7,468). Based on comparison with the well characterized related species, Danio rerio, that has an average total gene length of ~30 KB, this degree of fractionation suggests that the average FHM gene will be split among multiple contigs (Moss, Joyce et al. 2011). The scaffolded N50 (see following section) of the published assembly is 60,380, a framework that provides a much better chance of having a given gene “landing” on a single scaffold. Unfortunately, the fraction of uninformative N’s in the scaffolded genome is 33.48%, so there is a high chance that parts of any given gene might be “hidden” even if it does reside on a single scaffold. Finally, in the P. promelas draft genome paper, an analysis using CEGMA (Parra, Bradnam et al. 2007) indicated 74% of

248 conserved core genes mapped to single scaffolds, meaning that >25% were not on single scaffolds, not even considering the possible problems of the high “N” fraction in the scaffolds.

To achieve a better genome assembly with concordantly more complete and better-annotated gene models, our team at the Environmental Protection Agency in Cincinnati had additional P. promelas DNA sequencing runs done, including some using newer (third generation, see below) technologies that produce longer DNA sequences. The combination of “short-read” sequence data (both new and the sequences used to produce the previously published assembly) and “long-read” sequences were combined with the techniques of chromosome interaction mapping (“Hi-C”, Lieberman-Aiden, van

8

Berkum et al., 2009) and optical mapping (Schwartz, Li et al., 1993) to produce a final improved annotated FHM assembly (manuscript in preparation). It is emphasized again that though the new assembly is the foundation for the work presented in this thesis, this work is primarily presented here to lay out the framework upon which the author of this dissertation pursued his primary aims, which were to develop protein coding gene models from the new genome assembly, and then to demonstrate the utility of those gene models in an RNA-seq-based exposure study. Additional background discussion of several specific key elements related to sequencing and assembly are presented in the following sections.

2.1a. DNA sequencing

2.1a.1. Next generation sequencing

In the early-to-mid 2000’s “next generation/high throughput” (NGS/HTS) DNA sequencing platforms emerged and greatly expanded the amount of sequence data that could be generated compared to the earlier Sanger-based capillary approaches that accounted for the vast majority of existing sequence data prior to “Next Gen’s” emergence. This development was driven by several factors, including the completion of the Human (Venter, Adams et al., 2001), which provided a reference against which shorter sequences could be more readily mapped, along with developments in molecular biology that allow a variety of biological phenomena (e.g., genetic variation, RNA expression, protein-

DNA interactions) to be assessed through high-throughput DNA sequencing (Shendure and Ji, 2008).

Commercially, three next generation platforms emerged as the primary products in the marketplace, during the mid-2000’s, the 454 Life Sciences (Roche) Genome Sequencer, the Illumina Genome Analyzer

(based on the Solexa technology), and the SOLiD platform from Applied Biosystems (ABI). All of the platforms represented forms of “cyclic-array” sequencing, where a dense array of DNA features is

9

sequenced by iterative cycles of enzyme manipulation and imaging-based data collection (Shendure and

Ji 2008). The workflow of the three platforms was similar. The process generally involved randomly fragmenting DNA, ligating common adaptor sequences, then clonally amplifying individual DNA molecules that had been somehow spatially isolated, either on a planar substrate or on micron scale beads, which are subsequently arrayed. All three platforms relied on “sequencing by synthesis,” where extension of primed template DNA was accomplished by either a polymerase or a ligase. After each enzymatic addition, the entire array was scanned, for example for the fluorescent signal from the labeled nucleotides incorporated by a polymerase. The reader is referred to the review article of

Shendure and Ji (2008) for a more detailed discussion of the three platforms. Of the three platforms, one can properly say that the Illumina platform emerged as the “winner,” and updated variants of

Illumina sequencers produce most of the sequencing data in the world. Supplementary Figure S.1 provides a snapshot from 2013 where the dominance of the Illumina platform was already on display.

The sequencing “revolution” was accompanied by rapidly decreasing costs (supplementary Figure S.2), spurring the exploration/publication of more and more genomic and transcriptomic information.

Smaller laboratories could undertake projects that previously could have only been attempted at genome centers with large banks of Sanger-based sequencers. Groups such as our “FHM genome team” at the EPA in Cincinnati can now pursue the production of assemblies and transcriptomes for higher level organisms such as the Fathead. Supplementary Figure S.3, which shows the number of released eukaryotic genomes (at NCBI) by years through 2015, evidences that trend.

2.1a.2. The emergence of third generation sequencing

Over the last 6-8 years commercial “third generation” sequencing technologies emerged (Note they only accounted for 3% pie slice in supplementary Figure S.1). These technologies were built around the concept of sequencing single long molecules of DNA. The two leaders in this arena were Pacific

10

Biosciences (PacBio, Roberts R. J. 2013) and Oxford Nanopore Technologies (ONT, Eisenstein 2012). ONT

''Strand sequencing' is a technique that passes intact DNA polymers through a protein nanopore, sequencing in real time as the DNA translocates the pore. With nanopore sequencing, the user chooses fragment length and the nanopore sequences the entire fragments. Reads approaching 1Mb have been reported. PacBio’s Single Molecule Real Time (SMRT) sequencing is a sequencing-by-synthesis technology based on real-time imaging of fluorescently tagged nucleotides as they are synthesized along individual DNA template molecules. DNA polymerase drives the reaction, so while the template and polymerase are associated, the same signal strength is maintained. As a result, instead of the uniform, relatively short read length of Illumina sequences (typically 250 to 300 bases, or less), the SMRT read lengths have an approximately log-normal distribution with a long tail, and generally are much, much longer (supplementary Figure S.4).

It is important to note that along with the benefits of longer reads, there is a price to be paid. The single- pass error rate of base calls in PacBio SMRT reads is much greater than in that of Illumina reads, being around 10%, whereas Illumina single base calls average >99% accuracy. SMRT read accuracy is greatly improved by having adequate coverage. Using hybrid short read/long read approaches, where third generation long read and second generation short read data are used in combination, is a common practice, and can help ensure one has the most accurate base calls in the end. A 2017 publication

(Weirather, de Cesare et al.) compared the utility of PacBio and ONT in transcriptome experiments, and evaluating general data quality, found, “PacBio shows overall better data quality, while ONT provides a higher yield.” PacBio was gauged to perform marginally better than ONT in most aspects for both long reads only and Hybrid-Seq strategies in transcriptome analysis.

11

2.1a.3. Additional techniques to improve assembly

Beyond “third generation” sequencing, other methods have been used to improve assemblies by helping to better order and orient contigs. Among those techniques are chromosome interaction mapping (Hi-C,

Lieberman-Aiden, van Berkum et al., 2009), and optical mapping (Schwartz, Li et al., 1993). These methods provide relatively inexpensive high-resolution scaffolding data. Hi-C adapts chromosome conformation capture (3C, Dekker, Rippe et al., 2002) that identifies long range chromosome interactions in a non-targeted, unbiased manner. The average frequency of such interactions decays predictably as the linear distance along a chromosome increases, so this property can be used to scaffold contigs on a large scale, even to that of whole chromosome (Burtoln 2013). In optical mapping, genomic DNA is randomly sheared to produce a "library" of large genomic molecules. Single molecules of DNA are elongated on a slide under a fluorescent microscope, held in place by charge interactions.

The molecules are digested by restriction enzymes, which cleave at specific digestion sites. The resulting molecule fragments remain attached to the surface, and the fragment ends at the cleavage sites are drawn back, leaving gaps. The fragments are stained with intercalating dye and are visualized by fluorescence microscopy and are sized by measuring the integrated fluorescence intensity. This produces an optical map of single molecules. The individual optical maps are combined to produce a consensus, genomic optical map (Dimalanta, et al., 2004, via Wikipedia “optical mapping” page). The map can provide information that can be used to scaffold contigs (Riley, 2011) or correct assembly errors (Zhou, Lemos et al., 2011).

2.1a.4. Summary

To help us realize our goal of producing a high-quality assembly that was as contiguous as possible, our group obtained (under contract) PacBio long-read data of the FHM genome. This data was key to our efforts, making the likelihood that our new assembly would significantly surpass the contiguity of the

12

previously published assembly very high. Long-read technologies were in their infancy when the sequence data for the previous assembly was generated; obtaining long-read data at that time was not a practical option. The use of only Illumina short read data made for the previous assembly made the production of a highly contiguous genome difficult, and the end results from the previous assembly effort (>73,000 contigs in the final assembly) demonstrated that, even with the use of “jump-libraries”

(described below).

The principle reason short-read data can be inadequate, even at relatively deep sequencing depths, is the repetitive nature of eukaryotic genomes. “Repeats” in eukaryotes come in a number of forms (see next chapter) and are considered to pose the largest challenge to eukaryotic reference genome assembly (Phillippy, Schatz et al., 2008), as they can introduce intractable ambiguities in the assembly process. It is estimated that up to two-thirds of the human genome is composed of repetitive elements

(de Koning, Gu, et al., 2011). Table 1 presents a preliminary analysis of the repeat content (performed by a colleague) of the new FHM genome assembly. Though repeats of the different types can overlap and thus the total repeat content of the genome can’t be calculated by simply adding the percentages of the rows in the table, the table demonstrates that repeats compose a significant fraction of the FHM genome. The high repeat content is why long-read data was so critical. If reads are long enough, they can span large regions (thousands of bases) of repetitive sequence, anchoring them to regions of sequence that are unique enough so that they can be assembled with greater ease. These long repetitive elements would otherwise be extremely difficult to assemble. In the previous assembly effort, jump-libraries were employed, but at relatively low sequencing depths that were not adequate to overcome the challenge posed by the high repeat content of the FHM minnow genome.

13

Table 1. Estimated repeat content of FHM genome (preliminary data provided by W. Huang) Repeat type Count (%/Mb) Total retrotransposons 80848 5.97/63.65 SINE 5440 0.10/10.89 LINE 35744 2.13/22.79 LTR 39664 3.74/39.86 DNA 617423 12.29/131.05 Unclassified 29362 1.06/11.30 Total interspersed repeats --- 19.32/206.00 Small RNA 7002 0.07/0.72 Satellite 34976 0.58/6.20 Simple repeats 475608 2.18/23.28 Low complexity 50135 0.25/2.66

In addition to leveraging our new long-read data to improve the contiguity of our final assembly, we also generated additional Illumina short-read data (2x250 bp paired-end data). The new reads provided

136.9 Gb of new data, representing 125X coverage of the anticipated 1.1 Gb genome size, and provided us with additional high per-base quality data that could be used to polish our assembly. Polishing involves using the short-read data, which has a very high likelihood of ‘calling” any given base correctly, to refine regions of an assembly that are more uncertain. Essentially this is accomplished by aligning the short-read data to the assembly and correcting bases in the assembly that don’t agree with the bases in the overlapping high-quality short reads. The greater the coverage (number of reads overlapping at a given location), the greater the certainty the correct base is identified at any given base. (This applies not to just polishing, but to assembly based on overlaps in general). In addition to having new long-read and short-read data available to aid us in generating the new assembly, we also employed both Hi-C and optical mapping. The combination of the new data available to us, which provided us not only with long- read data, but also greater coverage, and the additional approaches employed to improve scaffolding,

14

allowed us to produce an assembly of much higher quality in terms of both contiguity and accuracy compared to the previously published genome.

2.1b. Genome assembly

Reiterating that the focus of the work described in this thesis is not genome assembly, but rather, given an assembly, determining the location and identity of protein coding genes in that assembly, rigorous details about assembly algorithms and approaches our team evaluated, which were many, are beyond the scope of this document. However, given that the development of the gene models was dependent on the quality of the underlying assembly, the general methods we used to evaluate assemblies in order to pick a “final” FHM assembly are discussed below and in the methods section. Some other general aspects of assembly, including some terminology, are also presented in the next several paragraphs.

2.1b.1. Two main approaches to assembly

The majority of DNA assemblers employ one of two broad algorithmic approaches. The first approach is overlap-layout-consensus (OLC; also known as Hamiltonian Path) assembly, which identifies variably sized overlaps between reads (presumably representing the same genomic location), uses a graph representation of the overlap pattern to order and align the reads relative to one another, and then generates consensus sequence from the alignments. Older examples of this strategy included well known assemblers such as Celera, Phrap, Cap, and Tigr (Sutton 1995, Green 1996, Huang 1999, Myers,

Sutton et al. 2000). The second approach is the De Bruijn Graph (DBG, also known as Eulerian Path) assembly originally described in Pevzner, Tang, et al (2001). DBG assemblers break all the reads apart into component k-mers (where k is usually about 16 to 100 residues), then construct a DBG (or equivalent data structure such as a suffix array or compressed suffix array) of successively overlapping k- mers, where all the overlaps must be exactly k-1 residues long. The DBG is then traversed to reconstruct

15

the genomic assembly. DBG assemblers typically have lower hardware requirements and are faster than

OLC assemblers.

2.1b.2. Some assembly terminology

In an assembly, a “contig” (Staden 1980) is a set of overlapping (i.e., contiguous) DNA segments that together represent a consensus region of DNA. A set of contigs, if enough information is available, can be gathered into longer contiguous pieces called scaffolds. Scaffolds consist of contigs separated by gaps of roughly known length, where the “unknown” bases of known length between the joined contigs are represented by “Ns” in the assembly sequence and the number of Ns represents the estimated length of the unidentified bases. Scaffolds are usually constructed by sequencing short-read “jump libraries”, which when sequenced provide “known” bases (i.e., sequenced bases) at each end of a long piece of DNA, and the size of the non-sequenced middle (again represented by Ns) is known, but not its base composition. When the sequenced ends of the jump library align to two different contigs with a given level of certainty, those contigs can be scaffolded.

Another term that characterizes an assembly is its “N50”. N50 is calculated by summing all sequence lengths in the assembly, starting with the longest, and observing the length that takes the summed length past 50% of the total assembly length. A related metric, which was adopted for Asssemblathon 1

(Earl, Bradnam et al. 2011), is the NG50 length. This normalizes for differences in the sizes of the genome assemblies being compared. It is calculated in the same way as N50, except the total assembly size is replaced with the estimated genome size when making the calculation (Bradnam 2013). N50 and

NG50 are considered simple measures of assembly quality, where higher values are considered better.

The estimated genome size for the FHM is approximately 1.1 Gb (1.1 billion bases). Another metric is the

L50 of a genome, which is the minimum number of scaffolds whose combined length is greater than or equal to half the total assembly length.

16

2.1b.3. BUSCO and mapping as additional tools to assess an assembly

Another way to assess the quality of an assembly, and a key metric we used to pick our final assembly, is to compare the gene content of that assembly relative to organisms that are closely related. “BUSCO”

(“Benchmarking Universal Single-Copy Orthologs”, most recently, Seppey, Manni, Zdobnov, 2019) is a tool that has been developed to perform such an assessment. The BUSCO tool assesses the completeness of genomes, transcriptomes, and proteins based on a set of “universal” orthologous single-copy genes that are shared between related organisms, where universal is defined as being present in >90% of the species represented in the reference data set, almost always as a single copy gene. The BUSCO tool provides a reference set of 4,584 orthologs for Actionpterygii (ray-finned fish, set

“actinopterygii_odb9”). There are currently 23 species of fish represented in the reference set, and the

4,584 “BUSCOs” in it are represented by consensus protein sequences. 1,082 of the 4,584 are found in multiple copies in some of the 23 species represented, however only in a maximum of two of the twenty-three species (i.e., they are present as single copy genes in a minimum of twenty-one of the twenty-three species). BUSCO uses tools from the blast suite (Altschul, Gish et al., 1990) and hmmer

(Eddy, 1998) to identify regions in new sequences that share homology with the BUSCO reference proteins to perform its assessments. For genome analyses, BUSCO also employs the gene predictor

Augustus (Stanke, Keller et al. 2006) to produce gene models, then uses the protein sequences derived from the models as the input to hmmer, which performs the homology search. BUSCO classifies discovered genes (proteins) as ‘complete’ when their lengths are within two standard deviations of the

BUSCO group mean length (i.e. within ∼95% expectation). ‘Complete’ genes found with more than one copy are classified as ‘duplicated’. Based on what is known about BUSCO orthologs, recovery of many duplicates indicates there are likely assembly errors. Genes partially recovered are classified as

‘fragmented’, and BUSCOs in the reference set which are not discovered at all are classified as ‘missing’.

17

A final and very simple way we evaluated assembly quality with was by mapping rates of RNA reads to the assembly. Given a set of high-quality RNA reads from FHM, if the assembly was complete and accurate, one would expect the vast majority of the reads to be able to be mapped successfully to the genome.

2.2. Materials and methods

2.2a. DNA sequencing

Brain and 11 tail muscle samples (about 30 mg each) were isolated from a single 10 month old male

FHM. High molecular weight (HMW) DNA was isolated from all 12 samples using a MagAttract kit

(Qiagen, Genrmantown, MD) according to the manufacturer's protocol. All 12 samples were proteinase

K digested (20 ul proteinase K plus 220 ul buffer ATL from MagAttract) overnight (16 h) on a thermomixer at 56 °C at 900 rpms. DNA concentrations were checked using the Nanodrop ND-1000 spectrophotometer (Nanodrop Technologies, Wilmington, DE) and pooled into groups with high or low concentration. The two resulting pools were checked again on the Nanodrop ND-1000 as well as the

Qubit 2.0 fluorometer (Invitrogen, Carlsbad, CA). The high concentration sample was about 94 ng/ul with a 260/280 value of 1.81 and a total volume of about 300 ul and the low concentration samples was about 46 ng/ul with 260/280 value of 1.72 and a total volume of about 600 ul. A Bioanalyzer DNA 12,000 chip (Agilent Technologies, Santa Clara, CA) was run on both to confirm the size of the DNA and nothing was seen below the 12,000 bp. DNA was stored at -20 C.

A 15 ul aliquot of the low concentration HMW DNA sample was shipped on dry ice to the RTSF

Core at Michigan State University for subsequent processing. A DNA library was prepared using the

Illumina TruSeq Nano DNA library preparation kit, loaded onto two lanes of an Illumina HiSeq 2500

18

Rapid Run flow cell and sequenced in a paired-end 250-base format. Base calling was done by Illumina

Real Time Analysis v1.18.64 and output was converted to FASTQ format with Illumina Bcl2fastq v1.8.4.

A 170 ul aliquot of the high concentration HMW DNA sample was shipped on dry ice to the University of

Florida Interdisciplinary Center for Biotechnology Research (ICBR) for subsequent processing. ICBR constructed a PacBio library after BluePippin (Sage Science, Beverly, MA) pulse-field gel electrophoresis size selection with a 20 kb insert size cutoff. PacBio single molecule real-time (SMRT; Pacific Biosciences,

Menlo Park, CA) sequencing was performed using 24 SMRT cells and P6/C4 chemistry.

High (105 ul) and low (600 ul) concentration aliquots of HMW DNA samples were subsequently sent to

ICBR for additional PacBio sequencing. The samples were purified using PowerClean (MO BIO

Laboratories, Inc.; Carlsbad, CA), three separate SageELF (Sage Science, Beverly, MA) pulsed-field gel- electrophoresis size selected >20-kb libraries were constructed and sequenced using 48 PacBio-SMRT cells and P6/C4 chemistry. PacBio subreads were extracted with PacBio SMRT Link (version 6). The extracted subreads were then corrected, trimmed, and assembled with CANU (version 1.8; Koren,

Walenz et al. 2017; Koren, Rhie et al. 2018) to generate the first draft diploid assembly. Pilon (version

1.23; Walker, Abeel et al. 2014) was used to polish the draft assembly by correcting miscalled bases, fixing mis-assemblies and filling small gaps. Two sets of Illumina reads were used together for this polishing: 1) paired-end 101-base reads from the NCBI SRA database (accession SRX423854; Burns et al.,

2016), and 2) and the new paired-end 250-base reads described above. For each Illumina data set, BWA-

MEM (version 0.7.15; Li and Durbin 2009) and SAMTOOLS (version 1.3.1; Li, Handsaker et al. 2009) were employed to align and sort Illumina reads to the assembly. After polishing, Purge_haplotigs (version

1.0.4; Roach, Schmidt et al. 2018) was used to remove haplotype copies of primary contigs. This was necessary since CANU often assembles both haplotypes from heterozygous regions of a diploid genome

19

as primary contigs. The haplotig-purged assembly was then further polished with the Illumina short reads using Pilon to correct artifacts that might be introduced during haplotig purging.

2.2b. Assembly and scaffolding

For both genome assembly and scaffolding, we tested several different tools and tried with different parameter settings looking for the configuration that resulted in the best FHM assembly. For example, we tested CANU (multiple versions), FALCON (version 1.2.4) and FALCON-UNZIP (version 1.1.4) in the pb-assembly pipeline (version 0.0.6; Chin, Peluso et al., 2016), DBG2OLC (Ye, Hill et al., 2016), MARVEL assembler (Grohme, Schloissnig et al., 2018), and Minimap/Miniasm (Li, 2016) as well some other assemblers. We also tested different running assembly options, particularly for CANU, for both diploid and haploid assemblies. For scaffolding using mate-pair reads, we tested several scaffolding tools, e.g.,

SSPACE, BESST (Sahlin, Vezzi et al., 2014), BOSS (Luo, Wang et al., 2017) and OPERA_LG (Gao, Bertrand et al., 2016). In addition, we tried different alignment options and different scaffolding orders with different types of data for scaffolding. For example, we tried scaffolding first with the Illumina mate-pair data, then BioNano, and finally Hi-C data.

We evaluated a genome assembly in multiple ways. We checked the assembly size for consistency with a Feulgen densitometry estimate of the genome size (1.09 Gbp; Gold and Amemiya, 1987). We evaluated N50 and NG50 for assembly continuity and examined mRNA-read mapping rates to assess genome completeness. As our primary metric, we used BUSCO (version 3.0.2; Simao, Waterhouse et al.,

2015; Waterhouse, Seppey et al., 2017) with the Actinopterygii data subset (version odb9) to evaluate completeness, continuity and duplication. We ended up using the best performing sequence set, compiled step-wise as: 1) scaffolding using a Hi-C library data, 2) scaffolding using Illumina mate-pair library data, 3) hybrid scaffolding with BioNano optical mapping data, and 4) a final scaffolding with Hi-C library data. Figure 1 displays a simple flow chart of the process used that resulted in the final assembly.

20

To generate the Hi-C data, we harvested the liver from a euthanized 10-12 month old adult male FHM, snap froze the tissue in liquid nitrogen, and shipped it overnight on dry ice to Dovetail Genomics (Scotts

Valley, CA) for Hi-C processing. The resulting sequence data was aligned to the assembly using BWA-

MEM (version 0.7.15) and SALSA (version 2.2; Ghurye, Rhie et al. 2019) was used for scaffolding. For the first round of HiC scaffolding, reads were mapped to the assembly with the default BWA-MEM paired- end read mapping protocol. In the second round of Hi-C scaffolding, reads were aligned with the option

“-SPM” to skip mate rescue, read paring, and mark shorter split hits as secondary.

We used two sets of Illumina mate-pair (jump-library) sequencing reads downloaded from the NCBI SRA database (6kb mate-pair library SRX566039 and 40kb mate-pair library SRX566040; Burns et al., 2016) for scaffolding. SSPACE (version 3.0; Boetzer, Henkel et al. 2011) was used for scaffolding using the

Illumina mate-pair data. SSPACE was run with the option “-x 0 -m 32 -o 20 -k 5 -a 0.70 -n 15 -p 1 -v 0 -z 0

-g 0 -S 0”.

Figure 1. Basic Flow chart of process used to produce final assembly (figure by W. Huang)

21

For optical mapping, blood was isolated from a 6-month old male FHM using the Blood and Cell Culture

DNA Isolation Kit (Bionano Genomics, San Diego, CA). Blood cells (fish erythrocytes are nucleated) were embedded in 2% agarose and treated with proteinase K, according to the manufacturer’s protocol. The agarose plugs were then shipped to the McDonnell Genome Institute at Washington University for further processing. Agarose was digested and DNA was drop dialyzed. DNA was assessed for quantity and quality using a Qubit dsDNA BR Assay kit and CHEF gel. A 750-ng aliquot of DNA was labeled and stained following the Bionano Prep Direct Label and Stain (DLS) protocol. The stained DNA was quantified using a Qubit dsDNA HS Assay kit and run on the Saphyr chip. The BioNano Solve (version 3.3; https://BioNanogenomics.com/support/softwarAuge-downloads/) hybridScaffold.pl tool was used for hybrid scaffolding. The option “-B 2 -N 1” was used to ensure no breaks were introduced into the genome assembly while allowing breaks in the BioNano genome map in order to resolve conflicts.

2.2c. BUSCO and RNA read mapping

Once the final assembly was available, BUSCO (version 3.0.1), employing the Actinopterygii dataset

(version odb_9) as its reference lineage, was used in “genome” mode to assess the BUSCO content of the new genome and the previously published genome. For the BUSCO runs, the Augustus species used for gene prediction was zebrafish. BUSCO retrains Augustus after an initial iteration of prediction and

BUSCO search is completed, in order to search for additional results after tuning the predictor specifically for the species being examined, so the starting lineage is not of critical importance. The comparison results are presented in the following section.

The STAR aligner was employed to align a set of ~16M paired-end (150 bp) FHM RNA reads to the previously published assembly and the new assembly to compare the mapping rates achieved. Results of this comparison are presented below. As an additional qualitative assessment of genome completeness,

STAR was used again to align a set of ~16 million randomly selected (USEARCH, version 9.2, Edgar, 2010)

22

50 base long single-ended FHM reads to the genome. The FHM reads were obtained from available in- house reads that were not used in the assembly process. STAR was also used to align an equal number of randomly selected zebrafish RNA reads of 50 bases (or less; ~0.3% less than 50) to the primary assembly of the latest zebrafish genome (GCF_000002035.6_GRCz11), downloaded for NCBI. Results are presented below.

2.3. Results

Table 2 displays summary statistics for the new genome (EPA_FHM_2.0) compared to the previously published genome (GCA_0007 00825.1). Across the board the summary results provide clear indication of the great improvements made with the new assembly. For example, the total number of scaffolds decreased to only 910 from 73,057, and the N50 increased nearly 200 times going from ~60kb to

~12Mb. In new the assembly 23 scaffolds contain over half of the assembly (the L50 value), while in the older assembly the L50 was 5,505. Along with the length driven improvements, the fraction of unknown bases in the assembly decreased from 33.5% to 13.2%.

Table 3 shows the comparative BUSCO results between the new and old genome and Figure 2 presents the comparison graphically. The BUSCO results indicate the new genome represents a substantially more complete assembly than the earlier genome. Table 4 displays the comparative mapping rates of

~16M randomly selected 150 bp paired-end reads to the previously published genome and the new assembly. Reads map at a 10% higher rate to the new genome. Table 5 shows the comparative mapping rates of sets of ~16M randomly selected 50 base single-end RNA reads from FHM and zebrafish respectively, to their respective genomes. The total mapping rates are both on the order of 90%.

23

Table 2. Summary statistics, new FHM genome assembly (EPA_FHM_2.0) compared to previously published assembly (GCA_0007 00825.1) Statistic EPA_FHM_2.0 GCA_0007 00825.1 Number of scaffolds 910 73,057 Total size of scaffolds 1,066,412,313 1,219,326,373 Longest scaffold 59,790,976 580,801 Shortest scaffold 1,007 1,000 Number of scaffolds >=1K nt 910 (100.0%) 73,057 (100.0%) Number of scaffolds >10K nt 757 (83.2%) 19,302 (26.4%) Number of scaffolds >100K nt 354 (38.9%) 2,265 (3.1%) Number of scaffolds >1M nt 133 (14.6%) 0 (0.0%) Number of scaffolds >10M nt 26 (2.9%) 0 (0.0%) Mean scaffold size 1,171,882 16,690 Median scaffold size 47,573 3,513 N50 scaffold length 11,952,773 60,308 L50 23 5505 scaffold %A 26.71 20.67 scaffold %C 16.68 12.62 scaffold %G 16.68 12.59 scaffold %T 26.70 20.63 scaffold %N 13.23 33.48

Table 3. BUSCO genome comparison between new FHM assembly (EPA_FHM_2.0) and previously published genome (GCA_0007 00825.1_FHM SOAPdenovo). GCA_0007 00825.1 EPA_FHM_2.0 FHM_SOAPdenovo Complete BUSCOs (C) 4538 (95.1%) 3506 (76.5%) Complete and single-copy BUSCOs (S) 4116 (89.8%) 3324 (72.5%) Complete and duplicated BUSCOs (D) 242 (5.3%) 182 (4.0%) Fragmented BUSCOs (F) 73 (1.6%) 507 (11.1%) Missing BUS COs (M) 153 (3.3%) 571 (12.4%)

24

Figure 2. BUSCO genome comparison between new FHM assembly (EPA_FHM_2.0) and previously published genome (SOAPdenovo). The new genome exhibits significantly improved BUSCO coverage.

Table 4. Mapping rates of ~16M randomly selected 150 bp paired-end FHM RNA reads to published FHM genome ( GCA_0007 00825.1_FHM SOAPdenovo) and the new FHM genome. Unique # of mapping Multimapping Total Mapping Target scaffolds percentage percentage rate GCA_0007 00825.1_FHM SOAPdenovo 73,057 59.43% 15.07% 74.50% EPA_FHM_2.0 910 75.63% 7.68% 84.93%

Table 5. Mapping rates of ~16M randomly selected FHM and zebrafish RNA reads (50 bases, single-ended), respectively, to the new FHM genome and the latest zebrafish genome primary assembly (GCA_0007 00825.1) downloaded from NCBI. Unique mapping Multimapping Total Mapping Target # of scaffolds percentage percentage rate EPA_FHM_2.0 910 77.01% 12.23% 89.24% GCA_0007 00825.1 993 79.25% 11.01% 90.26%

25

Additional results are currently being generated for inclusion in the upcoming genome publication, including analysis of non-coding RNAs, analysis of the genome’s repeat content, and a phylogenetic analysis. Non-coding RNA (ncRNA) genes are important for our future toxicological work because the resulting ncRNA is involved with a large number of structures and functions of cellular machinery. For

EPA these ncRNA also provide biomolecules that might serve as bioindicators of exposure/effects in toxicologic studies as well as allow comparisons across species enabling evolutionary insights and serving as the basis for cross-species toxicologic extrapolation. There are a number of types of ncRNA

(e.g. Ozata, et al, 2019, Uszczynska-Ratajczak, et al., 2018, Bracken, Scott, Goodall, 2016) and the number of types continues to increase.

The preliminary results from the repeat analysis (referenced earlier, Table 1) indicate the FHM is enhanced for simple and low-complexity repeats, rich in AT content. The genome overall is significantly enriched in AT content compared to zebrafish and the difference in repeats may be the chief driver of the overall difference exhibited, however, again, these results are preliminary. Detailed results for both repeats and ncRNA will appear in the final publication, along with a phylogenetic analysis.

2.4. Discussion

The new genome represents a dramatic improvement compared to the previously published assembly and is a major advance. The improvements in the new assembly were driven by several factors. Likely the most important factor in improving contiguity was the availability of the PacBio long-read data, which allowed previously intractable repetitive regions to be correctly assembled. The addition of the optical mapping and Hi-C data provided further scaffolding power.

Given this dissertation’s focus on the production of protein coding gene models from the assembly, the greatly improved contiguity of the new assembly provided a much better framework for the gene discovery process. Consider that the old assembly had a scaffold N50 of ~60 Kb. Based on a comparison

26

to the closely related zebrafish, we expect FHM genes to have a mean length of roughly half that. The

60Kb N50 value of the old assembly meant that over half the bases making up the entire assembly were on scaffolds less than twice the expected gene length, significantly limiting the number of genes we could expect to find residing on a single scaffold. Contrast this with the new assembly, and its N50 of nearly 12 Mb. Over half the total assembly is made up of scaffolds that end to end could hold up to forty full length genes, clearly providing a superior basis upon which to hunt for genes.

The BUSCO results further demonstrate the improvement of the new genome compared to the previously published genome by providing evidence that we have a significantly more complete genome. One could not expect to develop a comprehensive set of gene models from a genome that is not complete. The high level of complete, single-copy BUSCOs discovered in the new assembly implies that the assembly is essentially correct and complete, providing an excellent framework for comprehensive gene prediction. The ~10% increase in the paired-end RNA mapping rate to the new assembly (Table 4) compared to the previously published assembly is not as dramatic as what one might expect and is seen as indication that the quality of the previous assembly was relatively high within the scaffolds that were assembled; its primary shortcoming was in its poor contiguity. Our use of updated sequencing technology and additional scaffolding tools (optical mapping and Hi-C) were the key to generating a much better assembly.

The RNA mapping rate comparison done with zebrafish (Table 5) is another piece of evidence indicating a high level of genomic completeness. The FHM mapping rate (89.24%) was very close to the rate seen for zebrafish (90.26%). Considering the mature state of the zebrafish genome and the amount of resources invested over the years to produce it, the comparable mapping rate evidenced by the FHM

RNA provides additional evidence that the assembly is a high-quality representation of the genome and represents a significant advance.

27

Chapter 3. Annotation of protein coding genes in the FHM genome

3.1. Introduction and Background

To make a genome a truly useful resource for biological studies it must be functionally annotated.

Annotations are descriptions of structural or functional features of a genome. A key annotation is that of genes, which we define here as genomic regions that are transcribed into RNA, including, mRNA, tRNAs, rRNA, lncRNAs, snRNAs, micro-RNAs, etc. Our group’s goal is to develop RNA-seq based tools to evaluate toxicity and to diagnose environmental exposures, based, in part, if not entirely, on mRNA profiling. To do that it’s critical to identify genomic regions representing protein coding genes (mRNA), and to know something about the identities and functional roles played by those genes.

In this study, the approach used to develop protein coding gene models involved running two annotation pipelines, Maker and PASA/EVidenceModeler (EVM), and evaluating each pipeline’s output to select a preferred set of gene models. The 37,190 models resulting from the PASA/EVM pipeline were chosen. A novel, to the best of my knowledge, approach was then used to improve the mapping rates of

RNA-seq reads to exemplar transcripts of the chosen models. This approach was to run the PASA/EVM models back through Maker. Due to a particular parameter setting, the reprocessing through Maker removed ~30% of the input models. As the goal of running the models back through Maker was to improve mapping rates, and not to reduce the number of models, input PASA/EVM models that had no overlap with coding sequence (CDS) segments of the Maker output models were returned to the model set. This resulted in 36,881 gene models remaining after the Maker run, and the models did exhibit improved mapping rates compared to the 37,190 PASA/EVM input models. A final filtering process was then applied to the models. ~640M in-house generated RNA reads from multiple tissues and life-stages

28

of FHM were mapped to exemplar transcripts of each model and compared to our reference protein set using the tblastx algorithm. Models were then evaluated based on two factors, the rate at which the

RNA-seq reads mapped to them in the correct strand orientation, and their homology to reference proteins. Cutoff values for mapping rates and homology were established, and models were removed if they did not meet at least one of the two (mapping rate or homology-based) cutoffs. After filtering,

26,150 genes remained in the final model set.

Maker and PASA/EVM do not operate in exactly the same fashion which is why the two pipelines produced different output models, however the general paradigm under which they operate is essentially the same; they both use pieces of evidence generated by other tools to make informed predictions of the locations and structure of genes in a genome. Figures 3 and 4 immediately below are abstract representations of the overall annotation process used to arrive at the final gene models, and figures 3.A and 4.A, which appear later, present the processes in slightly more detail as overview flowcharts that are also related to specific sections of the text. More details are presented in the text in the sections below about the production runs of the annotation pipelines and in supplementary section

S1. Prior to getting into the details, however, some additional background information related to gene prediction and annotation is provided.

29

Figure 3. Conceptual diagram of the process used to produce two sets of gene models from the Maker and PASA/EVM annotation pipelines. Inputs/tools on the left in some form were provided to the two pipelines and they used evidence generated/derived from the inputs to make informed predictions of the locations and structures of genes present in the input FHM genome assembly. The most important input was ~320M 150 base pair paired-end input RNA (cDNA) reads collected from multiple early life stage FHM and multiple adult (male and female) FHM tissues. The wide range of tissues employed provided a high probability that a nearly complete transcriptome was available from which the pipelines could infer gene structure. The RNA reads were first corrected, adapter trimmed, and then assembled into transcripts with Trinity. The 37,190 PASA/EVM models (highlighted) were selected as the preferred set of models. Additional details are provided in the text and in Figure 3.A.

30

Figure 4. Conceptual diagram of the process of going from 37,190 models produced by the PASA/EVM annotation pipeline down to the final 26,150 gene models. Several detailed steps not depicted in this diagram are depicted in the more detailed Figure 4.A and are also described in the text.

3.1a. Gene prediction and annotation

For the purposes of this dissertation project, the focus was on protein encoding genes. In a larger view, let genes be defined as genomic regions that are transcribed into RNA, which can be mRNA, tRNAs, rRNAs, snRNAs, micro-RNAs, etc. Because the future RNA-Seq transcript profiling work our group will do at the Environmental Protection Agency (ultimately developing RNA-seq based tools to evaluate toxicity and to diagnose environmental exposures) will most likely focus on mRNA profiling, delineating genomic regions that encode mRNA (exons) is the most important aspect of annotation, and was the driver for this dissertation project.

There are several approaches to identifying mRNA-encoding genes in a new genomic assembly, including aligning mRNA sequences from the same species to the its assembly, aligning protein sequences from other species to the target assembly, and de novo or evidence informed gene

31

prediction, where prediction algorithms are tuned for a particular species by examining factors such as codon usage frequencies and splice junction motif frequencies in a collection of well characterized genes

(a training set) from that organism. If such a set of genes is not available, then one might use information from a closely related species to start, and then try to improve the training by inputting the initial predictions for the species being studied into another prediction round.

The direct approach of mapping ESTs (expressed sequence tags - RNA reads hundreds of bases long generated from cloning and older Sanger sequencing) and short RNA reads (e.g. from Illumina sequencing runs) from the same organism to the organism’s own assembly is the best approach to gene identification if reads are long enough, numerous enough, and repetitive sequences have been properly masked (see below). If those conditions exist, it is relatively straightforward to identify genomic regions corresponding to transcript sequence reads using splice aware mapping tools such as Bowtie2

(Langmead and Salzberg 2012, Kim, Pertea et al. 2013), and STAR (Dobin, Davis et al. 2013). Evaluation in

Engstrom et al (Engstrom, Steijger et al. 2013) suggest that STAR may be the preferred option due to high accuracy and speed.

The mapping approach has some limitations. Transcription appears to occur at low rates from non-gene regions in the genome (Carninci, Kasukawa et al. 2005, Consortium, Birney et al. 2007, Barrett, Fletcher et al. 2013), and genomic DNA and non-coding RNA can also contaminate RNA-seq and EST libraries.

These issues can lead to false positive gene predictions. One way to address this is to require several

RNA sequences to map to a given genomic region before including that region in a gene, since such sequences should be relatively rare. Another problem facing the RNA-mapping based approach to gene finding is that some genes are only transcribed at certain points in development or in certain tissues. If the available RNA sequence collection was not derived from biological samples corresponding to those developmental stages or tissues, the corresponding transcript sequences will be missing and the genes

32

will not be detected by the mapping method. In addition, the number of transcript copies expressed can vary greatly between genes within a cell. This creates an issue for detecting genes with low levels of transcription even when the correct tissues and developmental stages are sampled. Deep sequencing to get transcripts corresponding to the low-expressed genes may be necessary, resulting in much higher coverage of strongly expressed genes, making this an inefficient and unnecessarily costly process. This issue would be compounded if one wants multi-transcript coverage to avoid false positive gene predictions. To achieve the most representative transcriptome we could, we collected RNA from several early life stages and from multiple adult tissues to try to capture the products of as many active genes as we possibly could. Additional details appear in the methods section that follows.

To help overcome the problems discussed above, one can map full-length transcripts or protein sequences from other species onto the target genome assembly. This has the advantage of starting with full-length sequences, which is not the case with FHM ESTs and short RNA reads. Using sequence collections from well annotated organisms makes it more likely that complete sequences of rarely transcribed genes will be found, though sequences from well-annotated organisms will have diverged from the organism of interest, resulting in decreased sequence similarity. This divergence almost always affects nucleotide sequences more extensively than corresponding amino acid sequences, which is why mapping based on protein homology is preferred. Nucleotides in untranslated regions (UTRs) diverge more rapidly than protein coding regions since mutations in UTRs are less like to have functional consequences, so they can present more challenging targets for cross-species mapping. Also, as they are not translated, they are not present in the protein sequences from the well-annotated organism.

Probably the most common tool used to map RNA/DNA to proteins is tblastx from the BLAST suite. If already translated into proteins, direct protein-protein searching can be done with blastp. Exonerate

(algorithms described in Slater and Birney, 2005) is another splice aware tool for sequence comparisons.

33

To mine for more distant homology one can also use position-specific iterated blast (psi-blast, Altschul,

Madden, et al., 1997).

The third approach to gene prediction is ab initio gene prediction, where the presence of certain features in the genomic sequence are used to identify the presence of a gene. In eukaryotes: transcription start site and regulatory region mapping, CpG islands (associated with gene regulation, and often located 2-5 kilobases upstream of gene transcription start sites) and poly-adenylation signals help identify first and last exons of mRNA encoding genes. Augustus and SNAP (Korf 2004) are examples of gene predictors, evaluated by (Goodswen, Kennedy et al. 2012) together with two other programs. All four tools discussed in the paper use a variation of hidden Markov models (HMMs) to statistically model the structure of DNA sequences, and each gene finder has its own algorithm to decode the HMM into exact predictions. Due to the HMM nature of the gene finders, training them with data from the specific organism you are working with generally produces the best results, assuming one has enough good data to train with.

In summary regarding the three approaches to gene finding/prediction, the advent of second- generation sequencing made broad and deep sequencing of eukaryotic transcriptomes relatively affordable, so a high volume of mRNA data can be generated in a cost-effective manner. Such data provides a rich resource that allows for better gene prediction than is possible by mapping sequences from other organisms, or de novo gene prediction. But, genes that are rarely expressed, or expressed at very low levels, may be poorly represented in the RNA-Seq data, and therefore may be more readily identified by mapping sequences from other organisms or by de novo gene prediction. Attaining the best overall representation of all gene models will likely require a combination of the three approaches, which was the course taken, as two annotation pipelines (discussed below) that leverage the three

34

approaches discussed above were employed to generate gene predictions from the new FHM genome assembly

3.1b. Annotation pipelines

Over the last decade several genome annotation pipelines that perform many of the tasks described above have emerged and have gained widespread use and acceptance. As far as tools that are free, probably the most prominent are MAKER (now MAKER2, Cantarel, Korf et al., 2008) and PASA (Program to Assemble Spliced Alignments). PASA is well suited for making gene predictions based on mapping assembled short sequence reads and ESTs to the target assembly. MAKER2 (henceforth referred to as

MAKER) leverages assembled RNA reads and ESTs too, but also makes gene predictions using ab initio and evidence-based gene prediction methods. Both PASA and MAKER require short RNA reads to be pre-assembled, and both suggest genome assembly-guided Trinity (Haas, Papanicolaou et al. 2013) for this task; PASA also recommends a Trinity de novo assembly (Grabherr, Haas et al. 2011) as input. The entire “PASA” annotation pipeline involves the use of the “sister tools” TransDecoder

(https://github.com/TransDecoder/TransDecoder/wiki) and EVidenceModeler (“EVM”, Haas, Salzberg, et al., 2008). Given their wide acceptance, we chose to use both pipelines to produce gene models from our new genome assembly. We ran Maker first so we could leverage outputs generated during the

Maker runs as inputs for EVidenceModeler; EVM takes gene model predictions, rna alignment information, protein alignment information and a user defined “weights“ file as input. After running the two annotation pipelines, we evaluated the model sets each produced and picked the set we found superior for further use. We then did additional processing and filtering of the chosen model set to arrive at a final model set. Details of the use of both pipelines to produce our initial model sets and the other processes we employed to arrive at our final set of protein coding gene models are provided in the following methods section and associated supplementary materials.

35

3.1c. Masking

As briefly discussed in the previous chapter, repetitive elements pose a challenge to assembly and proper read mapping. Repetitive sequences make up a significant portion of vertebrate genomes. For instance, it was noted above that up to two-thirds of the human genome might be comprised of repetitive elements and 47% of D. rerio (zebrafish) introns are composed of repetitive sequences (Moss,

Joyce et al. 2011). Repetitive DNA is generally divided into two groups: (1) short tandem repeats (STRs), which include DNA satellites, minisatellites, and microsatellites; and (2) dispersed interspersed repeats

(or long interspersed repeats, LIRs). For the purposes of this proposal, details about the different types of repetitive elements and any functional behaviors they have or functional roles they may play is not necessary. Their relevance to this proposal is that repeats need to be “masked” prior to the gene annotation process, or they can cause problems. For example, STRs can produce sequence alignments with high statistical significance to low-complexity protein regions, which can create false homology

(evidence) for some genes throughout the genome. LIRs (which mainly consist of “transposable elements”, sequences interspersed throughout a genome which can copy themselves and/or move themselves from one genomic location to another) contain real protein coding genes. If, for example, a transposon is located within the intron (non-coding part) of a true gene, an ab initio gene predictor or homology influenced prediction tool might include the transposon as extra exons in the model it is producing for the gene to which the intron belongs. Masking helps avoid such errors.

Masking takes two forms, “soft masking” and “hard masking.” In hard masking all bases in the genomic sequence that are believed to be part of a repeat are changed to “N.” In soft masking, bases in repeat regions are changed to lower case letters (e.g., ATATATAT becomes atatatat). Programs used to do homology searches (such as blastn and blastp in annotation pipelines such as MAKER2 (Holt and Yandell

2011) handle hard and soft masking in different ways. In general, hard masked areas are not aligned to –

36

once masked those regions lose all ability to contribute information to the annotation process. Soft masked sequences can still contribute information to the annotation process, however. Typically, alignment programs such as blast won’t seed (start) any new alignments in a soft-masked region, however alignments that begin in a non-masked region can extend into a soft-masked region. This is a desirable feature since low-complexity regions can be present in real genes.

Various programs have been developed to identify and mask “repeats”. Tandem Repeat Finder (Benson

1999) and RepeatMasker (and the corresponding RepBase “library” of repetitive elements, Smit 2013-

2015) are well known examples. RepeatRunner (Smith, Edgar et al., 2007) combines the blastx program with RepeatMasker to identify repetitive elements. Development of a species-specific library of repetitive element can be (and was) accomplished using RepeatModeler (Smit, Hubley, 2008-2015), a companion program to RepeatMasker. Developing a species-specific library of repetitive sequence library was necessary since every genome has its own characteristics, including the nature of its repetitive elements.

For this project, when using Maker, the genome was masked prior to having RNA and protein alignments generated and having gene predictions done by the gene predictors. Maker was configured so that two of the gene predictors (Augustus and SNAP) would subsequently make predictions on the unmasked sequences too. Those predictions would only be included in the final Maker output if there was other evidence gathered during the run to support them. As such supporting evidence comes in the form of alignments of transcripts or proteins to the same genomic area where the gene was predicted to reside, and since the protein and RNA alignments were generated on masked sequence, gene predictions from originally masked regions would be expected to be rare in the final output. The use of

PASA/EVM provided another opportunity for such models to emerge though. All “raw” outputs from the three gene predictors generated during the Maker run (with the exception of those from scaffolds 1-21

37

of the assembly, which were generated separately; see below), including the predictions made on the unmasked sequences, were included as inputs to EVM. If new transcript alignment information generated by PASA (run prior to EVM) provided supporting evidence for raw models located in originally masked regions, they might emerge in the final EVM output, as PASA transcript alignments were given a high weight in the EVM process. Additional details of the running of the annotation pipelines and subsequent processing done to arrive at the final gene models is provided in the following methods section and the supplementary materials.

3.2. Materials and Methods

The process of producing the gene models required many steps, details of which can be laborious to explain. Efforts were made in the text below to attempt to explain the essentials, without getting into too many details, however because there were so many details involved it was sometimes difficult to

“boil things down” to a simple sentence or two. Because of this, the reader unfamiliar with doing the type of annotation being described, may find some aspects cumbersome and/or hard to understand.

The reader is therefore encouraged to refer to the two flow charts (Figures 3 and 4) developed for the overall annotation process in order to view the “big picture” which perhaps will inform them so that they might be able to interpret sections of the text better. Supplementary materials section S.1 provides even more details of the annotation process, focusing on specific commands used for each of the steps in the annotation pipelines, and may also aid understanding.

To start the annotation process, along with the newly assembled genome, a large amount of FHM RNA reads, collected from multiple early life stages and different tissues were required. Having enough tissues and life-stages represented would provide a good chance of capturing the entire FHM transcriptome, which provides the greatest opportunity to develop a comprehensive set of gene

38

annotations. The annotation process essentially started with the generation of the RNA reads, and the entire process from that starting point is described below.

3.2a. Generation of input RNA reads

Total RNA was isolated using Tri-reagent (Applied Biosystems, Foster City, CA) following manufacturer’s protocol and quantified using a Nanodrop ND-1000 (Thermo Fisher Scientific, Wilmington, DE). Prep quality was checked using an Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA) and RNA 6000

Nano kit.

Brain, liver, gonad, muscle, heart (n=3 from each sex), kidney, and gill were isolated from each of 5 adult male and 5 adult female FHMs (6-8 months old). Tissues from euthanized individual fish were snap frozen on liquid nitrogen and stored at -80 °C until processing. RNA for each tissue-gender combination was pooled, creating 14 pools (7 tissues x 2 genders). Equal amounts of RNA from each pool was then combined to make a final adult pool for sequencing.

RNA was also isolated from 12 immature developmental stages: unfertilized eggs (3 replicate isolations, each with 30 eggs), 24 h post fertilization eggs (2 replicate isolations of 30 eggs each), 48 h post fertilization eggs (2 replicate isolations of 30 eggs each), 72 h post fertilization eggs (2 replicate isolations of 30 eggs each), newly hatched larvae (3 replicate isolations of 15 larvae each), 24 h post hatch larvae (3 replicate isolations of 15 larvae each), 48 h post hatch larvae (3 replicate isolations of 10 larvae each), 72 h post hatch larvae (3 replicate isolations of 10 larvae each), 96 h post hatch larvae (3 replicate isolations of 10 larvae each), 7 days post hatch larvae (3 replicate isolations of 8 larvae each),

17 days post hatch larvae (3 replicates of 4 larvae each), and 30 day post hatch larvae (5 replicate isolations of individual larvae). Equal amounts of total RNA from one replicate isolation of each

39

developmental stage (except 30 d larvae, where all 5 replicates were used to make one 30 d pool) were combined to make a final juvenile pool for sequencing.

Adult and juvenile RNA pools were shipped on dry ice to the Research Technology Support Facility (RTSF)

Genomics Core at Michigan State University for subsequent processing. After performing a quality check of the RNA, a separate library was prepared by the RTSF for each pool using the TruSeq Stranded mRNA library preparation kit LT (Illumina, San Diego, CA). Libraries were pooled in equal amounts and loaded onto both lanes of a Illumina HiSeq 2500 Rapid Run flow cell V1. Paired-end 150 base sequencing was performed using Illumina Rapid Run SBS reagents. Base calling was done with Illumina Real Time

Analysis software v1.17.21.3 and output was converted to FASTQ format with Illumina Bcl2fastq v1.8.4.

The input RNA sequences (~320 million 150 bp paired-end reads) can be found at NCBI’s Sequence Read

Archive (SRA) as accessions SRR10199005 (early life stage) and SRR10199006 (adult tissues). Other inputs into the annotation pipelines are discussed in the following sections as relevant.

3.2b. Assembly of RNA reads

3.2b.1. Error correction and trimming

To prepare for the production runs of the pipelines on the final genome, our ~320M paired-end input

RNA sequence reads were error corrected using rCorrector (version 1.0.3.1; Song and Florea, 2015), after which, we removed read pairs where one or both reads were flagged as 'unfixable_error' by using

FilterUncorrectabledPEfastq.py (github.com/harvardinformatics/TranscriptomeAssemblyTools).

TrimGalore (version 0.6.1; github.com/FelixKrueger/TrimGalore; stringency was set to 1 and evalue to

0.1) was used to apply the Cutadapt algorithm (Martin, 2011) to remove adapter sequences from the

RNA reads. Following these processes there were ~280.5 M paired-end reads.2

40

3.2b.2. RNA read assembly

Trinity (version 2.5.0) was used to perform both de novo and genome-guided assembly of the reads.

Assembled reads were required as input into the gene model annotation pipelines. For the genome- guided assembly, STAR (version 2.6.0a; Dobbin et al., 2013) was first used in two-pass mode to map the processed RNA-seq reads to the genome assembly. Additional details, including specific commands used for the steps described in this paragraph can be found in supplementary materials section S.1.

3.2c. Preliminary annotation runs

The Maker and PASA/EVM pipelines and all their required associated software were installed on two

Linux workstations. Specific additional software required by the pipelines or otherwise used are listed in

Table 6. Software tools employed in some manner in MAKER and PASA annotation pipelines

Tool (version) MAKER PASA

RepeatMasker (open-4.0.7) X X*

ncbi-blast+ (2.6.0 and 2.2.31) X X*

Exonerate (2.2.2) X X*

BLAT (36x1) X

gmap (2017-11-15) X

Augustus (3.2.2) X X*

Genemark-ES (4.38) X X*

SNAP (2006-07-28) X X*

TransDecoder (5.5.0) X

Hmmer (3.1b2, vs. Pfam version 32.0) X

EvidenceModeler (1.1.1) X

*Indirect use as (primarily) outputs from these tools generated when run under MAKER were employed as inputs to “PASA” pipeline at EVidenceModeler step.

41

Table 6 and/or can be found in the more detailed descriptions of the annotation process that follow.

After installing the pipelines, several preliminary runs were done to become familiar with the workflow and to make sure everything was working correctly. Of particular importance was one of the first Maker runs done on an earlier draft genome assembly a member of our team had produced. The reason this run is important, and why it is noted here, is that select gene models from this run were used as inputs for accomplishing species-specific training of our gene predictors, Augustus (version 3.2.2; Stanke, Keller et al. 2006), SNAP (version 2006-07-28; Korf, 2004) and GeneMark-ES (version 4.38; Lukashin and

Borodovsky, 1998). Details about this preliminary Maker run and the training of the gene predictors are presented in the supplementary materials section S.2.

Another set of preliminary Maker runs was also important. These runs were used to evaluate the use of rCorrector and TrimGalore, two programs eventually used in the annotation production runs described below in section 3.2d. The Maker models generated during this evaluation process were also used to retrain SNAP, and the SNAP parameter file generated from this training was used as input for to the first production run iteration of Maker. These runs are described in supplemental section S.2c.

A final preliminary Maker run was done using a mature genome that is well annotated, the zebrafish.

This run was done to see the results Maker would produce for zebrafish if we mirrored the process used to run Maker on FHM. This was not an exhaustive evaluation, but rather a qualitative assessment. As zebrafish’s number of genes has been well defined (though continues to evolve), we decided to compare the number of zebrafish gene models produced by Maker to the published number of protein coding genes to see how Maker performed, thinking this could provide us with information that might be useful when we considered the FHM results produced by Maker. We hoped to see results that were reasonably consistent with those produced for zebrafish, since zebrafish and FHM are relatively closely related.

Similar results would be viewed as evidence that the Maker process we were employing for FHM was

42

working as one should expect. Additional details of this preliminary Maker run are in supplementary materials section S.2d.

Preliminary testing of the PASA/EVM pipeline was also conducted. Details are not provided, but using the same draft assembly as used for the first preliminary FHM Maker run, the pipeline was taken through all the required steps up to and through using EVidenceModeler to successfully produce a set of single transcript gene models.

3.2d. Maker and PASA/EVidenceModeler (EVM) annotation runs

With preliminary testing indicating that the pipelines were performing as expected, production runs of the two pipelines occurred. Figure 3.A presents a flow chart of the process of running the two pipelines.

Boxes on the flow chart are annotated with the numbering of the text sections below that discuss steps related to the box on the flow chart. The reader is encouraged to refer to the figure to aid in understanding the text in the following sections.

3.2d.1. Maker Production runs

For final production runs, the first iteration of Maker (version 2.31.9; Cantarel et al., 2008) was run on the assembled genome using transposon sequences identified with RepeatMasker, the Trinity genome- guided RNA assembly, and ~22.5K non-redundant FHM ESTs downloaded from the Joint Genome

Institute (JGI). An additional input was reference proteins from multiple species (Arabidopsis thaliana,

Bacillus subtilis, Caenorhabditis elegans, Ciona intestinalis, Danio rerio, Dictyostelium discoideum,

Drosophila melanogaster, Escherichia coli, Gallus gallus, Homo sapiens, Mus musculus, Lepisosteus oculatus, Oryzias latipes, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Xenopus tropicalis) downloaded from the European Institute (EBI). The ab initio gene predictors

Augustus, SNAP, and GeneMark-ES were used, employing hmm parameter files developed during the previously referenced species-specific training, or in the case of SNAP, the parameter file trained as

43

described in section S.2c. It is noted here that SNAP was retrained for the second iteration of Maker using all the gene models output from the first iteration of Maker; retraining in this manner is suggested in Maker tutorial instructions. The final trained parameter files used for each predictor during the Maker runs, including both versions of the SNAP parameters, are available in the supplementary materials.

Control files for all the Maker runs are also included the supplementary materials so all parameters settings employed by Maker are available.

Following the first Maker run, a second iteration of Maker was run. The input to the second run included the consolidated “.gff” output file from the initial run, which included all the masking, protein alignment, and rna alignment information collected during the first run. The consolidated gff file was made using

Maker’s gff3merge utility script. An additional EBI reference protein file containing proteins from

Branchiostoma floridae was also included as an input, along with the retrained SNAP parameter file reference above. The parameter files for the other two gene predictors remained the same. The output from the second run was consolidated into a single gff file using gff3merge. A total of 30,909 gene models, with 47,716 transcripts, were output by Maker. Fasta sequence files containing all transcripts and their corresponding proteins were also output.

3.2d.2. PASA/EVidenceModeler Production run As an alternative path to produce gene models, PASA (version 2.3.3) was employed in concert with

TransDecoder (version 5.5.0; transdecoder.sourceforge.net) and EVidenceModeler (EVM, version 1.1.1).

3.2d.2.1. PASA PASA, used with default settings, aligned the concatenated denovo and genome-guided Trinity assembled sequences to the genome, using both blat (version 36x1, Kent, 2002) and gmap (version

2017-11-15, Wu and Wantanabe, 2005), and produced final model gene structures/transcripts by comparing and refining the alignment results from the two sources (Figure 3.A, step 3.2d.2.1).

44

Figure 3.A. Protein coding gene model annotation, Part 1 – Produce models from MAKER and PASA annotation pipelines. Numbering to side represents text sections where step in flow chart is discussed.

45

3.2d.2.2. TransDecoder

The PASA assembled transcripts were used as input into the program TransDecoder.LongOrfs which identified ORFs at least 100 amino acids long. As an optional step, ORFs with homology to known proteins were identified using blastp (version 2.2.31) alignment to our reference proteins and a hmmer

(version 3.1b2) search of Pfam (version 32.0) domains. Following the optional homology searches, protein coding genes were predicted with TransDecoder.Predict using the default criteria; a minimum length 300 ORF and an internally calculated log-likelihood score >0. The outputs generated from the blastp and hmmer searches were leveraged by TransDecoder.Predict so that the final output coding region predictions included both regions having characteristics consistent with coding regions based on prediction and those that demonstrated blast homology to reference proteins or Pfam domain content.

The TransDecoder steps are summarized as a single box in Figure 3.A, 3.2d.2.2.

3.2d.2.3. EVidenceModeler (EVM) and final PASA processing

Following the running of Maker, PASA, and TransDecoder, EVidenceModeler (EVM) and PASA were used in series to produce a set of final coding sequence models from the PASA/EVM pipeline, with some of the Maker output, as described below, serving as input. Most of the effort to run EVidenceModeler was centered around gathering the necessary inputs, so most of the text that follows covers that aspect of the EVM process.

3.2d.2.3.i. Preparation of inputs for EVM

3.2d.2.3.i.a. Gene predictions

Maker protein models and TransDecoder model peptide sequences were blasted against our reference proteins and best hits were classified based on hit lengths as having coverage (of both query sequence and subject sequence) that was perfect or greater than or equal to 98%, 95%, 90%, 85%, 70%, 60% or

46

50% (less than 50% not reported). Models found to have perfect coverage were extracted from the final

(second iteration) Maker gff file and the TransDecoder gff file to serve as heavily weighted EVM inputs.

In addition to these “perfect” models, raw Augustus, SNAP, and GeneMark predictions from the second

Maker run were extracted from the Maker output directories for scaffolds 22 and above; those predictions were contained in single files for each predictor (For Augustus and SNAP, there were actually two files for each scaffold, one containing predictions made on the masked scaffold, and one containing predictions made on the unmasked scaffold). For scaffolds 1-21, which represented the longest scaffolds in the assembly, the gene predictor outputs generated during the Maker run were parsed across multiple files representing overlapping (by 1M bases) chunks of the large scaffolds. For these scaffolds, the three gene predictors were run independent of Maker, as it was deemed an easier road than developing a parser to piece together the models from the chunked files. To mirror the way Maker does things as much as possible, hints files with intron/exon information were generated from the .bam file output from the second pass of the STAR alignment discussed earlier (section 2.2) for use with Augustus and GeneMark, and SNAP and Augustus were run on both the masked and unmasked input scaffolds

(masked scaffolds were available from the "theVoid…" directory in the Maker output directories for each scaffold as the file "query.masked.fasta”). All the predicted gene models from Augustus,

GeneMark, and SNAP were combined with the “perfect coverage” Maker and TransDecoder models described above to become part of a single input gene predictions file for EVM

(“genePreds.all.with.maker.transdc.perfs.gff”).

3.2d.2.3.i.b. Protein and RNA alignment evidence

To prepare protein and RNA alignment evidence files as input for EVM, the blastn, blastx, est2genome and protein2genome (the latter two generated by Exonerate) alignment information was extracted from the consolidated Maker gff file (the output of step 3.2d.1 in Figure 3.A). The PASA assemblies gff file (the

47

output from Figure 3.A, step 3.2d.2.1) was concatenated with the blastn and est2genome (i.e. RNA alignment) information extracted from Maker to make the file “rna.evidence.gff”. The protein2genome and blastx (i.e. protein alignment) information extracted from Maker was concatenated into file

“protein.evidence.gff”. Custom Perl scripts were used to format the ninth field of the gff files into the format EVM expects.

3.2d.2.3.i.c. EVM weights file

EVM also requires a weights file as input. The assigned weights determine the weight given to the various forms of input evidence supplied to EVM. The weights assigned for the production run were as follows: Augustus, GeneMark and SNAP predictions, 1, blastn, blastx, est2genome, and protein2genome alignments, 5, and PASA, TransDecoder, and Maker predictions, 10. The reader is reminded that the

Maker and TransDecoder predictions included as EVM inputs were only those exhibiting perfect coverage to a reference protein, which is why they were assigned the highest weight.

The weights were selected based on guidance on the EVM web page and based on testing done with a random subset of input genomic scaffolds run through EVM with different combinations of weights

(details not provided). Gene models produced from the EVM during the tests were assessed by running

BUSCO on the output model proteins and by using STAR to align a randomly selected (USEARCH) subset of ~6.43M paired-end RNA-seq reads to the output model transcripts and comparing mapping rates.

Prior to mapping, the output EVM models were run back through two iterations (as recommended) of

PASA in the “annotCompare” mode to add UTRs and/or alternate transcripts to the models. After multiple rounds of weights testing, for the EVM production run, we used what are essentially the

“default” suggested weights, as those weights produced the best results during the testing.

48

3.2d.2.3.ii. EVM production run

Once the all the input predictions and alignment evidence was gathered, and the weights were settled on, the production run of EVM occurred. The specific commands used to accomplish the production run can be found in the supplementary section showing specific annotation pipeline commands. The production run of EVM produced 37,378 gene models for the entire genome, each represented by a single transcript/protein. There was no UTR information included in the output models; EVM only predicts coding sequence.

3.2d.2.3.iii. Final updating of EVM gene models with PASA

As mentioned in the previous paragraph, to add UTR sequence and/or generate alternative transcript models, the 37,378 EVM models were processed back through PASA in two iterations along with the final PASA transcript assemblies fasta file, in “annotCompare” mode, after first loading the EVM models into the underlying PASA MySQL database. Default settings were used. The output models produced from the first iteration were loaded into the database prior to the second iteration, supplanting the models that had been originally loaded. Following the second iteration there were 37,190 gene models with 59,057 transcripts, which represented the final output from our PASA/EVM pipeline. The output models were in gff format, and corresponding transcript and protein fasta files were parsed from the genome fasta file using custom perl scripts or were generated with utility scripts downloaded with EVM.

The use of EVidenceModeler was the most complex part of the annotation process, however for graphical simplicity it was summarized as a single box in Figure 3.A, 3.2d.2.3.

3.2d.3. Final comment of annotation pipelines

Additional details for the steps described in this section and the preceding section (Maker), including specific commands that were run and the names of input and output files can be found in the

49

supplementary materials section 1. Figure 3.A provides a graphic overview of the overall processes or running the annotation pipelines, however, particularly in the case of running EVM, the figure was not the best place to present all the details involved. The reader is encouraged to review supplemental section S.1. to gain a better understanding of all the details of running the annotation pipelines, particularly those related to running EVM.

3.2e. Assessment of models sets, selection of preferred set and additional processing

The next step was to compare the models generated by the two pipelines to determine which set of models was preferred. Figure 4.A depicts the rest of the process used to arrive at our final model set beginning with the assessment of the model sets generated by each of the two pipelines.

3.2e.1. BUSCO analysis and mapping rate comparison

The 30,909 Maker models and 37,190 PASA/TransDecoder/EVM (aka “EVM”) models were assessed in two ways, presence of BUSCOs and RNA-seq mapping rates. To perform these assessments, the longest model proteins and their corresponding transcripts were selected as exemplars for any given gene. The reference set for the BUSCO analysis was BUSCO’s set of 4,584 single copy orthologs present in

Actinoperygii (ray-finned fish). BUSCO was run in protein mode on both model sets. BUSCO reported

3,889 complete BUSCOs (3,625 single-copy, 264 duplicated) in the Maker gene models versus 4,312 complete BUSCOs (4,007 single-copy, 305 duplicated) in the EVM set. To assess mapping rates, the randomly selected ~6.43 million paired-end RNA-seq reads described earlier were mapped to the exemplar transcripts. STAR reported a total mapping rate of 77.93% (71.71% unique mappers) for the

Maker models and 77.93% (72.85% unique mappers) for the EVM models. Table 7 summarizes the

BUSCO results for the two models sets, and Figure 5 displays the results graphically. Table 8 presents the

50

summarized mapping results. Based on the superior BUSCO results and equivalent mapping rates, the

EVM set of models was selected as the preferred model set with which to continue.

Table 7. BUSCO comparisons between Maker and PASA/EVM models. 4,584 total BUSCOs. Exemplar proteins evaluated. Maker PASA/EVM

Complete BUSCOs (C) 3889 (84.9%) 4312 (94.1%)

Complete and single-copy BUSCOs (S) 3625 (79.1%) 4007 (87.4%)

Complete and duplicated BUSCOs (D) 264 (5.8%) 305 (6.7%)

Fragmented BUSCOs (F) 271 (5.9%) 132 (2.9%)

Missing BUSCOs (M) 424 (9.2%) 140 (3.0%)

Table 8. Mapping rates of ~6.43M randomly selected paired-end FHM RNA reads to exemplar transcripts from Maker and PASA/EVM pipelines prior to selecting preferred model set # of Unique mapping Multimapping Target transcripts percentage percentage Total Mapping rate Maker exemplars 30,909 71.71% 6.22% 77.93% PASA/EVM exemplars 37,190 72.85% 5.08% 77.93%

51

Figure 4.A. Protein coding gene model annotation, Part 2 – Evaluate candidate model sets, select preferred models, final processing and filtering

52

Figure 5. BUSCO comparison between exemplar proteins produced by maker (top) and PASA/EVM (bottom) annotation pipelines. results are relative to 4,584 BUSCOS in Actinopterygii reference data set.

3.2e.2. Process EVM models back through Maker

Given that the 30,909 Maker models exhibited mapping rates equivalent to the 37,190 EVM models, and therefore had a better mapping rate/gene, we decided to run the EVM models through Maker, attempting (presumably) to add additional UTR information so that we might boost mapping rates. The input to this Maker run included the alignment and repeat information collected during the first Maker run and, additionally, the PASA assemblies gff3 file (i.e., transcript alignment information generated by

PASA after the original Maker production runs). The 37,190 EVM gene models were passed to Maker as

53

predictions in the Maker control file rather than as models, so that they could be modified. Maker output only 25,612 final gene models from this run; it was later determined the low number was primarily due to the “keep_preds” parameter in the maker control file (“maker_opts.ctl”) file being set to 0.

3.2e.3. Reintroduction of some EVM models

As our goal at this time was not to eliminate models, but only to improve mapping rates, to return the number of models closer to the 37,190 that had been input to Maker, models from the input EVM set that did not exhibit coding sequence (CDS) overlap with any of the 25,612 models output by Maker were returned to the model set. During the overlap check three sets of EVM models that were interleaved were discovered. While considered interesting and possibly bearing further examination, for now these interleaved models were eliminated. In two cases where two genes were interleaved, the model showing the best coverage to a reference protein was kept. In the third case, involving three interleaved genes, the model with the biggest footprint was removed, which left two adjacent, non-interleaved models. Also while examining for overlap, it was discovered that 8 of the 25,612 models from Maker appeared to have introduced frame-shifts that resulted in proteins sequences with premature stop codons, so they were also discarded. At this stage there were 36,881 models, 25,604 from the Maker run and 11,277 reintroduced from the EVM set input to Maker.

3.2e.4. BUSCO analysis and mapping rates comparison of input and output models

The exemplar proteins and transcripts from the 36,881 models were then assessed based on mapping rates and BUSCO completeness as before, where exemplars were chosen as the longest protein sequence from a given gene and its respective transcript. The results of these analyses were compared to those previously generated for the input set of 37,190 EVM models. For the 36,881 models remaining the total mapping rate reported by STAR after mapping the ~6.43M randomly sampled paired-end RNA

54

reads to the exemplar transcripts was 85.30% (78.55% unique mappers, Table 10). BUSCO reported

4,318 complete proteins (4,016 single-copy). The total mapping rate was thus improved by ~7.4% relative to the original 37,190 EVM models’ rate of 77.93% (Table 8), and the number of complete

BUSCOs had also increased relative to the input EVM set (4,318 vs. 4,312, Tables 9 and 7 respectively).

These 36,881 models (58,725 transcripts) were thus considered an improved model set. The final step would be to filter the models based on their mapping characteristics and homology to reference proteins, which is described in the next section.

3.2f. Final model filtering

Given that the number of gene models remaining, 36,881, was higher than the number of genes one would expect to find based on e.g., zebrafish, the models underwent a process based on mapping evidence and homology to reference proteins to potentially filter out some models. For each putative gene, as above, exemplars were chosen as the longest protein sequences and their corresponding transcripts. The exemplar proteins and transcripts were analyzed with BUSCO. The transcript BUSCO results showed fewer single-copy complete BUSCOs (3,988 vs 4,016), so conservatively, from the BUSCO transcript results, model mRNAs representing single-copy BUSCOs were identified in the “full_table” tab-separated-value file output by BUSCO. Bowtie2 (version 2.3.5.1; Langmead and Salzberg, 2012) was then used to map, as unpaired, our input set of the ~640M 150 base FHM RNA reads to all 36,881 exemplar transcript models. In-house developed scripts were used to parse the “.sam” alignment output from Bowtie2 into a tab delimited file with columns showing mRNA (mapped to) name, number of forward counts, and number of reverse counts, where the counts represent the number of reads that

55

Figure 6. BUSCO protein comparison between pre-filtered and final (filtered) gene models using exemplar protein sequences. Results are relative to 4,584 BUSCOS in Actinopterygii reference data set. There was essentially no change in the number of BUSCOs as a result of filtering Models from 36,881 input models down to the final 26,150 models. mapped to a given mRNA (transcript) in forward and reverse orientation respectively. Altogether there were ~528M forward strand (considered “correct”) mappings and ~22M reverse strand mappings, with

35,736 of the 36,881 mRNA transcript models having reads map to them. Using the summarized mapping output, the tenth percentile of correct strand mappings/total mapping was calculated for the models (transcripts) that had been identified as single-copy BUSCOs. The resulting value was 0.842. This value was used as a cut-off for a 2-sided lower 95% CI binomial estimate of the ratio of correct strand

56

counts/total counts for non-BUSCO genes (transcripts); genes with an estimated lower CI boundary

>=0.842 were retained in the final models.

The criterion described above was based on mapping. A homology criterion was applied to evaluate models that may be valid but that represent genes that were not highly expressed in our data. Diamond

(version 0.9.22.123; Buchfink et al., 2015) was used to do a blastx search where the query sequences were transcript models not identified as single-copy BUSCOs, and the target was our EBI reference proteins. The tenth percentile of percent ID to the reference proteins was 0.503, and the tenth percentile of percent overlap to the best hit reference protein was 0.404. These values were used as homology cut-offs; models that met both homology criteria were also kept as final models.

The filtering process reduced the final number of models from 36,881 to 26,150, which approximately equals the estimated protein coding gene number in zebrafish. Though this was promising, we checked to see that the filtering process had not significantly altered the BUSCO content or mapping rates. To evaluate the effect of filtering on the mapping rates, the previously described set of ~6.43M randomly selected paired-end reads was mapped with STAR to both the pre-filtered and filtered model sets. The effect on mapping rates (Table 10) was negligible. BUSCO analysis was performed on the filtered protein exemplar sequences and the results compared to the BUSCO results for the pre-filtered models. The

BUSCO results (Table 9, Figure 6) also demonstrated almost no change, with the filtered set reporting the same results as the unfiltered set save for the loss of one single-copy BUSCO. Based on the maintenance of BUSCO content and mapping rates and given the proximity of the number of remaining models (26,150) to the number of protein coding genes in zebrafish, we accepted the filtered models as our final gene set.

57

Table 9. BUSCO comparison between pre-filtered and final filtered gene models. 4,584 total BUSCOs. Exemplar proteins evaluated. Pre-filtering Post-filtering (36,881 proteins) (26,150 proteins) Complete BUSCOs (C) 4318 (84.9%) 4317 (94.2%)

Complete and single-copy BUSCOs (S) 4016 (79.1%) 4015 (87.6%)

Complete and duplicated BUSCOs (D) 302 (5.8%) 302 (6.6%)

Fragmented BUSCOs (F) 120 (5.9%) 120 (2.6%)

Missing BUSCOs (M) 146 (9.2%) 147 (3.2%)

Table 10. Mapping rates of ~6.43M randomly selected paired-end FHM RNA reads to pre- filtered and post-filtered (final) gene models. # of Unique mapping Multimapping Target transcripts percentage percentage Total Mapping rate Pre-filtered models 36,881 78.55% 6.75% 85.16% Filtered models 26,150 78.32% 6.61% 84.93%

Following final filtering, Maker was used a final time to generate additional information for the final transcripts/proteins. Maker was run as before using the 37,190 EVM models as prediction inputs, only in this case, the “keep_preds” option was set to 1 instead of 0. This run resulted in 34,620 output gene models (as compared to 25,612 gene models when the “keep predictions” parameter had not been set to 1). The number of output models is provided for informational purposes only, as the output from this run was not used for anything other than as described hereafter. If an input EVM gene model output by this Maker run had not been included in the 25,612 output models from the previous Maker run, and if that model had survived filtering so that it remained in the final set of 26,150 gene models, and if its newly output transcript sequence AND protein sequence were identical to the corresponding sequences in the final filtered set, then it was deemed acceptable to transfer the Maker AED and eAED values

58

generated in this Maker run to the models in the final set. (Note: AED and eAED are metrics produced by

Maker during the annotation process that reflect the confidence in the annotations on a scale of 0-1, with 0 representing the best possible score and 1 the worst. Generally speaking, the metrics reflect the amount of evidence supporting an annotation. eAED in theory attempts to correct for some factors that

AED doesn’t take into account, however in most cases eAED and AED will be identical). It is noted here that this final Maker run is not depicted in Figures 4 or 4.A, which stop after the filtering process.

3.3. Results

There were 26,150 gene models with 47,578 transcripts in the final filtered models. The genomic coordinates for the gene models and transcript models can be found in the supplementary gff3 file

“EPA_FHM.2.0.gff”. To evaluate improvements in the gene model annotations relative to the previously published annotations (Saari, et al., 2017), BUSCO analysis was performed. The previous annotations had two primary forms, Augustus model predictions and models derived using Exonerate’s coding2genome process, where zebrafish reference coding sequences, downloaded from EMBL, were aligned to the previously published draft assembly. The former consisted of 43,345 gene models and the latter 36,911 gene models. As the previous annotations consisted primarily of single transcript models, the BUSCO analysis of the new models was performed using only exemplar transcripts and proteins.

Table 11 summarizes the results of the BUSCO analyses on the three sets of proteins and transcripts, and Figure 7 displays the results graphically.

59

Table 11. BUSCO comparison between new (EPA_FHM_2.0) annotations and published first generation annotations. Protein and transcript exemplar sequences evaluated for EPA_FHM_2.0. Published annotations include Augustus predictions and predictions from Exonerate's coding2genome function, from aligning zebrafish reference transcripts from EMBL to 2016 FHM assembly. 4,584 target BUSCOs. Displayed are “number of BUSCOs in category (percentage)” EPA_FHM 2.0 published Annotations Exemplar Exemplar Augustus Augustus coding2genome coding2genome Proteins Transcripts Proteins transcripts proteins transcripts Complete 3119 BUSCOs (C) 4317 (94.2) 4336 (94.6) (68.0) 3036 (66.2) 3196 (69.7) 3174 (69.2) Complete and single-copy 2930 BUSCOs (S) 4015 (87.6) 3986 (87.0) (63.9) 2885 (62.9) 1958 (42.7) 1964 (42.8) Complete and duplicated BUSCOs (D) 302 (6.6) 350 (7.6) 189 (4.1) 151 (3.3) 1238 (27.0) 1210 (26.4) Fragmented BUSCOs (F) 120 (2.6) 113 (2.5) 808 (17.6) 818 (17.8) 567 (12.4) 596 (13.0) Missing BUSCOs (M) 147 (3.2) 135 (2.9) 657 (14.4) 730 (16.0) 821 (17.9) 814 (17.8)

60

Figure 7. BUSCO comparison between new (“exemplarProteins”, “exemplarTrasncripts”) annotations and previously published first generation annotations. Published annotations include Augustus predictions and predictions from Exonerate's coding2genome function (from aligning zebrafish reference transcripts from EMBL to the published 2016 FHM assembly). Reported results are relative to 4,584 BUSCOS in Actinopterygii reference data set.

Given that our primary goal in producing the new gene models was so that they could be used as the basis for comprehensive RNA expression studies, we also compared mapping rates to transcripts from the previous annotations and to exemplar transcripts of the new models. We mapped ~16M million randomly selected (USEARCH) paired-end reads from our ~320M early life stage/mixed-tissue FHM cDNA reads to transcript models from each annotation set using STAR with identical mapping parameters. Table 12 displays the mapping comparison results. Both the BUSCO analysis and mapping

61

comparison demonstrate that the new models represent a significant improvement over the previously published first-generation annotations.

Table 12. Mapping rates of ~16M randomly selected paired-end RNA reads to transcripts, published FHM annotations (Augustus, coding2 genome) and exemplar transcripts of new annotations (EPA_FHM) # of Unique mapping Multimapping Total Mapping Target transcripts percentage percentage rate EPA_FHM exemplar transcripts 26,150 73.60% 11.16% 84.76% Augustus 43,345 49.60% 9.29% 58.89% coding2genome 36,911 24.21% 26.69% 50.90%

The BUSCO analysis results on the new annotations indicate a high level of completeness. In order to gauge the completeness of our models relative to well-studied and well annotated genome of a similar species, we performed BUSCO analysis and examined mapping rates in Danio rerio (zebrafish). We downloaded from NCBI the most recent version of predicted RNA and protein sequences produced by

NCBI’s Gnomon annotation pipeline for zebrafish (GCF_000002035.6_GRCz11_gnomon_rna.fna and

GCF_000002035.6_GRCz11_gnomon_protein.faa, respectively). Comparing the new FHM annotations to the Gnomon annotations seemed appropriate given the process used to produce the new FHM annotations. BUSCO analysis results comparing the completeness of the NCBI zebrafish models to the new FHM models are in Table 13, while Figure 8 displays the results graphically. The BUSCO results indicate that zebrafish’s models are more complete. This is not a surprising result given the overall resources and time that has gone into annotating the zebrafish genome. It is worth noting that the

Gnomon models do include 87,248 model transcripts and 74,211 model proteins, compared to the

47,578 transcripts/proteins in the new FHM models, which make it more likely that the zebrafish models would produce more complete BUSCO results.

62

Table 13. BUSCO comparisons between new (EPA_FHM_2.0) annotations and NCBI zebrafish Gnomon annotation pipeline predictions. Protein and transcript models evaluated. proteins transcripts

EPA_FHM 2.0 zfGnomon EPA_FHM 2.0 zfGnomon

Complete BUSCOs (C) 4320 (94.2) 4528 (98.8) 4342 (94.8) 4526 (98.7)

Complete and single-copy BUSCOs (S) 2816 (61.4) 2200 (48.0) 2744 (59.9) 2206 (48.1)

Complete and duplicated BUSCOs (D) 1504 (32.8) 2328 (50.8) 1598 (34.9) 2320 (50.6)

Fragmented BUSCOs (F) 119 (2.6) 33 (0.7) 109 (2.4) 34 (0.7)

Missing BUSCOs (M) 145 (3.2) 23 (0.5) 133 (2.8) 24 (0.6)

To get a simple evaluation of how the mapping rates of RNA (cDNA) to our new models compared to rates seen when using a well-studied genome, we mapped reads to both the new FHM transcripts and to the zebrafish gnomon RNA sequences. For FHM, we used the previously described ~16M randomly selected paired-end reads. For zebrafish, we randomly selected (USEARCH) ~16M reads from the paired- end reads that had been downloaded to do the preliminary Maker run on zebrafish. We used STAR with the same parameter settings in each case to perform the mapping. Table 14 summarizes the mapping results. Zebrafish exhibited a slightly higher mapping rate, but zebrafish reads were on average approximately only half the length of the FHM reads, which likely boosted the zebrafish mapping rate.

Considering this along with the greater number of potential targets the zebrafish models offered, the mapping rates exhibited by the new FHM compare very favorably, providing confidence that the new gene models represent a reasonably complete set of genes.

63

Figure 8. BUSCO comparison between new (“EPA_FHM”) model Protein and transcript sequences and NCBI Gnomon annotation pipeline predictions for latest zebrafish genome (GrCz11). Reported results are relative to 4584 BUSCOS in Actinopterygii reference data set.

Table 14. Mapping rates of ~16M randomly selected paired-end RNA reads to new FHM transcript models and NCBI Gnomon predicted transcripts for zebrafish. # of Unique mapping Multimapping Total Mapping Target transcripts percentage percentage rate NCBI-zf-Gnomon transcripts 87,248 35.21% 36.35% 71.56% EPA_FHM transcripts 47,578 42.43% 25.75% 68.18%

64

One additional comparison was done to zebrafish (and four other fish species). In Moss, Joyce, et. al

(2011), the authors examined intron structure in five teleost species including zebrafish. A genomic comparison focusing on introns was presented in the publication as Table 1. We calculated the same statistics for our final exemplar transcript models (using perl scripts); the published data was derived from canonical transcripts (as designated by EMBL), so our use of exemplars seemed appropriate.

Table 15 displays the previously published data and the corresponding data generated for FHM. As might be expected, the statistics for FHM appear to most closely resemble those of zebrafish.

Interesting to note is the AT richness of the FHM introns. The preliminary analysis of the repeat content of the genome referred to in the previous chapter indicates that low complexity/simple repeats that are

AT rich may be a distinguishing feature of the FHM genome. The AT richness of the FHM introns appears to support the preliminary genomic repeat data. We also calculated the frequency classes of splice signals within our exemplar introns. 98.44% of the introns had the canonical AT/CG signal, 0.90% were

GC/AG, and 0.13% AT/AC, and 0.53% represented an alternative motif.

Though not used in the current project, Maker AED and eAED values were included as part of the annotations for transcript models (mRNAs) as potentially useful information; the AED values were added to the terminal field of “mRNA” lines in file EPA_FHM_2.0.gff. 43,544 of 47,578 mRNAs had AED/eAED data assigned directly from Maker, while an additional 1,834 reintroduced EVM transcripts had AED and eAED values assigned based on the final Maker run described at the end of the methods section. A total of 4,034 reintroduced EVM transcript models were present in the final set post-filtering, so 45.2% of those had AED/eAED information added as a result of the final Maker run. The remaining 3,200 (54.8%)

65

reintroduced EVM transcripts present in the final models have no AED information associated with them.

Pimephales G. Oryzias Takifugu Tetraodon Danio rerio promelas aculeatus latipes rubripes nigroviridis Genome 461,533, 868,983, 393,312, 1,412,464,843 1,066,412,313 358,618,246 size 448 502 790 Protein coding 24,803 26,150 20,109 18,920 17,876 18,872 genes Canonical 24,803 26,150 20,109 18,920 17,876 18,872 transcripts Introns per 8.93 9.71 9.93 9.80 10.51 9.96 gene Number of 221,589 253,821 199,624 185,494 187,962 187,875 introns Maximum intron 378,145 381,255 175,269 295,125 93,537 631,227 length Total 151,619, 219,591, 108,524, intron 622,476,590 505,387,110 90,447,562 269 667 412 length Percentage 44.07% 47.39% 32.85% 25.27% 27.59% 25.22% of genome Mean 2,809 1,991 760 1,184 577 481 length Median 984 469 219 252 143 118 length Mode 84 85 85 77 78 76 length 25th percentile 138 117 104 90 84 80 length 75th percentile 2,563 1,500 615 1,026 450 350 length GC content 50.58% 37.09% 50.48% 47.10% 40.39% 49.21%

66

3.4. Discussion

The new gene models, coupled with the new genome, represent a major advance, providing a robust framework for gene expression studies, opening the door for a much deeper understanding of changes going on at the molecular level in response to toxins and other environmental stressors. The BUSCO analysis results for the exemplar proteins and transcripts indicate a high level of completeness (>94%) and represent a significant (>20%) improvement over the levels of completeness exhibited by the annotations developed from the earlier published genome (Table 11, Figure 7). There are over 300 duplicated BUSCOs reported by both the FHM transcripts and proteins however, indicating that some regions of the genome assembly may still be redundant and that some transcript models require additional refinement. Some BUSCOs are duplicated in limited numbers of the reference species though, so additional investigation of the duplicates is merited. If there is overlap between BUSCOs duplicated in the FHM results and those duplicated in other species, the multiple gene copies seen in FHM may be real and if so, would provide additional confirmation of the quality of the gene models and underlying assembly.

Comparing the BUSCO analysis of the final model proteins (Table 9) to that of the entire genome assembly (Table 3), there are 100 more complete single-copy BUSCOs present in the genome analysis.

This indicates that there are likely additional gene models to be mined from the assembly. Given that we know that the gene model filtering process didn’t affect the number of reported BUSCOs, the best place to start to find additional models is in the raw outputs provided by the three gene predictors. The genomic coordinates of the BUSCOs from the genome analysis that are not present in the protein analysis results will be known, so the raw output from the predictors can be examined to see if any raw predictions that were filtered out by the Maker and EVM pipelines were made at those coordinates and might be recovered, or at least serve as starting points for manual gene annotations. Overall, efforts will

67

continue to refine, improve and amend the gene models, including investigating the ones identified in the genome analysis that are missing from the protein analysis.

The final total of 26,150 protein coding genes compares very favorably with the reference genome of the closely related zebrafish, which is currently is estimated to have ~26,000 coding genes. The specific comparisons done to the zebrafish Gnomon predictions provide further evidence that the models we’ve produced are relatively complete and of relatively high quality. Table 13 and Figure 8 display the comparative BUSCO results for both predicted transcripts and proteins from the final FHM models and the zebrafish Gnomon predictions. For both transcripts and proteins, the zebrafish models report roughly 4% more complete BUSCOs (98+% vs. 94+%). This may in part be due to the larger number of available model sequences available for zebrafish (87,248 vs. 45,578 for FHM). Even if not driven by the additional sequences, we are highly satisfied with the results given the maturity of the zebrafish genome and that amount of resources that have gone into developing and curating it. Given that this was our group’s first attempt and genome assembly and protein coding gene annotation, we consider our models’ BUSCO coverage being as close as it is to the zebrafish’s a major achievement.

The comparative analysis of (principally) introns between the FHM and the five other species (Table 15) was useful as an additional source of information by which the quality of our gene models might be inferred. The general concordance of the FHM’s statistics with those of its co-family member zebrafish is encouraging. The Moss publication reports that in the five species it examined, the GT/AG (GU/AG) splice signal motif occurs most frequently in zebrafish introns, representing 93.57% of the total splice signals. In our exemplar FHM transcripts, the rate of occurrence of this canonical motif was found to be

98.44%. As our models were produced in an automated fashion we believe there may be some bias in the annotation pipelines towards using canonical splice signals, and manual curation of the new models might lower the fraction of the canonical signal.

68

A phylogenetic process, details of which will be presented in the upcoming genome publication, involving mapping of final exemplar proteins to reference proteins from the species present in our reference protein set, was used to assign names to our gene models. 20,564 of the 26,150 genes were assigned names through the analysis, with 5,603 having “-like” appended to the name they inherited from their reference ortholog; this indicated their homology to the reference protein did not meet our criteria for full name inheritance. The 20,564 named genes had their names added to the last field of

“gene” and “mRNA” lines in file EPA_FHM_2.0.gff. The 5,586 models that did not receive names kept simple names that were generated during the annotation process (e.g., “scaf1.5” is scaffold 1, gene 5).

To reiterate, the naming process will be described in detail in the genome publication.

The genome and models were uploaded to NCBI (Accession WIOS00000000) and will become publicly available when the genome manuscript is published in the coming months. Additional information (e.g., repeat information, ncRNA information, phylogenetic information) continues to be generated and written up by our team and will be added to the uploaded genome or otherwise uploaded to NCBI as part of our project, and will also be part of our publication.

69

Chapter 4 - Using the FHM gene models in an RNA-Seq expression experiment to develop predictors of exposure

4.1. Introduction and Background

Omics have long been identified as a potential solution for many issues with environmental regulations

(Ankley, Daston et al. 2006). One potential application of -based technologies is to be able to group chemicals with similar modes of action (MOA). Such groupings can be used to facilitate read- across from data rich to data poor chemicals to more accurately predict risk. Alternatively, they can be used forensically to identify causes of impairment in impacted systems. Though limited examples exist in the ecotoxicological literature, there are examples in the human health arena where this has been done

(Dumur, Fuller et al. 2011, Subramanian, Narayan et al. 2017).

One of the major hurdles to applying sequencing-based technologies to the FHM was the lack of a well- assembled genome with annotated protein-coding gene models. The generation of our high-quality new assembly and comprehensive protein-coding gene models allows, for the first time, the full application of sequencing-based technologies to the FHM, as it provides a valuable basis upon which meaningful

RNA-seq transcript profiling studies can be executed and interpreted. Part of the reason for doing this study is to in fact demonstrate that we can effectively employ the new genome and gene models as the basis for such work.

Our group has successfully applied a number of classification algorithms to identify chemical groups in aquatic exposures (Wang, Bencic et al. 2012, Flick, Bencic et al. 2014, Biales, Kostich et al. 2016, Kostich,

Bencic et al. 2019). Issues with specificity are one of the major concerns of omics-based groupings. This is difficult to test because it is not possible to test a classifier against all possible substances; however, one approach is to evaluate performance when challenged with a chemical with very different

70

toxicological properties. Though this approach is in no way exhaustive, it should allow us to determine if classifiers are utilizing chemical-specific features or those associated with general stress. In this study, we evaluated the performance of gene expression-based classifiers to classify the toxicologically unrelated chemicals bifenthrin and copper. Our model test system employed the FHM larvae, which was selected based primarily on practical considerations. The use of adult organisms requires considerably more resources (i.e. water) per exposure making them impractical for use in screening and in applications where water would need to be shipped to a test facility. Additionally, as chemicals target different tissues/organs, large scale screening of chemicals would require classification in each tissue/organ separately, greatly adding to the resource demands. Use of the whole larvae has the disadvantage of averaging across tissues and makes mechanistic interpretation difficult as different tissues will often respond in very different ways. We have recently evaluated the larval model relative to tissue specific classification and it performed similarly (Kostich et al., 2019).

Bifenthrin affects the kinetics of voltage-gated sodium channels present in neuronal axons causing an influx of sodium into the neuron and depolarization resulting in overstimulated innervated cells

(reviewed in Soderlund and Bloomquist, 1989). Environmental concentrations of bifenthrin (and other pyrethroid pesticides) are thought to negatively impact aquatic ecosystems (e.g., Ali, et al., 2011,

Weston, et al, 2008, Pennington, et al, 2014). Copper (Cu) is a vital mineral essential for many biological processes. When copper homeostasis is disrupted, excess Cu leads to toxicity associated hepatic disorder, neurodegenerative changes and other effects and disease conditions. Disease and deleterious effects are most often linked to copper’s role as a redox-active metal capable of initiating oxidative damage, driving cellular toxicity (reviewed in Gaetke, Chow-Johnson, and Chow, 2014).

The primary goals of this study were to develop classifiers of exposure to two toxicants having different mechanisms of action, and to test whether one can robustly detect at the transcriptomic level the

71

specific toxic effects of each toxicant. To that goal, we used a number of statistical and machine learning methods, in conjunction with cross-validation framework, to train classifiers of exposure and assess their overall performance. We also compared the most differentially expressed genes from each treatment to look for evidence that classification was based on toxin specific responses.

4.2. Methods

4.2a. Summary

In this study FHM larvae were exposed for 48 hours to sublethal concentrations of the toxicants bifenthrin (CAS RN 82657-04-3, exposure concentration = 3.2 ug/L) and copper (CAS RN 7440-50-8, exposure concentration = 40 ug/L). RNA was extracted from the whole fish and RNA sequencing libraries were prepared resulting in ten bifenthrin treated samples, eleven copper treated samples; there were also and eleven untreated controls. Sequencing was performed on two lanes (16 samples/lane) of an

Illumina HiSeq 4000 producing between 15.9M and 25.8M 50 base single-ended reads in each sample.

The stranded single-end RNA reads (50 bases) generated were mapped to exemplar model transcripts from our group’s new annotated FHM genome using the STAR aligner. Reads mapping to transcripts with the expected strand orientation were tabulated. For each of the two toxicants, a feature (gene) by observation (biological sample) read count matrix was generated. The count data was normalized by standard methods and lists of differentially expressed genes (DEGs) between control and exposed fish were generated. All 26,150 features were present in the lists, ranked by adjusted p-value (false discovery rate, FDR) in ascending order. For each toxicant/control combination, the DEG lists were fed to five different statistical classifiers, which were tuned and trained in a “wrapper-ed” cross-validation framework and classifier performance was evaluated based on the mean squared prediction error, cross-entropy, and misclassification rate. A final classifier was generated for each of the five algorithms

72

and final predictions were made on the treated samples opposite the ones trained for. During tuning/training, the number of features used by each classifier was not predetermined but was selected during the tuning process (The tuning process was “seeded” with features selected randomly on a log scale in the range of 10 to 1000 features). For each toxicant, a filtered set of DEGs was used to do a functional analysis to identify potential biomarkers of exposure. Additional experimental details are provided in subsequent sections.

4.2b. Exposure organisms

Fathead minnows (Pimephales promelas) were obtained from the on-site culture at the U.S. EPA Andrew

W. Breidenbach Environmental Research Center in Cincinnati, OH. Larvae were obtained using methods outlined in Lewis et al. (1994). However, to ensure that larvae were closely synchronized in their development, spawning tiles were placed in breeding tanks in the morning and were removed after one hour. Eggs were placed in an aerated separatory funnel for three days, then transferred to a plastic container and gently aerated with an air stone. Eggs and larvae were maintained in an incubator at 25

°C. Adults, eggs and larvae were maintained in dechlorinated tap water supplemented with CaCO3 to a hardness of 180 mg/L. Animal use was approved by the AWBERC Animal Use and Care Committee.

4.2c. Test chemicals and exposure water

A master stock solution of bifenthrin was prepared in DMSO at a concentration of 0.16 mg/ml. The copper sulfate master stock was prepared in deionized water a concentration of 2.0 mg/ml.

All exposures were performed in moderately hard reconstituted water (MHRW; Lewis et al., 1994).

Exposure solutions were prepared by diluting 10 µl of the appropriate stock solution in 500 mL MHRW to a final concentration of 40 µg copper/L or 3.2 µg bifenthrin/L. DMSO was added to both the copper

73

solution and the MHRW control water (10 µl in 500 ml) to obtain a uniform 0.002% DMSO in all exposure solutions. Exposure solutions were prepared fresh each morning.

4.2d. Exposures

Only larvae that had hatched at the start of the experiment (approximately 96 hpf) were used in exposures. For each treatment and controls, ten larvae were placed in each of five replicate 150 mL beakers containing 100 mL exposure solution. Beakers were arranged in a random manner on a tray, covered with an acrylic sheet to minimize evaporation, and placed in an incubator at 25 °C with a 16:8 h light:dark cycle. At 6 and 24 h, beakers were checked, and any dead larvae were removed. Three larvae from each beaker were removed at each of these time points as part of a separate experiment. After 24 h, approximately 90 mL of the water was removed and replaced with fresh exposure solution. At the end of the 48-h exposure, three larvae were removed from each beaker, and each larva was placed into a 1.5 ml microcentrifuge tube and frozen in liquid nitrogen. A 3.2 mm stainless steel bead was place in each tube prior to the addition of the larva. Samples were transferred to -80 °C for storage.

Over the course of the 48-h exposure, bifenthrin-exposed larvae became increasingly lethargic. Copper- exposed larvae were somewhat lethargic after 24 hours but showed signs of recovery and were swimming more normally at the end of the 48-h exposure. Three copper-exposed larvae, one bifenthrin- exposed larva and no control larvae died during the exposures.

4.2e. RNA isolation and preparation of sequencing libraries

RNA was extracted from larvae using MagMAX™-96 Total RNA Isolation Kit (Thermo Fisher Scientific) following the manufacturer’s protocol. Samples were removed from -80 °C and immediately placed on ice. MagMax lysing/binding buffer (100 µl) was added to each tube. Samples were homogenized using a

Bullet Blender Storm 24 homogenizer (Next Advance, Troy, NY). RNA was quantified

74

spectrophotometrically on a Synergy™ HTX Multi-Mode Microplate Reader using a Take3 Micro-Volume

Plate (Biotek). RNA quality was assessed using a 4200 TapeStation (Agilent).

Sequencing libraries were prepared using the SENSE mRNA-Seq Library Prep Kit V2 (Lexogen). Ten or eleven larvae (at least two from each beaker) from each treatment were used. All libraries were prepared from 250 ng total RNA according to the manufacturer’s protocol supplied with the kit. The concentration of each library was determined using the Qubit™ dsDNA HS Assay with a Qubit 2.0

Fluorometer (Thermo Fisher Scientific).

4.2f. RNA Sequencing

Sequencing libraries (10 bifenthrin treated, 11 copper treated and 11 controls) were combined into two separate pools of 16 libraries each, with equimolar amounts of samples added to the pools. The pooled libraries were shipped overnight on dry ice to the Research Technology Support Facility (RTSF) Genomics

Core at Michigan State University for sequencing. Pools were QC’d and quantified using a combination of Qubit dsDNA HS, Agilent 4200 TapeStation HS DNA1000 and Kapa Illumina Library Quantification qPCR assays. The two pools were loaded onto one lane each of an Illumina HiSeq 4000 flow cell and sequencing was performed in a 1x50bp single read format using HiSeq 4000 SBS reagents. Base calling was done by Illumina Real Time Analysis (RTA) v2.7.7 and output of RTA was demultiplexed and converted to FastQ format with Illumina Bcl2fastq v2.19.1.

4.2g. Mapping

The 50 base long reads in the RNA-seq fastq(.gz) file for each sample were mapped with STAR to indexed exemplar transcripts from each of the 26,150 protein-coding gene models from the new FHM genome.

Following STAR mapping two in-house developed C programs were employed in series and resulted in four output text files: A file of the features (gene transcripts, one per line), a file with the sample

75

identifiers (one per line) and two tab-separated counts matrices, where the rows correspond to the lines in the sample file, and the columns, left to right, correspond to the rows in the sample file. One counts matrix file contained read counts mapping to the forward strand and the other read counts mapping to the reverse strand of the target sequences. As the SENSE protocol results in reads from the second

(reverse) strand, the reverse counts matrix was used for further analysis. As described in section 4.3a below, an initial QC assessment was done at this stage involving generation of density plots, a box plot, and MDS plots for each toxicant/control combination. These plots were generated in R (version 3.6.0; R

Core Team, 2019).

The Whole Genome Shotgun assembly with the protein coding gene models has been deposited at

DDBJ/ENA/GenBank under the accession WIOS00000000. The version that was used in this paper is version WIOS01000000. It will become publicly available upon release of the genome assembly paper,

(Martinson et al., manuscript under preparation).

4.2h. Feature ranking

RNA-seq feature selection was performed using the R limma package (version 3.40.6; Ritchie et al.,

2015). Between sample normalization was conducted using the trimmed mean of M-values (Robinson and Oshlack, 2010) method implemented by the calcNormFactors function from the R edgeR package

(version 3.26.8; Robinson et al., 2010). Precision weights (Law et al., 2014) were calculated using the voom function from the limma package. The limma lmFit function was used to fit a generalized linear model with a term for chemical treatment effects. Moderated t-statistics (Phipson et al., 2016) were calculated for chemical treatment effects using the limma eBayes function, and the limma topTable function was used to sort features based on their nominal p-values. No p-value or FDR cutoff was employed for feature selection since the number of top features used during classification was treated as an empirically tunable parameter as described in the classifier development section.

76

4.2i. Classifier tuning, training, and testing

The paradigm employed for classification was statistically driven rather than being mechanistically driven (Kostich, 2017). All classification was performed using R packages: Table 16 lists the classifiers and versions employed. Classifiers were empirically tuned using a grid search approach used in our laboratory before. Sixty points in parameter space are chosen using a uniform random sample either on the raw scale, or if indicated, on a log scale. For feature selection and classification, a subset of the most significantly differentially expressed genes, i.e., features with the lowest adjusted p-value, is defined by randomly selecting the number of top features on a log scale in the interval [10, 1000]. Additional details of parameter tuning, including the specific parameters tuned for each algorithm, can be found in

(Kostich, et al., in review).

Table 16. Statistical classifiers employed to evaluate exposure status of samples Classifier R library Version References random forest randomForest 4.6-14 Breiman (2001) logistic regression with elastic net Glmnet 2.0-18 Friedman J, Hastie T, Tibshirani (2010) regularization naïve Bayes Caret 6.0-84 Maron (1961) support vector machine with e1071 (uses 1.7-2 Cortes C, Vapnik (1995) gaussian kernel libsvm) gradient boosted decision trees Xgboost 0.90.0.2 Friedman J, Hastie T, Tibshirani (2000), Friedman (2001)

Test sets consisted of ten (bifenthrin) or eleven (copper) treated, each paired with eleven untreated

(control) samples. During cross-validation, only data in the training set was used for training set normalization, feature ranking, parameter tuning and final fitting of each classifier, and test sets were normalized separately to maintain independence. Misclassification rates (average 0-1 loss), cross- entropy, and MSPE were calculated using R base package functions. For misclassification rate (0-1 loss),

0 is awarded when the class was correctly assigned and 1 when it was incorrectly assigned, employing a cutoff of 0.5.

77

For final model generation and testing five-fold cross-validation was employed and tuning and training was accomplished as described above. For both toxicants, the final model from each of the five algorithms was saved and then used to generate predictions for the samples treated with the other toxicant using the ‘predict’ function in R.

4.2j. Functional analysis/biomarker discovery

The Ingenuity Pathway Analysis program (IPA, version 49932394, Qiagen) was used to perform functional analyses. For both toxicants, the ranked feature lists were amended to include an additional first column containing gene identifiers for the transcripts in the form of Uniprot or GenBank reference protein accessions. Accessions and names were previously assigned to 20,564 of the gene 26,150 models (described in Martinson, et al. in preparation). The accessions and names assigned to gene models are in supplemental file “fhm_gene_names_20190531a.xlsx”. Regarding the naming, in some cases names were appended with ‘-like’ to indicate that a model’s homology to the reference protein from which its name was derived did not meet the criteria for exact name inheritance. Following the addition of the accessions to the ranked feature lists, the listed were uploaded to IPA, and IPA’s ”Core

Analysis” function was applied to each set of ranked DEGs with default settings, other than applying a filter so that only genes with FDRs <=0.05 would be included in the analyses. It should be noted that since 5,586 of the 26,150 gene models did not map well enough to a reference ortholog to be named

(above), such genes were not included in functional analysis regardless of their FDR value since an accession (or similar identifier) is required to be part of the analysis. Also noted here is that no distinction was made between genes with ‘-like’ in their name and those that did not; accessions associated with “-like” genes were included in the analysis.

The IPA Core Analysis attempts identify relationships, mechanisms, functions, and pathways relevant to an input dataset, and produces many results. We focused on the canonical pathway output, which, for

78

each toxicant, was downloaded (supplemental files bifenthrinIPACanonicalPathways.xlsx and

CopperIPACanonicalPathways.xlsx). We also used IPA’s “Comparison Analysis” tool to compare the core analysis results from the two toxicants. The canonical pathways files contain predictions of pathways that are enriched based on the DEGs in the analysis sets. The files contain a second tab listing the molecules (genes) used in the analyses. The results include two statistics, a p-value and z-value. The p- value is calculated using a right-tailed Fisher’s exact test and reflects the likelihood that the overlap of one or more molecules identified as significant in the experiment with a given pathway is due to random chance. The z-score, unlike the p-value, takes into account the directional effect of one molecule on another molecule or on a process, and the direction of change of molecules in the dataset, providing a statistical measure of the match between the expected relationship direction and the observed gene expression. A z-score > 2 or < -2 is considered significant. The reader is referred to http://pages.ingenuity.com/rs/ingenuity/images/0812%20upstream_regulator_analysis_whitepaper.pdf for an example of the details of z-score calculation.

4.3. Results

4.3a. Sequencing, Mapping, and Quality Control evaluation

The number of mapped reads in the samples ranged from 11.3 M to 18.7 M, while reads mapping to the correct (reverse) strand ranged from 10.7 M to 18.0 M. The percentage of reads mapping to the forward

79

strand ranged from 2.2% to 5.2%. Table 17 and supplementary Figure 4.S.1 summarize the total reads and mapping results.

Table 17. Summary of read counts and mapping across all 32 samples. Mapped Mapped Mapped % % rev % fwd Reads reads reverseStrand forwardStrand mapped mapped mapped Mean 20,868,744 14,722,222 13,951,356 770,866 70.50% 66.80% 3.70% Median 21,378,162 14,754,111 13,761,702 752,598 70.70% 66.80% 3.70% Min 15,948,060 11,304,033 10,655,113 421,757 66.90% 61.80% 2.20% Max 25,768,888 18,723,719 17,993,700 1,126,483 74.10% 71.60% 5.20%

To visually evaluate the quality of the tabulated correct strand mapping counts, read density plots

(Figure 4.S.2) and a box plot (Figure 9) were generated. Additionally, feature counts were summed across all samples and the thousand features with the most counts were used to generate multidimensional scaling (MDS) plots for each toxicant/control pairing (Figure 10). The MDS plots showed separation of treated samples from controls, with no samples appearing to be obvious outliers, so all samples remained in the data for subsequent analysis. The density plots and box plot indicated relatively equal densities and counts across all samples.

80

Figure 9. Box plot of raw sample counts in each sample, indicating even distribution of count data across all 32 samples. Samples with “B” in their name were bifenthrin treated, “C”, copper treated, and “M”, untreated controls

81

Figure 10. Multidimensional Scaling (MDS) plots for toxicant treated vs control (“M”) samples. bifenthrin/controls top, copper/controls bottom. Plots exhibit clear separation between treated and control samples and indicate there are no obvious outlier samples. Plots were constructed using the 1000 features that had the most counts summed across all samples in each of the two groups.

82

4.3b. Feature ranking (Differentially expressed gene lists, DEG lists)

Between sample normalization and feature selection was performed using the R packages EdgeR and limma. At the end of the process, moderated t-statistics were calculated for chemical treatment effects using the limma eBayes function, resulting in ranked feature lists included in supplemental files bifenRankedFeatureLists.xlsx and CopperRankedFeatureList.xlsx, for each toxicant/control combination, respectively. The initial 26,150 input features are ordered by lowest to highest adjusted p-value (false discovery rate, FDR). bifenthrin had 7,908 of 26,150 features with an FDR <.05, while copper had 829 features with FDRs <.05.

4.3c. Classifier performance

Five classifier algorithms were applied to the normalized expression data: random forest (“rf”, Breiman,

2001), logistic regression with elastic net regularization (“glmnet”, Friedman et al., 2010), naive Bayes

(“nbayes”, Maron, 1961), support vector machines (“svm.rbf”, SVM; Cortes and Vapnik, 1995) with a

Gaussian kernel, and gradient boosted decision trees (“gb.tree”, Friedman et al., 2000; Friedman, 2001).

For each toxicant, two rounds of nested tuning/training and prediction were done, and during each of the two rounds, both five-fold and ten-fold cross validation was performed. During each of the two rounds, at both fold levels, parameter tuning was accomplished in the inner loop of the nested process, and feature ranking and performance estimates using hold-out samples were generated in the outer loop. In one of the two rounds, for each fold, probability estimates of hold-out samples’ class membership were generated, where each fold consisted of a balanced number of controls and treated samples. In the second iteration, rather than outputting the probability estimates for each hold-out sample, three metrics, the mean squared prediction error (MSPE), cross-entropy, and misclassification rate were calculated within each fold from the prediction results obtained for the hold-out samples in that round of cross-validation.

83

For a binary prediction the MSPE is calculated on the q data points that were not used in estimating the model because they were held back for this purpose or because these data have been newly obtained.

in which n is the number of samples used in training, q is the number of hold-out samples, Y is the vector of known values (0 for controls, 1 for treated samples) of the variable being predicted, and is the vector of predicted values. Cross-entropy also measures the performance of a classification model with output probabilities between 0 and 1. In binary classification, cross-entropy can be calculated as:

−(y * log(p) + (1−y) * log(1−p)) where y is a binary indicator of correct class (0 for control, 1 for treated), p is the predicted probability of an observation belonging to the correct class. Cross-entropy especially penalizes predictions that are confident and wrong. The misclassification rate is simply the number of predictions that are wrong over the total number of predictions. For all three metrics, the lower the value the better, with 0 indicating perfect model performance. For bifenthrin, in the case of the by sample estimates, all five of the classifiers correctly predicted all samples’ membership as treated or control at both five-fold and ten- fold cross-validation. In only two cases was the probability of belonging to the correct class <0.8; during five-fold cv gb.tree assigned a probability of .719 to a bifenthrin sample and svm.rbf assigned a .750 probability to a bifenthrin sample. For copper, the performance of the predictors was slightly worse than for bifenthrin. 8 of 110 correct-class predictions during five-fold cv were <0.8, and another 4 (all gb.tree predictions), were <0.5, the typical cutoff for class assignment. For the ten-fold cv, there were 11 instances of the predicted probability of the correct class <0.8, but there were no erroneous predictions

84

using the 0.5 cutoff. Naïve Bayes was the only algorithm to have prediction class probabilities all exceed the 0.8. Table 18 shows for each toxicant, the average probability estimate for correct classes by each algorithm for five-fold and ten-fold CV. Supplemental files bifenBySampleEstimates.xlsx and

CopperBySampleEstimates.xlsx contain the full results of the sample class estimates by CV round.

Table 18. Average predicted probability of correct class assignment of samples during wrapped cross validation tuning, training and testing CV Algorithm folds B Control C Control 0.850 0.879 0.744 0.831 gb.tree 5 (+/-0.092) (+/-0.021) (+/-0.418) (+/-0.289) 0.996 0.997 0.939 0.983 glmnet 5 (+/-0.011) (+/-0.011) (+/-0.258) (+/-0.061) 1.000 1.000 1.000 1.000 nbayes 5 (+/-0.000) (+/-0.000) (+/-0.001) (+/-0.000) 0.991 0.999 0.922 0.953 rf 5 (+/-0.039) (+/-0.005) (+/-0.254) (+/-0.132) 0.908 0.895 0.860 0.894 svm.rbf 5 (+/-0.148) (+/-0.062) (+/-0.250) (+/-0.110) 0.892 0.900 0.885 0.890 gb.tree 10 (+/-0.007) (+/-0.009) (+/-0.099) (+/-0.107) 0.998 0.998 0.973 0.994 glmnet 10 (+/-0.004) (+/-0.005) (+/-0.142) (+/-0.020) 1.000 1.000 0.996 1.000 nbayes 10 (+/-0.000) (+/-0.000) (+/-0.024) (+/-0.000) 0.993 0.998 0.938 0.929 rf 10 (+/-0.031) (+/-0.013) (+/-0.203) (+/-0.186) 0.902 0.912 0.876 0.869 svm.rbf 10 (+/-0.106) (+/-0.074) (+/-0.207) (+/-0.174)

For the second round of wrapped training/testing, at both five-fold and ten-fold CV, MSPE, mean cross- entropy and mean misclassification rate were calculated for each round of CV. Table 19 shows the summarized results for these rounds of training and testing. Looking at the misclassification rate, bifenthrin prediction was perfect across all algorithms with the exception of a single misclassification by gb.tree. For copper, gb.tree, naïve Bayes, and svm.rbf all exhibited some misclassifications. Taken as a

85

Table 19. Mean MSPE, mean cross-entropy, and mean misclassification rate by algorithm during wrapped cross-validation tuning, training and testing bifenthrin copper CV mean mean mean mean Algorithm MSPE MSPE folds entropy L01 entropy L01 gb.tree 5 0.043 0.193 0.050 0.077 0.285 0.080 glmnet 5 0.000 0.004 0.000 0.006 0.026 0.000 nbayes 5 0.000 0.000 0.000 0.000 0.000 0.000 rf 5 0.001 0.009 0.000 0.011 0.063 0.000 svm.rbf 5 0.010 0.094 0.000 0.031 0.164 0.000 gb.tree 10 0.010 0.108 0.000 0.081 0.299 0.075 glmnet 10 0.000 0.002 0.000 0.010 0.035 0.000 nbayes 10 0.000 0.000 0.000 0.029 0.230 0.025 rf 10 0.000 0.002 0.000 0.013 0.063 0.000 svm.rbf 10 0.008 0.082 0.000 0.021 0.119 0.050 whole over both rounds of performance evaluation, naïve Bayes, glmnet, and random forest appeared to perform best. Full results for these tests are in supplemental files bifenPerformanceMetrics.xlsx and

CopperPerformanceMetrics.xlsx.

4.3d. Final classifier testing and functional analysis

After the multiple performance evaluation runs, final classifiers, trained separately for bifenthrin and copper, were captured from all five algorithms. In order to approximate the specificity of the classifiers, they were tested against samples treated with the chemical other than one with which they were trained. If selected features were simply targeting general stress and xenobiotic response, it would be

Table 20. Confusion matrices for predicted classes for copper samples (n=11) by bifenthrin trained classifiers and for bifenthrin samples (n=10) by copper trained classifiers (M=control) bifenthrin trained copper trained algorithm Predicted B Predicted M Predicted C Predicted M gb.tree 0 11 10 0 glmnet 0 11 4 6 nbayes 3 8 8 2 rf 0 11 4 6 svm.rbf 0 11 10 0

86

expected that cross chemical classification should perform similarly to single chemical classification. If not, then it suggests that the features are classifying based on a chemical specific transcriptomic response. Table 20 displays confusion matrices of the cross-chemical test results for the algorithms and toxicant on which they were trained.

For the bifenthrin trained classifiers, there were only three cases where a copper sample was classified as being a bifenthrin sample; all three erroneous predictions came naïve Bayes With respect to how bifenthrin treated samples were classified by the copper-trained classifiers, all of the algorithms classified some (or all) of the bifenthrin samples as belonging to the copper class. Random forest and glmnet did the best job of not misclassifying bifenthrin treated samples, with each classifying only four of ten bifenthrin samples as copper. The other algorithms had a much more difficult time, as naïve

Bayes, gb.tree, and svm.rbf classified, respectively, 8, 10 and 10 of the bifenthrin treated samples as copper. Given these results, in our study random forest and glmnet are considered the most promising algorithms in terms of being able to distinguish samples treated with chemicals with distinct mechanisms of toxicity. The results from these two algorithms are explored in more depth in the discussion section.

As another way to assess the uniqueness of the exposure profiles resulting from exposure to the two chemicals, we examined the top (ranked by FDR) genes from the bifenthrin and copper DEG lists for overlap at several “depths” (20, 100, 500, 1000). One thousand was chosen as the maximum because the tuning process for classifiers sets that as the maximum number of features that can be used, while the lowest value (20) was chosen because several of the classifiers (glmnet, naïve bayes and random forest) in this study typically employ 10-20 features for classification once tuned. Figure 11 shows the overlap between the two treatments at the different depths. At 20 features there is no overlap, and even when comparing the top 1000 features, less than 80% of DEGs are shared between the two

87

Figure 11. Comparison of overlap of genes identified as being most differentially expressed in bifenthrin treated (blue) and copper treated (red) samples compared to controls at different depths of DEG lists (20-1000). The relatively small amount of overlap between lists is viewed as evidence that toxin specific responses are more prevalent than general stress responses in the two treatment groups. chemicals. The lack of overlap provides additional evidence that we are seeing toxin specific responses to the two treatments.

To look for further support of our hypothesis that classification was based on chemical specific responses as opposed to general stress or xenobiotic clearance responses, we conducted a functional analysis using the Ingenuity Pathway Analysis program (IPA, version 49932394, Qiagen) to see if expected pathways were statistically enriched. Functional analysis was conducted on all genes exhibiting

FDR <.05 that could be associated with an IPA reference accession. Results downloaded from IPA can be

88

found in supplemental files bifenthrinCanonicalPathways.xlsx and CopperCanonicalPathways.xlsx, and the most significant pathways identified for each chemical are displayed in Figures 12 and 13, while

Figures 14 and 15 show comparisons of the most significant pathways found in each chemical, with significance ranked in two different ways. Figure 15 shows the top 20 pathways based on z-score, which reflects how the patterns of expression of genes in the pathways compare to their expected directions

(up or down) based on what is known about the pathways. A more positive z-score indicates experimental data are more in agreement with the expected trends. z-scores <-2 or >2 are considered significant (activated). Figure 16 shows the top 20 pathways in the comparison based on Benjamini-

Hochberg FDR. An FDR <=.05 is considered significant. The results files contain the significantly enriched canonical pathways as reported by IPA.

4.4. Discussion

4.4a. Sequencing mapping and QC

The sequencing data in each sample appeared to be of great enough magnitude and of high enough quality so that all samples could be included in the analysis, as indicated by the initial QC screening previously described. Mapping rates indicate that some genes may be missing from the target gene models, which would not be surprising given that the models were, for the most part, produced by automated pipelines from a newly assembled genome. The possibility also exists that there may be some residual genomic DNA sequence represented in our RNA libraries, which would also reduce the transcript mapping rates. Based on the success of the classifiers however, there was adequate representation of genes in the target sequences to meet the goals of this study.

4.4b. Feature ranking (DEG lists)

89

The lists of ranked features are contained in supplemental files bifenRankedFeatureList.xlsx and

CopperRankedFeatureList.xlsx. As described in the results section, there are many more genes with FDR

<.05 in the bifenthrin set compared to the copper set. It is theorized this may indicate the bifenthrin dose used in the study was “higher” (in terms of producing effects) compared to the copper dose. This theory is somewhat supported by the functional analysis results (below), where known effects of bifenthrin toxicity seem harder to detect than copper specific effects, however that is confounded by the small fraction of nerve tissue in the total tissue (whole larvae).

4.4c. Classifier performance

During the two rounds of cross-validated performance evaluation conducted prior to performing final classification of all samples, within each cross-validation set, all five classifiers tested were able to distinguish between the treated samples of the toxicant on which they were trained and controls. The bifenthrin classifiers appeared to perform better than the copper classifiers, reporting for all algorithms higher probabilities of correct class assignment in the by sample performance evaluations (Table 18), and better values for the three metrics (mean squared error, cross-entropy, and misclassification, Table

19).

Our results suggest that a number of different classification algorithms can be successfully trained to accurately predict exposure to both toxicants, suggesting a strong and robust transcriptional response to the exposure. In fact, even for copper, which appeared the more challenging toxicant to classify, glmnet and rf were able to correctly classify all samples (average misclassification rate=0 for both five-fold and ten-fold cv) within each cross round of cross-validation. While svm.rbf (SVM) and gb.tree (gradient boosting) algorithms performed somewhat worse, they also achieved good separation of the two classes, consistent with the MDS plots that showed readily apparent separation between controls and treated samples.

90

To allay concerns that the classifiers were performing “too well” due to their near perfect performance at distinguishing treated samples from controls during cross validation, we examined the features that were selected during each round of cross validation by glmnet, random forest and naive Bayes. For both copper and bifenthrin, these algorithms typically selected 10-20 features to use for classification during each round of cross validation. We found that certain specific features tended to be picked consistently for classification. As an example, random forest selected 73 features (with replacement) overall during five rounds of CV for bifenthrin and 69 features (with replacement) were selected over the five rounds of copper CV. So for bifenthrin, random forest averaged picking 14.6 features per round upon which to base classification, and for copper, the average number of features used was 13.8 features. Figure X shows the frequency of feature selection that occurred for each chemical overall during the five rounds of CV.

91

Figure 12. Features selected during five rounds of cross validation (CV) of the random forest classifier for bifenthrin and copper treated samples. The total number of features selected over the course of five rounds is represented by the “n=” at the top of each table. Feature names are represented as e.g. “scaf410.1-mRNA-1” (scaffold 410, gene 1, transcript 1). The number of times a given feature was selected over the five rounds of CV appears to the right of the feature name. For both bifenthrin and copper there is a set of features that tend to dominate, being selected at least 3 times over 5 CV folds. That consistent sets of features are used to distinguish treated from control samples is thought to explain the perfect/near perfect classification results that are exhibited during cross validation, regardless of the learning method/algorithm employed.

4.4d. Final classifier testing and functional analysis

When used in cross-chemical classification results were inconsistent between the copper and bifenthrin trained classifiers. For bifenthrin trained classifiers, except for naïve Bayes, the final classifiers did not identify copper treated samples as bifenthrin treated. This indicates that classification is based on

92

chemical specific features, or at least features that do not overlap with copper, as opposed to a generalized stress response. Alternatively, the copper-trained classifiers did not perform as well.

Bifenthrin treated samples were classified as copper treated samples across all five algorithms. Random forest and glmnet performed the best with a 40% misclassification rate. The misclassification of bifenthrin treated samples as copper may indicate that more features in the copper classifier were for general stress or, alternatively, it could indicate an overlap in the cellular effects elicited by copper and bifenthrin. Among the most prevalent responses observed in functional analysis of the copper exposed fish were related to oxidative stress. Though a hallmark of copper toxicity, oxidative stress is often associated with xenobiotic challenge, and has been observed following bifenthrin exposure (Jin, et al.,

2012).

Interpreting the importance of this evaluation is somewhat difficult. For use in environmental regulation, having some metric defining specificity would be an obvious advantage. Developing that metric is not particularly straight-forward, as it is not possible to screen a classifier against all possible unique chemical classes or MOA. Further complicating this is that often more than one chemical, targeting more than one pathway are active simultaneously. The experimental design here was designed to develop binary classifiers, constraining results to either treated or not. Given that, experimental designs that employ a number of MOA during the training will facilitate the development of multinomial classifiers. These will still be faced with the difficulty of classifying MOA not included in training but will be maximized to discriminate the MOA that were included in training. There are some notable examples where this type of approach yielded a high performing tool that is used in a regulatory context. The

Pathwork tissue of origin test is one such example (Tothill, et al., 2005). This test is designed to identify the tissue of origin for secondary tumors, which is important in identifying effective treatment options.

The problem of identifying the tissue of origin is somewhat analogous to identifying chemical exposure

93

except perhaps slightly simpler in that there are a finite set of tissue types and suggests that the development of a multinomial classifier for different environmental exposures may be possible.

If one examines the results provided by the glmnet and rf algorithms more closely, focusing on these algorithms due to their having the lowest misclassification rates during cross-chemical testing, more nuanced and promising results can be found. Table 21 shows the mean probabilities of class membership assigned to each the different sample types by glmnet and random forest when trained with either bifenthrin or copper and supplementary Table 4.S.1 displays the results on a sample-by- sample basis for the copper trained classifiers. In Table 21, for both algorithms, there is a clear difference between the mean probability values assigned to bifenthrin samples when the classifiers are trained with bifenthrin versus being trained with copper. For copper samples, there is an extreme difference, as the average probabilities for copper samples are an order of magnitude higher when the classifiers are trained with copper versus being trained with bifenthrin. Looking at the by-sample results in Table 4.S.1, for random forest, the four bifenthrin samples (B11, B2, B4, B5) identified as being copper) have probabilities of belonging to that class of .566, .526, .794, and .502 respectively. If one contrasts this with the probabilities assigned to true copper treated samples, nine of the eleven samples have a class probability >=0.92. For glmnet, the results are not as compelling, as three of the four bifenthrin samples classified as being copper (B11, B2 and B4) have probabilities greater than the copper

Table 21. Mean probabilities of class assignment of different sample types B, C, or M (control) by glmnet and random forest when trained with B or C algorithm glmnet rf bifenthrin copper bifenthrin copper Sample type trained trained trained trained B 0.996 0.594 0.992 0.505 C 0.032 0.939 0.075 0.922 M 0.003 0.017 0.001 0.047

94

sample (C10) assigned the lowest probability (0.583). The results suggest that employing a cutoff for class inclusion greater than 0.5 may be beneficial as the loss of sensitivity might be limited relative to t gains in precision. Table 22 shows the effect employing different cutoff values has on sensitivity (recall), precision and the resulting F1 score when classifying samples in our more challenging case of “cross- chemical” classification, copper. To calculate the values in the table, samples were binned as “copper” and “not copper”; when trained for copper we do not want classifiers to report copper “hits” unless samples were really exposed to copper. In our experiment, at a cutoff of 0.5, recall is perfect as there are no false negatives reported by either algorithm, however there are false positives (bifenthrin samples reported as being copper treated). As the cutoff value is raised, recall tends to decrease while for glmnet, precision improves; for random forest, precision peaks at a cutoff of 0.6. If one uses the F1 score (the harmonic mean of precision and recall) to represent the overall testing accuracy, the cutoffs of 0.7 and 0.8 provide the maximum F1 score in the case of glmnet, while a cutoff of 0.6 results in the maximum F1 for random forest. Recommending one algorithm over the other given the limited data being examined is problematic, however. Deciding an appropriate cutoff in general will be dictated by the distribution of the empirical data, as well as the application for which the tool is being designed. For example, more stringent cutoff values would be prudent in applications where misclassification results

Table 22. Sensitivity, precision and F1 score for glmnet and random forest at different cutoffs for class assignment when classifiers were trained with copper treated samples and controls. Glmnet rf sensitivity sensitivity cutoff precision F1 score precision F1 score (recall) (recall) 0.5 1.000 0.733 0.846 1.000 0.733 0.846 0.6 0.909 0.769 0.833 1.000 0.917 0.957 0.7 0.909 0.833 0.870 0.818 0.900 0.857 0.8 0.909 0.833 0.870 0.818 1.000 0.900 0.9 0.818 0.818 0.818 0.818 1.000 0.900 0.95 0.818 0.900 0.857 0.636 1.000 0.778

95

in a significant resource expenditure, whereas in chemical screening, where there is a minor cost to incorrect grouping, a more inclusive cutoff value might be used.

Our primary goal was to see if we could develop classifiers specific to the two toxicants studied. The fish in this study were exposed to sub-lethal concentrations of bifenthrin and copper. Though the toxicant concentrations were sub-lethal, they were relatively high, and likely resulted in general stress responses in the test organisms as well as toxicant specific effects. The functional analyses support the idea that at least some toxicant specific effects were evident. The most significant pathways identified in the copper treated samples (e.g, mitochondrial dysfunction, and oxidative phosphorylation (Figure 14) can be linked to copper’s primary known toxic effects of oxidative stress and mitochondrial function disruption (e.g.,

Hosseini, et al., 2014, Borchard, et al., 2018, Mehta, et al., 2006). For bifenthrin, the pathways identified as most significant (Figure 13) are not as easily tied to the expected effects of the toxicant, however some connections can be made. bifenthrin is known to induce estrogenic effects in fish (reviewed in

Brander, Gabler, et al. (2016) and two of the most significant pathways (role of BRCA1 in DNA damage response, hereditary breast cancer signaling) may reflect the estrogenic effects of the toxin. None of the top pathways that are identified are obviously connected to the neurotoxic effects for bifenthrin.

However, bifenthrin has been shown to have effects outside of the CNS as well (e.g., muscle; reviewed in Soderlund and Bloomquist, 1989). The lack of their appearance near the top of the significance list may be attributable to the use of whole larval samples, since CNS tissue makes up a very small fraction of the total tissues.

96

Figure 13. Most significant pathways identified by Ingenuity Pathway analysis (IPA) when comparing differential expression between bifenthrin treated samples and controls, ranked according to the results of Fisher Exact test. Positive z-score (orange bars) indicates pathways that activated in the expected direction based on what is known about the genes in the canonical pathways. The presence of two pathways related to breast cancer in the top 4 is viewed as evidence that bifenthrin’s estrogenic effects are showing up in the exposed organisms.

97

Figure 14. Most significant pathways identified by IPA when comparing differential expression between copper treated samples and controls, ranked according to the results of Fisher Exact test. Positive z- score (orange bars) indicates pathways that activated in the expected direction based on what is known or about the genes in the canonical pathways. The top pathways identified align well with copper acting as a redox-active metal capable of initiating oxidative damage.

98

The heatmaps of the IPA comparison analysis between bifenthrin and copper are displayed as Figures 14 and 15. In Figure 15, it is worth noting that for bifenthrin, one “neuro” pathway (the neuroinflammation signaling pathway) does emerge as significant, indicating that some signal of its neurotoxic effects can be detected with appropriate filtering and analysis of the data. In Figure 15 there appears to be a significant amount overlap between the pathways activated by the two toxicants. This may seem counter intuitive, especially in the case of bifenthrin trained classifier, where except for a very few cases of erroneous predictions made by naïve Bayes, copper treated samples do not result in false positives

(i.e., are not classified as bifenthrin samples). The apparent disconnect between the overlap seen in

Figure 15 and the lack of false positives in the “cross-chemical” testing can be explained by the input to the functional analysis compared to the input used for classification. Typically, the number of features used by classifiers after tuning is limited. Glmnet, naïve bayes, and random forest usually relied on between 10-20 features for classification, while gb.tree and svm.rbf most often used hundreds of features. The functional analysis included all genes with FDR <=0.5 however, which represents 4,323

DEGs for bifenthrin and for 481 copper. Figure 11 shows that there is little to no overlap of copper and bifenthrin DEGs at the levels of the top 20-1000 DEGs, and thus the genes driving classification are likely unique. In the larger population of genes used for functional analysis however, especially given the high number of DEGs in the bifenthrin samples, more frequent overlap between the DEG lists is likely, which is evident in the results displayed in Figure 15.

The significant pathways comparison based on FDR (Figure 16) appears to offer a more toxin-specific perspective. Here copper is shown as significantly affecting mitochondrial function and oxidative phosphorylation while bifenthrin induces cancer and cell cycle (likely composed of many overlapping genes) pathways and exhibits estrogenic activity (with the appearance of breast cancer pathways). In summary, evidence is present to think copper exposure induced Reactive oxygen species (ROS)

99

production and oxidative stress and associated outcomes and bifenthrin acted as an estrogen and neurotoxin (neuroinflammation), which is consistent with their known toxic effects.

100

Figure 15. Comparison of most significant pathways activated in bifenthrin treated samples (BSAll_FDR.05) vs copper treated samples, based on IPA z-score, where orange coloration indicates activation in the expected direction and blue activation opposite the expected direction. The strength of response of Oxidative phosphorylation in the copper treated samples is seen as evidence that classifiers are at least in part leveraging genes related to copper toxicity. The appearance of the Neuroinflammation Signaling Pathway as significant in the bifenthrin treated samples is viewed as evidence that bifenthrin’s neurotoxic effects can be identified as part of the exposure response.

101

Figure 16. Comparison of most significant pathways activated in bifenthrin treated samples (BSAll_FDR.05) vs copper treated samples, ranked based on BH-FDR as calculated by IPA. The top two pathways are associated with known toxic effects of copper. The dots in the corresponding columns for bifenthrin indicate the results for those two pathways were not found to be significant (FDR<=0.5). In the bifenthrin treated samples, two pathways related to breast cancer, are significant, evidencing bifenthrin’s estrogenic effects. The same pathways are insignificant in the copper treated samples. Overall the results provide evidence that toxicant specific effects drive the classification results.

102

4.4e. Limitations and future studies

This study was conducted using whole larvae, so examination of tissue/organ specific effects is not possible. Additionally, the use of whole organism results in the expr6ession response being an average from all responsive tissues and confounds the identification of affected cellular targets and functions that are anticipated based on previous studies. Our study employed a single concentration of each of two toxicants and thus may not extrapolate to other toxicants, or responses expected at lower concentrations. However, it did allow us to evaluate the specificity of a multigene classifier, at least in a limited system. Characterization of classifier performance would be greatly enhanced by evaluating it to a range of concentrations and to a greater number of similarly and differently acting chemicals. The relatively high concentration of the two toxicants employed in this study likely engendered some generalized stress responses that may be shared across other toxicant classes. This was evident in functional analysis, as were toxicant specific responses. Developing multinomial classifiers may effectively reduce the chances of including general stress response genes in selected features, as they will not be useful for discriminating across chemicals. In the future we plan to look at cross-classification of different concentrations of the same toxicants used for classifier training, as well as toxicants with similar mechanisms of action, and compare these to classifiers built to distinguish toxins with alternative modes of action. The ultimate goal is to develop robust classifiers capable of detecting and distinguishing exposures to different environmental stressors involving different mechanisms of toxic actions. One way we may be able to enhance classifier robustness would be to select features (DEGs) for classifier training that are the most biologically relevant based on known mechanisms of toxicity, and then build classifiers using a subset of such features from (to the extent practical) “non-overlapping” pathways expected to be, or identified as, significant.

103

4.5 Conclusions

The new FHM genome and its transcript models provide a sound basis for conducting RNA-Seq experiments, allowing both meaningful functional analysis and a basis for the development of classifiers of exposure. Based on the current mapping results, additional mining of the genome may yield more annotated gene models. However, it appears a reasonably comprehensive set of genes have been identified. We have successfully developed a number of classifiers to identify toxicant exposure, which further supports this conclusion. However, overall performance characteristics indicate that random forest and logistic regression with elastic net regularization (glmnet) may be better suited for use when examining samples affected by other stressors than the one used to train the classifier. We were able to use transcript profiling to detect exposure, but our ability to draw functional conclusions about impacted physiological processes was made more difficult due to the averaging across disparate tissue types. Some findings that appear to be biologically relevant were discernible, however. Additional experiments are needed to investigate classifier sensitivity to lower concentrations of toxicants and their ability to avoid misclassifying toxicants with different modes of action, and to explore the utility of functional analyses to uncover biologically meaningful signatures of exposure.

104

Bibliography

Aardema, M. J. and J. T. MacGregor (2002). “Toxicology and genetic toxicology in the new era of “”– impact of “‐omics” technologies.” Mutation Research 499:13– 25. Ali S. F., B. H. Shieh, Z. Alehaideb, M. Z. Khan, A. Louie, N. Fageh and F. C. Law (2011). “A review on the effects of some selected pyrethroids and related agrochemicals on aquatic vertebrate biodiversity.” Canadian Journal of Pure & Applied Sciences. 5(2):1455-64. Altschul, S. F., W. Gish, W. Miller, E. W. Meyers and D. J. Lipman (1990). "Basic Local Alignment Search Tool." Journal of Molecular Biology 215. Altschul, S. F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Research 25(17): 3389-3402. Anders S. and W. Huber W (2010). "Differential expression analysis for sequence count data." Genome Biology, 11(10): R106. Ankley, G. T. and D. L. Villeneuve (2006). "The fathead minnow in aquatic toxicology: past, present and future." Aquat Toxicol 78(1): 91-102. Ankley, G. T., G. P. Daston, S. J. Degitz, N. D. Denslow, R. A. Hoke, S. W. Kennedy, A. L. Miracle, E. J. Perkins, J. Snape, D. E. Tillitt, C. R. Tyler and D. Versteeg (2006). "Toxicogenomics in regulatory ecotoxicology." Environ Sci Technol 40(13): 4055-4065. Barrett, L. W., S. Fletcher and S. D. Wilton (2013). "Untranslated Gene Regions and Other Non-coding Elements." 1-56. Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucl Acids Res 27(2): 573-580. Benjamini Y., Yekutieli D. The control of the false discovery rate in multiple testing under dependency. The annals of statistics. 2001;29(4):1165-88. DOI: 10.1214/aos/1013699998 Biales A. D., M. S. Kostich, A. L. Batt, M. J. See, R. W. Flick, D. A. Gordon, J. M. Lazorchak and D. C. Bencic (2016). "Initial development of a multigene 'omics-based exposure biomarker for pyrethroid pesticides." Aquatic toxicology. Oct 1;179:27-35. DOI: 10.1016/j.aquatox.2016.08.004 Biales A. D., M. S. Kostich, A. L. Batt, M. J. See, R. W. Flick, D. A. Gordon, J. M. Lazorchak and D. C. Bencic (2016). "Initial development of a multigene 'omics-based exposure biomarker for pyrethroid pesticides." Aquatic Toxicology 179(Supplement C): 27-35. Bracken, C., Scott, H. & Goodall, G. (2016). "A network-biology perspective of microRNA function and dysfunction in cancer." Nat Rev Genet 17, 719–732. https://doi.org/10.1038/nrg.2016.134 Boetzer, M., C. V. Henkel, H. J. Jansen, D. Butler and W. Pirovano (2011). "Scaffolding pre-assembled contigs using SSPACE." Bioinformatics 27(4): 578-579. Bolger, A. M., M. Lohse and B. Usadel (2014). "Trimmomatic: a flexible trimmer for Illumina sequence data." Bioinformatics 30(15): 2114-2120. Borchard, S., F. Borka, T. Rieder, C. Eberhagena, B. Popper, J. Lichtmannegger, S. Schmitt, , J. Adamski, M. Klingenspore, K.-H. Weiss and H. Zischkaab (2018). "The exceptional sensitivity of brain mitochondria to copper." Toxicology in Vitro. 51:11-22. DOI: 10.1016/j.tiv.2018.04.012 Bradnam, K., Fass, J. N., et al (2013). "Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species." GigaScience 2(10). Brander S. M., N. L. Fowler, R. E. Connon and D. Schlenk (2016). "Pyrethroid Pesticides as Endocrine Disruptors: Molecular Mechanisms in Vertebrates with a Focus on Fishes." Environ Sci Technol. Sep 6;50(17):8977-92. DOI: 10.1021/acs.est.6b02253

105

Breiman L. "Random forests." (2001). Machine learning. 45(1):5-3M. K. 2. DOI: 10.1023/a:1010933404324 Buels, R., E. Yao, C. M. Diesh, R. D. Hayes, M. Munoz-Torres, G. Helt, D. M. Goodstein, C. G. Elsik, S. E. Lewis, L. Stein and I. H. Holmes (2016). "JBrowse: a dynamic web platform for genome visualization and analysis." Genome Biol 17: 66. Burns, F. R., A. L. Cogburn, G. T. Ankley, D. L. Villeneuve, E. Waits, Y.-J. Chang, V. Llaca, S. D. Deschamps, R. E. Jackson and R. A. Hoke (2016). "Sequencing and de novo draft assemblies of a fathead minnow (Pimephales promelas) reference genome." Environ Toxicol Chem. 35(1): 212-217. Burtoln, J. N. e. a. (2013). "Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions." Nat Biotech 31: 1119-1125. Burton, H. J. Huson, J. C. Nystrom, C. M. Kelley, J. L. Hutchison, Y. Zhou, J. Sun, A. Crisa, F. A. Ponce de Leon, J. C. Schwartz, J. A. Hammond, G. C. Waldbieser, S. G. Schroeder, G. E. Liu, M. J. Dunham, J. Shendure, T. S. Sonstegard, A. M. Phillippy, C. P. Van Tassell and T. P. Smith (2017). "Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome." Nat Genet 49(4): 643-650. Bushnell, B. U. S. D. o. E. J. G. I. (2014). "BBMap." from https://jgi.doe.gov/data-and-tools/bbtools/. Cantarel, B. L., I. Korf, S. M. C. Robb, G. Parra, E. Ross, B. Moore, C. Holt, A. Sanchez Alvarado and M. Yandell (2008). "MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes." Genome Res 18. Carninci, P., T. Kasukawa, S. Katayama, J. Gough, M. C. Frith, N. Maeda, R. Oyama, T. Ravasi, B. Lenhard, C. Wells, R. Kodzius, K. Shimokawa, V. B. Bajic, S. E. Brenner, S. Batalov, A. R. R. Forrest, M. Zavolan, M. J. Davis, L. G. Wilming, V. Aidinis, J. E. Allen, A. Ambesi-Impiombato, R. Apweiler, R. N. Aturaliya, T. L. Bailey, M. Bansal, L. Baxter, K. W. Beisel, T. Bersano, H. Bono, A. M. Chalk, K. P. Chiu, V. Choudhary, A. Christoffels, D. R. Clutterbuck, M. L. Crowe, E. Dalla, B. P. Dalrymple, B. de Bono, G. D. Gatta, D. di Bernardo, T. Down, P. Engstrom, M. Fagiolini, G. Faulkner, C. F. Fletcher, T. Fukushima, M. Furuno, S. Futaki, M. Gariboldi, P. Georgii-Hemming, T. R. Gingeras, T. Gojobori, R. E. Green, S. Gustincich, M. Harbers, Y. Hayashi, T. K. Hensch, N. Hirokawa, D. Hill, L. Huminiecki, M. Iacono, K. Ikeo, A. Iwama, T. Ishikawa, M. Jakt, A. Kanapin, M. Katoh, Y. Kawasawa, J. Kelso, H. Kitamura, H. Kitano, G. Kollias, S. P. T. Krishnan, A. Kruger, S. K. Kummerfeld, I. V. Kurochkin, L. F. Lareau, D. Lazarevic, L. Lipovich, J. Liu, S. Liuni, S. McWilliam, M. M. Babu, M. Madera, L. Marchionni, H. Matsuda, S. Matsuzawa, H. Miki, F. Mignone, S. Miyake, K. Morris, S. Mottagui-Tabar, N. Mulder, N. Nakano, H. Nakauchi, P. Ng, R. Nilsson, S. Nishiguchi, S. Nishikawa, F. Nori, O. Ohara, Y. Okazaki, V. Orlando, K. C. Pang, W. J. Pavan, G. Pavesi, G. Pesole, N. Petrovsky, S. Piazza, J. Reed, J. F. Reid, B. Z. Ring, M. Ringwald, B. Rost, Y. Ruan, S. L. Salzberg, A. Sandelin, C. Schneider, C. Schönbach, K. Sekiguchi, C. A. M. Semple, S. Seno, L. Sessa, Y. Sheng, Y. Shibata, H. Shimada, K. Shimada, D. Silva, B. Sinclair, S. Sperling, E. Stupka, K. Sugiura, R. Sultana, Y. Takenaka, K. Taki, K. Tammoja, S. L. Tan, S. Tang, M. S. Taylor, J. Tegner, S. A. Teichmann, H. R. Ueda, E. van Nimwegen, R. Verardo, C. L. Wei, K. Yagi, H. Yamanishi, E. Zabarovsky, S. Zhu, A. Zimmer, W. Hide, C. Bult, S. M. Grimmond, R. D. Teasdale, E. T. Liu, V. Brusic, J. Quackenbush, C. Wahlestedt, J. S. Mattick, D. A. Hume, C. Kai, D. Sasaki, Y. Tomaru, S. Fukuda, M. Kanamori-Katayama, M. Suzuki, J. Aoki, T. Arakawa, J. Iida, K. Imamura, M. Itoh, T. Kato, H. Kawaji, N. Kawagashira, T. Kawashima, M. Kojima, S. Kondo, H. Konno, K. Nakano, N. Ninomiya, T. Nishio, M. Okada, C. Plessy, K. Shibata, T. Shiraki, S. Suzuki, M. Tagami, K. Waki, A. Watahiki, Y. Okamura-Oho, H. Suzuki, J. Kawai and Y. Hayashizaki (2005). "The Transcriptional Landscape of the Mammalian Genome." Science 309(5740): 1559-1563. Celius, T., J. B. Matthews, J. P. Giesy and T. R. Zacharewski (2000). "Quantification of rainbow trout (Oncorhynchus mykiss) zona radiata and vitellogenin mRNA levels using real-time PCR after in vivo treatment with estradiol-17β or α-zearalenol." The Journal of Steroid Biochemistry and Molecular Biology 75(2): 109-119.

106

Chikhi, R., Rizk, G. (2013). "Space-efficient and exact de Bruijn graph representation based on a Bloom filter." Algorithms for Molecular Biology 8(22). Chin, C.-S., D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner and J. Korlach (2013). "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data." Nature Methods 10: 563. Chin, C.-S., P. Peluso, F. J. Sedlazeck, M. Nattestad, G. T. Concepcion, A. Clum, C. Dunn, R. O'Malley, R. Figueroa-Balderas, A. Morales-Cruz, G. R. Cramer, M. Delledonne, C. Luo, J. R. Ecker, D. Cantu, D. R. Rank and M. C. Schatz (2016). "Phased diploid genome assembly with single-molecule real-time sequencing." Nature Methods 13: 1050. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273-97. DOI: 10.1007/bf00994018 Dekker, J., K. Rippe, M. Dekker and N. Kleckner (2002). "Capturing chromosome conformation." Science 295(5558): 1306-1311. de Koning, A. P. J., W. Gu, T. A. Castoe, M. A. Batzer and D. D. Pollock (2011). "Repetitive Elements May Comprise Over Two-Thirds of the Human Genome." PLOS Genetics 7(12): e1002384. Denslow N. D., N. Garcia-Reyero and D. S. Barber (2007). "Fish 'n' chips: the use of microarrays for aquatic toxicology." Mol Biosyst. Mar;3(3):172-7. DOI: 10.1039/B612802P Dimalanta, E. T., A. Lim, R. Runnheim, et al. (2004). "A Microfluidic System for Large DNA Molecule Arrays." Anal. Chem 76: 5293-5301. Dobin A, C.A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaissonand T. R. Gingeras (2013). “STAR: ultrafast universal RNA-seq aligner”. Bioinformatics. 29(1):15-21. DOI: 10.1093/bioinformatics/bts635 Dumur, C. I., C. E. Fuller, T. L. Blevins, J. C. Schaum, D. S. Wilkinson, C. T. Garrett and C. N. Powers (2011). "Clinical verification of the performance of the pathwork tissue of origin test: utility and limitations." Am J Clin Pathol 136(6): 924-933. Earl, D., K. Bradnam, J. St John, A. Darling, D. Lin, J. Fass, H. O. Yu, V. Buffalo, D. R. Zerbino, M. Diekhans, N. Nguyen, P. N. Ariyaratne, W. K. Sung, Z. Ning, M. Haimel, J. T. Simpson, N. A. Fonseca, I. Birol, T. R. Docking, I. Y. Ho, D. S. Rokhsar, R. Chikhi, D. Lavenier, G. Chapuis, D. Naquin, N. Maillet, M. C. Schatz, D. R. Kelley, A. M. Phillippy, S. Koren, S. P. Yang, W. Wu, W. C. Chou, A. Srivastava, T. I. Shaw, J. G. Ruby, P. Skewes-Cox, M. Betegon, M. T. Dimon, V. Solovyev, I. Seledtsov, P. Kosarev, D. Vorobyev, R. Ramirez- Gonzalez, R. Leggett, D. MacLean, F. Xia, R. Luo, Z. Li, Y. Xie, B. Liu, S. Gnerre, I. MacCallum, D. Przybylski, F. J. Ribeiro, S. Yin, T. Sharpe, G. Hall, P. J. Kersey, R. Durbin, S. D. Jackman, J. A. Chapman, X. Huang, J. L. DeRisi, M. Caccamo, Y. Li, D. B. Jaffe, R. E. Green, D. Haussler, I. Korf and B. Paten (2011). "Assemblathon 1: a competitive assessment of de novo short read assembly methods." Genome Res 21(12): 2224-2241. Eisenstein, M. (2012). "Oxford Nanopore announcement sets sequencing sector abuzz." Nature Biotechnology 30: 295. English, A. C., S. Richards, Y. Han, M. Wang, V. Vee, J. Qu, X. Qin, D. M. Muzny, J. G. Reid, K. C. Worley and R. A. Gibbs (2012). "Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology." PLOS ONE 7(11): e47768. Engstrom, P. G., T. Steijger, B. Sipos, G. R. Grant, A. Kahles, G. Ratsch, N. Goldman, T. J. Hubbard, J. Harrow, R. Guigo, P. Bertone and R. Consortium (2013). "Systematic evaluation of spliced alignment programs for RNA-seq data." Nat Methods 10(12): 1185-1191. Flick, R. W., D. C. Bencic, M. J. See and A. D. Biales (2014). "Sensitivity of the vitellogenin assay to diagnose exposure of fathead minnows to 17alpha-ethynylestradiol." Aquat Toxicol 152: 353-360. Friedman J., T. Hastie and R. Tibshirani (2000). "Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors)." The annals of statistics. 28(2):337-407. DOI: 10.1214/aos/1016120463

107

Friedman JH. (2001). "Greedy function approximation: a gradient boosting machine." Annals of statistics. 1:1189-232. DOI: 10.1214/aos/1013203451 Friedman J, T. Hastie T and R. Tibshirani (2010). "Regularization paths for generalized linear models via coordinate descent." Journal of statistical software. 33(1):1. DOI: 10.18637/jss.v033.i01 Gaetke, L. M.,H. S. Chow-Johnson and C. K. Chow (2014). "Copper: Toxicological relevance and mechanisms." Arch Toxicol. Nov; 88(11): 1929–1938. DOI:10.1007/s00204-014-1355-y Gao, S., D. Bertrand, B. K. H. Chia, et al. (2016). “OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees.” Genome Biol 17:102. DOI:10.1186/s13059-016-0951-y Gold J. R. and C. T. Amemiya. (1987). “Genome size variation in North American minnows (Cyprinidae). II. Variation among 20 species.” Genome. 29(3):481-9. Hosseini M. J., F. Shaki, M. Ghazi-Khansari and J. Pourahmad (2014). " Toxicity of copper on isolated liver mitochondria: impairment at complexes I, II, and IV leads to increased ROS production." J. Cell Biochem Biophys. 70(1):367-81. DOI: 10.1007/s12013-014-9922-7. Ghurye, J., M. Pop, S. Koren, D. Bickhart, and C.-S. Chin. (2017) “Scaffolding of long read assemblies using long range contact information.” BMC Genomics. 18: 527. DOI: 10.1186/s12864-017-3879-z Glasauer, S. M. and S. C. Neuhauss (2014). "Whole-genome duplication in teleost fishes and its evolutionary consequences." Mol Genet Genomics 289(6): 1045-1060. Goodswen, S. J., P. J. Kennedy and J. T. Ellis (2012). "Evaluating High-Throughput Ab Initio Gene Finders to Discover Proteins Encoded in Eukaryotic Pathogen Genomes Missed by Laboratory Techniques." PLOS ONE 7(11): e50609. Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. di Palma, B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman and A. Regev (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nat Biotechnol 29(7): 644-652. Green, P. (1996). http://bozeman.mbt.washington.edu/phrap.docs/phrap.html. Grohme M. A., S. Schloissnig, A. Rozanski A, et al. (2017) “The genome of Schmidtea mediterranea and the evolution of core cellular mechanisms.“ Nature. 1;554(7690):56-61. DOI: 10.1038/nature25473. Haas, B. J., A. Papanicolaou, M. Yassour, M. Grabherr, P. D. Blood, J. Bowden, M. B. Couger, D. Eccles, B. Li, M. Lieber, M. D. MacManes, M. Ott, J. Orvis, N. Pochet, F. Strozzi, N. Weeks, R. Westerman, T. William, C. N. Dewey, R. Henschel, R. D. LeDuc, N. Friedman and A. Regev (2013). "De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity." Nature protocols 8(8): 10.1038/nprot.2013.1084. Haas, B. J., S. L. Salzberg, W. Zhu, M. Pertea, J. E. Allen, J. Orvis, O. White, C. R. Buell and J. R. Wortman (2008). "Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments." Genome Biol 9(1): R7. Hastie, T., R. Tibshirani and J. Friedman (2001). The Elements of Statistical Learning (1st ed.). Springer. ISBN 0-387-95284-5 Holt, C. and M. Yandell (2011). "MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects." BMC Bioinformatics 12(1): 491. Howe, K., M. D. Clark, C. F. Torroja, J. Torrance, et al. (2013). "The zebrafish reference genome sequence and its relationship to the human genome." Nature 496(7446): 498-503. Huang, X. and A. Madan (1999). "CAP3: A DNA Sequence Assembly Program." Genome Res 9: 868-877. Jin, Y, X. Pan and Z. Fu, Z (2014). "Exposure to bifenthrin causes immunotoxicity and oxidative stress in male mice." Environ Toxicol. 29(9):991-9. DOI: 10.1002/tox. DOI: 10.1002/tox.21829. Kent, W. J. (2002). "BLAT - the BLAST-like alignment tool." Genome Res 12(4): 656-664.

108

Kielbasa, S. M., R. Wan, K. Sato, P. Horton and M. C. Frith (2011). "Adaptive seeds tame genomic sequence comparison." Genome Res 21(3): 487-493. Koren S., B. P. Walenz, K. Berlin, J. R. Miller and A. M.Phillippy. (2017). “Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.” Genome Research. DOI:10.1101/gr.215087.116 Koren, S., et al. (2018). "De novo assembly of haplotype-resolved genomes with trio binning." Nature Biotechnology. 36(12): 1174-1182. Kostich M. S., D. C. Bencic, A. L. Batt, M. J. See, R. W. Flick, D. A. Gordon, J. M. Lazorchak JM and A. D. Biales (2019). "Multigene biomarkers of pyrethroid exposure: exploratory experiments." Environmental Toxicology and Chemistry. 2019. DOI: 10.1002/etc.4552 Kostich M. S (2017). "A statistical framework for applying RNA profiling to chemical hazard detection." Chemosphere. 188:49-59. DOI: 10.1016/j.chemosphere.2017.08.136 Langmead, B. and S. L. Salzberg (2012). "Fast gapped-read alignment with Bowtie 2." Nat Methods 9. Law CW, Y. Chen, W. Shi and G. K. Smyth. (2014). "voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome biology." 15(2):R29. DOI: 10.1186/gb-2014-15-2-r29 Lewis, P.A., D. J. Klemm, J. M. Lazorchak, T. J. Norberg-King and W. H. Peltier (1994). "Short-term methods for estimating the chronic toxicity of effluents and receiving water to freshwater organisms." Third edition. EPA/600/4-91/002, July 1994 United States: N. p., 1994. Li, H. (2016). “Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.” Bioinformatics. 32:14:2103–2110. DOI:10.1093/bioinformatics/btw152 Li, H. and R. Durbin, (2009). “Fast and accurate short read alignment with Burrows-Wheeler transform.” Bioinformatics. 15;25(14):1754-60. DOI: 10.1093/bioinformatics/btp324 Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis and R. Durbin (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079. Lieberman-Aiden, E., N. L. van Berkum, L. Williams, et al. (2009). "Comprehensive Mapping of Long- Range Interactions Reveals Folding Principles of the Human Genome." Science 326(5950): 289-293. Maron, M. E. (1961). "Automatic indexing: an experimental inquiry." Journal of the ACM. 8(3):404-17. DOI: 10.1145/321075.321084 Luo, J., J. Wang, Z. Zhang, M. LI and F.X. Wu (2017). “BOSS: a novel scaffolding algorithm based on an optimized scaffold graph.“ Bioinformatics. 5;33(2):169-176. DOI: 10.1093/bioinformatics/btw597 Martin M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet. journal. 17(1):10-2. DOI: 10.14806/ej.17.1.200 https://dx.doi.org/10.1093/nar/gks042 Mehta, R., D. M. Templeton and P. J. O’Brien (2006). "Mitochondrial involvement in genetically determined transition metal toxicity II. Copper toxicity." Chemico-Biological Interactions. 163:77–85. Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer and B. Wold (2008). "Mapping and quantifying mammalian transcriptomes by RNA-Seq." Nat Methods 5. Moss, S. P., D. A. Joyce, S. Humphries, K. J. Tindall and D. H. Lunt (2011). "Comparative analysis of teleost genome sequences reveals an ancient intron size expansion in the zebrafish lineage." Genome Biol Evol 3: 1187-1196. Myers, E. W., G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. J. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H.-H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams and J. C. Venter (2000). "A Whole-Genome Assembly of Drosophila." Science 287(5461): 2196-2204.

109

Nuwaysir, EF, Bittner, M, Trent, J, Barrett, JC, Afshari, CA, (1999). "Microarrays and toxicology: The advent of toxicogenomics." Mol Carcinog. Mar;24(3):153-9. DOI: 10.1002/(SICI)1098- 2744(199903)24:3<153::AID-MC1>3.0.CO;2-P Parra, G., K. Bradnam and I. Korf (2007). "CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes." Bioinformatics 23. Pennington P. L., H. Harper-Laux, Y. Sapozhnikova and M. H. Fulton (2014). "Environmental effects and fate of the insecticide bifenthrin in a salt-marsh mesocosm." Chemosphere. 2014 Oct;112:18-25. DOI: 10.1016/j.chemosphere.2014.03.047. Pevzner, P. A., H. Tang and M. S. Waterman (2001). "An Eulerian path approach to DNA fragment assembly." Proceedings of the National Academy of Sciences 98(17): 9748-9753. Phillippy, A. M., M. C. Schatz and M. Pop (2008). "Genome assembly forensics: finding the elusive mis- assembly." Genome Biol 9(3): R55. Phipson B., S. Lee, I. J. Majewski, W. S. Alexander and G. K. Smyth (2016). "Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression." The annals of applied statistics. 10(2):946. DOI: 10.1214/16-AOAS920 Ozata, D.M., I. Gainetdinov, A. Zoch, et al (2019). "PIWI-interacting RNAs: small RNAs with big functions." Nat Rev Genet 20, 89–108. https://doi.org/10.1038/s41576-018-0073-3 R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Riley, M. C., et al (2011). "Rapid whole genome optical mapping of Plasmodium falciparum." Malaria Journal 10(252). Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK Ritchie M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi and G. K. Smyth (2015). "limma powers differential expression analyses for RNA-sequencing and microarray studies." Nucleic acids research. 43(7):e47. DOI: 10.1093/nar/gkv007 Roach, M.J., S. A. Schmidt and A. R. Borneman (2018). “Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies.” BMC Bioinformatics. 19:460. DOI:10.1186/s12859-018-2485-7 Roberts R. J., et al (2013). "The advantages of SMRT sequencing." Genome Biology 14(6): 4. Robinson, M.D., D. J. McCarthy and G. K. Smyth, (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics, 26(10): 139-40. https://dx.doi.org/10.1093/bioinformatics/btp616 Robinson, M. D. and A. Oshlack, (2010). "A scaling normalization method for differential expression analysis of RNA-seq data." Genome Biology 11(3): R25 Saari, T. W., A. L. Schroeder, G. T. Ankley and D. L. Villeneuve (2017). "First-generation annotations for the fathead minnow (Pimephales promelas) genome." Environ Toxicol Chem. 36(12):3436-3442. DOI: 10.1002/etc.3929. Saeys, Y., I. Inza and P. Larranaga (2007). "A review of feature selection techniques in bioinformatics." Bioinformatics 23(19): 2507-2517. Sahlin, K., F. Vezzi, B. Nystedt, et al. (2014) “BESST - Efficient scaffolding of large fragmented assemblies.” BMC Bioinformatics. 15:281 DOI:10.1186/1471-2105-15-281 Schwartz, D., X. Li, L. Hernandez, S. Ramnarain, E. Huff and Y. Wang (1993). "Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping." Science 262(5130): 110-114. Seppey M., Manni M., Zdobnov EM (2019). "BUSCO: Assessing Genome Assembly and Annotation Completeness". Methods Mol Biol. 2019;1962:227-245. doi: 10.1007/978-1-4939-9173-0_14. Shendure, J. and H. Ji (2008). "Next-generation DNA sequencing." Nature Biotechnology 26: 1135.

110

Simao, F. A., R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva and E. M. Zdobnov (2015). "BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs." Bioinformatics 31(19): 3210-3212. Slater, G. S. and E. Birney (2005). "Automated generation of heuristics for biological sequence comparison." BMC Bioinformatics 6: 31. Smit, A. and R. Hubley (2008-2015). "RepeatModeler Open-1.0.". Smit, A., R. Hubley and P. Green (2013-2015). "RepeatMasker Open-4.0.". Smith, C. D., R. C. Edgar, M. D. Yandell, D. R. Smith, S. E. Celniker, E. W. Myers and G. H. Karpen (2007). "Improved repeat identification and masking in Dipterans." Gene 389(1): 1-9. Soderlund D. M. and J. R. Bloomquist (1989). "Neurotoxic actions of pyrethroid insecticides." Annual review of entomology. 34(1):77-96. DOI: 10.1146/annurev.en.34.010189.000453 Staden, R. (1980). "A new computer method for the storage and manipulation of DNA gel reading data." Nucleic Acids Res 8: 3673-3694. Stanke, M., O. Keller, I. Gunduz, A. Hayes, S. Waack and B. Morgenstern (2006). "AUGUSTUS: ab initio prediction of alternative transcripts." Nucleic Acids Res 34(Web Server issue): W435-439. Subramanian, A., R. Narayan, S. M. Corsello, D. D. Peck, T. E. Natoli, X. Lu, J. Gould, J. F. Davis, A. A. Tubelli, J. K. Asiedu, D. L. Lahr, J. E. Hirschman, Z. Liu, M. Donahue, B. Julian, M. Khan, D. Wadden, I. C. Smith, D. Lam, A. Liberzon, C. Toder, M. Bagul, M. Orzechowski, O. M. Enache, F. Piccioni, S. A. Johnson, N. J. Lyons, A. H. Berger, A. F. Shamji, A. N. Brooks, A. Vrcic, C. Flynn, J. Rosains, D. Y. Takeda, R. Hu, D. Davison, J. Lamb, K. Ardlie, L. Hogstrom, P. Greenside, N. S. Gray, P. A. Clemons, S. Silver, X. Wu, W. N. Zhao, W. Read-Button, X. Wu, S. J. Haggarty, L. V. Ronco, J. S. Boehm, S. L. Schreiber, J. G. Doench, J. A. Bittker, D. E. Root, B. Wong and T. R. Golub (2017). "A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles." Cell 171(6): 1437-1452 e1417.Tothill RW, Kowalczyk A, Rischin D, Bousioutas A, Haviv I, van Laar RK, Waring PM, Zalcberg J, Ward R, Biankin AV, Sutherland RL, Henshall SM, Fong K, Pollack JR, Bowtell DD, Holloway AJ. (2005). "An expression-based site of origin diagnostic method designed for clinical application to cancer of unknown origin." Cancer Res. May 15;65(10):4031-40. Sutton, G., et al (1995). "TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects." Genome Science & Technology 1(1). Trapnell, C., B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold and L. Pachter (2010). "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation." Nat Biotech 28. Uszczynska-Ratajczak, B., J. Lagarde, A.Frankish, et al (2018). "Towards a complete map of the human long non-coding RNA transcriptome." Nat Rev Genet 19, 535–548. https://doi.org/10.1038/s41576-018- 0017-y Vapnik, V. (2000). "The Nature of Statistical Learning Theory, 2nd ed.". Venter, J. C., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans and R. A. Holt (2001). "The Sequence of the Human Genome." Science 291. Wagner, G. P., K. Kin and V. J. Lynch (2012). "Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples." Theory Biosci 131(4): 281-285. Walker, B. J., T. Abeel, T. Shea, M. Priest, A. Abouelliel, S. Sakthikumar, C. A. Cuomo, Q. Zeng, J. Wortman, S. K. Young and A. M. Earl (2014). "Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement." PLOS ONE 9(11): e112963. Wang, R. L., D. Bencic, A. Biales, R. Flick, J. Lazorchak, D. Villeneuve and G. T. Ankley (2012). "Discovery and validation of gene classifiers for endocrine-disrupting chemicals in zebrafish (danio rerio)." BMC Genomics 13: 358.

111

Wang, Z., D. Neuburg, C. Li, L. Su, J. Y. Kim, J. C. Chen and D. C. Christiani (2005). "Global Gene Expression Profiling in Whole-Blood Samples from Individuals Exposed to Metal Fumes." Environmental Health Perspectives 113(2): 233-241. Weirather, J. L., M. de Cesare, Y. Wang, P. Piazza, V. Sebastiano, X. J. Wang, D. Buck and K. F. Au (2017). "Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis." F1000Res 6: 100. Weston D. P., M. Zhang, M. J. Lydy (2008). "Identifying the cause and source of sediment toxicity in an agriculture-influenced creek." Environ Toxicol Chem. 27(4):953-62. DOI: 10.1897/07-449.1. Ye, C., C. M. Hill, S. Wu, J. Ruan and Z. Ma (2016). "DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies." Scientific Reports 6: 31900. Ye, C., Z. S. Ma, C. Cannon, M. Pop and D. Yu (2012). "Exploiting sparseness in de novo genome assembly." BMC Bioinformatics 13(Suppl 6):S1. Zhou, J., B. Lemos, E. B. Dopman and D. L. Hartl (2011). "Copy-number variation: the balance between gene dosage and expression in Drosophila melanogaster." Genome Biol Evol 3: 1014-1024.

112

Supplementary Materials

S.1. Specific steps/commands used in Maker and PASA/EVM annotation pipelines production runs

The process of running through both of the annotation pipelines was quite involved, requiring many steps to reach the end result of a set of each models from each pipeline. Many of the programs run as part of the two pipelines produced voluminous output, too much to be detailed here. Therefore,

“Outputs” shown below are only key files that were used as inputs for succeeding steps in the pipelines.

Key input/output files and some other files (e.g., Maker control files, discussed below) are included as supplemental files. All others are available from the author upon request. In many of the commands displayed, specific paths reflecting the directory hierarchy of the specific machine where the command appear (e.g., in the first command below the path “/data/hts/Mitch/fhm/Rna20140416a/” appears. In the case of this command, this directory was the location of our “raw” RNA fastq sequence files. In some cases (commands), an environmental variable may have been defined to represent the path to a program. An example of this is in the PASA command, where the home directory of the PASA program was defined as “$PASA_HOME”, and this variable name appears in the PASA command, directing the command to the proper directory to find the PASA program. The reader is encouraged to disregard specific paths and filenames that are referenced in commands, as they are certainly specific for this project. That stated, the hope is that the reader can gain an understanding of the flow and essential details of how the Maker and PASA/EVM annotation pipelines were run from the commands presented below. The reader can assume default settings were used when programs were run unless specifically noted in the text or in commands themselves (where specific parameters may appears with assigned values). The reader is also reminded again to refer to Figures 3 and 4 for assistance in understanding the

113

overall process flow. The annotation process steps, in the order in which they occurred, follow. For the sake of conserving space (primarily), single spacing is used in the commands section.

S.1a. rCorrector (version 1.0.3.1) rCorrector was used to make corrections to input RNA reads, one of the most important inputs to the protein-coding annotation process.

Input: Four sets of 150 bp paired-end read files with cDNA from juvenile (multiple time points) and adult FHM. Approximately ~320 million paired-reads total.

Commands: perl run_rcorrector.pl -1 /data/hts/Mitch/fhm/Rna20140416a/1_Adult_ACAGTG_L001_R1_001.fastq,/data/hts/Mitch/fhm/Rna2 0140416a/1_Adult_ACAGTG_L002_R1_001.fastq,/data/hts/Mitch/fhm/Rna20140416a/2_Juev_GTGAAA _L001_R1_001.fastq,/data/hts/Mitch/fhm/Rna20140416a/2_Juev_GTGAAA_L002_R1_001.fastq -2 /data/hts/Mitch/fhm/Rna20140416a/1_Adult_ACAGTG_L001_R2_001.fastq,/data/hts/Mitch/fhm/Rna2 0140416a/1_Adult_ACAGTG_L002_R2_001.fastq,/data/hts/Mitch/fhm/Rna20140416a/2_Juev_GTGAAA _L001_R2_001.fastq,/data/hts/Mitch/fhm/Rna20140416a/2_Juev_GTGAAA_L002_R2_001.fastq -t 46

Output: Eight files (four sets of paired-end files) named e.g., 1_Adult_ACAGTG_L001_R1_001.cor.fq and 1_Adult_ACAGTG_L001_R2_001.cor.fq.

The utility script FilterUncorrectabledPEfastq.py (from github.com/harvardinformatics/TranscriptomeAssemblyTools) was then used to remove read pairs where one or both reads of any pair were flagged as "unfixable_error" by rCorrector. I don't believe there are any version numbers associated with those tools. The script was run for each of 4 sets of paired-end output files from rCorrector. An example command for one of the four pairs is:

"python FilterUncorrectabledPEfastq.py -1 rcorrector_out_dir/1_Adult_ACAGTG_L001_R1_001.cor.fq -2 rcorrector_out_dir/1_Adult_ACAGTG_L001_R2_001.cor.fq", where the two "....cor.fq" files were the rCorrector output files for that set of paired-end files.

Outputs: Four pairs of corrected paired-end read files from adults and juveniles with “unfixable” read- pairs removed, named e.g., unfixrm_1_Adult_ACAGTG_L001_R1_001.cor.fq and unfixrm_1_Adult_ACAGTG_L001_R2_001.cor.fq. S.1b. TrimGalore (version 0.6.1) TrimGalore, implementing the Cutadapt algorithm, was used to trim sequencing adapters from the corrected reads from the previous step.

Inputs: four sets of paired-end reads that were output from FilterUncorrectabledPEfastq.py above (e.g., unfixrm_1_Adult_ACAGTG_L001_R1_001.cor.fq unfixrm_1_Adult_ACAGTG_L001_R2_001.cor.fq).

Command:

114

trim_galore --paired --phred33 --output_dir trim_galore_out_dir -q 6 --stringency 1 -e 0.1 rcorrector_out_dir/unfixrm_1_Adult_ACAGTG_L001_R1_001.cor.fq rcorrector_out_dir/unfixrm_1_Adult_ACAGTG_L001_R2_001.cor.fq rcorrector_out_dir/unfixrm_1_Adult_ACAGTG_L002_R1_001.cor.fq rcorrector_out_dir/unfixrm_1_Adult_ACAGTG_L002_R2_001.cor.fq rcorrector_out_dir/unfixrm_2_Juev_GTGAAA_L001_R1_001.cor.fq rcorrector_out_dir/unfixrm_2_Juev_GTGAAA_L001_R2_001.cor.fq rcorrector_out_dir/unfixrm_2_Juev_GTGAAA_L002_R1_001.cor.fq rcorrector_out_dir/unfixrm_2_Juev_GTGAAA_L002_R2_001.cor.fq

Outputs: Four pairs of adapter trimmed, quality trimmed, paired-end reads meeting TrimGalore’s specified criteria in the previous command (-q 6 --stringency 1 -e 0.1, e.g., unfixrm_1_Adult_ACAGTG_L001_R1_001.cor_val_1.fq and unfixrm_1_Adult_ACAGTG_L001_R2_001.cor_val_1.fq ). S.1c. STAR (version 2.6.0a) STAR was employed in two-pass mode (to refine splice junctions) to align the output reads from TrimGalore to the assembled genome. S.1c.1. Pass 1 Inputs: four sets of paired-end validated reads from TrimGalore, e.g., unfixrm_1_Adult_ACAGTG_L001_R1_001.cor_val_1.fq and unfixrm_1_Adult_ACAGTG_L001_R2_001.cor_val_1.fq)

Command:

/data/jmartins/STAR-2.6.0a/bin/Linux_x86_64_static/STAR --runThreadN 48 --genomeDir genomeIndex --readFilesIn trim_galore_out_dir/unfixrm_1_Adult_ACAGTG_L001_R1_001.cor_val_1.fq,trim_galore_out_dir/unfixr m_1_Adult_ACAGTG_L002_R1_001.cor_val_1.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L001_ R1_001.cor_val_1.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L002_R1_001.cor_val_1.fq trim_galore_out_dir/unfixrm_1_Adult_ACAGTG_L001_R2_001.cor_val_2.fq,trim_galore_out_dir/unfixr m_1_Adult_ACAGTG_L002_R2_001.cor_val_2.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L001_ R2_001.cor_val_2.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L002_R2_001.cor_val_2.fq -- limitSjdbInsertNsj 1100000 --outTmpDir tmp1 --outFilterType BySJout --outFilterMultimapNmax 20 -- outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignSJoverhangMin 8 -- alignSJDBoverhangMin 1 --sjdbOverhang 149 --outFileNamePrefix pass1

Outputs: standard STAR output files. The only important one was “pass1SJ.out.tab”, the table of discovered splice junctions, which was used on pass 2 as an additional input. S.1c.2. Pass 2 Inputs: same as pass1 with the exception of adding pass1SJ.out.tab file.

Command:

115

/data/jmartins/STAR-2.6.0a/bin/Linux_x86_64_static/STAR --runThreadN 48 --genomeDir genomeIndex --readFilesIn trim_galore_out_dir/unfixrm_1_Adult_ACAGTG_L001_R1_001.cor_val_1.fq,trim_galore_out_dir/unfixr m_1_Adult_ACAGTG_L002_R1_001.cor_val_1.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L001_ R1_001.cor_val_1.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L002_R1_001.cor_val_1.fq trim_galore_out_dir/unfixrm_1_Adult_ACAGTG_L001_R2_001.cor_val_2.fq,trim_galore_out_dir/unfixr m_1_Adult_ACAGTG_L002_R2_001.cor_val_2.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L001_ R2_001.cor_val_2.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L002_R2_001.cor_val_2.fq -- limitSjdbInsertNsj 1100000 --outTmpDir tmp2 --limitBAMsortRAM 12000000000 --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 -- alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbOverhang 149 --outSAMtype BAM SortedByCoordinate --outWigType wiggle --outWigStrand Stranded --sjdbFileChrStartEnd pass1SJ.out.tab --outFileNamePrefix pass2

Outputs: Standard STAR output files and other optionally requested output, the most important of which was the coordinate sorted .bam alignment file, “pass2Aligned.sortedByCoord.out.bam”.

S.1d. Trinity (version 2.5.0) Trinity was used to assemble the corrected/trimmed reads into transcripts. The reads were assembled both de novo and in a genome-guided manner. S.1d.1. De novo assembly Inputs: TrimGalore output fastq files

Command:

Trinity --seqType fq --left trim_galore_out_dir/unfixrm_1_Adult_ACAGTG_L001_R1_001.cor_val_1.fq,trim_galore_out_dir/unfixr m_1_Adult_ACAGTG_L002_R1_001.cor_val_1.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L001_ R1_001.cor_val_1.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L002_R1_001.cor_val_1.fq --right trim_galore_out_dir/unfixrm_1_Adult_ACAGTG_L001_R2_001.cor_val_2.fq,trim_galore_out_dir/unfixr m_1_Adult_ACAGTG_L002_R2_001.cor_val_2.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L001_ R2_001.cor_val_2.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L002_R2_001.cor_val_2.fq -- max_memory 10G --CPU 28 --normalize_reads --SS_lib_type RF --monitor_sec 180 --output trinity_denovo_out_dir&>trinity.denovo.out&

Outputs: Standard Trinity outputs, most important of which for our purposes was the Trinity.fasta file of assembled transcripts. S.1d.2. Genome guided assembly Inputs: TrimGalore output fastq files and STAR the output bam file pass2Aligned.sortedByCoord.out.bam.

Command:

Trinity --genome_guided_bam pass2Aligned.sortedByCoord.out.bam --genome_guided_max_intron 1000000 --seqType fq --left

116

trim_galore_out_dir/unfixrm_1_Adult_ACAGTG_L001_R1_001.cor_val_1.fq,trim_galore_out_dir/unfixr m_1_Adult_ACAGTG_L002_R1_001.cor_val_1.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L001_ R1_001.cor_val_1.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L002_R1_001.cor_val_1.fq trim_galore_out_dir/unfixrm_1_Adult_ACAGTG_L001_R2_001.cor_val_2.fq,trim_galore_out_dir/unfixr m_1_Adult_ACAGTG_L002_R2_001.cor_val_2.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L001_ R2_001.cor_val_2.fq,trim_galore_out_dir/unfixrm_2_Juev_GTGAAA_L002_R2_001.cor_val_2.fq --right - -max_memory 10G --CPU 24 --normalize_reads --SS_lib_type RF --monitor_sec 180 &>trinity.out&

Outputs: Standard Trinity outputs, most important of which for our purposed was the Trinity-GG.fasta file of assembled transcripts, where “GG” indicates that the assembly was genome guided. S.1e. Maker2 (version 2.31.9) Maker was run in two iterations to produce a set of candidate gene models/transcripts/proteins S.1e.1. Iteration 1 Inputs: The fasta file of the assembled genome, transposon sequences identified with RepeatMasker (received from a colleague), the Trinity genome-guided transcripts, to which was concatenated ~22.5K non-redundant FHM ESTs downloaded from the Joint Genome Institute (JGI). An additional input was a file of reference proteins (also received from the colleague who provided the transposon file) from multiple species (Arabidopsis thaliana, Bacillus subtilis, Caenorhabditis elegans, Ciona intestinalis, Danio rerio, Dictyostelium discoideum, Drosophila melanogaster, Escherichia coli, Gallus gallus, Homo sapiens, Mus musculus, Lepisosteus oculatus, Oryzias latipes, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Xenopus tropicalis) downloaded from the European Bioinformatics Institute (EBI). The ab initio gene predictors Augustus, SNAP, and GeneMark-ES were used, employing hmm parameter files developed during species-specific training, described in more detail in the section below. References to the input files are included in the Maker control file maker_opts.ctl during actual runs, saved as maker_opts.log post-run. The maker_bopts.ctl file is a file where parameters for Exonerate and blast programs run under Maker can be adjusted; the parameters were left at default values provided by Maker. The file maker_exe.ctl is simply a file showing the paths to executable files run under Maker, including NCBI+’s makeblastdb, blastn, and blastx, RepeatMasker, Exonerate, and the gene predictors SNAP, Genemark-ES, and Augustus. The location of probuild, required for GeneMark, is also in the file. All of the control files for each run (as ”.log”) files are available as supplemental files.

Command: Because Maker uses the three control files described above to control, the command when running Maker is simply "Maker". The command was run multiple times at once to take advantage of the 56 threads available on the work station employed; a Maker log file generated during the run keeps track of the different scaffolds that are being processed or that are already finished so the different instances of Maker running choose different genomic scaffolds to process at any given time.

Outputs: A consolidated “.gff” output file from the initial run, which includes all the masking, protein alignment, and rna alignment information from the run. Final gene model predictions are also included in the file. The consolidated gff file was made from individual gff files produced by Maker for each scaffold using Maker’s gff3merge utility script. Fasta sequence files of model transcripts and proteins are also produced for each scaffold found to contain models. These were concatenated (cat command) into two files representing model transcripts and proteins for the whole genome.

117

S.1e.2. Iteration 2 Inputs: The consolidate gff file from iteration 1 and genome Fasta file, along with one additional EBI reference protein file containing proteins from Branchiostoma floridae (which was accidentally left out of the reference protein file originally), and a retrained SNAP parameter file. SNAP was retrained for the second iteration with all the gene models output from the initial production run, as suggested in Maker tutorial instructions. Detailed commands for the retraining are contained in the section on gene predictor training below. The parameter files for the other two gene predictors remained the same. GeneMark-ES was retrained after the first Maker iteration, however evaluation of output GeneMark models using the originally trained GeneMark parameters vs. the new parameters, based on blast coverage to the reference proteins, indicated the earlier trained parameters produced superior results (data not shown). Augustus parameters were unchanged due to the extensive training Augustus had already undergone as described in the section on gene predictor training below. All the final trained parameter files used for each predictor during the Maker runs, including both versions of the SNAP parameters, are available in the supplementary materials.

Outputs: A consolidated gff file prepared from produced scaffold specific gff files using gff3merge, and Fasta files containing all transcript and protein sequences. A total of 30,909 gene models, with 47,716 transcripts, were output by Maker. Additional outputs that were leveraged in the running of the PASA/Evidence Modeler pipeline are described in the Evidence Modeler section below. S.1f. PASA (version 2.3.3) PASA was used as the first step in an alternative (to Maker) annotation process. PASA uses spliced alignments of expressed transcript sequences (or in this case, Trinity assembled transcripts) to automatically model gene structures. The particular process employed was as described in PASA’s “Build a comprehensive transcriptome database, integrating genome-guided and genome-free transcript reconstructions” on the PASA web-site (https://github.com/PASApipeline/PASApipeline/wiki) S.1f.1. PASA preparation step 1 The first step was to concatenate the two sets of Trinity assembled transcripts (de novo and genome- guided) into a single file.

Command: cat Trinity.fasta Trinity.GG.fasta > transcripts.all.fasta S.1f.2. PASA preparation step 2 Step two is to use a PASA utility script to create a file containing the list of transcript accessions that correspond to the Trinity denovo assembly.

Command: $PASA_HOME/misc_utilities/accession_extractor.pl < Trinity.fasta > tdn.accs S.1f.3. Run PASA Inputs: c18pcphsoh2.fa, (the genome Fasta file), "transcripts.all.fasta" (the combined genome-guided and denovo Trinity assembled transcripts file from 6a), "tdn.accs" (the list of denovo assembled accessions from step 6b), and alignAssembly.config, which is a simple PASA configuration file, which was run with default settings other than supplying a name to the “MYSQLDB” field (MYSQLDB=pasa_production_final_old). The contents of the file appear below.

118

______

File contents of alignAssembly.config:

# MySQL settings MYSQLDB=pasa_production_final_old ####################################################### # Parameters to specify to specific scripts in pipeline # create a key = "script_name" + ":" + "parameter" # assign a value as done above. #script validate_alignments_in_db.dbi validate_alignments_in_db.dbi:--MIN_PERCENT_ALIGNED=75 validate_alignments_in_db.dbi:--MIN_AVG_PER_ID=95 validate_alignments_in_db.dbi:--NUM_BP_PERFECT_SPLICE_BOUNDARY=0 #script subcluster_builder.dbi subcluster_builder.dbi:-m=50) ______

Command: /data/jmartins/progs/PASA/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g ../c18pcphsoh2.fa -t transcripts.all.fasta --ALIGNERS blat,gmap --CPU 28 --TDN tdn.accs -- MAX_INTRON_LENGTH 400000 --transcribed_is_aligned_orient&>pasa.out&

Outputs: PASA generates much output, but the key output for the purposes of the annotation pipeline are the files “pasa_production_final.assemblies.fasta”, which contains final PASA transcript assemblies, and “pasa_production_final.pasa_assemblies.gff3”, which contains the final PASA alignments of the transcripts to the genome that were used to generate the Fasta file produced by PASA. S.1g. TransDecoder (version 5.5.0) TransDecoder was used to identify candidate coding regions within the PASA assembled transcript sequences. S.1g.1. TransDecoder step 1 The internal TransDecoder script TransDecoder.LongOrfs was used to extract the long open reading frames from the PASA assembled transcripts.

Input: “pasa_production_final.assemblies.fasta” from PASA

Command: TransDecoder.LongOrfs -t pasa_production_final.assemblies.fasta

Outputs: The primary outputs from this step are a longest_orfs peptide (protein sequence) file and a corresponding cds (coding sequence) Fasta file, longest_orfs.pep and longest_orfs.cds, respectively. S.1g.2. TransDecoder step 2

As an optional step, The Longest peptide file was subsequently blasted (version 2.2.31) and Hmmer’ed (version 3.1b2) against ebi reference proteins and PFAM respectively, for use in prediction step.

119

Inputs: longest_orfs.pep. For blastp, ref_prots.faa (reference proteins blast database files, previously formatted with blast+ makeblastdb command). For hmmer, reference input was Pfam-A.hmm (PFAM reference domain file downloaded from PFAM).

Commands: blastp -query ./transdc_out_dir/longest_orfs.pep -db ../production/ebi_protein_files/ref_prots.faa -max_target_seqs 1 -outfmt 6 -evalue 1e-5 - num_threads 48 >trandc.longest_orfs.vs.ebi_prots.outfmt6 hmmer-linux-intel-x86_64/binalries/hmmscan --cpu 48 --domtblout pfam.all.domtblout /home/jmartins/pfam/Pfam-A.hmm transdecoder_dir/longest_orfs.pep

Outputs: From blastp the output was the file trandc.longest_orfs.vs.ebi_prots.outfmt6, a tab separated file of best blast hits of the input proteins to our reference protein sequences, and from hmmscan, the output file pfam.all.domtblout, the per PFAM domain hits tabular output. S.1g.3. TransDecoder step 3 The next step was to do the prediction of likely coding regions using a another internal Transdecoder script.

Inputs: The PASA assemblies fasta file and the blastp and hmmscan output files produced in the previous step. The -S option was also included because the PASA assemblies had been generated with stranded cDNA.

Command:

TransDecoder.Predict -t pasa_production_final.assemblies.fasta --retain_pfam_hits pfam.all.domtblout --retain_blastp_hits trandc.longest_orfs.vs.ebi_prots.outfmt6 -S

Outputs: Outputs include PASA_TransDecoder gff3, bed, cds, pep files (the latter two fasta files), named e.g., pasa_production_final.assemblies.fasta.transdecoder.cds, pasa_production_final.assemblies.fasta.transdecoder.gff3, etc.

(Notes: If a candidate ORF was fully encapsulated by the coordinates of another candidate, TransDecoder reported the longer ORF, but a single transcript could include multiple ORFs. As part of prediction a PSSM is built/trained/used to refine the start codon prediction). S.1g.4. TransDecoder step 4 The next step was to use a TransDecoder utility script used to translate the gff3 output from the prediction step to a genome-based coordinate system.

Inputs: the referenced “TransDecoder” gff3 file and the original PASA assembly gff3 file and Fasta file.

Command: $path_to_transdecoder_directory/util/cdna_alignment_orf_to_genome_orf.pl pasa_production_final.assemblies.transdecoder.gff3 pasa_production_final.pasa_assemblies.gff3 pasa_production_final.assemblies.fasta > pasa_production_final.assemblies.transdecoder.genome.gff3

120

Outputs: Files pasa_production_final.assemblies.transdecoder.genome.gff3 was the key output from the TransDecoder process. S.1h. EVidenceModeler (EVM, version 1.1.1) EVM was used to generate a set of gene models. EVM models coding sequence only, providing a single transcript model for each gene. S.1h.1. Prepare EVM inputs To run EVM, one first needs to gather the necessary inputs. EVM requires genome sequences in Fasta format and input gene structures and alignment evidences described in GFF3 format. The alignment evidence includes protein alignments and rna (transcript) alignments. The gene structures come primarily from gene predictors (in our case Augustus, SNAP, and GeneMark-ES). An advantage of running Maker first was that the Maker output gff3 file includes alignment information. Further, in “theVoid…” directory generated for each scaffold during a Maker run, one can find the “raw” gene predictions from the predictors, as well as a repeat masked version of the scaffold. These parts of the Maker output could be leveraged to provide inputs for EVM. In EVM, one can also provide a subset of gene predictions as “OTHER_EVIDENCE”. This category was used to flag a subset of final gene models that had been produced by either Maker and TransDecoder that appeared to have extremely high homology to a reference protein. The following sections provide additional details.

S.1h.1.1. Acquire “perfect coverage” gene models from Maker and TransDecoder output

Maker protein sequences (from iteration 2) and final TransDecoder peptide sequences were blasted against our ebi reference proteins and best hits were classified based on hit lengths as having coverage (of both query sequence and subject sequence) that was perfect, 98%, 95%, 90%, 85%, 70%, 60% or 50%; less than 50% not reported. Models found to have perfect coverage were extracted from final Maker gff file and PASA_TransDecoder gff file to serve as EVM inputs. They were added to the gene predictions input file (see below).

The blast commands described above were, for the TransDecoder proteins,

/data/progs/ncbi-blast-2.6.0+-src/c++/ReleaseMT/bin/blastp -query pasa_production_final.assemblies.fasta.transdecoder.pep -db ../../production/ebi_protein_files/ref_prots.faa -num_threads 48 -num_descriptions 5 -num_alignments 5 -outfmt 0 -out pasa_assemblies_transdc.vs.ebi.ref_prots.blast.outfmt0 and, for the Maker proteins:

/data/progs/ncbi-blast-2.6.0+-src/c++/ReleaseMT/bin/blastp -query c18pcphsoh2.all.maker.proteins.final.defMod.fasta -num_threads 48 -num_descriptions 5 - num_alignments 5 -outfmt 0 -out /data/production_final/c18pcphsoh2.all.maker.proteins.final.defMod.vs.ebi.ref_prots.blast.outfmt0

Custom perl scripts were used to produce the coverage evaluation from each of the blast outputs and then extract models (and their genomic coordinates) identified as having perfect coverage to a reference protein from the Maker and TransDecoder gff3 files. 1,997 Maker models and 6,156

121

TransDecoder models were extracted, saved in the files maker.covPerf.gff and pasa.transdc.covPerf.gff respectively.

S.1h.1.2. Acquire gene model predictions from Augustus, SNAP, and GeneMark-ES

S.1h.1.2.i. The raw ab initio predictions from the Maker run from Augustus, SNAP, and GeneMark were extracted from the Maker output directory for scaffolds 22 and above, as those predictions were contained in single files, respectively. The output files are saved in the output directory named “theVoid…” in the output directory created for every scaffold by Maker. Perl scripts were used to adjust the formats of the outputs as necessary into the gff3 format expected by EVM.

S.1h.1.2.ii. For scaffolds 1-21, the ab initio prediction outputs generated during the Maker run were parsed across multiple files representing overlapping (by 1M bases) chunks of the scaffolds, so for these scaffolds, the three gene predictors were run independent of Maker, as it was deemed an easier road than developing a parser to piece together the models from the chunked files. To mirror the way Maker does things as much as possible, hints files with intron/exon information were generated (data not shown) from the STAR alignment "pass2" output for use with Augustus and Genemark, and SNAP and Augustus were run on both masked and unmasked input scaffolds; the masked scaffolds are available in the "theVoid..." Maker output directory for each scaffold as the file "query.masked.fasta".

Inputs: genomic scaffolds 1-21 (masked and unmasked for Augustus and SNAP) and hints files (for Augustus and GeneMark)

Commands:

#Genemark-ES Suite (version 4.38)

/data/jmartins/maker/gm_et_linux_64/gmes_petap/gmes_petap.pl --prediction --predict_with /data/jmartins/maker/gmhmm.mod --evidence ../genemark.hints.sorted.with.all.w2h.hints_scafs1- 21.gff --max_contig 3000000 --cores 28 --max_intron 400000 -max_intergenic 2000000 -- min_gene_prediction 179 --sequence ../scafs1-21.fasta

#Augustus (version 3.2.2)

/data/jmartins/maker/augustus-3.2.2/bin/augustus --species=fathead2_2 --UTR=off -- hintsfile=all.hints.augustus.scafs1-21.gff --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg --gff3=on scafs1- 21.masked.fasta

/data/jmartins/maker/augustus-3.2.2/bin/augustus --species=fathead2_2 --UTR=off -- hintsfile=all.hints.augustus.scafs1-21.gff --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg --gff3=on scafs1- 21.fasta

#SNAP (version 2006-07-28)

/data/jmartins/maker/snap/snap -gff snap.retrained.hmm /media/3f0f1386-6815-4473-acc5- cce84c971061/production_final/scafs1-21.fasta>scafs1-21.snap.gff

/data/jmartins/maker/snap/snap -gff snap.retrained.hmm /media/3f0f1386-6815-4473-acc5- cce84c971061/production_final/scafs1-21.masked.fasta>scafs1-21.snap.masked.gff

122

Outputs: Gene predictions for scaffolds 1-21 from each of the predictors. For Augustus and SNAP, there were two sets of predictions, one each from predicting on masked and unmasked input scaffolds. Perl scripts were used to extract the gene models from the various gtf/gff outputs from all the prediction files. The scripts took care of formatting the output, particularly field 9, to the format that EVM expects. All the predicted gene models then became part of the single input gene predictions file for EVM (see below).

S.1h.1.3. Gather RNA transcript and protein alignment information from the first iteration Maker output file blastn, est2genome, blastx, and protein2genome alignment information was extracted from the consolidated Maker gff3 file using the Linux grep command (not shown), and again, a simple Perl script was used to modify the format of gff field 9 slightly to the format that EVM expects. The alignment information became part of the protein evidence file and transcript evidence file for EVM in step 8a.4 below.

S.1h.1.4. Prepare final evidence files for EVM

S.1h.1.4.i. Using the Linux “cat” command (not shown), the PASA assemblies gff3 file, another source of transcript alignment information, was concatenated with the blastn and est2genome gff files extracted from Maker to make file "rna.evidence.gff" file for EVM.

S.1h.1.4.ii. Similarly, the protein2genome and blastx gff files from Maker were concatenated into file "protein.evidence.gff" for EVM.

S.1h.1.4.iii. The ab initio prediction gff files (extracted from Maker and those produced outside of Maker for scaffolds 1-21) were concatenated with the “perfect coverage” Maker and PASA_TransDecoder models described in 8a.1 above into the file "genePreds.all.with.maker.transdc.perfs.gff. Table S.1.1 shows the number of models contained in the final gene predictions file from each source:

Table S.1.1. Summary of input gene models to Evidence Modeler Source Number of models Augustus 134,968 Augustus_masked 78,642 SNAP 230,408 SNAP_masked 156,418 GeneMark-ES 144,248 “Perfect coverage” Maker models 1,997 “Perfect coverage TransDecoder models 6,156 Total predictions 752,837

S.1h.1.4.iv. Prepare EVM weights file

Evidence modeler uses a weights file for the evidence types. Weights had been assigned based on earlier testing (data not shown) done a randomly selected subset of scaffolds from an earlier draft assembly, where protein and transcript models produced by EVM using different weights were

123

evaluated for BUSCO coverage and mapping rates of ~6.43M randomly selected paired-end RNA reads selected from our initial 320M paired-end reads to model transcripts produced during the test. Prior to mapping, the test EVM output was run through two iterations of PASA to (potentially) add UTRs and additional transcript models to the EVM models. Exemplar (longest) transcripts were then isolated and used as the target for mapping. (The process of running EVM outputs back through PASA is discussed further in the next section). The testing showed that using what are essentially the default weights recommended by EVM produced the best results, so the final weights for our EVM “production” run, contained in the file "weights1.txt", were as follows:

ABINITIO_PREDICTION snap 1 ABINITIO_PREDICTION snap_masked 1 ABINITIO_PREDICTION genemark 1 ABINITIO_PREDICTION augustus 1 ABINITIO_PREDICTION augustus_masked 1 PROTEIN protein2genome 5 PROTEIN blastx 5 TRANSCRIPT blastn 5 TRANSCRIPT est2genome 5 TRANSCRIPT assembler-pasa_production_final_old 10 OTHER_PREDICTION transdecoder 10 OTHER_PREDICTION maker 10

S.1h.2. Partition inputs EVM uses partitioned inputs to increase processing speed, and provides a utility script to accomplish this.

Inputs: The genome Fasta file and the EVM evidence files prepared as described above. The overlapSize (between partitions) was set based on an EVM recommendation that it be set to approximately twice the average anticipated gene size. The segemntSize was set to ten times that value.

Command:

$EVM_HOME/EvmUtils/partition_EVM_inputs.pl --genome c18pcphsoh2.fa

--gene_predictions genePreds.all.with.maker.transdc.perfs.gff

--protein_alignments protein.evidence.gff

--transcript_alignments rna.evidence.gff

--segmentSize 600000 --overlapSize 60000 --partition_listing partitions_list.out

Outputs: The file partitions_list.out, showing paths to the partitioned inputs. S.1h.3. Generate EVM commands The next step in the EVM process is to generate commands, which was done with another EVM utility script.

124

Inputs: The genome Fasta file, the EVM weights file, the EVM evidence files, the partitions list file.

Command:

$EVM_HOME/EvmUtil/write_EVM_commands.pl --genome c18pcphsoh2.fa --weights /media/3f0f1386- 6815-4473-acc5-cce84c971061/production_final/EVM/weight1/weights1.txt

--gene_predictions genePreds.all.with.maker.transdc.perfs.gff

--protein_alignments protein.evidence.gff

--transcript_alignments rna_evidence.gff --output_file_name evm.out --partitions partitions_list.out > commands.list

Outputs: The file commands.list, showing 2837 commands to be executed on the partitioned inputs. This commands are in the form of e.g.,

/data/jmartins/progs/EVidenceModeler-1.1.1/EvmUtils/.././evidence_modeler.pl -G c18pcphsoh2.fa -g genePreds.all.with.maker.transdc.perfs.gff -w /media/3f0f1386-6815-4473-acc5- cce84c971061/production_final/EVM/weight1/weights1.txt -e rna.evidence.gff -p protein.evidence.gff --exec_dir /media/3f0f1386-6815-4473-acc5- cce84c971061/production_final/EVM/weight1/scaf1/scaf1_1-600000 > /media/3f0f1386-6815-4473- acc5-cce84c971061/production_final/EVM/weight1/scaf1/scaf1_1-600000/evm.out 2> /media/3f0f1386-6815-4473-acc5-cce84c971061/production_final/EVM/weight1/scaf1/scaf1_1- 600000/evm.out.log

(“/media/3f0f1386-6815-4473-acc5-cce84c971061” is an external drive where the overall working directory “production_final” was located). S.1h.4. Execute EVM commands The next step was to execute the 2837 commands. A custom perl script was used to run them in parallel to decrease total processing time. Details of that step are not presented.

Outputs: Files “evm.out”, created in each partitioned input folder. S.1h.5. Convert EVM output to gff format and gather into one file After running the main EVM commands, EVM utility scripts were used to (5.1) combine the partitioned output and (5.2) convert the output to gff3 format. The Linux ‘find’ command (5.3) was then used to concatenate all of the gff3 outputs into a single file, “EVM.all.gff3”.

S.1h.5.1.

Inputs: partitions list and the evm.out files.

Command: $EVM_HOME/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out -- output_file_name evm.out

Outputs: A single evm.out file in each partitioned folder representing the combined output from the evm output files for the partitions in that folder.

S.1h.5.2.

125

Inputs: partitions_list, consolidated evm.out file, and genome Fasta file.

Command:

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out --output evm.out --genome c18pcphsoh2.fa

Outputs: a gff3 results file for each input genomic scaffold, evm.out.gff3

S.1h.5.3.

Inputs: the evm.out.gff3 file for each scaffold.

Command: find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all.gff3

Outputs: Consolidated results file for all scaffolds in gff3 format, EVM.all.gff3.

There were 37,378 gene models, each represented by a single transcript, in the final output file. S.1i. Final PASA processing Once the EVM gff3 output was saved into a single file, the next step in the pipeline was to employ PASA again to update the EVM models, which contain coding sequence only and are limited to a single transcript. PASA updates the EVM models by adding UTRs and/or providing models of alternate transcripts when the alignment information generated during the initial run of PASA (section S.1f above) supports such actions. The recommendation from the PASA developers is to run two iterations of PASA at this stage. The commands were as follows: S.1i.1. PASA update 1 S.1i.1.1. First load the EVM produced models/annotations.

Inputs: consolidated EVM gff3 file, genomic Fasta file, and PASA configuration file.

Command; $PASAHOME/scripts/Load_Current_Gene_Annotations.dbi -c ../../PASA/alignAssembly.config -g c18pcphsoh2.fa -P EVM.all.gff3

Outputs: The outcome of this step is simply the loading of the EVM annotations into the underlying MySQL database.

S.1i.1.2. Do the update.

Inputs: genomic Fasta file, the transcript models Fasta file (“pasa_production_final.assemblies.fasta “) produced by the original PASA run (from step S1.f.3 above), and the PASA configuration file annotCompare.config. It was run with default entries, so it's contents (below) are unchanged from the file downloaded with PASA except for the name of the MySQL database:

File contents of annotCompare.config:

# MySQL settings

MYSQLDB=pasa_production_final_old #######################################################

126

# Parameters to specify to specific scripts in pipeline # create a key = "script_name" + ":" + "parameter" # assign a value as done above.

#script cDNA_annotation_comparer.dbi cDNA_annotation_comparer.dbi:--MIN_PERCENT_OVERLAP=<__MIN_PERCENT_OVERLAP__> cDNA_annotation_comparer.dbi:--MIN_PERCENT_PROT_CODING=<__MIN_PERCENT_PROT_CODING__> cDNA_annotation_comparer.dbi:--MIN_PERID_PROT_COMPARE=<__MIN_PERID_PROT_COMPARE__> cDNA_annotation_comparer.dbi:-- MIN_PERCENT_LENGTH_FL_COMPARE=<__MIN_PERCENT_LENGTH_FL_COMPARE__> cDNA_annotation_comparer.dbi:-- MIN_PERCENT_LENGTH_NONFL_COMPARE=<__MIN_PERCENT_LENGTH_NONFL_COMPARE__> cDNA_annotation_comparer.dbi:--MIN_FL_ORF_SIZE=<__MIN_FL_ORF_SIZE__> cDNA_annotation_comparer.dbi:--MIN_PERCENT_ALIGN_LENGTH=<__MIN_PERCENT_ALIGN_LENGTH__> cDNA_annotation_comparer.dbi:-- MIN_PERCENT_OVERLAP_GENE_REPLACE=<__MIN_PERCENT_OVERLAP_GENE_REPLACE__> cDNA_annotation_comparer.dbi:-- STOMP_HIGH_PERCENTAGE_OVERLAPPING_GENE=<__STOMP_HIGH_PERCENTAGE_OVERLAPPING_GENE__> cDNA_annotation_comparer.dbi:--TRUST_FL_STATUS=<__TRUST_FL_STATUS__> cDNA_annotation_comparer.dbi:--MAX_UTR_EXONS=<__MAX_UTR_EXONS__> cDNA_annotation_comparer.dbi:--GENETIC_CODE=<__GENETIC_CODE__> ______

Command: $PASAHOME/scripts/Launch_PASA_pipeline.pl --CPU 52 -c ../../PASA/annotCompare.config - A -g

Outputs: An updated gff3 file of genome annotations, pasa_production_final_old.gene_structures_post_PASA_updates.36011.gff3, which was, for simplicity, copied to file “EVM.all.update1.gff3”. The file contained 37,195 gene models with 55,556 transcripts. S.1i.2. PASA update 2 For the second iteration of the PASA update, the output gff3 file from the first iteration, the same process used in iteration 1 was then repeated.

S.1i.2.1. First load the EVM produced models/annotations produced in the first iteration.

Inputs: Updated EVM gff3 file, genomic Fasta file, and PASA configuration file.

Command:

$PASAHOME/scripts/Load_Current_Gene_Annotations.dbi -c ../../PASA/alignAssembly.config -g c18pcphsoh2.fa -P EVM.all.update1.gff

Outputs: The outcome of this step is simply to load the updated EVM annotations into the underlying MySQL database.

S.1i.2.2. Run PASA to do the updating.

$PASAHOME/scripts/Launch_PASA_pipeline.pl --CPU 52 -c ../../PASA/annotCompare.config -A -g c18pcphsoh2.fa -t ../../PASA/pasa_production_final.assemblies.fasta&>annot_compare2.out&

127

Outputs: An updated gff3 file of genome annotations, pasa_production_final_old.gene_structures_post_PASA_updates.3168.gff3, which was, for simplicity, copied to file “EVM.all.update2.gff3”. The file contained 37,190 gene models with 59.057 transcripts and represented the final output for the PASA/EVM annotation pipeline.

The commands outlined above were the ones used to arrive at the 30,909 Maker models and 37,190

PASA/EVM models that were compared based on BUSCO content and RNA mapping rates to determine which set to proceed with. The PASA/EVM set, based primarily on its superior BUSCO coverage, was chosen.

S.2. Preliminary Maker runs and notes on gene predictor training

S.2a. Preliminary MAKER run using the “dbg2olc” assembly

Once Maker was installed on the Linux workstation a preliminary run was done to verify that everything appeared to be configured correctly and to become familiar with the general work flow and outputs produced by Maker. Given that our 320M+ paired-end RNA-Seq reads were expected to provide the best set of evidence for defining intron/exon boundaries in our to be produced models, this first “full”

Maker run was performed using a draft genome assembly our group had produced and with the Maker options control file (maker_opts.ctl) set up without any gene predictors (e.g, Augustus, SNAP) invoked.

Instead in the control file was set so that only Exonerate’s est2genome process would generate predictions. This meant the only gene predictions generated by Maker would be based on how assembled transcripts (see following) aligned to the genome. Given that detailed commands for running the annotation pipelines were provided in the preceding supplementary section, detailed commands for the preliminary run are not provided here.

Prior to running Maker the adult and juvenile RNA-seq reads were combined into two files r1.fastq and r2.fastq, and then had sequencing adapters trimmed with BBDuk (version 37.41). Reads shorter than 40 bases long and with average quality scores <20 were discarded (at this time TrimGalore was not being

128

used yet). Following trimming there were ~291.4 M paired-ends sequence pairs remaining. To be used in the Maker annotation process the mRNA reads needed to be assembled into a more tractable set of longer transcripts. Trinity version 2.4.0 was used in “genome guided” mode to accomplish this. Trinity’s genome guided assembly first required that the mRNA reads be mapped to the reference genome

(“dbg2olc_final_pilon_scaffolded”) in a splice aware manner. STAR (version 2.5.3a) was used to accomplish that, in two-pass mode. samtools (version 1.4) was used to convert the output *sam alignment file to a coordinate sorted *bam file. Trinity (version 2.4.0) was then run in the “genome guided” mode with the following parameters --genome_guided_max_intron 1000000 -- genome_guided_bam cleanTrimLeftMaq20vsDefMod_sorted.bam --seqType fq --left bbClean1_trimLeftMaq20.fq --right bbClean2_trimLeftMaq20.fq --max_memory 10G --normalize_reads

--SS_lib_type RF --monitor_sec 180, where the “…sorted..bam” and “...20.fq” files in the parameter list represent the .bam file and trimmed RNA read files respectively. The output from Trinity consisted of

1,050,431 “transcripts” representing 706,934 “genes,” ranging in length from 201 to 40,435 bases, with a median length of 381. The assembled Trinity “transcripts” were used as input into a Maker run, along with the test genome, with the est2genome parameter set to 1, so that gene structure would be inferred from the blastn alignments of the transcripts to the genome, after subsequent refinement of the exact boundaries of the hits by the program Exonerate. The default setting to ignore genomic scaffolds less than 10,000 bases long was used, so 8,310 of 8,630 available scaffolds were examined. For

RepBase masking in RepeatMasker, model_org=all was chosen. A “basic” Fathead minnow specific repeat library was also provided to RepeatMasker, constructed from the target genome using

RepeatModeler (Smit 2008-2015) with the BuildDatabase command (“-engine ncbi) followed by the

RepeatModeler command. The transposable elements protein file that downloads with Maker was also provided (- te_proteins.fasta). All other Maker parameters were default except max_dna_len, which was set to 600,000, and split_hit=200,000 (expected maximum intron size). For each of the analyzed

129

scaffolds, Maker output a “.gff” (genomic feature format, version 3) file providing details of exactly where RepeatMasking occurred, what it was based on, and where the Trinity sequences aligned (blastn and Exonerate alignment results). Along with this data were also provided the coordinates for predicted gene models, based on Exonerate’s est2genome process. For any scaffold where one or more genes was predicted, there were also two fasta files output. One of the files was a file of all the predicted transcripts from the scaffold, and the other was the corresponding proteins that the transcripts would be translated into. Using the Linux “cat” command, the fasta files were concatenated into two files, one of all the transcripts and one of all the proteins. The output consisted 30,657 gene models with 64,804 transcripts. The proteins sequences were analyzed with BUSCO (version 3.01) with the Actinopterygii data subset (version odb9, consisting of 4,584 single copy ortholog proteins present in ray-finned fish) to evaluate completeness, continuity and duplication. BUSCO reported a total of 2,725 “Complete BUSCOs”

(59.5%) and 531 (11.6%) “Fragmented BUSCOs”. The Maker run was considered successful as it produced models from an entire assembly, however the BUSCO “coverage” was considered quite short of what we hoped as a group hoped to achieve; our unofficial goal for complete single-copy BUSCOs was at least 85%.

S.2b. Gene predictor training

S.2b.1. Augustus training

The main reason the Maker run above was presented was because it served as a source of gene models that were used to train our gene predictors (Augustus, SNAP and Genemark-ES). Species-specific training for the predictors was important because every genome has its own characteristics. The intron comparison we did to the five other fish species, where we saw the AT richness of the FHM minnow genome compared to all the other species, provides an example of genome specific structure. The training process for SNAP and GeneMark is relatively simple, while training Augustus is more involved,

130

so the focus of most of the information presented in this section is on Augustus. As was the case where the annotation pipeline commands were given in supplemental section 1, there may be specific paths or environmental variables (faux or real) representing directory paths in specific commands that appear below. Again, the reader should disregard such as they are not of importance. The important part is the steps of the process that were performed. This section, while detailed, is not necessarily exhaustive.

Readers may contact the author for additional specific information and guidance if desired, or follow the link provided below to gain additional information and understanding should confusion arise. Presented now is the description of the training that was accomplished.

The protein sequences from the preliminary Maker run discussed above were blasted against the NCBI nr database. Using custom perl scripts, the blast results were assessed to find models that had perfect coverage to non-predicted (i.e., NCBI accessions not starting with “XP”). 321 such models were identified and chosen as an initial training set for Augustus. For the purpose of training, we examined to see that none of the genes overlapped, per Augustus’s guidance. The Augustus utility script

“gff2gbSmallDNA.pl” which can convert gff files (the format of the output models from Maker) to the

GenBank (gb) format required for Augustus training had yet to be discovered, so instead, the Maker utility script maker2zff, SNAP program fathom (part of the SNAP download), and SNAP utility script convert_fathom2genbank.pl (found online, in a Maker help thread) were employed to do the conversion. The two commands were:

$PATH_TO_MAKER/bin/maker2zff trainees_DANRE_noOverlaps.gff

$PATH_TO_SNAP/fathom genome.ann genome.dna -categorize 1000

The output files from the fathom command, uni.ann and uni.dna,, were then used as input for the conversion to GenBank format step:

131

convert_fathom2genbank.pl uni.ann uni.dna

At this point the input (uni.ann) to the conversion command contained only 267 of the 321 models that had been selected for training; no investigation was done as to why some models were lost. Of the 267, all were present in the GenBank converted output file (called “DANRE_trainees.gb”). I note that files had

“DANRE” in their name because the plan was to use the zebrafish parameter files that were downloaded with Augustus as a starting point to compare to; DANRE was thus for Danio rerio. The sequences in the

.gb (GenBank) file had no flanking sequence, just CDS and introns. I used the Augustus utility script randomSplit.pl to choose 200 of the 267 to be trainers, leaving 67 as the test set. The command to accomplish that was:

/data/progs/augustus-3.2.2/scripts/randomSplit.pl DANRE_trainees.gb 67

All the files in the $AUGUSTUS_CONFIG_PATH/species/zebrafish directory were copied into a new directory called $AUGUSTUS_CONFIG_PATH/species/fathead2. After modifying all the files in the new directory appropriately, changing any occurrences of “zebrafish” to “fathead2”, I ran the Augustus training command on the 200 trainers, then ran Augustus on the 67 test samples using both the zebrafish parameter files and the newly trained fathead parameter files. Those commands were:

/data/progs/augustus-3.2.2/bin/etraining –species=fathead2 DANRE_trainees.gb.train

/data/progs/augustus-3.2.2/bin/augustus --species=fathead2 DANRE_trainees_1-9-18.gb >test.out

/data/progs/augustus-3.2.2/bin/augustus --species=zebrafish DANRE_trainees_1-9-18.gb.test

>zebratest.out

The fathead parameters performed quite badly (~9% gene level sensitivity), but considerably better than zebrafish parameters (<4.5% gene level sensitivity), so at this point it was determined to leave the zebrafish parameter files behind and move forward with the new fathead parameter files.

132

Returning to the initial set of 321 potential trainers, 11 of the 321 reads were removed because they might be too similar to other models in the training set. Augustus recommends sequences be <70% identical. This was assessed by blasting the 321 against themselves and removing eleven deemed to be too similar to other potential trainers. The final list of 310 was saved in “DANRE_trainees_final_list.txt”, and the trainer gff file was now called DANRE_trainees_final.gff (and DANRE_trainees_final_noFasta.gff, which had fasta sequences removed from the bottom of the gff file, for use in the next step).

As a next step, different-sized flanking region lengths were examined. Now aware of the Augustus utility script gff2gbSmallDNA.pl, it was used to produce GenBank formatted sequence files with different lengths of sequence flanking the 310 mRNA sequences to be used for further training and testing.

Flanking regions of 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, and 6000 bases were produced.

Augustus was run (i.e., predictions were made) on all the files, and the best gene level sensitivity (0.31) was at 2000. 2000 bases of flanking sequence was considered a rather conservative flanking region size, so the 3500 flanking region, which exhibited the second best gene level sensitivity value (.29%), was selected to continue with. Example commands for the exercise described in the preceding sentences were as follows: gff2gbSmallDNA.pl DANRE_trainees_final_noFasta.gff defMod.fasta 3500 gbOutTrainees_3500.gb,

(where “DANRE_trainees_final_noFasta.gff” is the gff file with the genomic coordinates of the trainer models, “defMod.fasta” is the file of genomic sequences in Fasta format, and “gbOutTrainees_3500.gb” is the output file in GenBank format). mv gbOutTrainees_3500.gb gbOut_3500.gb (The “mv” command simply changed the name of the output gb file from the previous step to a simpler name).

The 310 trainers were then split into training and test sets using Augustus’s randomSplit script.

133

/data/progs/augustus-3.2.2/scripts/randomSplit.pl gbOut_3500.gb 100

The previous command resulted in two files, fathead2 gbOut_3500.train and fathead2 gbOut_3500.test.

The former consisted of 210 models for training, and the latter a file of 100 models for testing. The next two steps were to run the training command with the 210 trainers and then to run Augustus on the test set with the newly trained parameter files.

/data/progs/augustus-3.2.2/bin/etraining --species= fathead2 gbOut_3500.train

/data/progs/augustus-3.2.2/bin/Augustus –species=fathead2 gbOut_3500.test >aug_gbOut_3500.test

The gene level sensitivity result (highlighted), extracted from the output file “aug_gbOut_3500.test “, was: gene level | 224 | 100 | 29 | 195 | 71 | 0.29 | 0.129 |

The next step was to examine how the addition of “hints” files would affect prediction. The combined

BBDuk trimmed adult and juvenile RNA fastq files, /media/f0609a36-47ab-468e-a96d-

2d4196d0d1fa/STAR/r1.fastq and /media/f0609a36-47ab-468e-a96d-2d4196d0d1fa/STAR/r2.fastq were broken in to six pieces each and two-pass mapped with STAR to produce bam alignment files. STAR logs and output are in six subdirectories of /data/trainingTmp named “first4M”, “0.5”, “5b”, “2”, “3”, and

“4”. Following completion of the mapping runs, the merge command from the program “bamtools” was used to prepare a consolidated bam file made from all the individual bam files in the sub-directories.

“wig” output was also generated during the STAR alignment runs, and all of the wig files (…strand1 .wig files and …strand2 .wig) were concatenated. The Augustus utility script wig.pl was then used to produce combined wig files for the two strands. Commands found at http://gensoft.pasteur.fr/docs/augustus/3.2.1/tutorial2015/ittrain.html were then used to do hint generation (That page starts with the use of STAR to perform the mapping). In this case the STAR

134

mapping was done using only the 310 Maker training genes (+/- 3500 bases) as a mapping target so that the “genomic” coordinates for the hints that were generated were compatible with the Augustus run done on the .gb file of the 310 genes +/- 3500 flanking bases. If mapped to the whole genome the hint coordinates would be of no use since they would be based on the overall genome coordinates. The file

/media/7c341871-b79f-4db9-9630- ea5ea1875b7f/makerFullRepeatModelerFix/maskedTraineeSeqs_3500.fas was what was mapped to; the STAR index for the sequence while was created first and was in directory

/data/trainingTmp/traineeScaffsMaskedIndex). Then the STAR alignment described above was done, and then the subsequent commands (most of which were also discussed above) were:

/opt/smrtanalysis/install/smrtanalysis_2.3.0.140936/analysis/bin/bamtools merge -in first4M/Aligned.sortedByCoord.out.bam -in 0.5/Aligned.sortedByCoord.out.bam -in

5b/Aligned.sortedByCoord.out.bam -in 2/Aligned.sortedByCoord.out.bam -in

3/Aligned.sortedByCoord.out.bam -in 4/Aligned.sortedByCoord.out.bam -out all.Aligned.sortedByCoord.out.bam

/opt/smrtanalysis/install/smrtanalysis_2.3.0.140936/analysis/bin/bamtools header -in all.Aligned.sortedByCoord.out.bam>header (Augustus training materials said to get a bam header).

/data/progs/augustus-3.2.2/bin/bam2hints --intronsonly --in=all.Aligned.sortedByCoord.out.bam -- out=hints_1-29-18.intron.gff cat */Signal.Unique.str2.out.wig >all.str2.out.wig cat */Signal.Unique.str1.out.wig >all.str1.out.wig perl wig.pl

135

cat all.str1.total.wig | /data/progs/augustus-3.2.2/scripts/wig2hints.pl --width=10 --margin=10 -- minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep --radius=4.5 --pri=4 --strand="-" > hintsNeg.ep.gff cat all.str2.total.wig | /data/progs/augustus-3.2.2/scripts/wig2hints.pl --width=10 --margin=10 -- minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep --radius=4.5 --pri=4 --strand="+" > hintsPos.ep.gff cat hints_1-29-18.intron.gff hintsPos.ep.gff hintsNeg.ep.gff> hints.gff

All hints were now in the hints.gff file. The next step was to try some testing employing the hints. The following Augustus runs were tried with the hints file.

/data/progs/augustus-3.2.2/bin/augustus --species=fathead2 gbOut_3500.test --softmasking=on -- hintsfile=hints.gff --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg>3500_trainingWithHints.gff

/data/progs/augustus-3.2.2/bin/augustus --species=fathead2 gbOut_3500.test --softmasking=on --

UTR=on --hintsfile=hints.gff -- extrinsicCfgFile=extrinsic.M.RM.E.W.cfg>3500_trainingWithHints_withUTR.gff

/data/progs/augustus-3.2.2/bin/augustus --species=fathead2 gbOut_3500.test --softmasking=on --

UTR=on --hintsfile=hints.gff --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg --alternatives-from-evidence=true

--allow_hinted_splicesites=atac>3500_trainingWithHints_withUTR_atac.gff.

Gene level sensitivity results for these Augustus runs are highlighted below:

#############+/- 3500 bps test set run with hints, no UTR gene level | 386 | 100 | 39 | 347 | 61 | 0.39 | 0.101 |

#############+/- 3500 bps test set run with hints, and UTR option on

136

gene level | 215 | 100 | 70 | 145 | 30 | 0.7 | 0.326 |

#########+/- 3500 bps test set run with hints, UTR on, atac splice junctions on, and alternative from evidence flag set gene level | 231 | 100 | 74 | 157 | 26 | 0.74 | 0.32 |

The last of the three commands executed (highlighted above), with hints, UTR on, atac splices allowed, and the alternatives-from-evidence flag set did the best job, producing 0.74 gene level sensitivity highlighted immediately above. The gene level sensitivity results overall indicated that we had a well- trained parameter set.

At this point, five copies of the fathead2 parameter files in the Augustus/config/species directory were made and placed in five new directories under Augustus/config/species directory (named fathead2_1, fathead2_2…, fathead2_5). The files in each new directory were then renamed and edited appropriately. So for example, the file fathead2_utr_probs.pbl file in directory fathead2 was copied to each of the five new directories and in those directories renamed to file fathead2_1_utr_probs.pbl in directory fathead2_1, file fathead2_2_utr_probs.pbl in directory fathead2_2, file fathead2_3_utr_probs.pbl in directory fathead2_3, etc. The files were then edited to change any references to “fathead2” to reflect the new directory they were (e.g., fathead 2_1). A custom perl script

Random.pl was used to make 5 different training and test sets from the 310 trainers (210 training and

100 test models respectively in each, gbOut_3500_1-29.1.train/test through gbOut_3500_1-

29.5.train/test), and then Augustus’s etraining command was run on each of the five training files with the 210 genes in them, followed by running Augustus on the corresponding five test sets with each of the corresponding newly generated parameter files. Example training and testing commands for the set named fathead2_1 for example were:

137

etraining --species=fathead2_1 gbOut_3500_1-29.1.train

/data/progs/augustus-3.2.2/bin/augustus --species=fathead2_51gbOut_3500_1-29.1.test -- softmasking=on --UTR=on --hintsfile=hints.gff --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg --alternatives- from-evidence=true --allow_hinted_splicesites=atac > gbOut_3500_1-29.1.test.gff

The gene level sensitivity reported from the five sets ranged from 0.72 (fathead2_1) to 0.80

(fathead2_3). The Augustus script optimize_augustus.pl was then run using all 310 final training set sequences (the +/-3500 bp set) and focusing on the folder/parameter set in /data/progs/augusutus-

3.2.2/config/species/fathead2_3.

That command was: optimize_augustus.pl --species=fathead2_3 -- metapars=$AUGUSTUS_CONFIG_PATH/species/fathead2_3/fathead2_3_metapars.cfg --UTR=on -- cpus=48 gbOutTrainees_3500.gb&>optimize_3500_1-31-18.out&

To augment the training set the Maker model proteins blast vs. nr results previously generated were reassessed, allowing slightly more latitude in the coverage to the nr proteins (still excluding predicted

(XP) proteins). IDs of 553 new trainer candidates are in finalVsNrBest.ods; these new trainers were unique transcripts with ~+/- 2% coverage to target proteins based on alignment length/target length.

(Although probably not ideal, six pairs of the 553 represented alternate transcripts from six genes, though the transcripts did return different best blast hits). Combined with the 310 original trainers, there were now 853 trainers from Maker. Finally, to acquire additional trainers, BUSCO was run on the draft genome employing the newly optimized fathead2_3 parameters. From the BUSCO results were harvested the best single copy models from the Augustus gene predictions that were produced as part of the BUSCO run. Initially, 55 BUSCOs were added to the 853 Maker models, but subsequently additional BUSCOs were added, and after final overlap checking, there were 1,145 non-overlapping trainers (825 from Maker, 320 from BUSCO) in file /media/f0609a36-47ab-468e-a96d-

138

2d4196d0d1fa/snap/all.trainees.final_3-21-18.noOverlaps.withFasta.gff3. In converting the original 908 and subsequent 1145 to “gb” format there was some loss of genes, and in the GenBank formatted files there were 902 and 1142 genes respectively. Fathead2_2 was Augustus etrained with a “connected” set

(meaning the models rather than being stand alone .gb file entries were embedded within the genomic scaffold sequences) of 649 of the 902 models (maker.busco.trainees.connect.gb.train) initially, and compared to the performance of fathead2_3. Fathead2_3 still appeared to perform better, but shortly after that test the fathead2_3 parameter folder became corrupted during a subsequent optimization attempt using all 902 trainers, so a switch was made to using the fathead2_2 parameter folder, which originally had the second best gene level sensitivity (0.77) during the earlier training/testing with the original 310 Maker models used for training. At this point some uncertainty enters the picture. Either fathead2_2 was “etrained” with some subset (or all) of the 1,142 augmented training sequences, or the optimization script was run with all 1,142. Most likely the optimization was run, which would have been typical, but unfortunately the records of this final step in training were lost when a hard drive crashed before the results folder (finalAugustusTraining) was backed up, and enough time had passed before writing this, so that I am unable to say for certain exactly which path I took. In any case, the final HMM parameter files in “fathead2_2” are available, which is most the most important thing, as they represent the parameter set used by Augustus to do any predictions in the annotation runs.

S.2b.2. GeneMark-ES Training

GeneMark training was much simpler than Augustus training, and occurred In the folder

/run/media/jmartins/999c9aa6-acfc-44e8-aabb-307e14fb6604/progs/gm_et_linux_64/gmes_petap. The command (below) can be found in the file run.cfg in the directory where the training occurred. gmes_petap.pl --ET ../SJ.out.gff --min_contig 10000 --cores 46 --max_intron 200000 --sequence

/data/clean.bbmap/defMod.fasta,

139

where SJ.out.gff is a modified version of the splice junction output from STAR produced by aligning RNA reads to the dbg2olc genome assembly (/data/clean.bbmap/defMod.fasta). The SJ.out.gff file was produced by using the auxiliary GeneMark script star_to_gff.pl on the SJ.out.tab file produced by STAR.

The input RNA reads to STAR were those that had been trimmed with BBDuk. The final hmm parameter file produced was named gmhmm.mod, which was used as the GeneMark-ES parameter file employed for the Maker production runs.

S.2b.3. SNAP training

The SNAP training was also considerably simpler than the Augustus training, and was accomplished in the manner described in the online Maker tutorial. The input file all.trainees.final_3-21-

18.noOverlaps.withFasta.gff3, containing the 1,145 combined Maker and BUSCO gene models identified for use as trainers during Augustus training and corresponding 794 genomic scaffolds were fed as input to the Maker auxiliary script maker2zff to create files compatible with the SNAP training regimen. The command run on the gff file with fasta sequences was:

/data/progs/maker/bin/maker2zff -n all.trainees.final_3-21-18.noOverlaps.withFasta.gff3.

This created the output files genome.dna and genome.ann. The training commands as outlined in

SNAP’s “00README” file (which are the same as outlined in the Maker tutorial) were then executed: fathom genome.ann genome.dna -categorize 2000 fathom uni.ann uni.dna -export 2000 -plus mkdir params (# create directory called “params”) cd params (# go into params directory) forge ../export.ann ../export.dna

140

cd .. (# move back up one level out of params directory) hmm-assembler.pl my-genome params > my-genome.hmm

The "params" folder was renamed to "fathead_params” and the final *hmm file (my-genome.hmm) was renamed “fathead_3-21-18.hmm”. This file was used as the input SNAP parameter file during the final preliminary runs of Maker described in section S.2c immediately below.

S.2c. Preliminary MAKER runs using the “c15dcphs.fa” assembly

Prior to doing the annotation “production runs” that make up the bulk of what is presented in Chapter

3, we used one of the last assemblies we produced that was not chosen as our “final” assembly in two

Maker runs. The assembly (“c15dcphs.fa”) was one of the many we produced using the different approaches discussed in section 2.2b. The specific approach used to produce the assembly is not critical.

The importance of these preliminary run is primarily due to two factors. The first reason they are important is that the second of these runs was the first run where the use of rCorrector and TrimGalore was introduced into the overall annotation process. The potential of adding these two steps was considered after reading the recommendations at https://informatics.fas.harvard.edu/best-practices- for-de-novo-transcriptome-assembly-with-trinity.html.

The two Maker runs discussed here were done with identical inputs except for the Trinity (genome- guided) assembled reads that were used as the RNA transcript input to Maker. The first run had Trinity assembled reads that were produced using our ~320M “raw” paired-end RNA reads as input. For the second run, prior to using Trinity to assemble the RNA reads, the raw RNA reads were first corrected with rCorrector and then trimmed with TrimGalore. The result of the two Trinity runs was telling. Trinity produced 1,014,577 output transcripts when the raw reads were used as input, and only 757,658 when the corrected and trimmed (“cleaned”) reads were used as input. It is expected that the real number of

141

transcripts produced by FHM is significantly less (likely <100,000), so the lower number of transcripts produced from the cleaned reads was seen as a positive.

A single iteration of Maker was run with each of the Trinity read sets. The run using the raw reads resulted in 32,210 gene models, while the run using the “clean” reads resulted in 31,397 gene models.

The lower number of gene models was also viewed as a positive outcome since it was closer to the number of known protein coding genes in the closely related zebrafish. Based on the outcome of this evaluation a decision was made to employ rCorrector and TrimGalore in the annotation production runs described in Chapter 3.

The second reason these Maker runs were important is that the output models from the Maker run that was done using the “clean” input reads were used to train SNAP another time prior to the annotation production runs. SNAP was trained in the standard manner as described in section S.2b.3, using all

31,397 Maker models, as recommended in the Maker tutorial. SNAP was then used to make predictions using the originally trained parameter file described in section S.2b.3 (fathead_3-21-18.hmm) and the new parameter file. Predictions were made on a set of 800 test sequences (“snapTestSeqs.fas”), extracted from the c15dcphs.fa assembly, where Maker had predicted high quality gene models

(assessed by blasting all 31,397 Maker models against the our EBI reference proteins).

The model protein sequences output from each SNAP run were then blasted against the EBI reference proteins and assessed for coverage to the reference proteins. The proteins from the SNAP run that used the new parameters were found to have higher quality (coverage) predictions (data not shown), and thus the new SNAP parameter file (named “all.maker.snap2.hmm”) was selected for use for the Maker production run.

As noted in the body of the paper, SNAP was retrained yet again between the two iterations of the

Maker production runs, using the same steps outlined in section S.2b.3 above, with the input training models being all the Maker models output from the first production run iteration of Maker. The new

142

hmm parameter file generated from the retraining process (“snap.retrained.hmm “) was then used as input during the second production run iteration of Maker.

During the preliminary Maker runs discussed in this section, retraining of Augustus and GeneMark-ES was also evaluated using the same 800 high-quality Maker models which had been extracted from the

31,797 total Maker models. In both cases the retrained parameter files did not produce output models that had better coverage to reference proteins than models produced using the parameter files developed during the earlier species-specific training described in sections S.2b.1 and S.2b.2, so the parameters developed as described in section S.2b were used for the annotation production runs.

S.2d. Maker zebrafish run for qualitative comparison to Maker FHM process

Given that this was the first time our group had done a genome annotation, and correspondingly was the first time we (I) used Maker, we decided to do a comparative evaluation of the Maker pipeline using another species, attempting to the extent possible to mirror the entire process used for the Maker FHM run. Zebrafish, being from the same family (Cyprinidae) as FHM and being an extremely well-studied genome with much publicly available sequence data seemed to be an ideal choice as the organism on which to do such a comparison. Given that this exercise was essentially a qualitative assessment to give us some indication that the results we were seeing from Maker for FHM were reasonable, and because details of the steps associated with running Maker appear elsewhere in this dissertation, minimal details of the specifics of this Maker run are provided.

To do the Maker run, NCBI’s short read archive was searched to gather RNA reads from different tissues and life stages of zebrafish. Table S.3.1 lists the reads that were downloaded as input to the annotation process. One of the downloaded sets (SRR5022640) had some unmatched reads in the paired-end files.

A perl script was used to remove such reads. There were 323,482,000 pairs of reads used in the end. It

143

should be noted that for FHM we had approximately 320M pairs of reads, however they were 150 bases long, whereas the downloaded reads consisted of various shorter length sets. Given the qualitative nature of the assessment we were doing and the large amount of data present in what was

Table S.3.1. Summary of downloaded zebrafish reads for Maker annotation run SRA Accession Tissue Spots bases(gb) base pairs size SRR5378555 juvenile male testis 23,581,077 4.7 101 SRR5022640 embryo 10 hpf 21,629,414 4 100 SRR1609753 Muscle 18,159,812 3.6 100 pooled male and SRR5893042 22,375,051 3.4 76 female embryo pooled male and 20,266,392 3.1 SRR5893166 female embryo 76 SRR891504 Liver 28,442,972 2.9 50 SRR6308293 Blood 20,461,146 3.8 98/89 pooled male and 18,148,267 2.8 SRR5893071 female embryo 76 SRR891495 Heart 27,037,328 2.8 51 SRR630469 embryo 5 dpf 20,109,108 2.1 51 pooled male and 15,220,175 2.3 SRR5893069 female embryo 76 SRR5342780 Ovary 27,797,691 3.9 70 SRR6652897 Brain 14,200,064 2.9 101 pooled male and 11,840,646 1.8 SRR4375303 female embryo 76 Mixed Tissues (muscle, ovary, SRR5599699 kidney, gill, liver, 12,699,919 2.5 100 intestines, heart, brain) SRR891511 Brain 15,662,615 1.6 51 pooled male and 8,641,032 1.3 SRR4375302 female embryo 76 Total paired-end 326,272,709 reads

downloaded, the difference in read lengths wasn’t considered critical.

All of the paired-end files (once the unpaired reads were removed) were concatenated into two files,

SRR_1.fq and SRR_2.fq. A genome-guided assembly of the reads was then performed using Trinity

(version 2.4.0). A Fasta file of the most recent version of the zebrafish genome assembly,

GCF_000002035.6_GRCz11_genomic.fna, was downloaded from NCBI to use as guide for the assembly

144

process. The Fasta file consisted of 1,923 chromosome/scaffold/contig sequences. Prior to running

Trinity, the reads first had to be aligned to the genome. STAR (version 2.6.0a) was used to align the reads to the STAR indexed genome and produced the sorted-by-coordinate bam alignment file required by Trinity to do the genome guided assembly.

Another input needed for the Maker run (if we wanted to mirror the process we used for FHM) was a species-specific library of repeats. RepeatModeler (version open-1-0.10) was run on the Trinity assembled reads to produce such a library. The instructions to accomplish this can be found here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic.

Regarding the gene predictors to be used in Maker, Augustus has an existing parameter set for zebrafish, so the Maker control maker_opts.ctl was altered to use the zebrafish parameters. For

GeneMark and SNAP, primarily because of the (relatively) “quick and dirty” nature of the assessment we were trying to perform, we used the hmm parameter files we had developed for FHM during the species-specific training described in section S.2b above rather than custom training those predictors for zebrafish.

Once the Trinity assembled reads and custom repeats library were available and the decision about the predictor parameter files made, the Maker run was attempted. The maker_opts.ctl file was the same as used for the initial FHM run, other than being altered to point to the appropriate input files and pointing

Augustus to its zebrafish parameter files. The other two Maker control files were unaltered compared to the respective FHM control files.

Results

99 of the 1,923 input zebrafish genomic sequences were shorter than the cut-off of 5000 bases employed in the Maker run. Of the remaining 1,824 input genomic sequences, 1,352 (74.1%) had predicted gene models on them, and the total number of model predictions was 34,857. The published number of protein coding genes in 2013 (Howe, Clark, et. al.) was 26,206, so using that figure as a basis,

145

Maker “overpredicted” by 8,651 genes (33%). As a single iteration of Maker was run on zebrafish, for comparison I looked at the results from first the iteration of Maker production runs on FHM. 89 of the

910 FHM scaffolds were shorter than 5000 bases and were not processed by Maker. 591 of the remaining 821 scaffolds (72.0%) had predicted gene models on them. The total number of predicted

FHM models was 32,189. Given that we don’t know the supposed number of genes in FHM, we instead compared the predicted total to the total reported for the closely related zebrafish, 26,206, in which

Table S.3.2. Comparison of mapping rates of ~16M paired-end reads to transcript models produced by one iteration of Maker zebrafish FHM Average input read length 151 300 Average mapped length 149.62 287.37 Uniquely mapped reads % 28.76% 39.15% % of reads mapped to multiple loci 21.80% 20.39% % reads mapped total 50.56% 59.54% case the predicted FHM total reported by Maker was high by 23%. As final comparison we used STAR to compare mapping rates of ~16M randomly selected paired-end reads from the respective input read sets to the respective transcript models for each species. Supplemental Table S.3.1 shows the mapping rates reported for each species.

The lower number of predicted models and better mapping rates exhibited by FHM may have been due to FHM having approximately twice the number of input bases (as reflected in the average input read length in Table S.3.2.); having more input data may have resulted in the production of better models.

Another factor influencing the output was that the zebrafish genome employed contained not only primarily assembly scaffolds but also included alternate assemblies for some sections of the genome

(~15% of the size of the primary assembly). This very likely increased the number of gene predictions, as duplicated genomic regions could produce duplicated genes. As this exercise was meant to be a simple comparative analysis, further investigation of how this influenced the results was not undertaken. The

146

results in their existing form provided us with confidence that the Maker pipeline as we were employing was generating reasonable results for FHM, which was really what we wanted to know.

147

S.3. Additional Tables and Figures referred to in the text

Figure S.1. World market share, "Next Gen" sequencers (from https://cen.acs.org/articles/92/i33/Next- Gen-Sequencing-Numbers-Game.html). ABI is now a part of Thermo Fisher.

148

Figure S.2. Dramatic Decrease in Cost of Sequencing with the emergence of NGS/HTS

149

Eukaryotic Species Genomes Released Each Year as found in NCBI Genome 700

600

500

400

300

200

100

0 1992199819992000200120022003200420052006200720082009201020112012201320142015

Figure S.3. Genomes released each year (data from ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/)

150

Supplemental Figure S.4. PacBio SMRT read lengths (from http://www.pacb.com/smrt-science/smrt- sequencing/read-lengths/)

151

25,000,000 Table 4.S.1. Final probabilities of class membership assigned to all samples by glmnet and random forest algorithms when trained with copper (B=bifenthrin, C=copper, M =control). For copper and controls, results come 20,000,000 from the five-fold cross validation “by sample” performance evaluation. For bifenthrin, results come from applying the final classifier trained with all copper and control samples after five-fold CV tuning.

15,000,000 glmnet rf Sample M C M C B1 0.452 0.548 0.598 0.402 10,000,000 B10 0.745 0.255 0.602 0.398 B11 0.389 0.611 0.434 0.566 B13 0.661 0.339 0.516 0.484 5,000,000 B14 0.914 0.086 0.564 0.436 B2 0.088 0.912 0.474 0.526 B4 0.020 0.980 0.206 0.794

0 B5 0.768 0.232 0.498 0.502

C1 C2 C4 C5 C6 C7 C8

B5 B1 B2 B4 B7 B8

M1 M2 M3 M4 M5 M7 M8

C10 C11 C13 C14

B10 B11 B13 B14

M11 M13 M14 B7 0.926 0.074 0.502M10 0.498 B8 total Reads mappedTotal0.978 mappedReadsR0.022 mappedReadsF0.560 0.440 C1 0.417 0.583 0.330 0.670 C10 0.014 0.986 0.008 0.922 C11 0.002 0.998 0.000 1.000 C13 0.009 0.991 0.008 0.992 C14 0.002 0.998 0.000 1.000 C2 0.001 0..999 0.008 0.992 C4 0.005 0.995 0.022 0.978 C5 0.174 0.826 0.326 0.674 C6 0.001 0.999 0.004 0.996 C7 0.000 1.000 0.000 1.000 C8 0.043 0.957 0.078 0.922 M1 0.997 0.003 1.000 0.000 M10 0.992 0.008 0.994 0.006 M11 0.999 0.001 1.000 0.000 M13 0.999 0.001 1.000 0.000 M14 0.942 0.058 0.818 0.182 M2 0.997 0.003 0.990 0.010 M3 0.995 0.005 1.000 0.000 M4 0.998 0.002 0.898 0.102 M5 0.906 0.094 0.874 0.126 M7 0.991 0.009 1.000 0.000 M8 0.999 0.001 0.908 0.092

152

25,000,000

20,000,000

15,000,000

10,000,000

5,000,000

0

C1 C2 C4 C5 C6 C7 C8

B5 B1 B2 B4 B7 B8

M1 M2 M3 M4 M5 M7 M8

C10 C11 C13 C14

B10 B11 B13 B14

M10 M11 M13 M14

total Reads mappedTotal mappedReadsR mappedReadsF

Figure 4.S.1 Total and mapped reads per sample. Gray bar indicates number of read counts that mapped to transcript models in expected strand orientation, yellow bar displays mappings to opposite strand. Only reads mapping to the expected (reverse) strand were used for downstream analysis

153

Figure 4.S.2 Density plots of raw count data for all 32 samples. All graphs except lower right represent six samples. Plots indicate there is a uniform distribution of counts data across all samples. Humps at 0 represent features with no counts

154

S.4. List of Publications

The following publications, which are currently in review or in preparation, are directly linked to the work described in this dissertation. The first is the publication where the new genome assembly and annotations will be released, and the second through fifth all involve mapping RNA to the new gene models developed as aim one. All these publications are planned for release during the current year.

1) Martinson J, Huang W, Toth G, Bencic D, Flick R, See MJ, Lattier D, Biales A, Kostich M. 2020. An improved annotated assembly of the Pimephales promelas genome. (in preparation). For submission to Genome Research. (tentative target).

This publication will describe the new FHM genome assembly and annotations. It represents the culmination of our team’s multi-year effort to produce a significantly improved FHM genome that will finally allow omics-based research to be fully leveraged in this primary model organism central to aquatic toxicology research.

2) Kostich, MS, Bencic, D, Flick R, See, MJ, Martinson, JW, Huang, W, Toth, GP and Biales A. 2020. Comparing RNA-seq to microarrays for detecting chemical exposures using whole Pimephales promelas larvae. (in review).

This work in this publication was the first to employ the new genome and gene models in an RNA-seq experiment. It explores several factors, including mapping programs, sequencing depth, read lengths, and biological replicates. It will also evaluate six different classifier algorithms, with a primary goal of assessing the utility of RNA-seq (using the new gene models) compared to using microarrays in expression studies.

3) Martinson J., Bencic D., Flick R., Kostich M., Huang W., Lattier D., Biales A. 2020. Developing predictors of exposure for the pyrethroid pesticide bifenthrin and copper, chemicals with different modes of toxic action, in Pimephales promelas larvae using RNA-seq and a new annotated genome assembly. For submission to Environmental Toxicology and Chemistry (in review).

This publication, representing the work described in Chapter 4, will be the second to use the new genome and gene models in an RNA-seq expression experiment. A goal of our group at the EPA is to develop RNA-seq based classifiers that are specific for different classes of chemicals. This paper evaluates the sensitivity and precision of predictors of exposure developed from copper and bifenthrin treated FHM. Bifenthrin and copper are chemicals with different modes of toxic action. Functional analysis based on the annotated gene models was used to find supporting evidence that the classifiers leveraged toxin specific responses.

4) Toth, G.P., Lattier, D., Bencic, D., Kostich, M., Martinson, J., Flick, R. and Biales, A. 2020. Differential expression of coding and non-coding RNA in fathead minnow larvae exposed to ethinylestradiol. For submission to Journal of Environmental Toxicology and Chemistry

155

As part of a series of studies to find chemical-class-specific RNA classifiers for groups of chemicals, larval fathead minnows were treated with a potent synthetic estrogen, ethinyl estradiol (EE2), followed by RNA sequencing of coding (mRNA) and non-coding sequences (microRNA). Various classification algorithms were applied to these data and assessed for their potential to predict treated versus non- treated organisms. Several algorithms were successful at prediction with mRNA data from the high- dose and control organisms, with the metric “area under receiver operating characteristic curve” approaching the ideal value of 1. Study of four lower EE2 doses is underway.

5) Fetke, J., Biales, A., Pilgrim, E., Martinson, J., See, M. J., Huang, W., Flick, R. 2020. Estrogen receptor alpha (ERα) is upregulated in male fathead minnows when fish are exposed to EE2. Author order after Fetke TBD. To be submitted to Aquatic Toxicology.

DNA methylation in the promoter region of genes is known to be associated with active Transcription. This study examines the methylation profile of estrogen receptor alpha when an upregulation of ERα in induced by exposure to EE2.

The following publications, in preparation or already published, are publications the author of this dissertation has contributed bioinformatic analyses to since the time he began learning bioinformatics at the University of Cincinnati. If listed and not yet published, it is anticipated that the publication will be released during the current year.

Darling JA, Martinson J, Pagenkopp-Lohan K, Carney KJ, Pilgrim E, Banerji A, Ruiz GM. 2020. Metabarcoding reveals high levels of ballast water borne biodiversity entering three major United States ports. For submission to Marine Pollution Bulletin.

Meredith C, Hoffman J, Trebitz A, Martinson, J, et. al. 2020. Evaluating the potential for DNA metabarcoding to replace morphological ID for estimating zooplankton composition in Lake Superior. For submission to Limnology and Oceanography (full author list and author order currently unknown). For submission to TBD.

Hoffman J, Meredith C, Pilgrim E, Trebitz A, Hatzenbuhler C, Kelly J, Peterson G, Lietz J, Okum S, Martinson, J. 2020. Comparison Between Morphology- and High Throughput Sequence-Based Taxonomic Identification of Fish Larvae Assemblages. For submission to TBD.

Darling, J, Martinson, J, Gong, Y, Okum, S, Pilgrim, E, et. al. 2018. Ballast Water Exchange and Invasion Risk Posed by Intracoastal Vessel Traffic: An Evaluation Using High Throughput Sequencing. Environ Sci Technol. Sep 4;52(17):9926-9936. DOI: 10.1021/acs.est.8b02108

Banerji, A, Bagley, M, Elk, M, Pilgrim, E, Martinson, J, Santo Domingo, J. 2018. Spatial and temporal dynamics of a freshwater eukaryotic plankton community revealed via 18S rRNA gene metabarcoding. Hydrobiologia 818(1): 71-86. DOI: 10.1007/s10750-018-3593-0.

156

Hatzenbuhler, C, Kelly, JR, Martinson, J, Okum, S, Pilgrim, E 2017. Sensitivity and accuracy of high- throughput metabarcoding methods for early detection of invasive fish species. Sci Rep. 2017 Apr;13(7):46393. DOI: 10.1038/srep46393

Nacci, D, Proestou, D, Champlin, D, Martinson, J, Waits, E. 2016. Genetic basis for evolved tolerance to dioxin-like pollutants in wild Atlantic killifish: more than the aryl hydrocarbon receptor. Molecular Ecology 2016 Nov;25(21):5467-5482. DOI: 10.1111/mec

Waits, E, Martinson, J, Rinner B, Morris S, Proestou D, Champlin D, Nacci D. 2016. Genetic Linkage Map and Comparative Genome Analysis for the Atlantic Killifish (Fundulus heteroclitus). Open Journal of Genetics, 6, 28-38. DOI: 10.4236/ojgen.2016.61004.

Kostich, M, Flick, R, Martinson, J, 2013. Comparing predicted estrogen concentrations with measurements in US waters. Environ. Pollut. 178C, 271-277

Maddaloni, M, Santella, D, Itkin, C, Kahn, H, Stephansen, S, Chang, M, Borst, M, Bourbon, J, Elsen, F, Martinson, J, O’Neil, M, Henning, C. 2013. Fish Tissue Analysis for Mercury and PCBs from a New York City Commercial Fish/Seafood Market, EPA/600/R-11/066F; U.S. Environmental Protection Agency.

Lamendella, R, Domingo, JW, Ghosh, S, Martinson, J, Oerther, DB. 2011. Comparative fecal unveils unique functional capacity of the swine gut. BMC Microbiol. (11):103

Maki, N, Martinson, J, Nishimura O, et al. 2010. Expression profiles during dedifferentiation in newt lens regeneration revealed by expressed sequence tags. Molecular Vision, (16):72–78

157