Westphalian Wilhelms-University Munster¨

Master’s Thesis

Genome Annotation of Pogonomyrmex californicus

Examiner: Author: Prof. Dr. Juergen Gadau Jonas Bohn Prof. Dr. Wojciech Makalowski

A thesis submitted in fulfillment of the requirements for the degree of Master of Science

at the

Institute of Bioinformatics

April 2019 Declaration of Academic Integrity

I hereby confirm that this thesis on Genome Annotation of Pogonomyrmex californicus is solely my own work and that I have used no sources or aids other than the ones stated. All passages in my thesis for which other sources, including electronic media, have been used, be it direct quotes or content references, have been acknowledged as such and the sources cited.

(date and signature of student)

I agree to have my thesis checked in order to rule out potential similarities with other works and to have my thesis stored in a database for this purpose.

(date and signature of student)

i Acknowledgements

I thank the whole team of Prof. Gadau and Prof. Makalowski for the excellent support and the friendly interaction. I would like to thank you for always having a contact person who was able to help me by solving problems. I want to thank Phd. student Reza Halabian for his very helpful support and educational conversations. In addition, I would like to thank the computer specialist Norbert Grundmann for his persistent and also instructive support with several PC problems.

I dedicate this thesis to my mother and my deceased father who has always supported me.

ii iii Abstract

Background

During the last four decades sequencing technologies developed over three generations depending on changes in the scope of genome projects. In order to analyze this increas- ing volume of data, new genome annotation methods continue to be developed. Ants like Pogonomyrmex californicus belong to a very divergent subfamily of . Un- fortunately, just a few genomes from Myrmicinae have been sequenced so far. In order to increase the genetic knowledge about ants from this subfamily, P. californicus was annotated from a high-quality draft genome assembly.

Results

The analysis of P. californicus sequence data resulted in an identification of 394,064 repeats where 30,292 are unclassified. These repeats mask about 22 % of the whole genome assembly. The structural annotation part detected 22,844 unique proteins which were coded by 23,874 transcripts. About 9,000 proteins do have detected Pfam domains. From these, about 8,500 have additionally InterPro domains. Inside these, about 5,500 have additionally gene ontology annotations. The function of 49 % of characterized unique proteins were detected. 1,572 ncRNAs were detected. This include 89 rRNA as well as 1,146 tRNA genes which are coding for 20 amino acids. Additionally 182 pseudogenes, 21 undetermined genes, and one suppressor gene were identified in this pool of RNAs.

Conclusion

This thesis resulted in an annotation-directed improvement annotation of the high- quality draft genome assembly. A high amount of duplications were registered in the genome assembly which leaded to a greater assembly as expected. Additional sequence data as well as further processing of the assemblies would lead to better quality. As about 56 % of proteins were identified, the detection of functions of predictions as well as manual annotation of not characterized high confidence proteins are required for a more complete annotation. However, the accordance of described functions of proteins with published proteins for P. californicus as well as the high number of identified high confidence proteins show good quality of the annotation. Improvement of the genome- and transcript assembly is necessary in order to complete the annotation and delete potential error sources. Contents

Declaration of Authorshipi

Acknowledgements ii

Abstract iii

Contents iv

List of Figures vi

List of Tables vii

Abbreviations viii

1 Introduction1 1.1 DNA Sequencing...... 2 1.1.1 Next generation Sequencing (NGS)...... 4 1.1.2 Third Generation Sequencing (TGS)...... 8 1.2 Genome annotation...... 8 1.2.1 Quality assessment...... 9 1.2.2 Repeat identification and annotation...... 11 1.2.3 Structural genome annotation...... 12 1.2.4 Measuring the accuracy of gene prediction and gene annotation.. 15 1.2.5 Functional genome annotation...... 17 1.3 Pogonomyrmex californicus ...... 18

2 Material and Methods 22 2.1 Material...... 22 2.1.1 genomic DNA data...... 23 2.1.2 transcriptomic RNA data...... 24 2.2 Methods...... 25 2.2.1 Transcript assembly construction...... 26 2.2.2 Genome annotation...... 28

3 Results 35

iv Contents v

3.1 Transcript assembly construction...... 35 3.2 Repeat annotation...... 36 3.3 Structural annotation...... 37 3.3.1 GeneModelMapper (GeMoMa)...... 37 3.3.2 MAKER...... 38 3.4 Functional annotation...... 38 3.5 non-coding RNA annotation...... 41

4 Discussion 44 4.1 gDNA sequencing...... 44 4.2 Genome assembly...... 45 4.3 Transcript assembly...... 51 4.4 Repeat annotation...... 52 4.5 Structural annotation...... 53 4.6 Functional annotation...... 57

5 Conclusion 60

6 Availability 61

A Quality of gDNA-sequencing 62

B Quality of RNA-sequencing 64

C Detection of duplicates 66

D Programs used for the Genome annotation 69

E Data from relative species 71

F Assessment of Annotation 74

G Repeat annotation 76

H Functional annotation 79

Bibliography 81 List of Figures

1 Sanger sequencing principle...... 3 2 Genome cost development...... 5 3 Development of sequencing...... 6 4 Gene prediction and gene annotation...... 13 5 Gene prediction statistic...... 16 6 Genome size estimations of ants...... 18 7 Cladogram of relative species...... 19 8 Karyotypes of P.californicus...... 20 9 Genome annotation work flow...... 26 10 Transcript assembly construction...... 27 11 GeMoMa annotation work flow...... 29 12 MAKER annotation procedure...... 31 13 AED distribution and functional classification...... 40 14 AED distribution of unknown proteins...... 41 15 Genome assembly sequence length distribution...... 46 16 BUSCO analysis of different P. califonicus genome assembly versions... 49 17 Missing genes in annotation...... 55 18 Source species of functional annotation...... 57 19 10x sequencing quality per base for forward reads...... 62 20 10x sequencing quality per base for reverse reads...... 63 21 RNA sequencing quality per base from forward RNA read done by Helmkampf et al.(2016)...... 64 22 RNA sequencing quality per base from reverse RNA read done by Helmkampf et al.(2016)...... 65 23 Duplicated transcripts in the genome...... 66 24 Assembly self-comparison...... 67 25 Similarity distribution for protein predictions...... 68 26 BUSCO assessment on genome assemblies...... 74 27 BUSCO assessment on transcripts...... 75 28 BUSCO assessment on proteins...... 75 29 TRNA classification distribution...... 80

vi List of Tables

2 Quality parameter of P. californicus genome assemblies...... 23 3 Classification of collected ants from Helmkampf et al.(2016)...... 24 4 Summary of transcript assemblies...... 36 5 Repeat annotation summary...... 36 6 Summary of GeMoMa predictions...... 37 7 Summary of MAKER predictions...... 38 8 Summary of functional predictions...... 39 9 structural ncRNA predictions...... 42 10 Regulatory ncRNA predictions...... 43 11 Signals from ncRNA...... 43 12 DOGMA result summary...... 56 13 Summary of used programs...... 69 14 Summary of data sources from used relative species...... 71 15 Summary of genome assembly quality values...... 71 16 Summary of redundancy in published proteins and transcripts of relative species. This data is based on data sources from table 14...... 71 17 Comparison of genome assemblies...... 72 18 Ortho-DB v. 9: Species in DB...... 73 19 RepeatMasker result table from repeat annotation...... 76 20 Comparison of functional annotations from P. californicus. Public Gen- bank annotations and functional annotation of this thesis were compared. 79

vii Abbreviations

General: cDNA complementary DeoxyriboNucleic Acid gDNA genomic DNA mRNA messenger RiboNucleic Acid ncRNA non-coding RNA miRNA micro RNA snRNA small nuclear RNA snRNP small RiboNucleoprotein Particles snoRNA small nucleolar RNA tRNA transfer RNA rRNA ribosomal RNA dNTP deoxyriboNucleoside TriPhosphate ddNTP dideoxyriboNucleoside TriPhosphate CDS Coding Sequence EST Expressed Sequence Tag Nt Nucleotide NCBI National Center for Biotechnology Information Contigs Contiguous sequence NGS Next Generation Sequencing TGS Third Generation Sequencing SMS Single-Molecule-Sequencing SBS Sequencing By Synthesis SBL Sequencing By Ligand SLR Synthetic Long Reads SSR Simple Sequence Repeat

viii Abbreviations ix

TE Transposable Element LINE Long Interspersed Nuclear Elements SINE Short Interspersed Nuclear Elements SMRT Single-Molecule Real-Time LSV Large Structural Variation FPKM Fragments Per Kilobase of exon per Million reads mapped HMM Hidden Markow Model GHMM Generalized Hidden Markow Model MSP Maximal Segment Pair AED Annotation Edit Distance SC Splice Complexity CDA Conserved Domain Arrangements EGASP ENCODE Genome Annotation ASsessment Project File formats: SAM Sequence Alignment Map BAM Binary Alignment Map GFF General Feature Format GTF General Transfer Format Tools: BLAST Basic Local Alignment Search Tool CEGMA Core Eukaryotic Genes Mapping Approach BUSCO Benchmarking Universal Single-Copy Orthologs SNAP Semi-HMM-based Nucleic Acid Parser Statistics: TP True Positive FP False Positive TN True Negative FN False Negative SP SPecificity SN SeNsitivity ACP Average Conditional Probability AC ACcuracy Chapter 1

Introduction

All organisms have a blueprint. The coding of this blueprint is known as deoxyribonucleic acid (DNA). This blueprint directs the structure of the cell which is specific for its function and other things. For eukaryotic cells, most of the DNA is inside the nucleus or cell core. The DNA was first recognized in 1869 by the medical doctor Friedrich Miescher (Gerabek et al., 2005). He discovered nuclein as a substance from the cell core of a Leukocyte (white blood cell). In 1896, the German scientist Albrecht Kossel detected nuclein in yeast (Felix, 1955). He described nuclein as a polymer with four different nucleotides (Felix, 1955). He called the nucleotides adenine (A), guanine (G), cytosine (C), and thymine (T) (Felix, 1955). About five decades later, Oswald Avery proved that DNA is the material which contains the cell’s genetic information (Avery et al., 1944). He killed potentially pathogenic pneumococci by heat and extracted different parts of the cell (proteins, lipids, polysaccharids and DNA). After that, he applied each of the extracted substances to harmful cells. Finally, all cells were alive except the cells where the DNA had been added to. Based on this result, he concluded that DNA is responsible for characteristics of the pathogen (Avery et al., 1944). In 1949 Erwin Chargaff noticed that molar ratios of total purines (adenine and guanine) to total pyrimidines (cytosine and thymine) in extracted DNA are very similar (Chargaff, 1950). Based on this observation, he discovered the nucleotide pairing system of the DNA. Adenine is paired with Thymine and Guanine with Cytosine. Over the next 10 years, several researchers became interested in the structure of DNA, which led to the discovery of the double helix conformation of James Watson and Francis Crick in 1953 (Watson and Crick, 1953). This confirmation is based on incorrect structure interpretation from Linus Pauling and Robert Corey’s research. Pauling and Corey assumed the nucleotides were leading to the outside of the helix confirmation (Watson and Crick, 1953). The confirmation of the DNA fulfilled the pairing prediction from (Chargaff, 1950). The model from Watson and Crick is one mile stone, where further going genetic research is

1 Chapter 1: Introduction 2 based on. At the time, the true purpose of DNA was unclear. Five years later, Francis Crick published a concept about the general transfer of genetic information. He called this ”The central dogma of molecular biology” (Crick, 1958). This publication deals with the information flow between three biopolymers: DNA, Ribonucleic Acid (RNA), and proteins. Where DNA and RNA are polymers containing nucleotides and proteins include amino acids. F. Crick hypothesized that since information is in a protein, it cannot convert back into RNA or DNA.

”Information means here the precise determination of sequences, either of bases in the nucleic acid or of amino acid residues in the protein.” (Crick, 1958)

He finally reviewed his theory in 1970 based on misunderstandings (Crick, 1970). He described the information flow and classified the transfer of biological information into three groups. One path of genetic information has been referred to DNA as DNA (hereinafter referred as DNA copy or DNA replication). Another path starts from DNA information which were transferred into messenger RNA (mRNA) (hereinafter referred as transcription). An additional path of information was mentioned as coming from mRNA which is translated to synthesized proteins (hereinafter referred as translation) (Crick, 1970). These three information pathways are called general transmission because they can occur in all cells (Crick, 1970). Special transmissions are known as RNA to RNA information transfer (RNA replication), RNA to DNA information transfers (reverse transcription) and directly from DNA to protein (direct translation) (McCarthy and Holland, 1965; Mark S. Bretscher, 1968; Crick, 1970). And the third class was called unknown transfers, because it contains non-detected information transmissions. This class contains protein to protein, protein to DNA, and protein to RNA information flow (Crick, 1970). Finally, scientists focused on finding out what information is contained in DNA sequences. They needed to detect the order of nucleotides in the whole DNA of organisms in order to find the entire number of genes. This task is generally called genome annotation. Due to the order of steps during the process, genome annotation can begin after sequencing data has been obtained from the DNA of organisms.

1.1 DNA Sequencing

In order to start annotating genomes, new technologies were needed. Robert Holley sequenced a yeast alanine transfer Ribonucleic Acid (tRNA) for the first time in 1964. Chapter 1: Introduction 3

It consists of 77 nucleotides (R. W. Holley et al., 1965). He used enzymes like pancre- atic ribonuclease and Taka-Diastase ribonuclease Tl to digest RNA and analyzed it by Chromatography.

In December 1977, Frederick Sanger published a method to find the sequence of nuclein acids in the DNA with a chain-terminating approach (also called chemical sequencing method) based on the principles of Wu and Kaiser(1968)(Sanger et al., 1977). He used modified nucleotides developed from Atkinson et al.(1969) for terminating the copy pro- cess of the DNA. This modified nucleotides are analogs to natural nucleotides and act like alternative subtracts for DNA Polymerase. The enzyme is unable to continue the copy process based on missing 3’-hydroxyl group in the modified nucleotides. These modified nucleotides are called 2’,3’-Dideoxyribonucleotides (ddn) (also called dideoxyribonucleo- side triphosphate (ddNTP)). Sanger added each type of these modified nucleotides (ddA, ddG, ddC and ddT) to the copy process (see Figure1).

Figure 1: The Figure displays Sanger’s sequencing principle. The target sequence ’CGATGCT’ was handed to four different DNA polymerase experiments. These include specifically modified nucleotides as ddNTP’s. He added this modified nucleotides to the normal deoxyribonucleoside triphosphate’s (dNTP). This results in displayed sequences on the left side. The sequencing stopped at the specific modified nucleotide. The visualization of each batch is displayed as a schematic polyacrylamide gel on the right side of the image. The resulting sequence is the complementary sequence to the target sequence. The direction of movement is indicated by the arrow on the left side of the gel.

The DNA polymerase approach was performed once for each modified nucleotide type (Sanger et al., 1977). The idea was to stop the DNA chain extension, if the modified nucleotide had been inserted at the position where the unmodified nucleotides would normally take place (e.g. ddA replaces A) (Sanger et al., 1977). The result is DNA- fragments of different lengths. The last nucleotides of the fragments are the modified Chapter 1: Introduction 4 nucleotides that stopped the reaction (Sanger et al., 1977). Thus, the position of the nucleotide in the sequence was recognized by the length of the fragments in the batches (see Figure1). The sequence parts were detected by fragmented electrophoresis on denaturing acrylamide gels, each band representing sequences of a specific length (Sanger et al., 1977). The discontinuous acrylamide disc gel electorphoresis splits the DNA- fragments by length inside an electronic field (Ornstein, 1962). The DNA is negatively charged and moves from the negative pole to the positive pole, shorter DNA-segments are move faster then longer DNA-segments (Ornstein, 1962). The pattern of the bands in the gel shows the distribution of nucleotides in the synthesized DNA strand (Sanger et al., 1977). As the logical understanding of this technique allows, the last nucleotide would be unknown since the termination of the chain reaction occurs in all four approaches at the last nucleotide. To fix this problem, known sequences were added at the end of the unknown sequences (Sanger et al., 1977). The Sanger sequencing has revolutionized the research of DNA sequencing based on the absence of radioactivity and the possibility of sequencing longer DNA segments based on nucleotide specific chain termination. Finally, the invention of the Polymerase Chain Reaction (PCR) was carried out in 1986 when Kary Mullis developed an standardized enzymatic approach to copying long DNA- fragments in a standardized system (Mullis et al., 1986).

1.1.1 Next generation Sequencing (NGS)

After initial sequencing approaches, projects for sequencing genomes emerged (Pareek et al., 2011). Different groups started to sequence entire genomes from a variety of species, including humans (Pareek et al., 2011). One of the biggest genome projects was the human genome project. It was an international project, which started in 1990 with the aim of annotate the human genome (International Human Genome Sequencing Consortium et al., 2001). It ended in 2003 with 20,500 identified genes in three billion base pair genome (International Human Genome Sequencing Consortium et al., 2001). This project has been carried out by combining PCR technology and fluorescent labeled nucleotide with Sanger sequencing technology (International Human Genome Sequenc- ing Consortium et al., 2001). Sequencing of this genome cost about three billion US dollars. The Human Genome project was expensive based on the labor intensity and time consumption. Parallel to the public human genome project, Venter, J. C. et. al (2001) sequenced the genome with a shot-gun sequencing approach. This technology is based on fragmented sequencing and subsequent reconstruction of the genome. Venter, J. C. et. al(2001) sequenced about 95 % of the human genome and found 26,000 to 38,000 genes. Further development of several sequencing technologies are based on shut- gun sequencing. Sequencing technologies became more precise and faster (Pareek et al., Chapter 1: Introduction 5

2011). As a result of this development in 2004, the National Human Research Institute (NHGRI) wanted to reduce the cost of sequencing the human genome to $ 1,000 in ten years (Schloss, 2008). To reach this aim, the next generation of sequencing methods (NGS) has been introduced (see Figure2). Figure2 displays the sequencing costs per genome over time. Moore’s Law describes the trend of computer power (number of Tran- sistors for integrated circuit) in the computer hardware industry based on observations in research and development (see Figure2)(Moore, 1998). This law describes a long term trend of doubling the computational power every two years and associated decreas- ing price per transistor (Brock, 2006). By following this law, the price for the technical execution of sequencing a genome will decrease exponentially (see Figure2). The trend described by Moore’s law was very close to the real costs of sequencing a genome, since January 2008 when the sequencing centers transitioned from Sanger-based sequencing to next generation sequencing. This change of sequencing costs occurred four years after the first NGS method came up.

Figure 2: Genome sequencing cost development from 2001 till 2017. The logarithm scaled y axis displays the costs for sequencing a genome, where the x axis show the time scale1.

1Source: www.genome.gov/27541954/dna-sequencing-costs-data/ (Date of access: 24th of January, 2019) Chapter 1: Introduction 6

Sanger sequencing approaches and NGS methods have some similarities based on the principle of using both fragment DNA and modified nucleotide incorporation (Mardis, 2017). Finally, there are many differences based on technology, time efficiency, cost and volume of sequencing. NGS methods have been developed based on changing interests in genome-data-science to study genome variations and annotation of personal genomes (van Dijk et al., 2018). Genome sequencing would like to present all chromosomes from a species, as a physical map of its genetic content (Ekblom and Wolf, 2014). Genome sequencing projects already deal with a consensus model of several individuals from one species. The final genome assembly would represent genetic variations such as inser- tion/deletion (InDel) polymorphism, copy number variation, small-scale rearrangements or tissue dependent mutations (Ekblom and Wolf, 2014). Due to limitations of Sanger sequencing, a genetic variation analysis on a small genome scale was performed in the human genome project. Projects like the 1000 Genomes project came up in 2008 to find genetic variants in the human genome from different continents (The 1000 Genomes Project Consortium, 2015; Sudmant et al., 2015).

Figure 3: Illustration of breakpoints in the development of sequencing methods2.

However, NGS methods were introduced to achieve the goal of high efficiency, low costs, and high throughput. Characteristics for NGS methods are high parallelism of sequenc- ing and low time consumption in comparison to Sanger sequencing (Mardis, 2017). These techniques are divided into two classes, called sequencing by synthesis (SBS) and sequencing by ligation (SBL) (Goodwin et al., 2016). NGS procedures are known as short-read sequencing methods. (Goodwin et al., 2016). The first developed NGS method was the Pyrosequencing approach (SBS) from Roche in 2004 (Mardis, 2017).

2Sources: www.nature.com/scitable/content/diagram-of-dna-double-helix-as-proposed-4453 www.yourgenome.org/stories/the-dawn-of-dna-sequencing www.support.illumina.com/sequencing/sequencing instruments/hiseq-4000/ jcr content/right-par/image.img.jpg/1537121125165.jpg www.forbes.com/sites/janetwburns/2018/01/29/handheld-device-gives-clearest-ever-view-of-human-genome-for-1000/#270f0e2f36a5 www.community.10xgenomics.com/t5/10x-Blog/Everything-you-wanted-to-know-about-Linked-Reads/ba-p/263 (Date of access: 15th of February, 2019) Chapter 1: Introduction 7

Different strategies exist to replace in-vivo cloning and Amplification from Sanger se- quencing. Methods as bead-based, solid-state and DNA nanoball are most known (Good- win et al., 2016). The most popular SBS method is represented by Illumina, developed in 2006 (see Figure3). This method results in DNA-fragments of 150-300 base-pairs (bp) length. The Illumina approach uses in-situ sequencing, where DNA is linked by synthetic DNA primers on a flow cell (Mardis, 2013, 2017; Steven R. Head et al., 2018). The sequencing length is limited by technical challenges during the sequencing process and by prior preparation of the DNA. Normally, a decreasing trend of the base-calling quality is noticeable along the read (see Figure 19 and 20). Nucleotide sequence detec- tion is based on reversible dye-terminator nucleotides. These nucleotides emit specific light, during the extension of the linked sequence, from a single-stranded sequence to a double-stranded sequence (Mardis, 2013). The emitted light is detected by a camera and the Polymerase can continue amplification (Mardis, 2013). These reactions are high paralleled in the scale of thousands to many millions of sequencing reactions per run (van Dijk et al., 2018). Finally, short sequences must be assembled to longer sequences. This is a part of computational analysis of SBS raw data. Assembling is basically the reconstruction of the fragmented template from overlapping short sequences (Ekblom and Wolf, 2014). Resulting assemblies are classified on different levels of completeness. The first level is the contiguous sequence (contig) level. Contigs are continuous linear stretches of overlapping short reads (Ekblom and Wolf, 2014). The next assembly level is the scaffold level, which consists of connected contigs (Ekblom and Wolf, 2014). The last level is the chromosome level. This level needs joined scaffolds. Each sequence represents one complete chromosome of the organism (Allendorf et al., 2010). Neverthe- less, short reads lead to strongly fragmented genomes, where single nucleotide variations and short indels are detectable, but larger variations are not (van Dijk et al., 2018). De-novo whole genome assembling, based on short Illumina reads, could lead to a high rate of false positive motives especially in long repetitive sequences (Mardis, 2017; van Dijk et al., 2018). In order to improve this approach, 10x Genomics presented a new library preparation concept in 2015. They introduced an improvement of the Illumina shot-gun sequencing technology, called synthetic long reads (SLR) method (see Figure 3). This approach includes a new device for the fragmentation and library prepara- tion step, before sequences were added to the Illumina SBS machine (Mardis, 2017). The Idea is to label fragments in order to reconstruct long sequences after short read sequencing. They added barcodes to the fragments to generate individual long DNA fragment specific libraries (Mardis, 2017). After Illumina sequencing, specific propri- etary algorithms provide continuous read assemblies (Mardis, 2017). This method is able to capture haplotypic variations of diploid chromosomes. Chapter 1: Introduction 8

1.1.2 Third Generation Sequencing (TGS)

Based on the systematic problem of short read assembling, further generation of sequenc- ing methods emerged in 2010. The current generation of sequencing methods is called the third generation sequencing (TGS). These techniques are known as single molecule sequencing (SMS) methods. Advantages of developed methods in this field are mainly based on long reads generation (higher completeness), real time sequencing (fast evalu- ation), and the absence of PCR amplification (saving time and money) (van Dijk et al., 2018). Long read sequences help to find the order of one whole sequence to detect the real sequence in one read (less gaps in assembly generation) (van Dijk et al., 2018). The first TGS method was the single-molecule real-time (SMRT) sequencing technique intro- duced by Pacific Biosciences (PacBio) (Eid et al., 2009). This introduced a new variant of sequencing with immobilized DNA-Polymerase by using dye-labeled nucleotides and detecting specific light emissions during the replication (Eid et al., 2009; Rhoads and Au, 2015). The light emissions are recognized by a movie and interpreted as nucleotide signals in later evaluation of the movie (Eid et al., 2009; Rhoads and Au, 2015). This procedure can generate reads in length of 10 kilobases (kb), unfortunately with an error rate of about 10 % - 15 % (Lu et al., 2016). This rate is significantly higher than the error rate of NGS methods (< 2 %) (Lu et al., 2016). Methods to decrease the base calling error rate came up, e.g. building a consensus sequence of re-sequenced long reads (Lu et al., 2016). The machines for sequencing were relatively big. In order to reduce the size of sequencing, Oxford Nanopore Technologies (ONT) introduced nanopore se- quencing as a TGS method in 2014 (see Figure3)(Jain et al., 2015). This technique is based on measuring differences of electricity while DNA is passing through the pore. Differences of these measurements were characterized as sequences of nucleotides. The error rate of the MinION platform is about 65 % - 88 % (Ashton et al., 2014; Laver et al., 2015; Ip et al., 2015). So, it is significantly higher than PacBio sequencing. The high base-calling error rate is based on missing re-sequencing of the same template. One DNA strand can just be sequenced once. In order to compensate high error rates in the final assembly, most TGS assembler deal with overlap-layout-consensus (OLC) methods (Lu et al., 2016). These methods increase the confidence of the assembly by correcting errors and building long consensus sequences from long reads (Pop, 2009).

1.2 Genome annotation

After the sequencing assembly is constructed, data analysis of this sequences is required, to find out what information the data contains. The procedure of finding out the purpose Chapter 1: Introduction 9 of the sequences is called annotation. Therefor it is necessary to detect the whole num- ber of genes, the genome. For this procedure of detection, additional informations from the transcriptome are necessary. The transcriptome is defined as the whole amount of RNAs, coming from a cells, tissue, or organism (Krebs et al., 2010, Ch. 4, p. 72). These RNAs were sometimes extracted under particular conditions, for detection of differences on molecular level (e.g. differences in gene expression). As RNAs do not only have a coding function for proteins they also regulate processes like in transcription (Krebs et al., 2010, Ch. 13, p. 321). The Genome annotation is divided into two processes. Structural genome annotation is the process of finding genes and/or intron-exon struc- ture in the sequence files (Yandell and Ence, 2012). This step uses the transcriptome for gene structure support. An incomplete transcriptome based on missing transcripts from different tissues or low number of organism may lead to a leak of detected genes. The second process is called functional genome annotation, this is the process of describ- ing the function of this genes (Yandell and Ence, 2012). Sequence similarity between detected proteins to known proteins from other species, support evidence in this step. Incomplete functional annotations of proteins from relative species in databases will here lead to incomplete functional predictions of detected proteins. Several steps are included in the procedure in this thesis which is about the genome annotation of the harvester ant Pogonomyrmex californicus (see section 1.3).

One might well ask, what actually brings the annotation of genomes after this great effort. The answer of this question is not easy to address for ab-initio annotations from organisms with no existing genome annotation. We do not know what genes are in this organisms, but we can estimate based on observed characteristics as changing phenotype or behaviour. However, as the human genomes were annotated for the detection of ge- netic variations and genetic markers for different diseases, annotations of other organisms can lead to effort in diagnostics of diseases or detection of potential drug targets. For example, the genome annotation of the human gut microbiome leaded to the detection of three clusters (or enterotypes) which are independent from age, gender, or ancestry (Arumugam et al., 2011). This annotation result could be used for the diagnostics of colorectal cancer, diabetes, cardiovascular pathologies and other diseases (Arumugam et al., 2011). The final accurate functional annotation of different organisms could lead to medical usage such as diagnostics or drug discovery and build a basis for further research in genetics (Karp et al., 1999).

1.2.1 Quality assessment

Before annotation, quality assessment of the assembly is required. This step checks whether the assembly is ready for annotation or not. If the genome assembly looks Chapter 1: Introduction 10 complete, the genome annotation can start. Several parameters and programs are useful to estimate the quality and completeness of the assembled genome. Important param- eters for evaluating the quality of the assembly are the N50 and N90 scores. The N50 value specifies the length of the collection of all contigs that contain half of the assembly length in a list ordered by length, starting with the longest sequence (Ekblom and Wolf, 2014). The N90 value represents the length of the collection of contigs, containing 90 % of the assembly length. The N50 describes a kind of median of contig lengths (Ekblom and Wolf, 2014). The bigger the N50 value is, the longer the sequences in the assem- bly are. Longer sequences indicate better assembly quality, based on better detection of large structural variations (SNVs) and other large sequence motifs (van Dijk et al., 2018). It is known that the median of the gene length is approximately proportional to the genome size, and thus an estimate of the minimum N50 scaffold length could be made (Yandell and Ence, 2012). Unfortunately there is no strict rule for N50 values describing the real quality of the genome assembly (Yandell and Ence, 2012). Addi- tionally, other parameters are important. The number of ”N” characters can display the accuracy of the genome assembly, because ”N” characters occur as gaps inside scaf- folds (between contigs) or as sequencing errors inside contigs. Quality parameters for a genome assembly are the average gap size and the average of gap numbers for each scaffold (Yandell and Ence, 2012). Another approach deals with measuring the quality of a genome assembly based on searching for conserved genes which are specific for the phylogenetic clade of the target organism. Programs like Core Eukaryotic Genes Map- ping Approach (CEGMA), Benchmarking Universal Single-Copy Orthologs (BUSCO), or DOGMA deal with this approach (Parra et al., 2007; Sim˜aoet al., 2015; Dohmen et al., 2016). CEGMA is a pipeline for mapping exon-intron structure of a set of con- served protein families in genomic sequences (Parra et al., 2007). BUSCO maps single copy orthologous genes to assess the genome, proteome or transcriptome completeness (Sim˜aoet al., 2015). DOGMA is a new approach to assess completeness on transcrip- tome and proteome level (Dohmen et al., 2016). This program deals with the detection of specific Conserved Domain Arrangements (CDAs) (Dohmen et al., 2016). However, since there are missing annotations which lead to leaks in the database of these tools, these tools are able to estimate the completeness of genomes, based on conserved motifs from annotation of relative species. The resolution and meaningfulness of these tools depends on the amount of annotations from relative species.

Based on different understandings of the quality of genome annotation in eukaryots, levels were defined to measure assemblies and annotations in ready-to-publish quality (Chain et al., 2009). First level of assembly quality is the ”standard draft” level. Assem- blies of this level include minimal or unfiltered data with regions of poor quality, these are known as being relatively incomplete (Chain et al., 2009). The next level is the Chapter 1: Introduction 11

”high-quality draft” assembly. At least 90% of the genome must be represented in the assembly, but sequencing errors and misassemblies are possible (Chain et al., 2009). The ”improved high-quality draft” is the last level of assembly quality. This includes qual- ity requirements of ”high-quality draft” assemblies without discernible misassemblies, reduced gap resolution, and low scaffold number (Chain et al., 2009). Annotations are also categorized in three levels: ”annotation-directed improvement”, ”non-contiguous finished”, and ”finished” (Chain et al., 2009). ”Annotation-directed improvement” is the first level of annotation which presupposes verification and quality improvement for coding regions by preventing anomalies such as frame-shifts and stop codons (Chain et al., 2009). Repeat regions are not fully resolved but gene models and annotations are meaningful from the biological point of view. ”Non-contiguous finished” is the next level of annotation quality standard. This quality standard requires automated and manual improvement for resolving all gaps and reduced annotated and classified low-quality re- gions of the genome. The gold standard of annotation is the ”finished” quality level (Chain et al., 2009). This requires less than 1 error per 100,000 base pairs (Chain et al., 2009). Additionally all replicons are assembled into a single contig and the annotation has been reviewed and edited (Chain et al., 2009). Base-calling errors are just allowed based on low-level biological variation, where multiple platforms have been used to reach this standard and to exclude systematic errors based on differences in the sequencing quality (Chain et al., 2009).

1.2.2 Repeat identification and annotation

The first step after assessing the quality of the source data is the identification and anno- tation of repetitive regions in the genome assembly. Repeats are characters or sequences which occur at least twice in the genome. This procedure masks repetitive regions in the genome as low-size characters of nucleotides (soft masking) or ’N’ characters (hard masking). This step is necessary before starting the identification and annotation of genes, because unidentified repeats can lead to wrong gene interpretations e.g. they produce ambiguities in the alignment (Treangen and Salzberg, 2011; Yandell and Ence, 2012). The functional prediction and phylogenetic development of repeats is a grow- ing area of genomic research (Kapitonov and Jurka, 2008; Cordaux and Batzer, 2009; Witherspoon et al., 2009) as they are a part of genetic variation and genome instability (Yandell and Ence, 2012). About 50 % of the human genome contains repeats (Inter- national Human Genome Sequencing Consortium et al., 2001) but just about 12 % in Nematodes like Caenorhabditis elegans (The C. elegans Sequencing Consortium, 1998) have repeats. The coverage of repeats in a genome from a fire ant like Solenopsis invicta is about 33 %3. The genome repeat coverage looks quite divergent though the tree of Chapter 1: Introduction 12 life. For example, simple sequence repeats (SSRs) as one class of repeats do have a high repeat-number variation (Kashi and King, 2006). Several studies support the hy- pothesis that SSR variations play a role in adaptive evolution, based on evidence from many molecular and phenotypic effects (Kashi and King, 2006). However, repeats are classified into different groups which are mainly separated into low-complexity regions and transposable elements (TEs). TEs are themselves separated into two classes. Class I TEs (”Retrotransposons”) are elements coming from an intermediate RNA. Class II (”DNA transposons”) TEs were produced by the cut-and-paste transposition. TEs are known as ”jumping genes” (McClintock, 1950), they seem to play a role in genome func- tion and evolution (Bucher et al., 2012). Interspersed repeats (IR) are DNA repeats of variable length that are separated by hundreds to millions of nucleotides. Long in- terspersed nuclear elements (LINEs) are IRs with > 300 bp length. Short interspersed nuclear elements (SINEs) are IRs with 100-300 bp length (Kapitonov and Jurka, 2003, 2008; Treangen and Salzberg, 2011). Several types of TEs are known, e.g. Alu elements are retrotransposon SINEs and occur 500,000 - 1,000,000 times in the human genome (Kazazian and Moran, 1998). Helitrons are DNA transposons occurring in eukaryotic cells from widespread species. These TEs are hypothesized to replicate itself in a rolling- circle mechanism (Thomas and Pritham, 2015). TEs do have a complex classification but not all of them have been discovered so far. It turns out that repeat identification and annotation is a major challenge in classification of genome assemblies and genome annotation. Most repeats are poorly conserved, thus making them hard to detect. Tools for detecting and annotating repeats are homology-based tools (McClure et al., 2005; Buisine et al., 2008; Han and Wessler, 2010) or de-novo tools (Bao and Eddy, 2002; Price et al., 2005; Smit and Hubley, 2008-2015). The process describing how repeat an- notation happened in this project is in section 2.2.2. The de-novo repeat identification in this thesis is mainly based on searching sequences in the genome from a repeat library (Smit et al., 2013-2015).

1.2.3 Structural genome annotation

The structural genome annotation is the identification of potentially gene structures in the genome. Several gene prediction software and gene annotation software were used for discovering gene structures. Gene-finding software like Semi-HMM-based Nucleic Acid Parser (SNAP) (Korf, 2004) or Augustus (Stanke and Morgenstern, 2005), are specialized in finding the most likely coding sequences (CDS) of a gene. Both gene finders are working with hidden-markov models in order to construct a gene model. Genome

3Source: https://www.ncbi.nlm.nih.gov/genome/annotation euk/Solenopsis invicta/103/ (Release 103, Date of access: 20th of March, 2019) Chapter 1: Introduction 13 annotation software like MAKER finds also Expressed Sequence Tags (ESTs) in CDS (Yandell and Ence, 2012). The main difference between gene predictors and annotation programs like MAKER is that annotation programs find untranslated regions (UTR) which flank ESTs, and alternative spliced variances of ESTs (see Figure4)(Yandell and Ence, 2012).

Figure 4: Differences between gene prediction and protein annotation are shown. SNAP is a gene prediction program for finding the most likely CDS sequence of a gene (shown in green). Exonerate is an alignment program. It aligns sequences in a splice- aware fashion, it detects UTRs and alternatively spliced variants (shown in orange). BLASTX is a program searching for protein sequences by using the BLAST algorithm (shown in yellow). This step supports protein evidence for the final annotation (shown in blue). [Source: Yandell and Ence(2012)].

Finally, an evidence alignment is useful to support the predicted gene model and anno- tate start- and stop-codon of all alternative splice forms (Yandell and Ence, 2012). This task is displayed in Figure4 as the yellow part, done with BLASTX (Altschul et al., 1990). This program is based on the BLAST algorithm. BLAST programs are searching for sequence similarities with a less computational intensive method by directly approx- imate alignments using maximal segment pair (MSP) scores (Altschul et al., 1990). However, BLAST-based-programs are dealing with sequences on nucleotide, transcript, and proteins level. Programs using this algorithm are mainly interested in finding simi- lar sequences in a database. The process displayed in Figure4 shows several additional steps like filtering of marginal alignments, clustering of redundant identical alignments, alignment overlapping detection, and polishing as realigning of predictions which support greater accuracy in the final annotation (Yandell and Ence, 2012). The final annotation is a summary of filtered sequence evidences from different sources like mRNA or EST Chapter 1: Introduction 14 predictions and protein predictions. Additional sequence data from RNA sequencing projects are additionally often used to provide needed information for better separa- tion of exons, splice sites and alternative spliced exons (Yandell and Ence, 2012). The more evidence is available, the more complete and accurate the annotation will be in the end. Based on the increasing amount of annotated species, some genome annota- tion pipelines support the usage of annotations from relative species. These annotations would be used to support evidence for resulting annotations of the target species in ab-initio annotations.

RNA seq data is also helpful to provide the correctness of gene models in the final annotation. In order to generate useful transcript assemblies for genome annotations, several tools were developed (Langmead et al., 2009; Trapnell et al., 2009; Grabherr et al., 2011; Henschel et al., 2012; Trapnell et al., 2012; Kim et al., 2015; Pertea et al., 2015). These tools are mainly classified into two classes: genome independent and genome dependent transcript assembling programs. Genome independent (also called de-novo) tools like Trinity are reconstructing the transcript by using de Bruijn graphs (Slater and Birney, 2005; Grabherr et al., 2011; Henschel et al., 2012). De Bruijn graphs are a common method for constructing assemblies from short NGS reads as it clusters overlapping sequences. Nodes are continuous k-mers (reads with length of k) and edges represent differences in k-mers (Chen et al., 2011; Grabherr et al., 2011; Henschel et al., 2012). Genome-dependent approaches like the Tuxedo pipeline are mapping RNA-seq parts to the genome with respect to the intron-exon structure and construct transcript assemblies (Langmead et al., 2009; Trapnell et al., 2009, 2012). Both strategies were used for transcript construction (see section 2.2.2).

Genome annotation projects sometimes are dealing with an estimation of total gene prediction numbers. It is sometimes hard to say, how many genes are in total in an organism, based on different conservation of genes and the completeness of RNA se- quences. Annotations may miss some less conserved sequences and not all RNA se- quences are available, which are necessary to provide evidence for annotations as they are the basis for the transcript construction. Less conserved sequences as long non- coding RNAs (lncRNA) are one cause for this leak of annotation, but the annotation of them is also a part of genome annotation. LncRNA are non-coding RNAs (ncRNA) of more than 200 bases (Khorkova et al., 2015). NcRNA genes are not coding for proteins, they produce a functional RNA product (Mattick and Makunin, 2006). Several types of ncRNAs are playing various roles in different areas. NcRNAs are playing a central role in regulation, replication, splicing, and protein synthesis (Mattick and Makunin, 2006). The weak conservation of ncRNAs on the sequence level may depend on the structural and regulatory function, where the 3-D structure is more important as the sequence (Mattick and Makunin, 2006). As molecules like tRNAs have a well conserved Chapter 1: Introduction 15 secondary and tertiary structure, several approaches search for structure motives based on this conservation (e.g. tRNAscan-SE)(Lowe and Eddy, 1997). One important part of ncRNAs are pseudogenes. Pseudogenes are hypothesized to came from duplicated genes which became inactive during evolution (Khorkova et al., 2015). Additionally ribosomal RNAs (rRNAs) are classified as ncRNAs (Parks et al., 2019). The rRNA as a part of the ribosome are responsible for the translation of mRNA sequences into proteins. The interesting story about rRNAs is, that they are known as being species specific conserved. So, rRNAs are highly similar within a species but divergent between species (Parks et al., 2019).

Two main strategies of structural gene annotation are present: the homology-based strategy and the ab-initio strategy. The homology-based strategy makes use of DNA, RNA, and protein sequences from related organism (Keilwagen et al., 2016) whereas the ab-initio strategy uses mathematical methods rather than external evidence to iden- tify gene structures (Yandell and Ence, 2012). Annotation pipelines like MAKER feed alignment evidence from different sources to gene prediction programs. This procedure is known as evidence-driven gene prediction. The combination of different sources of ev- idence (e.g. annotation from relatives) and retaining models represent a cross-evidence is becoming a more popular method for de-novo structural gene annotation (Shields et al., 2018). This is mainly based on increasing availability of high quality genome annotations in databases like Genbank.

1.2.4 Measuring the accuracy of gene prediction and gene annotation

A big problem in genetic science is incomplete genome annotation. Using this flawed data leads to wrong interpretations. For that reason, the measuring of annotation quality and completeness is an important part of genome annotation. The human ENCODE Genome Annotation Assessment Project (EGASP) came up in 2006 to asses prediction methods for protein coding genes and the completeness of current genome annotations of the human genome (Guig´oet al., 2006). Gene finders such as SNAP and Augusts predict genes in the genome (Cantarel et al., 2008). These programs can be trained on precompiled parameters from related organisms (see section 2.2.2). However, it is necessary to consider leaks in specificity and sensitivity of these methods, because they are not mainly based on biology. In order to measure that, we need to define values for the calculation. The definitions are mostly based on evaluation approaches from Mois`esBurset and Roderic Guig´o(1996). Mois`esBurset and Roderic Guig´oevaluated gene prediction programs on a large set of vertebrate sequences. They measured the accuracy of predictions on nucleotide- and exon level. For defining the sensitivity and specificity definitions from Yandell and Ence(2012) were used. Chapter 1: Introduction 16

”Sensitivity (SN) is the fraction of the reference feature that is predicted by the gene predictor.(...)By contrast, specificity (SP) is the fraction of the prediction overlapping the reference feature(...)”

In order to gain better clarification, these values needed to be defined in a biological context (see Figure5). True positives (TP) is the number of nucleotides in exons that have been detected correctly (Mois`esBurset and Roderic Guig´o, 1996; Yandell and Ence, 2012). False negatives (FN) are the number of nucleotides in exons which are not in the prediction, they are a hidden part of the reference gene model (Mois`esBurset and Roderic Guig´o, 1996; Yandell and Ence, 2012). Finally, false positives (FP) are the number of exonic nucleotides in the prediction which are wrongly predicted as coding nucleotides (Mois`esBurset and Roderic Guig´o, 1996; Yandell and Ence, 2012).

Figure 5: On the left side a 2x2 contigency Table shows the relations of the values used in equitation 1.1, relative to the predicted and the reference prediction. On the right side the meaning of TN, FN, TP, and FP is shown in biological context. Reality here means the reference prediction which is known as the true prediction. [Source: Mois`esBurset and Roderic Guig´o(1996)]

TP TP SN = SP = (1.1) TP + FN TP + FP

Actually, SN is the number of correct exons divided by the number of actual exons from the reference gene model (see formula 1.1)(Mois`esBurset and Roderic Guig´o, 1996). SP is defined as the number of correct exons divided by the number of predicted exons (see formula 1.1)(Mois`esBurset and Roderic Guig´o, 1996). The traditional definition of specificity (SP = TN/(TN + FP )) has been changed into equation 1.1 because TN (True Negative: non exons predicted correctly as not coding) tend to be much larger then TP, which makes the value uncomfortable to classify and non-informative (Mois`es Burset and Roderic Guig´o, 1996). These definitions are based on exons where it is an all or nothing principle but the calculations can also depend on exonic nucleotides (see Figure5 the right side) (Yandell and Ence, 2012). The Accuracy (AC) presented Chapter 1: Introduction 17 in equation 1.2 is a simplification based on the Average Conditional Probability (ACP) from (Anderberg, 2014), formulations with TN are excluded (Mois`esBurset and Roderic Guig´o, 1996; Eilbeck et al., 2009).

SN + SP AC = (1.2) 2

The Annotation Edit Distance (AED) is a measurement to quantify the amount of changes between annotations (Eilbeck et al., 2009). The higher the AC value, the lower the AED is. The low AED value reflects small annotation changes (Yandell and Ence, 2012).

AED = 1 − AC (1.3)

The AED calculation is included in annotation pipelines such as MAKER 2. The pro- gram calculates specificity and sensitivity of the prediction (see equation 1.1) which is based on overlapping nucleotides between predictions and references (see Figure5) (Mois`esBurset and Roderic Guig´o, 1996; Holt and Yandell, 2011). The program calcu- lates the accuracy of the prediction (see equation 1.2) as the degree of confidence, which is displayed as the AED (see equation 1.3) in the end.

1.2.5 Functional genome annotation

The functional annotation is the description of the provided function from the predic- tion. This part of the genome annotation connects the predicted structures (mostly CDS or proteins) with biological information. Several resources are available for stor- ing functional descriptions of proteins. The Gene Ontology (GO) term describes the molecular function which involves a protein (Harris et al., 2004). For example, Kyoto encyclopedia of genes and genomes (KEGG) is a database which connects protein pre- dictions with functions in the biochemical pathways (Kanehisa, 2000). Another way to discover the function of the protein prediction is by using MAKER protocols. This procedure is based on BLAST searches in databases for highly-conserved proteins such as Uniprot sprot (The Uniprot Consortium, 2018), Refseq or the non redundant (nr) BLAST database (O’Leary et al., 2016). The detection of the function of a protein was done based on similarity searches in this project.

”The discovery of sequence homology to a known protein or family of proteins often provides the first clues about, the function of a newly sequenced gene.” (Altschul et al., 1990) Chapter 1: Introduction 18

A common procedure for the detection of functions is a similarity search by using the BLAST algorithm. Detailed detection of proteins with unknown function require manual annotation. Manual annotation of function is more about detailed similarity searches on different sequence levels (nucleotide, transcript, and protein) and detecting of conserved sequence patterns (e.g. protein domains) for providing confidence of a function.

1.3 Pogonomyrmex californicus

Ants are insects that belong to the Invertebrates. Seventeen extant subfamilies of ants (belonging to the clade of Formicidae) exist. The most divergent ones are Dolichoderinae, Formicinae, Myrmicinae and Ponerinae (see Figure6)(Brady et al., 2014; Ward et al., 2015; Bolton, 2019). The subfamily Myrmicinae includes 6,711 valid extant species.

Figure 6: This Graph displays the phylogenetic relationships of ant subfamilies with respect to estimated mean genome size and the divergence of species in each subfamily (displayed as triangles in the tree). This was done by Tsutsui et al.(2008) based on data from Brady et al.(2006) and Moreau et al.(2006) (pg = picograms ). [Source: Tsutsui et al.(2008)] Chapter 1: Introduction 19

Pogonomyrmex californicus, Pogonomyrmex barbatus, and Solenopsis invicta also belong to this group (see Figure7)(Bolton, 2019). Currently 23 ants of this subfamily have been sequenced (based on filtering of Myrmicinae on Genbank), which is about 0.3 % of the whole number of ants in this subfamily. This trend also exists in other subfamilies. Ants are particularly interesting behavioral research targets due to the fact that they have a hierarchical system, with the queen at the top and the worker bees at the bottom. Additionally, some ants such as P. californicus belong to a group of eusocial insects. Eusociality occurs in many Hymenopterans such as ants, bees and wasps as well as in Isoptera termites. Eusociality is mentioned as the social overlap of generations because adults take care of young individuals (H¨olldobleret al., 1990). Other characteristics of eusocial insects are morphological differences as well as variations in reproduction (for ants: between worker and queens) (H¨olldobler et al., 1990).

Figure 7: Cladogram of relative ants (Camponotus floridanus, Pogonomyrmex barba- tus ,and Solenopsis invicta) to P. californicus. Apis mellifera is known as the honey bee and represents an out-group of ants. Data from genome annotation of these species were used as reference data for genome annotation of P. californicus. Based on data from Moreau et al.(2006)

Furthermore P. californicus is a member of the ”harvester ants” or ”seed-harvester ants” because it collects seeds. P. californicus is characterized as a fed ant or fire ant due to its colour. This ant lives in the desert of California, USA. Harvester ants collect food and material from their environment to build their nests (Gordon, 1986). Due to the lack of food and plant material in the desert, the ant adapted the behaviour to the desert (Gordon, 1986). These ants engage in two types of harvesting: total harvesting (all food items considered) and seed harvesting (only seeds are considered) (de Vita, 1979). P. californicus is an interesting source of behavioral ecology due to its queen’s nesting habits. In a normal case scenario, the nest is set up by one queen who has proven her supremacy over all the others (haplometrotic) (haplometrotic). P. californicus colonies tend to cooperative nest founding. This type of nest founding includes parallel exis- tence of multiple queens in one colony (pleometrotic). In order to find connections be- tween gene expression/regulation and the social behavior of queens during nest founding, Chapter 1: Introduction 20

Helmkampf et al.(2016) carried out differential gene expression analysis on P. californi- cus queens. They found out that haplometrotic and pleometrotic aggressive queens have stronger transcriptional changes than non-aggressive ones. Additionally, they discovered a difference in the gene expression of the two types of founding ants (haplometrotic vs. pleometrotic). They also found different expressed genes in behavioral differences (ag- gressive vs. non-aggressive) in the queens. This analysis showed up-regulation of several genes for chemical communication in non-aggressive queens. Furthermore, Helmkampf et al.(2016) found gene interconnections and co-regulations with a system genetics ap- proach for supporting aggression behaviour in P. californicus. P. californicus belongs to the Hymenoptera clade of Invertebrates where the male ants have single copy chro- mosomes (haploid). Females have two-copy chromosomes (diploide). In total, the P. californicus ants have 16 chromosomes (haploid) or 32 chromosomes (diploid) (see Figure 8)(Taber et al., 1988).

Figure 8: Karyotypes of 32 chromosomes (16 diploid chromosomes) from female P. californicus. Collection six is based on thirteen female individuals collected 1985 in Bakersfield, USA-California. Collection seven is based on four individuals from sub- species P. californicus estebanius sensu Pergande collected 1985 in San Bernardina, USA-California. [Source: Taber et al.(1988)]

Chromosomes do have classifications based on their appearance, size, and number (LEVAN et al., 1964). An important characteristic of morphological identification of chromosomes is the location of the centromere (LEVAN et al., 1964). The centromere is the point of contact between two chromatids (LEVAN et al., 1964). Chromosome-pairs 1-10 from collection 6 in Figure8 are metacentric (X-shaped), because the centromere is placed Chapter 1: Introduction 21 in the middle of the chromatids. Chromosome-pairs 10 and 11 are acrocentric (the cen- tromere is quite near to one end of chromatids) in collection 7 (see Figure8)(Taber et al., 1988). Acreometic chromosomes can fuse to metacentric chromosomes (Schu- bert, 2007), which may lead to a different number of chromosomes during evolution. The number of chromosomes are shared with thirteen other Pogonomyrmex ants (Taber et al., 1988). One approach to detect the number of chromosomes in a genome is to search for telomeres. Telomeres are terminal repeated sequences on the chromosomes. The repetition of TTAGG is characteristic for most telomers.

Some related organisms were already sequenced (see Figure7). Genomes from relative ants are important in genome annotation in order to use them in homology-based or evidence-driven approaches. The most relative ant to P. californicus which has been annotated is P. barbatus (Smith et al., 2011), also called the ”red harvester ant” as it belongs to the clade of Pogonomyrmex. It was sequenced using the 454 pyrosequencing approach (see Table 15) and manually annotated 1,200 genes (Smith et al., 2011). Next to P. barbatus is S. invicta, which also belongs to the fire ants. These three ant species belong to the Myrmicinae subfamily. The genome differences in this species may reflect the divergence in Myrmicinae. Camponotus floridanus is a member of the Formicinae subfamily with 3,159 valid extant species (Bolton, 2019). This ant was used as an out- group for Myrmicinae ants. Apis melliferra (also called: honey bee) is not an ant but as it is well annotated and a member of Hymenopterans it was used as an out-group of ants. Ants and other insects are very often involved in genome sequencing projects based on their relative small genome size (see Figure6). The mean genome size of ants was determined as having 361.8 Mb but the human genome is about ten times bigger (Venter, J. C. et. al, 2001; Tsutsui et al., 2008). As mentioned above, subfamilies like Myrmicinae and Formicinae are very divergent with about six thousand and three thousand species (Bolton, 2019). Tsutsui et al.(2008) estimated the haploid mean genome size for P. californicus as 249.5 Mb while using flow cytometry. Chapter 2

Material and Methods

This chapter contains detailed descriptions of the genome annotation pipeline and the data used for the pipeline. Sequence isolation and all sequencing steps, as well as the collection of ants and the assembling process of the genome, were performed by collab- orators before commencing this project. The execution of REPET as part of the repeat annotation was also done by collaborators. The assembly construction of the transcrip- tome (described in Section 2.2.1) as well as the whole genome annotation process was done by myself. Data and data preparation is still important for this project. For this reason, it is included in this thesis.

2.1 Material

The genomic DNA (gDNA) of Pogonomyrmex californicus was sequenced from the 10x Genomics pipeline (see Section 1.1.1). This pipeline includes preparation of gDNA as filtering (barcoding and library construction) with the Chromium Genome Approach and polishing of the genome assembly. The pipeline uses a short-read sequencer like Illumina HiSeq 3000. The resulting genome reads were assembled by Supernova 1 and Supernova 2 to generate genome assemblies. Also see Section 4.1 for more information.

Various data sources were used for the transcript construction. Transcript data from Helmkampf et al.(2016) and transcript data from MinION in-house-sequencing are included in the transcriptome-construction pipeline (see Section 2.2.1). This was done based on different tissues from which the RNA sequences are derived. The published data is based on the head of the ant, and the in-house-MinION-sequencing data is based on the whole body of the ant.

22 Chapter 2: Material and Methods 23

Sequence data from relative species used in the genome annotation pipeline have been downloaded from GenBank, provided by the National Center for Biotechnology Infor- mation (NCBI). The summary of the data sources are displayed in appendixE Table 14.

2.1.1 genomic DNA data

The genomic data is derived from different populations of P. californicus. Thirteen male ants were collected before the de-novo genome sequencing was done. The extraction of gDNA was done by using the Qiagen MagAttract HMW DNA Kit with the protocol for tissue (P22). This resulted in 4,575 nucleotides (nt) of isolated gDNA. About 1.2 ng of gDNA was sequenced by Illumina shot-gun-sequencing. The paired-end-sequencing ended in 269,953,173 reads with a length of 150 bp.

Table 2: Quality parameter from P. californicus genome assemblies, generated with two different versions of the Supernova assembler (Supernova 1 and Supernova 2) based on the 10x Genomics reads.

Assembly 1 Assembly 2 Parameter (from Supernova 1) (from Supernova 2) Assembly level Scaffold Scaffold Number of 11,620 12,983 Sequences Sequence 527 - 1,914,639 395 - 8,422,444 length [bp] median sequence 2,151 4,713 length [bp] mean sequence 20,820 24,187 length [bp] total length [bp] 241,934,436 314,0117,594 percentage of 0.09 % 3.07 % N’s N50 [bp] 270,734 1,497,864 (Scaffold) N90 [bp] 270,734 9,161 (Scaffold)

The genome assembly from P. californicus is based on 10x Genomics data, assembled with Supernova 1 and Supernova 2 (Weisenfeld et al., 2017) and polished after assem- bling, with Pilon (Walker et al., 2014) (see Table2). The genome assembly created by Chapter 2: Material and Methods 24

Supernova 2 was used (detailed discussion in Section 4.2). This genome assembly con- tains 12,983 sequences on scaffold level. Sequences with a length from 395 to 8,422,444 nt are included. The N50 value is 1,497,864 nt. About 3 % of the assembly are ”N” characters which represent gaps between the contigs in the scaffolds or sequencing errors (see Table2). Based on the average scaffold length, and other parameter (see Table2) the Supernova 2 assembly was used for genome annotation.

2.1.2 transcriptomic RNA data

The RNA data came from two sequencing approaches. One was sequenced by Helmkampf et al.(2016) with the Illumina HiSeq 2000 sequencing system from the head of P. californicus. The other approach uses MinION nanopore (inhouse) sequencing using the whole body of P. californicus (done by collaborators).

The RNA sequences coming from Helmkampf et al.(2016) were based on 42 specimens collected during the mating flight season in 2013. Speciments come from two different parts of California, USA. The collected insects were divided into six different categories depending on the nest founding (haplometric = colonies are established by recently mated queens, pleometrotic = cooperative colony founding queens) and aggressive be- havior of the queens (Helmkampf et al., 2016)(see Table3).

Table 3: Classification of collected ants for differential expression analysis by Helmkampf et al.(2016).

Number of ants Description 6 haplometrotic singletons (Hs) 6 pleometrotic singletons (Ps) 8 haplometrotic pairs (Hp) 8 pleometrotic pairs (Pp) 8 haplometrotic and pleometrotic pairs (mixed pairs, Hp and Ph) 6 aggressive haplometrotic (HA) with aggressive pleometrotic (PA)

After collecting RNA of the head, the material was sequenced using the Illumina HiSeq 2000 sequencing system. This resulted in 18.7 million reads containing 79.8 Gb. The raw RNA reads from the head of the ant were trimmed with Trim galore (version 0.3.1) and assembled by the de-novo assembler Trinity Grabherr et al.(2011). The transcript assembly has a total length of 799 Mb which was separated into 311,726 sequences. Most of these sequences occurred at low abundance. Sequences with lower expression level than 0.14 fragments per kilobase of exon per million reads mapped (FPKM) were Chapter 2: Material and Methods 25

filtered out. After filtering, there were 7,890 transcripts left. These sequences were used for differential gene expression and coexpression analysis (Helmkampf et al., 2016).

The RNA sequences from third generation MinION sequencing were based on the whole body of male ants. The target ants were collected from a mixture of 5 polygamous colonies from Pine Valley, USA-California in 2016. The Oxford Nanopore Preparation Kit SQK-PCS108 was used to extract the RNA. The flow cell FLO-MIN107 R9 was used for sequencing. The TGS RNA sequencing resulted in 1,751,199 long reads, 394,085 reads containing 241 Mb passed the base calling quality control. The length of sequences variate from 49 nucleotide (nt) to 6,182 nt. The collection of ants, sequencing and preparation of the material were done by collaborators. Subsequently further analysis based on this data was done by myself.

After trimming sequences with lower length than 500 nt, the number of sequences de- creased to 232,216 containing 180 Mb. The RNA assembly calculation was done with canu (v. 1.8). Ths resulting assembly contains 246 contigs with 196,563 nt (see Table4). The sequence length variation in this assembly ranged from 501 nt to 1,775 nt, where the mean sequence length is about 799 nt.

2.2 Methods

The implemented genome annotation pipeline has different stages (see Figure9). The programs used in this pipeline were summarized in the supplementary Table 13. The pipeline is separated into two main sections, the Transcript Assembly Construction sec- tion and the Genome Annotation section. Details of the Transcript Assembly Construc- tion section are displayed in Figure 10. This part of the pipeline came from different sequencing approaches which are based on extracted RNAs from various tissues from P. californicus. The main steps of the transcript assembly construction were trim- ming, aligning, and assembling. First, transcripts were trimmed or filtered. Afterwards, trimmed or filtered sequences were aligned to the genome assembly in the aligning step. The whole transcript assembly was summarized and assembled in the assembling step. For annotating purposes, the transcript assembly has given to genome annotation pro- grams such as GeneModelMapper (GeMoMa) (Keilwagen et al., 2016) and MAKER 2 (Holt and Yandell, 2011). Before the genome annotation started, repeats in the genome were masked and annotated using RepeatMasker (Smit et al., 2013-2015) with a com- bined database. The repeat annotation (see Table5) and the masked genome were gave to the genome annotation programs. GeMoMa was the first genome annotation program in the pipeline. It used data from previously annotated relative species to generate gene models for annotating the target genome. Four annotated relative insect genomes were Chapter 2: Material and Methods 26 used as reference genomes in the GeMoMa approach (see Figure7). The transcript data from the transcript assembly annotation pipeline has been added to the annotation. The out coming prediction was handed to MAKER for using additional information and generate more accurate predictions. The MAKER pipeline is based on a self-training approach. This shall complete the P. californicus predictions and merge them with GeMoMa predictions (Holt and Yandell, 2011).

Transcript assembly construction

RNA Trimming Aligning Assembling

GENOME

Annotation pipeline

Repeat Genome GeMoMa MAKER 2 annotation annotation

Figure 9: Overview of the whole process for annotating the genome of P. californicus. The genome is displayed as a black box with unknown content. The RNA seq data from P. californicus is also displayed as a black box with unknown content. Two work flows are shown in this graph, the upper one represents the Transcript Assembly Analysis Pipeline and the lower one is represents the Genome Annotation Pipeline. The constructed transcript assembly was added to the genome annotation programs GeMoMa and MAKER to use assembled transcripts for the annotation of the genome.

2.2.1 Transcript assembly construction

The transcript assembly generation was done by a summary of programs with different gene-detection-strategies (see Figure 10 and Table 13). The detailed work flow for the construction of the final transcript assembly includes several approaches to summarize results from different data sources. The RNA sequence data came from second- and third-generation sequencing approaches. The construction was based on a high variation of programs, because they are suited for short reads (NGS) or long reads (TGS). Genome independent transcript assemblies were aligned to the genome by LAST which takes exon-intron structure into account (Kielbasaet al., 2011). The results from two different versions of the Tuxedo suite were used for generating a basis-transcript assembly (see Figure 10). This part of the pipeline was based on short RNA reads sequenced by Chapter 2: Material and Methods 27

Illumina HiSeq 2000 sequencing system, done by Helmkampf et al.(2016). This RNA raw reads were separated into conditions (see Table3) for the Tuxedo 1 suite. This was done based on recommendations from Trapnell et al.(2012). A separated assembling of the transcript increases the probability of detecting all available splice variants and reduces the chance for incorrect assembling (Trapnell et al., 2012). Subsequently, the separated assemblies were merged together.

MinION transcript construction

RNA Filtering Assembling Aligning (Body) (> 500 bp) (Canu) (LAST)

long reads

GENOME

Transcript Tuxedo Suite (1 & 2) transcript construction assembly

Indexing Aligning Assembling (Bowtie2/HISAT2) (TopHat2/HISAT2) (Cufflinks/StringTie)

Trinity transcript construction RNA Trimming (Head) (Trim galore) Assembling Aligning (Trinity2) (LAST) short reads M. Helmkampf et al. (2016) M. Helmkampf et al. (2016) Assembling/filtering Aligning (Trinity1/RSEM) (LAST)

Figure 10: Overview of the constructing process for a final transcript assembly based on data from different tissues and pipelines. Two RNA data sources were used. One is based on the whole body of P. californicus, coming from in-house TGS. The other one is based on shot-gun sequencing of the head of the ant, done by Helmkampf et al. (2016).

The transcript assembly from Helmkampf et al.(2016) is displayed in the lowest part of Figure 10. RNA without grouping in conditions (see Table 10) was used for the Tuxedo 2 suite and Trinity 2 (version 2.8.4). The RNA raw data was trimmed by Trim galore (ver- sion 0.5.0) before it was handed to the Tuxedo and Trinity approaches. The published approach used Trim galore (version 0.3.1) for trimming the raw data and subsequently assembling them with Trinity 1 (version r20131110) (Helmkampf et al., 2016). Tuxedo starts by building an index file from the genome and using Bowtie 2 (version 2.3.4.2) in Tuxedo 1 (Langmead and Salzberg, 2012) and HISAT2 (version 2.1.0) (Kim et al., 2015) in Tuxedo 2. Tuxedo suites align the RNA data to the genome assembly, identify splice junctions between exons (Tophat (version 2.1.2) with Bowtie (Trapnell et al., 2009) for Tuxedo 1 and HISAT2 (Kim et al., 2015) for Tuxedo 2). The transcripts were assem- bled and their expressions were quantified (Cufflinks (Trapnell et al., 2012) for Tuxedo 1 and StringTie (version 1.3.4d) (Pertea et al., 2015) for Tuxedo 2). This transcript Chapter 2: Material and Methods 28 assembly construction is dependent on the genome assembly. Additionally, Trinity tran- script assembling was carried out. One genome-independent assembly was generated by Helmkampf et al.(2016) with the first version of Trinity. Another genome-independent assembly was created by using the current version of Trinity. These Trinity sequences were aligned to the genome by using LAST (version 946) and merged by Cuffmerge (version 1.0.0). Additionally, the RNA reads from the whole body (MinION sequenc- ing) were filtered by a sequence length of 500 bp and assembled with canu (version 1.8). The MinION reads were aligned to the genome by using LAST, in the same way as previously done with Trinity assemblies. The merged Tuxedo assembly was used as the high confidence model to which other assemblies were added. Finally, the result- ing assembly represented gene models from merged Tuxedo assemblies 1 and 2, with improvement based on data from Trinity (1 and 2) and MinIon assemblies. The final transcript assembly is not just based on short RNA reads from the head of the ant, it has been extended by information from TGS data.

2.2.2 Genome annotation

The genome annotation step begins with the repeat annotation of the genome (see Figure 9) and ends with creation of a collection of predictions. This first step is about repeat annotation and masking. The final masked genome and the created transcript assembly from the transcript assembly work flow (see Figure 10) were the input files for the genome annotation pipelines. GeMoMa and MAKER are two genome annotation pipelines used as the main part of this work flow (see Figure9). Firstly, GeMoMa was used to annotate the P. californicus genome based on previously-annotated genomes from relative species. Secondly, MAKER2 extracted additional predictions, which may not occur in the relative species. The result of this annotation pipeline would be coding transcripts, predicted proteins, and detected tRNAs. Finally, it was necessary to identify the functions of these predictions. In the end, functional annotation steps were performed.

Repeat annotation

The first step in the genome annotation is the repeat annotation. The identification and masking of repeats in the genome includes several steps. RepeatModeler (version 1.0.11) is the first program in the pipeline, it is known as a program for de-novo repeat family identification (Smit and Hubley, 2008-2015). It uses gene-finding programs such as RECON (version 1.08) (Bao and Eddy, 2002) and RepeatScoud (version 1.0.5) (Price et al., 2005) for finding repeats, build consensus models, and classify them in repeat families. For increasing the completeness of repeat finding, TE-class (version 2.1.3) Chapter 2: Material and Methods 29

(Abrus´anet al., 2009) was used to classify unknown repeats coming from consensus models of RepeatModeler. Parallel to the de-novo approach, REPET (version 2.5) pipeline was used to identify additional TE-elements in the genome assembly (Flutre et al., 2011; Quesneville et al., 2005). REPET analysis was done by collaborators. These sequences were put together with the Hymenoptera specific repetitive sequences from Repbase (version 22.07) (Bao et al., 2015) to build a summary of occurring classified repeats. These repeat libraries were gave to RepeatMasker (version 4.0.7) (Smit et al., 2013-2015) to find repeats in the genome assembly and mask them by low letter codes such as a, c, g, or t in the genome (soft masking).

GeneModelMapper (GeMoMa)

The GeneModelMapper (GeMoMa) is an annotation pipeline used to find protein-coding genes with genome annotations from relative species as a reference annotation (Keilwa- gen et al., 2016). The reference annotations are coming from GenBank database provided by NCBI (see Table 14 in appendixE). Firstly, GeMoMa is extracting protein-coding exons from previously-annotated reference genomes (Keilwagen et al., 2016) (see Figure 11).

Figure 11: Work flow of the GeMoMa annotation program. Blue surrounded items represent input data, green boxes are GeMoMa moduls, and grey boxes are coming from external moduls. Multiple arrorws indicate usage of multiple reference annotation which were merged in the GeMoMa Annotation Filter. [Source: Keilwagen et al.(2018)] Chapter 2: Material and Methods 30

Afterwards, individual exons were mapped to locations on the target genome by tBLASTn (version 2.7.1) (Altschul et al., 1990). This step searches for coding sequences from the reference annotation in the target genome. The main module of the pipeline is the GeneModelMapper. This part of the pipeline tries to match the resulting models from tBLASTn to the target genome (Keilwagen et al., 2016). Optionally it is possible to use transcript assemblies from the target species as evidence supporting resulting model (Keilwagen et al., 2016). The transcript assembly from P. californicus was added to this pipeline as a binary alignment map (BAM) file for extracting introns. The model resulting from this step can vary greatly depending on the reference species used for the annotation. This depends on the quality of the reference species and the degree of relationship between the reference- and target species. The more relative the reference species is to the target species, the better the gene models fit. Four different species (A. melliferra, C.floridanus, S. invicta, and P. barbatus) are related to P. californicus (see Figure7). These reference genomes were finally joined and filtered with the GeMoMa annotation filter (GAF) (Keilwagen et al., 2018) (see Figure 11). This is a new feature from GeMoMa to improve the final predictions and complete the resulting annotations (Keilwagen et al., 2018). The GAF removes spurious predictions by a relative GeMoMa score, joins duplicated predictions, and filters for evidence from used reference annota- tions (Keilwagen et al., 2018). In the end, the predictions with the highest score of each location in the genome were represented in the final annotation (Keilwagen et al., 2018). Non-identical overlapping predictions with a high score were specified as alternative transcripts (Keilwagen et al., 2018).

MAKER 2

MAKER is an annotation pipeline for smaller eukaryotic and prokaryotic genome projects (Holt and Yandell, 2011). This program makes use of gene-prediction software such as Augustus and SNAP in order to align Expressed Sequence Tags (ESTs) and proteins to the genome. Augustus predicts genes in eukaryotic genomic sequences by using the Generalized Hidden Markov Model (GHMM) (Stanke and Morgenstern, 2005). SNAP produces specific hidden Markov models (HMM) to simulate protein-coding sequences in gDNA (Korf, 2004). MAKER were used for homology-based prediction as well as for organism specific annotation (Holt and Yandell, 2011).

The MAKER algorithm is mainly based on five steps (see Figure 12). The first step includes the analysis of low-complexity sequences (Cantarel et al., 2008). RepeatMasker (version 4.0.7) was used to identify the repeats and potentially genes. This repeat mask- ing was skipped in the MAKER pipeline based on previous repeat masking. Afterwards, the pipeline will identify ESTs, mRNAs, and proteins in the input genome assembly. Chapter 2: Material and Methods 31

BLAST (version 2.7.1) was used to align protein- and RNA data to the genome assem- bly. This step produces protein- and RNA evidence (Cantarel et al., 2008). MAKER allows the usage of protein and RNA data from relative species in this step to increase the completeness of final predictions. Based on the soft masked genome, BLAST is not allowed to seed hits in masked regions but it can extend hits in these low-complexity masked regions. The second step is about filtering and clustering. Filtering includes identification and removing of marginal predictions and sequence alignments by taking scores, percent identities, etc. into account. After that, MAKER clusters overlapping alignments and predictions together and identifies redundant evidence. Subsequently, Exonerate (version 2.2.0 with glib version 2.47.3) (Slater and Birney, 2005) was used to polish BLAST hits by realign these hits in a splice-aware fashion to the masked genome assembly, which might contained highly similar ESTs, mRNAs, and proteins (see Figure 4)(Cantarel et al., 2008).

Figure 12: Genome annotation algorithm of the MAKER pipeline. Starting with repeat masking of the genome assembly and ending with generating gene models by using protein and RNA evidence. [Source: Campbell et al.(2014)]

Parallel to this, ab-initio gene-predictors like Augustus (version 3.3.1) and SNAP (ver- sion 2.34.2) are used as gene-finders, they identified potentially genes in the masked genome (Cantarel et al., 2008). In this run the gene-finders were trained on previously provided high confident gene models (see below). In the fourth step it tries to produce evidence for annotations by merging information from the polished and clustered EST- Chapter 2: Material and Methods 32 and protein alignments. These evidence-based hints were fed back to the gene-finders for increasing accuracy of final predictions (Cantarel et al., 2008). The last step is about filtering and annotating. The pipeline determines alternative splice forms and other features such as UTRs based on EST evidence from an entire pool of ab-initio and evi- dence supported gene-predictions (Cantarel et al., 2008). MAKER recombines the high evidence annotations from the previous step in order to find overlapping predictions and select the best matching model for the final annotation.

The MAKER annotation procedure for P. californicus has four runs which were done for training MAKER in order to increase the accuracy of the annotation. The first run was performed for generating an annotation with all available sequence information from P. californicus and the relative species (A. melliferra, C.floridanus, S. invicta, and P. barbatus). These annotations were already used in the GeMoMa run. So, the input data for the first MAKER run contained annotated repeats (the masked genome assembly and repeat annotation), the transcript assembly, the transcript- and protein data from relative species, as well as GeMoMa predictions. MAKER used the Augusts gene-prediction program. Nasonia was used as the Augustus reference species for the first round, because it belongs to the Hymenoptera clade where P. californicus is also present. For the next run BUSCO (version 3.0.2) trained Augustus on the P. californicus genome assembly (Waterhouse et al., 2017). This training resulted in P. californicus specific GHMM for creating more specific gene models, which are directly based on the genome assembly. This GHMM was used as a reference model in Augustus for further MAKER runs. Additionally, SNAP was trained on confident genes. Confident genes are sequences with a length of at least 50 base pair and a maximum Annotation Edit Distance (AED) of 0.25. These sequences are coming from the first MAKER run. The Augustus reference model as well as the SNAP produced HMMs were used in the second MAKER run to increase the accuracy of the annotation. The input files from the third round were the output files from the second round. SNAP was trained on the output files from the second run again in the same way as it was done for the second run. The resulting HMMs from SNAP as well as specific gene models from Augustus and annotations from MAKER and GeMoMa were used to generate a merged final annotation in the third run of MAKER. This was done to produce a complete genome annotation in the end. Finally, the last run was done to include protein-isoforms in the prediction. The option alt splice has been set to 1 in order to find alternative spliced isoforms. Further analysis was done, considering the AED of the prediction. Normally, AED calculations are based on reference gene models. As ab-initio genome annotations do not have a reference gene model, MAKER uses experimental evidence such as transcript evidence from EST’s or mRNA-seq or protein evidence from homologous species (Holt Chapter 2: Material and Methods 33 and Yandell, 2011). In further analysis CD-HIT (version 4.7) was used for the detection of similar and duplicated sequences (Li and Godzik, 2006).

Functional Annotation

The functional annotation describes and identifies the predictions from the structural annotation (see Section 1.2.5). The functional annotation of the detected genes include the identification of proteins and the bio-medical pathways as well as their molecular function (GO-annotation). This annotation was performed in three steps. Each step searches for similarity between predicted proteins and annotated high confidence pro- teins in published databases. If the new protein prediction show similarity to published proteins, the functional description of the published proteins were used for the protein prediction. After the first search, the uncharacterized proteins were handed to the next step. The identification of similarities was based on BLAST searches (Altschul et al., 1990).

The first round was done with the Uniprot database (The Uniprot Consortium, 2018). This database includes high-quality protein sequences with functional information (The Uniprot Consortium, 2018). This was done by tools from MAKER to identify similarity of protein predictions from P. californicus annotation with described proteins in Uniprot (The Uniprot Consortium, 2018). Additionally Interproscan (version 5.30-69.0) (Jones et al., 2014) was used to identify protein domains in the proteins which may give hints to functional usage. Identification of similarities between proteins in the next steps were based on BLAST searches, as it was already done in the first round. The difference between the first round and the next two rounds of functional similarity searches were different databases and different identification methods. The second search was based on Pogonomyrmex specific Refseq database (O’Leary et al., 2016) and the last run was based on the non redundant protein BLAST database (nr database). The nr BLAST database includes non-redundant (no sequences duplication’s) protein sequences from several protein databases. Just characterized proteins were used for annotations in these two rounds (filtered ”uncharacterized Proteins” or ”Proteins with unknown function”). The size of used databases differ, 12,578 Hymenoptera specific proteins from Refseq database were used for the functional annotation in the second round (see Table8) and the whole nr BLAST database for the final round (O’Leary et al., 2016). The nr database includes over 181 million classified protein sequences. After a BLAST search with an e-value of 10-6 was performed, an in-house script was used to add the functional description of proteins from the databases to the header of the predictions. The identification of functions in the second and third similarity searches were based on covering the alignment of the proteins. Therefor, predictions where the alignment cover Chapter 2: Material and Methods 34 more than 70 % of the whole proteins were selected for functional annotation. So, if the alignments of the proteins coming from the annotation pipeline cover more than 70 % of the functional described proteins in the database, the prefix ”Similar to” was added to the functional description of the predicted protein. For detecting proteins which are just parts of functional described proteins, additionally calculation was executed. If the BLAST alignment covered more then 70 % of the predicted protein but less than 70 % of the functional described protein in the database, the prefix ”Part of” has been added to the functional description of the predicted protein sequence. This situation may happen, if the protein prediction is incomplete or if just a functional unit of the protein is described. Chapter 3

Results

The whole annotation procedure of P. californicus resulted in 424,357 detected repeats where about 7 % of the repeats stay unclassified. The structural annotation ended with 23,874 unique transcripts coding for 22,844 unique protein predictions and 1,350 identified tRNAs. The function of approximately 56 % of predicted transcripts and proteins were identified.

3.1 Transcript assembly construction

The RNA data has been aligned to the genome assembly by combining data from differ- ent sources. Different sources include different sequencing methods as well as different pre-assembled transcripts (see section 2.2.1). RNA sources are coming from the head of P. californicus as well as from the whole body. The RNA sequences from the head were sequenced by Illumina short read sequencing done by Helmkampf et al.(2016). Whole- body sequencing was based on in-house TGS MinION long-read sequencing (done by collaborators). Because of different sequencing methods, different assembly approaches were needed. However, to generate a final transcript, long-read assemblies as well as genome dependent and genome independent short-read assemblies have been merged to a final transcript (see Table4). The Trinity assembler is known as a genome indepen- dent assembler and the Tuxedo assembler is known as genome dependent assembler. The procedure resulted in an transcript assembly with 51,604 sequences with transcript lengths from 105 bp till 30,234 bp. The combination of sources was done for increasing the completeness of the transcript assembly. This transcript assembly was used for the genome annotation pipelines.

35 Chapter 3: Results 36

Table 4: Transcript assembly summary. The merged assembly represents the final resulting assembly from merging above listed assemblies (see section 2.2.1 for methods).

Median Mean Number Assembly Sequence Assembly Source sequence sequence N50 [nt] of transcripts size [nt] lengths [nt] length [nt] length [nt] Canu (v. 1.8) Body 246 196,563 501 - 1,775 734 799 792 Tuxedo (v. 1) Head 34,953 71,805,488 79 - 26,134 1,490 2,054 2,760 Tuxedo (v. 2) Head 51,441 107,364,332 200 - 30,234 1,513 2,087 3,163 Trinity (v. 1) Head 7,890 31,829,855 237 - 27,920 3,434 4,034 5,055 Trinity (v. 2) Head 160,741 137,321,817 251 - 19,115 547 854 1,219 Merged Head 51,604 130,933,055 105 - 30,234 1,920 2,537 3,529 Assembly and Body

3.2 Repeat annotation

The identification and annotation of repeats from the genome assembly is the first step in genome annotation. The repeat annotation was done by a summary of repeat de- tection programs (see section 2.2.2). 21.94 % (about 69 Mbp) cover identified repeats in the genome assembly (see Table5). Seven percent of the repeat annotation is still unclassified.

Table 5: Classification of repeats detected in the genome assembly. This Table shows summarized annotations from RepeatMasker (see section 2.2.2). For detailed annotation see Table 19.

relevant types number of length percentage Class (>1%) elements occupied [bp] of sequence DNA-elements 84,097 27,798,513 8.85 % Maverick 5,067 8,225,910 2.62 % TcMar 14,969 4,153,558 1.33 % LINE 10,024 3,975,615 1.27 % SINE 689 84,031 0.03 % LTR 16,156 11,449,544 3.65 % Gypsy 7,517 78,56,582 2.5 % Unclassified 30,292 9,480,522 3.02 % Interspersed 148,986 54,450,951 17.37 % repeats Low complexity 31,226 1,620,888 0.52 % repeats Simple repeats 244,071 12,693,600 4.05 % Total 424,357 68,780,482 21.94 % Chapter 3: Results 37

Total insterspersed repeats cover the main part with 17.37 % of all identified repeats. The second biggest part is simple repeats with four percent. Based on the detailed report, 6 elements of the telomeric sequences TTAGG were detected by the repeat masking step. The sequence TTAGG occurs 117,113 times in the genome assembly.

3.3 Structural annotation

The structural annotation in this thesis is mainly based on two annotation pipelines which have been executed after each other (see section 2.2.2). The whole procedure include a mix of a homology-based-approach and an ab-initio pipeline.

3.3.1 GeneModelMapper (GeMoMa)

GeMoMa identifies protein coding genes by using genome annotations from relative species (reference species) (Keilwagen et al., 2016). Since the latest version of GeMoMa allows the usage of different reference types to extend the scope of the predictions cov- ered by the reference annotations (Keilwagen et al., 2018), the annotation pipeline uses this feature. Four relative species have been used as reference species for P. califor- nicus predictions (see section 2.2.2). GeMoMa has been executed four times for each species which resulted in a summarized pool of protein predictions (see Table6). The final prediction represents a consensus of protein predictions coming out by using these reference annotations. Within the final annotation an decreasing number of P. califor- nicus proteins provided by the four relative species are handed in. In the end of the GeMoMa pipeline the GAF joined all annotaion coming from different reference species and filtered them by a relative GeMoMa score. The final annotation would contain no duplicates and just the best annotations for P. californicus coming from different refer- ence annotations. Finally, the prediction contain 20,170 proteins (see Table6). About 64 % of the final P. californicus GeMoMa predictions are based on P. barbatus reference annotations (see Table6).

Table 6: Summary of GeMoMa predictions for P. californicus, provided by different reference species. See table 14 for used data sources.

Number of Number of Number of Reference species predicted proteins protein predictions protein prediction in the of the reference species in P. californicus final annotation Pogonomyrmex barbatus 19,128 14,851 12,865 (63.7 %) Solenopsis invicta 21,118 14,160 3,697 (18.3 %) Camponotus floridanus 18,824 13,769 2,549 (12.9 %) Apis mellifera 22,451 10,752 1,095 (5.49 %) Number of merged GeMoMa predictions: 20,170 Chapter 3: Results 38

The reference annotation from relative species are coming from Genbank database. The ID-number of the annotation version is displayed in Table 14. The Table6 shows the reference species which are the source of predictions from the final 20,170 predictions. The main part (63.7 %) of the final GeMoMa annotation has been covered by P. barbatus which is the most relative ant to P. californicus in this set of reference species.

3.3.2 MAKER

The main part of the structural annotation took place in the MAKER annotation pipeline. Several runs of MAKER were executed for increasing the quality and complete- ness of the final protein predictions (see section 2.2.2). Finally, the structural prediction ended with 27,263 predicted proteins and the same amount of coding transcripts (see Table7). From these predictions, 4,419 (16.2 %) proteins and 3,389 (12.4 %) transcripts are identical. Additionally, 1,350 non coding RNAs (ncRNAs) as tRNAs were predicted. 608 (45.0 %) of these tRNAs were duplicated (see Figure 29). Twenty Amino-acids were determidet by 1,082 tRNA, and additionally 182 pseudogenes, one ochre suppressor gene and 21 undetermindet tRNAs were identified by internal tRNAscanSE (see Table9).

Table 7: Summary of MAKER predictions for P. californicus and unique of protein- and coding transcript predictions.

Prediction Number of predictions Proteins 27,263 unique Proteins 22,844 Transcripts 27,263 unique Transcripts 23,874 t-RNA 1,350 unique t-RNA 742

3.4 Functional annotation

The functional prediction is based on BLAST similarity searches for protein predictions with height confident proteins in databases. Three BLAST searches took place with dif- ferent databases, in order to increase the completeness of functional described proteins (see section 2.2.2 for further description of the procedure). The first round identified 12,615 protein similarities with the Uniprot protein database from total 27,264 predicted proteins from the final MAKER run (see Table8). The 14,649 uncharacterized proteins from the first step were handled to the next round. This round identified 1,849 from Chapter 3: Results 39

14,649 proteins in the predicted sequences. Hence, 12,800 sequences stay not character- ized from the second run, they were handed to the third round. Finally 2,148 proteins were identified by this run. In total, 16,612 proteins are characterized but 10,652 pro- teins are still unknown. Totally 15,231 InterPro domains, 16,552 Pfam domains, and 15,577 GO-Annotations have been detected in this proteins. The characterized proteins are classified into two classes depending on the coverage between protein-predictions and protein sequences in the database (see section 2.2.2). The prefix ”Similar to” means, the prediction is covering at least 70 % of the protein sequence in the database. The prefix ”Part of” means, the prediction covers just a part of the protein sequence from the database. Just 2.5 % of these functional annotations were described with ”Part of”. So, 97.4 % of protein predictions cover at least 70 % of the matched protein. Additionally, 6,982 of protein predictions contain Domains and GO-Annotations.

Table 8: Summary of functional predictions from the final MAKER run for P. cali- fornicus.

Detected protein Unknown Database Runs Database functions proteins size 1 12,615 14,649 uniprot sprot 557,992 Refseq Pogonomyrmex 2 1,849 12,800 12,578 (characterized) non redundant (nr) 3 2,148 10,653 181,118,669 Database 16,612 10,652 Final: (60.09 %) (39.91 %) 12,864 9,980 Final unique: (56.31 %) (43.69 %)

The unknown proteins were split in two categories based on the AED of the prediction (see Figure 14). The unknown proteins with an AED smaller 0.5 are called ”putative uncharacterized proteins” because they have similarities with the reference and high evidence but are not characterized. The unknown proteins with an AED above 0.5 are called ”hypothetical proteins” because they are maybe not proteins at all and the evidence of these prediction is low. Chapter 3: Results 40

Figure 13: Description of AED distribution with the Cumulative distribution function (CDF) as well as description of structural classes inside AED quarters.

The number of characterized proteins seem to shirk during increasing AED (see Figure 13). Uncharcterized proteins seem just to occur in predictions above AED 0.5. About 87 % of all predictions do have an AED above 0.5. Based on observations of the cumulative distribution function (CDF) in Figure 14, about 87 % of the proteins with unknown function do have an AED score equal to or under 0.5 AED. For that, about 87 % of the proteins with unknown function are called putative uncharacterized proteins. Consider- ing identical predictions, 56.31 % (see Table8) of finally unique predicted proteins (see Table7) have been functional annotated. 8.973 of these unique proteins do have Pfam- domains, 8.569 have identified InterPro-domains, and 5.576 include GO-annotations. All sequences containing GO-annotations do also contain InterPro- and Pfam-domains. Chapter 3: Results 41

Figure 14: The distribution of proteins without identified function. The y axis repre- sents the cumulative distribution and the x axis shows the AED. Proteins with an AED less than 0.5 are called ”putative uncharacterized proteins” (green area) and predictions above an AED of 0.5 are called ”hypothetical proteins” (red area).

3.5 non-coding RNA annotation

Genome annotation is not only about detection of coding genes, it includes the detection and classification of ncRNA genes as well. NcRNAs have different classes and functions. The detection of ncRNA genes, expect tRNA genes, were done by collaborators. More detailed classification of tRNAs is present in appendixH Figure 29. Table9 shows pre- dictions for structural ncRNAs and Table 10 shows predictions of regulatory ncRNA. TRNA and ribosomal RNA (rRNA) are included in structural ncRNAs based on their specific conserved secondary structure. Five small nucleotide RNA (snRNA) are also known as structural ncRNAs (Krebs et al., 2010, Ch.28 p.701 ). They are essential for the spliceosome. These snRNAs are known as U1, U2, U4, U5, and U6. Small nucleotide ribonucleoprotein particles (snRNP) are proteins, these include snRNAs. The five snR- NAs within snRNPs make up the major spliceosome. In addition, some eukaryotes have another spliceosome, the minor spliceomose. This spliceosome consists of U5, U11, U12, U4atac, and U6atac (Krebs et al., 2010, Ch.28 p.703 ). The corresponding U4 and Chapter 3: Results 42

U6 units in minor spliceosome are called U4atac and U6atac because they processes non-canonical splice sites. That means introns are flanked with AT at 5’ end and AC at 3’ end. Likewise, U1 and U2 in the minor spliceosome are called U11 and U12. These RNAs were included in the structural Table based on the fact, that they are a part of the spliceosome. The gene coding for U12 has not been found in the genome-assembly, fortunately it has been detected in the raw reads from Illumina gDNA sequencing.

Table 9: Summary and classification of structural ncRNA predictions. LSU = large sub-unit; SSU = small sub-unit; MRP = mitrochondrial ribonucleoprotein; *detected from raw reads.

Classification Description detected gene number U1 5 U2 3 U4 1 U5 3 spliceosomal snRNA U6 10 U4atac 1 U6atac 1 U11 1 U12* 1 amino acid specificity determined 1082 undetermined 21 tRNA ochre suppressor 1 pseudogenes 182 5 8S rRNA 13 5 S rRNA 23 rRNA LSU rRNA 35 SSU rRNA 18 ribosome binding SRP RNA 1 conserved region 1 1 SPHINX conserved region 2 1 mitrochondrial RNase MRP 1 endoribonuclease RNase P nuc 4

Small nucleolar RNAs (snoRNA) as e.g. U3, were included in the regulatory Table because they are mostly involved in pre-rRNA processing and other processing and modification events of rRNA (Krebs et al., 2010, Ch.28 p.717). For that reason they were included as regulatory RNA in Table 10. In addition to those RNA molecules, Chapter 3: Results 43 cis-regulatory elements were annotated. Since these predictions are regulation signals, they were summarized in Table 11.

Table 10: Summary and classification of regulatory ncRNA predictions.

Classification Description detected gene number snRNA 7SK 1 mir-8 8 mir-9 4 mir-927 4 mir-190 4 miRNA mir-2 4 mir-10 4 other types 26 U3 3 snoRNA other types 13

Table 11: Summary and classification of signals coming from ncRNA predictions.

Classification Description detected signal number Histone 3’ UTR stem-loop 74 iron response element II 1 cis regulatory element potassium channel RNA 8 R2 RNA element 9 Chapter 4

Discussion

4.1 gDNA sequencing

The gDNA sequencing of male ants with the 10x Genomics approach is based on Illumina shot-gun sequencing. Illumina sequencing results contain fastq files of the predicted sequence with a per-base quality score under each sequence. The scores display the probability of base calling errors. The Sequencing data includes 269,953,173 paired-end reads with a length of 150 bp (see Section 2.1.1). Figure 19 and 20 in AppendixA show the per-base quality of the forward and the reverse strand from sequencing. The quality scores of this Figures show worse quality through the end of the read. The quality-score average reaches from 28 to 39 for the forward strand and from 20 to 38 for the reverse strand. This phenomena is common for Illumina sequencing and caused by an increasing number of erroneously synthesized molecules (Tan et al., 2019). This incorrect synthesis resulted in increasing occurrence of mixed signal detection within clusters (Tan et al., 2019). However, several detections of sequencing errors are known as caused and not occurring randomly (Dohm et al., 2008; Nakamura et al., 2011; Meacham et al., 2011; Minoche et al., 2011). The Figures 19 and 20 show general differences where the forward strand seem to have better per-base quality as the reverse strand. The fragment length has been identified as the major cause for low quality in the reverse reads (Tan et al., 2019). Unfortunately, there is no error model which explains this phenomenon at the moment. As the average quality is mostly in the green area (complete in green area for forward strand and to 105-109 bp for the reverse read) this sequencing data has acceptable per-base quality sequencing scores.

44 Chapter 4: Discussion 45

4.2 Genome assembly

The genome of P. californicus has been assembled based on sequencing data from the 10x Genomics shut-gun sequencing approach. The genome assembly was made by collab- orators. The sequences coming from this approach have been assembled twice with two different versions of Supernova (Supernova 1 and Supernova 2), because after assembling reads from 10x Genomics with Supernova 1, the next version of the assembler (Super- nova 2) came out. This assemblers estimated that the organism where the sequences are coming from is a diploid organism. The male ant is haploid, therefore it produced twice the same assembly where one of this assemblies has been chosen by random. The quality parameters of both assemblies were compared (see Table2). This leaded to the decision, that the new assembly (generated by Supernova 2 assembler) shows longer sequences on scaffold level. The median of the sequence length is twice higher than the median of the sequence length from the Supernova 1 assembly. The median sequence length is very low in comparison to the mean length. This huge difference indicate, that most sequences are relatively small and just a few are very long (no normal distribution of sequence length). This was approved by analyzing the sequence length distribution of the Supernova 2 assembly in Figure 15. Additionally, the N50 value is greater in the Supernova 2 assembly. Half of the scaffolds inside the Supernova 2 assembly represent the 42th scaffold with a length of about 1.5 Mbp. The N90 value is represented by the 4075th scaffold with a length of 9,161 bp. The high variance between L50 and L90 support the assumption that there are very much small scaffolds and just a few long ones. An additional quality parameter is the coverage of sequenced reads. A reasonable read coverage of 50 - 60 times the genome assembly size is needed to generate sufficient coverage of reads that uniquely anchor the longest repeat regions in the genome assem- bly (Lu et al., 2016). The genome sequence coverage has been calculated as the product of read length (150 bp) and number of reads (269,953,173) over the genome assembly size (314 Mbp) (Sims et al., 2014). This calculation leaded to the sequencing coverage, which is about 129 times the assembly generated by Supernova 2. That means we have about 129 times the sequences of the genome assembly size. This value is higher as the required coverage. The coverage is a kind of a quality value because it is thought that as more sequences are available as more overlapping sequences we have in the assembly which produces trustful contigs/scaffolds in the assembly (Sims et al., 2014).

The sequence length of the P. californicus genome assembly generated from Supernova 2 reaches from 395 bp to 8.4 Mb (see Table2). The sequence length distribution of the assembly is displayed in Figure 15. This Figure shows that about 86 % of the number of sequences have a length from 395 bp to 20,000 bp (see Figure 15). These shorter sequences make just about 20 % of the whole assembly sequence length. Ten Chapter 4: Discussion 46 percent of the assembly are sequences with a length between 20,000 bp up to 40,000 bp and they cover about twelve percent of the whole genome assembly length. This means that about 96 % of the sequences in the assembly just cover about 32 % of the whole assembly length. Therefore, the main-part of the assembly (68 %) is included in about 500 sequences (4 %) with a length above 40,000 bp (see Figure 15). Hence, the main part of the assembly is covered by few very long sequences. One criteria for a good assembly, is that the main part of the assembly is covered by less amount of long sequences. For those reasons we choose the Supernova 2 assembly with a size of about 314 Mb, distributed over 12.983 sequences.

Figure 15: This Graph displays the frequency of sequences in the genome assembly (created by Supernova 2) with a length from 0 bp to 40 Kbp. The frequency is shown on the y-axis and the sequence length (in bp) on the x-axis. The bars are calculated in 500 bp range and the last bar at 40 Kbp represents the frequency of all sequences above 40 Kbp. The first line of the annotations describe the number of sequences and the second line describes the coverage of sequences for the whole genome assembly length. Chapter 4: Discussion 47

CEGMA and BUSCO were used for measuring completeness of the genome assembly from P. californicus. CEGMA detects intron-exon structures of conserved proteins (Parra et al., 2007) and BUSCO is based on the detection of conserved single-orthologous genes, which might be included into the genome (see Section 1.2.1). Completeness esti- mations from CEGMA are based on the detection of 248 ultra-conserved CEGs in the genome (Parra et al., 2007). CEGMA estimation resulted in 97.98 % (243 of 248 genes) completeness of the genome, where 96.37 % (239 of 248 genes) of CEGMA genes were complete present in the genome assembly. P. barbatus genome assessment resulted in a CEGMA estimation of 99 % completeness (245 of 248 genes), where 92 % (229 of 248 genes) of these CEGMA genes were complete (Smith et al., 2011). Completeness estimations from CEGMA for these both ants resulted in similar completeness estima- tion (98 % vs. 99 %) but small differences in the prediction of complete genes (96 % vs. 92 %). The conclusion of this analysis is that P. barbatus genome assembly seems have small differences (1 %) in completeness to the P. californicus genome assembly. The difference of CEGMA completeness estimations could be based on more frequent detection of partial genes in P. barbatus.

For BUSCO analysis, the Hymenoptera specific single-copy orthologous genes from Or- thoDB (v.9) were used (Zdobnov et al., 2017). This Section of the database contains 4,415 different consensus sequences from orthologous genes of 25 insects inside the Hy- menoptera clade (see Table 18)(Zdobnov et al., 2017). For quality control and com- parison, BUSCO ran on four related species as shown in Figure 26. The BUSCO As- sessment has three classifications: Complete (include single-copy and duplicated), Frag- mented, and Missing. Complete means, that the single-copy orthologous genes were detected completely. The complete Section is separated in the single-copy and dupli- cated Sections. Single-copy means that the orthologous gene has been detected once in the genome. The duplicated Section includes those which have been detected sev- eral times in the genome. The fragmented Section contains genes without complete detection, and the missing Section is for those orthologous genes which have not been detected in the genome. Comparing the completeness of genome assemblies from all species in Figure 26 in AppendixF lead to observations that Apis mellifera has the most complete genome assembly, with just 33 missing orthologous genes. This is based on the high quality chromosome level assembly of A. mellifera with the highest N50 (12 Mbp) of relative ants in this thesis (see Table 15 in AppendixE). P. californicus is the third complete genome assembly with 70 missing genes after P. barbatus with 50 missing genes. Comparing the completeness of the P. californicus genome assembly with all other assemblies in Figure 26, it is evidently that P. californicus contains more duplicated single orthologous genes as any of the other related genome assemblies. This means that there are more than one copy of this single-copy orthologous genes in P. Chapter 4: Discussion 48 californicus. This was thought caused by small repetitive sequences but final annotation predictions followed the trend of duplication. Based on the observation of duplicated predictions in the genome annotation (see Section 4.5) a self comparison of the genome assembly was necessary to find the source of these duplications.

This was done by collaborators (see Figure 24). The analysis of duplicates consisted of a Mega-BLAST analysis for three different genome assemblies, which were compared to itself. The two versions of the genome assembly from P. californicus (new assembly = P. californicus assembly generated by Supernova 2, old assembly = P. californicus assembly generated by Supernova 1) as well as the P. barbatus genome assembly were included (see Table 14). Figure 24 shows the results of the Mega-BLAST analysis of each genome assembly. The x-axis shows the number of alignments from one contig/scaffold in another contig/scaffold of the genome assemblies. Alignments with at least 99 % iden- tity are included in this Figure. The y-axis shows categories of alignment lengths. For example, Figure 24 displays that 432,289 alignments with a size between 0-50 nt in the P. californicus Supernova 2 genome assembly are with 99 % identity in another scaffold of the same genome assembly. There are differences between the genome assemblies of P. californicus (Supernova 1 vs Supernova 2). The P. californicus assembly generated from Supernova 2 shows the most duplicated sequences in every category of alignment length. A higher number of duplicated small sequences is expected because of small low com- plexity regions in the assemblies. A decreasing trend in frequency of duplications with an increasing alignment length was observed, with few exceptions on alignment length 1001-2000 nt and 2001-3000 nt. The number of duplicates in every included assembly shows some exceptions for these duplication trend. Also interestingly is that there is a very long duplication (more than 5001 nt) in the P. barbatus genome assembly. This means, that long duplications occur also in published genome assemblies. This might need further analysis, considering that this is a draft genome. However, this assembly analysis shows that the P. californicus genome assembly generated by Supernova 2 has very much duplicated long sequences (6,135 are bigger than 5001 nt), interestingly the genome assembly generated by Supernova 1 has very much less. The comparison to the P. barbatus genome assembly shows that this amount of duplicates were probably not related to biological causes. Figure 23 in AppendixC supports this assumption. This Figure shows how often the annotated transcripts of P. californicus match to the genome assembly (Supernova 2). The y-axis shows the number of transcripts and the x-axis the number of duplications of these transcripts. More then 10,000 transcripts occur just once in the genome, but there are 24 with more than 100 duplications in the genome. In conclusion, this graph shows that the annotation pipeline took care about multiple occurring of coding sequences in the genome, otherwise there would be more transcripts Chapter 4: Discussion 49 for each match in the assembly (see Section 2.2.2). It might had an effect on the tran- script annotation somehow, based on observation of exact duplications in annotated transcripts (see Figure 25). About 16.21 % of the transcripts were duplicated.

Figure 16: BUSCO genome assessment results for different P. californicus genome assemblies made by different versions of the Supernova assembler (Supernova 1 and Supernova 2). Additional sequences with a length above 5 Kbp and above 10 Kbp have been analyzed from the P. californicus genome assembly generated by Supernova 2. This result is based on the Hymenoptera specific single-copy ortholog database from OrthoDB (v.9).

BUSCO assessment results of both assemblies from P. californicus were compared, for analyzing completeness of both assemblies (see Figure 16). Additional, the Supernova 2 assembly has been trimmed by 5,000 bp and 10,000 bp sequence length to see the effect of smaller sequences in the BUSCO results. The BUSCO results show huge differences between the number of duplicated predictions (Supernova 1 with 22 vs. Supernova 2 with 451 duplicated predictions). The number of fragmented genes in the Supernova 1 include 160 more orthologous genes as the assembly from Supernova 2. Additional, 41 genes are missing in the Supernova 1 assembly in comparison to the Supernova 2 assembly. The comparison of Supernova 2 with different filter strategies shows a decreasing trend for duplicated and fragmented genes in the results by excluding shorter sequences (shorter than 5,000 bp and 10,000 bp). That means, that additional missing Chapter 4: Discussion 50 genes in the filtered results of Supernova 2 in comparison to the whole Supernova 2 assembly include fragmented and duplicated genes in smaller sequences of the assembly. However, the assembly with the most single-copy orthologous genes is P. californicus genome assembly generated by Supernova 1.

The generated genome assembly still fulfill criteria for a high-quality draft genome as- sembly based on the high score for measuring the completeness of the genome assembly from BUSCO and from CEGMA. These results indicate a completeness with more then 90 %. The high base-call quality from Illumina sequencing support this assumption (see Figure 19 and 20). A whole genome BLAST search was done, to see how many sequences from P. califonicus (Supernova 2) assembly cover the genome assemblies from relative species (see Table 17 in AppendixE). The target assemblies in Table 17 are coming from related species. The P. califonicus assembly is the query assembly. The shared identity of compared species decreases with shrinking level of relativity. That means, P. califonicus shares more sequences with near relative species (e.g. P. barbatus) as with more different species (e.g. A. mellifera). The distribution of identity scores from alignments seem to change while decreasing level of relativity. Low differences between the mean and median indicate a normal distribution of the values, where changes in- dicate outlier values. If the median is greater as the mean, some alignments with very low identity would have a decreasing effect on the mean. Additionally differences in the alignment length are noticeable. 87.7 % of the P. barbatus genome were aligned to the P. califonicus (Supernova 2) genome assembly. Just about 43 % of S. invicta and just 12 % of A. mellifera has been aligned to P. califonicus genome assembly. This also might be connected to the level of relativity to P. califonicus. Interesting here is, there are still about 12.3 % from the P. californicus genome which are not found in the P. barbatus genome assembly. This might depending on the quality of the assembly (less complete and/or higher number of exact duplicated scaffolds) or these sequences are P. californicus specific sequences. However, this results seem to reflect expectations of phylogenetic relationships from (Moreau et al., 2006). But as we can see on BUSCO results (Figure 26 and 16) there are still at least about 20 more missing genes for P. californicus as for P. barbatus genome assembly. The cause of this leak in completeness might be the quality of the gDNA raw material coming from the ant. The 10x Genomics approach has an improved library preparation approach as well as a barcoding system for query reconstruction. This steps do not guarantee good quality, but it might have an positive effect on the quality. For that reason, the quality of the DNA raw-material coming from the ant might be the imitated factor for the resulting quality of the genome assembly. Long DNA strands are needed as a basement for a better genome assembly. Chapter 4: Discussion 51

4.3 Transcript assembly

The Transcript assembly is a mix of different sources from different sequencers generated by different assembler programs (see Table4). The main part of the transcript assembly is based on RNA sequences from Helmkampf et al.(2016). These sequences are coming from the head of the ant. Sequencing with Illumina resulted in 93,895,202 reads with a length of 100 bp. The mean quality values of this sequencing are reaching from 29 till 39 which displays good base-calling quality of the sequencing progress (see Figure 21 and 22). Based on the absence of a genome, Helmkampf et al.(2016) produced a genome independent transcript assembly by using Trinity (v. 1). Since genome data is available, a genome dependent transcript assembly had been built up with the Tuxedo suite. Splign search (Kapustin et al., 2008) was performed in order to align transcripts to the genome by taking care about the intron-exon structure. Splign analysis was done for quality assumption. The analysis resulted in 7,031 (89 %) detected transcripts from the published transcripts in the genome. Which indicated that 859 transcripts are missing. The resulted transcripts from the annotation procedure were analyzed as well. This resulted in 99.99 % matching. So, that mean 15 of the final annotated transcripts would not fit to the genome assembly. Based on this results, missing transcripts from both data sets were handed in a BLAST run against the genome with an e-value of 10-5. This BLAST analysis ended with the detection of 13 complete missing published transcripts and all 15 missing annotated transcripts in the genome. Based on the probability of missing sequencing reads in the assembly, additional BLAST search was performed with the 13 missing published transcripts by using the same parameters as in the previous BLAST run. Five of these 13 published transcripts stay absent in the genome without any hit during the analysis. That means, 99.8 % of the published transcripts from Helmkampf et al.(2016) seems to be present in the genome assembly. The assembly size of the published data generated by Helmkampf et al.(2016) is less then one third of the transcript assembly of Tuxedo 2 or the final merged transcript assembly (see Table4). By comparing the published Trinity assembly with the annotated transcripts from P. californicus it came out, that 7,474 (92.19 %) transcripts from Helmkampf et al.(2016) are matching to the annotated transcripts with BLASTn (e-value 10-5).So, most of the published transcripts match to the genome, about 8 % seem to be complete absent in the annotated transcripts.

The distribution of tissue sources in the final assembly are relative unilaterally. Just 246 transcripts with a size of about 196 Kbp came from the whole body of the ant. These less amount of RNA sequences from the whole body is based on the low base-call quality from in-house MinIon sequencing (done by collaborators). 1,357,114 reads were characterized as failed after sequencing. 394,085 long reads passed the quality control. Chapter 4: Discussion 52

That means 78 % of resulting reads are not present in the resulting assembly based on low quality. This may based on systemic problems in MinIon sequencing or on the flow- cell (see Section 1.1.2). The Tuxedo assemblies 1 and 2 were merged and the Trinity assemblies as well as the MinION assembly where used to extend the merged Tuxedo assembly (see Section 2.2.2). The number of transcripts between the Tuxedo 2 assembly and the final assembly are very similar. Also the longest transcript sequence in the final transcript assembly seems to come from the Tuxedo 2 assembly. However, based on the mean sequence length and the N50, the final assembly seems to consist more of longer sequences as the Tuxedo 2 assembly but more of shorter sequences as the published assembly.

4.4 Repeat annotation

The detection of repetitive sequencing is a challenging part of genome annotation due to the extremely high sequence divergence and fast evolution of interspersed elements. The repeat annotation of P. californicus resulted in masking about 22 % of the genome assembly and identified 424,357 repetitive elements. The P. barbatus repeat annotation masks about 19.5 % (about 36 Mb) of the genome assembly and identified 297,221 repetitive elements (Smith et al., 2011). In estimation, about 10-45 Mb are missing in the draft genome of P. barbatus (Smith et al., 2011). Based on this assumption 0.8 - 3.7 % of annotated repeats are missing in the P. barbatus genome. The estimated coverage of the full repeat annotation of P. barbatus would be between 20.3 % and 23.2 % based on the range of the repeat annotation of P. californicus. Repeat annotation of P. californicus looks relative complete for this comparison. Hence, there are main differences in the composition of some repeat classes, the repeat annotation from P. californicus have more than doubled Interspersed repeats (17.37 % vs. 7.93 %) and simple repeats (4.05 % vs. 1.71 %) as P. barbatus. P. barbatus has three times more low complexity repeats (1.97 % vs. 0.52 %) and a little bit more unclassified repeats in the repeat annotation (3.75 % vs. 3.02 %). However, there are similar percentages of detected SINES (0.03 % vs. 0.04 %) which are known as lineage specific relatively long non-coding transposable elements (TEs). In order to estimate the completeness of the repeat annotation and the genome annotation, the detection of telomers is helpful. The telomers for Insects are repeats of TTAGG which are flanking the chromosomes. The P. barbatus repeat annotation detected 27 of 32 telomers from a diploid karyotype. For P. californicus, six from 16 telomeric regions have been identified. That means, that the assembly may miss some telomeric regions or the repeat annotation of P. californicus is not fully complete. Based on the improvement of P. barbatus repeat annotation4 which Chapter 4: Discussion 53 covers 30.82 % of the assembly, the repeat annotation of P. californicus may miss some repeats.

4.5 Structural annotation

The structural annotation was done by GeMoMa and MAKER. The predicted genes were protein coding genes. The MAKER annotation shows a transcript for every protein sequence. The final outcome from MAKER annotations includes 2,263 isoforms but also 4,420 exact protein duplications and 3,392 exact transcript copies, as well as 608 tRNA duplications (see Figure 25). Exact copies were also detected in the final GeMoMa annotation. From 20,170 predictions coming from GeMoMa annotation, 5,655 proteins and 5,363 transcripts were identical. CD-HIT generates clusters of sequences where the longest sequence represents all sequences clustered by similarity (Li and Godzik, 2006). Some parts of proteins or functional units as for dimer proteins are part of this cluster. The observation of exact copies of proteins in the final results of MAKER may based on the high amount of duplicated sequences in the genome assembly of P. californicus generated by Supernova 2 (see Section 4.2), but these duplications may also include high conserved functional parts of proteins. Anyhow, a difference of 1,030 predictions between unique proteins and unique transcripts is noticeable (see Table7). This difference include 558 splice variances of protein coding genes. The rest of this difference may be variances of coding triplets for the same amino acid in the transcripts. 1,471 isoforms are present in the final unique set of proteins which means that about 35 % from the isoforms of proteins are exact copies of other proteins or copies of parts of other proteins in the prediction. Therefore, these exact copy isoforms were not included in the final protein predictions. The final unique transcripts contain 2,029 transcript isoforms which indicate, that about 10 % of the total 2,263 isoforms are exact duplications. Nevertheless, these exact copy isoforms make just about 18 % of all duplicates for proteins and about 5 % of all transcript duplicates. Therefore, detected isoforms seem not to be the root cause of these exact duplications, but are still part of it.

For detecting differences in completeness between different species, BUSCO analysis was done on transcript and protein level (see Figure 27 and 28 in AppendixF) in the same way as mentioned in Section 4.2. It is obvious that there is a high number of duplicated genes (dark blue Section in the bars) in all species on transcript and protein level. P. californicus transcripts and proteins do have fewest number of duplicates, because the final unique annotations were handed to BUSCO analysis. So, the exact

4Source: https://www.ncbi.nlm.nih.gov/genome/annotation euk/P ogonomyrmex barbatus/101/ (Release 101, Date of access: 18th of March, 2019) Chapter 4: Discussion 54 sequence copies are excluded from this sequences. Duplicated single-copy orthologous genes are still included in the BUSCO results of P. californicus. Contigs of scaffolds in this genome assembly might be duplicated where single-copy orthologous genes occur. Further analysis on the source of this duplications ended with the detection of several exact copies or included sequences in proteins and transcripts from published assemblies by using CD-HIT (see beginning of Section 4.5) (see Table 16). The differences between the amount of duplicated sequences from transcripts and proteins in Table 16 would explain the differences of duplications in the BUSCO analysis of C. floridanus (see Figure 28). 5,665 exact copies of proteins or protein parts were found in C. floridanus proteins. This might result in a high number of duplicated genes in Figure 27 and 28.

P. californicus transcript has more than three times missing single-copy orthologous genes on transcript level as S. invicta transcripts have. The higher percentage of missing genes in the P. californicus transcript may based on the source of RNA (mostly just head). Just a small number of TGS long MinION sequences were added to the genome assembly, for extending existing gene models based on a leak of quality (see Section 4.3). Therefor, missing genes coming from the whole body of the ant leaded to a leak of completeness on the transcript level. In order to fulfill the analysis of annotation completeness, BUSCO assessment results based on proteins are displayed in Figure 28 in AppendixF. These results display the same trend as shown in BUSCO assessment on transcript level between the species. Figure 17 shows the differences in missing genes on genome-, transcript- ,and protein level. The y-axis shows the percentage of missing genes in the used database and absolute values are displayed above the bars. The number of missing genes for P. californicus drastically increases from the genome over transcripts to proteins. This trend is probably based on missing transcripts from different tissues which leaded to an incomplete transcript assembly (see above). Some relative species in Figure 17 show a different trend. More missing genes are displayed in genomes of relative species as in transcripts or proteins. C. floridanus and S. invicta have the most missing genes on genome level. It looks like, the genome assemblies are missing more genes as the proteins and transcripts (see Figure 17). This observation may based on different performed strategies from BUSCO (Sim˜aoet al., 2015). Interestingly, C. floridanus is included in the Hymenoptera specific single-copy ortholog database from OrthoDB (v.9) which was used for BUSCO analysis (see Table 18 in AppendixE). The huge amount of missing genes in the C. floridanus genome may caused by missing detection of BUSCO on genome level. S. invicta shows most missing genes on genome level of all published related species in Figure 17. This is maybe because the S. invicta annotations are currently not included in the used database (see Table 18). As S. invicta is not part of the database, it was not considered in the generation of consensus sequences. So, the resulted consensus sequences of this database may not support gene models from S. Chapter 4: Discussion 55 invicta on genome level. Interestingly, S. invicta show greater completeness on protein and transcript level.

However, Figure 17 shows that 9 % of proteins are missing in the final protein prediction. The trend of missing genes from genome level to protein level conclude that coding proteins in the genome are missing in the proteins. Therefor, detailed detection on protein level have been performed in order to find the cause of this missing proteins. 31.6 % of the missing genes with an alignment coverage of at least 75 % where found in the genome assembly by performing tBLASTn analysis (e-value: 10-5). Also, 58.2 % of missing genes in this detection have still a coverage of at least 50 %. Which means that about 234 of these missing proteins on protein level of P. californicus were detected with at least 50 % alignment coverage in the genome. BUSCO uses Augustus and HMMER additionally to tBLASTn for gene detection on genome level (Waterhouse et al., 2017). This method is more sensitive based on the presence of gene prediction software like Augustus. This means that BUSCO assumptions of most missing proteins on protein level from P. californicus in Figure 17 were present in the genome.

Figure 17: Missing single-copy orthologous genes in genome, transcripts, and proteins from Apis mellifera, Camponotus floridanus, Pogonomyrmex barbatus, Pogonomyrmex californicus, and Solenopsis invicta. Data is based on BUSCO assessment with Hy- menoptera specific single-copy ortholog database from OrthoDB (v.9).

A detailed detection of these missing proteins within annotated proteins have been performed. 148 (36.8 %) missing proteins were found with at least 70 % coverage in Chapter 4: Discussion 56 the protein predictions of P. californicus by using BLASTp (e-value: 10-5). Eighteen proteins were found with a coverage above 90 % and an identity value above 70 %. Based on this result, it seems that sensitivity of BUSCO on protein level is lower as on genome level. So, the BUSCO assessments may need improvement for performing more sensitive detection on protein level. In total, there are probably about 5 % (254) of missing proteins still present in the final P. californicus protein prediction which were not detected by BUSCO. Additionally to the BUSCO analysis DOGMA has analyzed conserved domain arrangements on protein level. DOGMA web services were used for analysis5. The insect domain core set from Pfam version 32 was used as a database for these searches. This analysis is based on unique sets of protein sequences, where all exact duplications were removed (see Table 16). DOGMA analysis resulted in an estimation of 90.51 % completeness of unique protein predictions from P. californicus (see Table 12). 450 protein domains from 4,741 Hymenoptera specific domains are missing in the final protein set of P. californicus. This number is slightly under the expected score, because related species as P. barbatus contains about 98 % of Pfam domains. Interestingly P. californicus show more complete CDAs with a size of two or three than CDAs with a size of one. P. barbatus shows the opposite trend. Missing CDAs of all three sizes display a leak of completeness of about ten percent in P. californicus protein predictions. So, the DOGMA estimations show a little bit greater completeness as the BUSCO estimations on protein level. BUSCO analysis resulted in 9 % missing proteins for P.californicus protein predictions (see Figure 17) where DOGMA estimates about 9.5 % missing proteins. Table 12 show also worse quality in S. invicta proteins in comparison to the other species. It looks like DOGMA results display slightly worse completeness on protein level for all relative species as BUSCO do.

Table 12: DOGMA result summaries on unique protein data from relative ants. De- tection of Conserved Domain Arrangements (CDAs) with different sizes for used ants. The colour code displays the relative comparison of completeness (red: < 90 %, orange: < 98 %, green: > 98 %).

Completeness [%] Species CDA size Total 1 2 3 P. californicus 89.64 91.71 91.42 90.51 P. barbatus 98.24 97.83 97.45 98.02 S. invicta 98.80 96.91 93.98 97.64 C. floridanus 98.92 99.08 98.54 98.92 A. mellifera 98.35 98.68 98.18 98.44

5Source: https://domainworld-services.uni-muenster.de/dogma (Date of access: 18th of March, 2019) Chapter 4: Discussion 57

4.6 Functional annotation

The description of the function for structural predictions is called functional annotation. Annotated proteins are classified into three main classes: hypothetical proteins, charac- terized proteins, and uncharacterized proteins. Hypothetical proteins are supposed to be house-keeping proteins or to have low evidence. Uncharacterized proteins are pro- teins which have potentially higher evidence, but no described function in other species. They might also be house-keeping proteins. Characterized proteins are proteins with low or high evidence which are identified as proteins in other species. The labeling of proteins in the P. californicus annotation are depended on the AED value (see Section 2.2.2). The distribution of this three classes of annotation depending on the AED value is displayed in Figure 13. The confidence grow with sinking AED. The percentage of hypothetical proteins increases with the AED value, it is the opposite for the character- ized proteins. Hypothetical proteins below an AED value of 0.5 were identified in the UniProt database with hypothetical proteins from other species.

Figure 18: This word cloud displays the species where functional annotations of classified P. californicus proteins are coming from. Proteins from these species do have the highest similarity to detected proteins from P. californicus. This Word cloud was generated by using WordItOut online services.

The source organisms for annotations of characterized proteins are displayed in Figure 18. Interestingly, the main sources for functional annotations of P. californicus proteins are not the relative species. 17 % of functional annotations are coming form Drosophila melanogaster and 13 % from Homo sapiens. These are not really related species, but as Chapter 4: Discussion 58 they are known as model organisms they have a relative complete and improved annota- tion. Also annotated proteins from used relative species in the annotation pipeline may have functional annotations from these model organisms. However, 12 % of functional annotations are coming from P. barbatus, which means that these proteins do fit better to P. barbatus proteins as to proteins coming from less relative model organism. Ad- ditionally, several annotations from P. barbatus are based on Drosophila melanogaster proteins.

The Domain annotation as a part of the functional annotation identified 8,973 sequences with Pfam domains, 8,569 sequences with InterPro domains and 5,576 sequences with GO-annotations. It was observe that all sequences with InterPro domains do have ad- ditionally Pfam domains and all sequences with GO-annotations do have Pfam and InterPro domains. As domains are known as conserved regions and GO-annotations describe identified molecular functions, the 5,576 proteins are thought to be high confi- dent proteins. 98 % of this high confident proteins are characterized, 1 % are putative uncharacterized, and 1 % are hypothetical. In total 96 proteins are not characterized high confident proteins.

Annotations for ncRNAs were also included. TRNA annotations were detected by tRNAscan-SE in the annotation pipeline. 1,350 tRNAs were annotated. 1,082 of these were determined to code for 20 different amino acids, one suppressor gene, 21 unde- tected genes, and 182 pseudogenes (see Table9 for summary and Figure 29 in Appendix H for detailed classification). The tRNA suppressor gene has been identified as ochre suppressor which is known as UAG to UAA coding triplet mutation (Krebs et al., 2010, Ch. 8, p. 202-204). This has an suppressing effect because UAA is coding for a stop codon that is recognizable by Release Factor 1 (RF1 or eRF1 in eukaryots) which termi- nates the translation of polypeptides (Krebs et al., 2010, Ch. 8, p. 202). The suppressor gene might be equivalent to mutation of tRNATyr or tRNAGln at the 3rd position of the codon (Beier, 2001). However, the number of predicted tRNAs is very high in com- parison to the genome size of the ant. About 500 tRNA genes are distributed over the whole human genome, which is about ten times bigger as the ant genome (3,234 Mb vs. 314 Mb) (International Human Genome Sequencing Consortium et al., 2001). Based on an updated annotation at GenBank, the P. barbatus genome has just about 201 annotated tRNA sequences4. This comparison means that the number of detected tRNA genes in the P. californicus genome is very much higher as expected. Noticeable is the high part of tRNAs classified as coding for threonine (771) and isoleucine (90). The number of pseudogenes (182) is a little bit less in comparison to the number from current annotation improvement of P. barbatus (200)4. Probably tRNA predictions in- clude false positive detections. Additional to tRNAs other ncRNAs where detected (see Tables 11,9, and 10). The detection of spliceosomal snRNAs may also be an indicator for Chapter 4: Discussion 59 annotation completeness, these genes are necessary for the function of the spliceosome. All spliceosomal snRNA genes were identified. The U12 gene has been identified in the raw sequencing reads, because it was missing in the assembly. In comparison to other insects, the U6 gene is highly over represented in the P. californicus genome. Other in- sects as Drosophila melanogaster just have 3 copies of these gene which is also the case for eighth other insects (Mount et al., 2007). All other numbers of these spliceosomal snRNAs are very similar to other insects.

Finally, 37 published partial proteins6 from P. californicus have been compared with BLASTp (e-value: 10-5) to the protein annotation of this thesis. Table 20 in Appendix H describes the functional description coming from the published proteins merging the protein predictions in this thesis. The Table displays just 20 from 38 protein match, be- cause 18 proteins are mitrochondiral proteins which are not included in the raw material where the genome assembly is based on. All matches which are displayed in Table 20 do have at least 99 % identity with the protein annotation of this thesis. By comparing the descriptions, one description is not 100% fitting to the functional description in this thesis. The published sequence with the GI-ID of 1033938453 has the same sequence as the proteins 428787291, 428787257, and 428787255. The interesting fact is, they are labeled different as elongation factor 1 and elongation factor 2. However, as all of them are matching to elongation factor 1 of annotated proteins in P. californicus. These four proteins may have a wrong functional description in GenBank.

6Source: https://www.ncbi.nlm.nih.gov/protein/?term=Pogonomyrmex%20californicus (Date of access: 18th of March, 2019) Chapter 5

Conclusion

This project includes the genome annotation of Pogonomyrmex californicus on an high quality draft genome assembly. This genome assembly is based on sequencing with the 10x Genomics approach. It has been assembled with Supernova 1 and 2 assembler. Unfortunately a high number of inter-scaffold duplications were detected in this genome assembled by Supernova 2 and leaded to a greater assembly as expected. Interestingly, the assembly generated by Supernova 1 based on the same data, show less fragmentation and is in the expected size scale. As the annotation pipelines deal with duplications in the genome, duplications in the annotations were reduced for P. californicus. It has also been observed that there are many duplications in published annotations. These published duplications do mostly just occur on transcript an protein level, but there in a relative high number (see Figure 27 and 28). The resulting annotation of the high-quality draft genome from P. californicus fulfill criteria of annotation-directed improved annotation. About 394,064 repeats were annotated and cover about 22 % of the genome. Currently about 1.7 % of annotations are missing in the genome, 4 % are missing in the transcripts, and about 10 % of proteins are missing. These leak of completeness might based on missing RNA data from all tissues of the ant because the leak of completeness increases at transcript level. However, annotations of proteins are matching to published annotated proteins for P. californicus. Furthermore, about 5,500 proteins have high confidence, because they include protein domains as well as GO-annotations. The genome assembly needs improvement for more completeness. Pryszcz and Gabald´on(2016) published a method to address this problem of high fragmented assemblies. This improvement would probably lead to less redundant duplicated assembly. Also additionally high quality gDNA would lead to better sequencing results where long reads sequencing would increase the completeness of the genome assembly.

60 Chapter 6

Availability

All data files used in this thesis as well as the results are online available from www. bioinformatics.uni-muenster.de/internal/projects/P.californicus/. To guar- antee reproductivity of the produced results, a detailed description with commands from the whole presented genome annotation and further going analysis is available on the website. Please see Table 13 in the AppendixD for used programs in the annotation pipeline.

61 Appendix A

Quality of gDNA-sequencing

Figure 19: The per-base quality of the 150 bp Illumina forward reads used in the 10x Genomics sequencing approach. The x-axis shows the position in the read (bp) and the y-axis shows the Illumina 1.6 encoded quality values, calculated by FastQC (v. 0.11.5) (Andrews(2016)). The green area stands for very good quality, the orange area calls reasonable quality, and the red area is classified as poor quality (Andrews(2016)). The blue line represents the mean quality where the red line in the boxes shows the median quality (Andrews(2016)). Yellow boxes represent the intern-quantile range from 25 to 75 %. The upper and lower whiskers represent the 10 % and 90 % points (Andrews (2016)).

62 AppendixA: Sequencing of gDNA 63

Figure 20: The per-base quality of the 150 bp Illumina reverse reads used in the 10x Genomics sequencing approach. The x-axis shows the position in the read (bp) and the y-axis shows the Illumina 1.6 encoded quality values, calculated by FastQC (v. 0.11.5) (Andrews(2016)). The green area stands for very good quality, the orange area calls reasonable quality, and the red area is classified as poor quality (Andrews(2016)). The blue line represents the mean quality where the red line in the boxes shows the median quality (Andrews(2016)). Yellow boxes represent the intern-quantile range from 25 to 75 %. The upper and lower whiskers represent the 10 % and 90 % points (Andrews (2016)). Appendix B

Quality of RNA-sequencing

Figure 21: The per-base quality of the 100 bp Illumina RNA forward reads used in the sequencing approach done by Helmkampf et al.(2016). The x-axis shows the position in the read (bp) and the y-axis shows the Illumina 1.6 encoded quality values, calculated by FastQC (v. 0.11.5) (Andrews(2016)). The green area stands for very good quality, the orange area calls reasonable quality, and the red area is classified as poor quality (Andrews(2016)). The blue line represents the mean quality where the red line in the boxes shows the median quality (Andrews(2016)). Yellow boxes represent the intern-quantile range from 25 to 75 %. The upper and lower whiskers represent the 10 % and 90 % points (Andrews(2016)).

64 AppendixB: Quality of RNA-sequencing 65

Figure 22: The per-base quality of the 100 bp Illumina RNA reverse reads used in the sequencing approach done by Helmkampf et al.(2016). The x-axis shows the position in the read (bp) and the y-axis shows the Illumina 1.6 encoded quality values, calculated by FastQC (v. 0.11.5) (Andrews(2016)). The green area stands for very good quality, the orange area calls reasonable quality, and the red area is classified as poor quality (Andrews(2016)). The blue line represents the mean quality where the red line in the boxes shows the median quality (Andrews(2016)). Yellow boxes represent the intern- quantile range from 25 to 75 %. The upper and lower whiskers represent the 10 % and 90 % points (Andrews(2016)). Appendix C

Detection of duplicates

Figure 23: Transcripts mapping to the genome. Displayed results from Splign anal- ysis. Y-axis shows logarithm scaled number of unique annotated transcripts matching to the genome assembly (Supernova 2). The x-axis shows the number of occurrences in the genome assembly as the number of duplication’s (copies of transcripts in the genome).

66 AppendixC: Detection of duplicates 67

Figure 24: Mega-BLAST was used to find duplicated sequences in the Pogonomyrmex californicus genome assemblies made by different versions of the Supernova assembler (new assembly = assembly generated by Supernova 2, old assembly = assembly gen- erated by Supernova 1). This were compared to the self comparison of the genome assembly of Pogonomyrmex barbatus. This plot was generated by collaborators. AppendixC: Detection of duplicates 68

Figure 25: Diversity of protein predictions. The y axis represents the percentage of sequences clustered by CD-HIT (version 4.7). The x axis represents the sequence similarity which is represented by sequences inside each cluster (Li and Godzik(2006)). 16.21 % of protein sequences from 27,264 are clustered with 100 % similarity and 7,949 sequences are clustered with a similarity of 40 %. Appendix D

Programs used for the Genome annotation

Table 13: Summary of all programs that have been used in the genome annotation workflow.

Work-flow Name Version Usage Reference trimming of Trim galore 0.5.0 not published RNA reads Transcriptome building analysis Bowtie 2 2.3.4.2 index and Langmead et al.(2009) aligning splice junction Tophat 2 2.1.2 Trapnell et al.(2009) identification assemble Cufflinks 2.2.1 Trapnell et al.(2012) exons merge Cuffmerge 1.0.0 Trapnell et al.(2012) assemblies aligning reads HISAT 2 2.1.0 Kim et al.(2015) to genome assemble StringTie 1.3.4d and quantify Pertea et al.(2015) trancripts de-novo Trinity 1 r20131110 Grabherr et al.(2011) assemble de-novo Trinity 2 2.8.4 Henschel et al.(2012) assemble

69 AppendixD: Used programs in the work flow 70

Table 13: continued from previous page. Work-flow Name Version Usage Reference assemble canu 1.8 TGS Koren et al.(2017) sequences aligning LAST 946 Kielbasaet al.(2011) sequences Repeat repeat 1.0.11 Smit and Hubley(2008-2015) Modeler identification Repeat classifies annotation TE-class 2.1.3 Abrus´anet al.(2009) TEs detection of Flutre et al.(2011) REPET 2.5 TEs Quesneville et al.(2005) detection of Repeat 4.0.7 repeats and Smit et al.(2013-2015) Masker masking homology based Genome GeMoMa 1.5.3 Keilwagen et al.(2016) genome annotation annotation genome MAKER 2 2.31.10 annotation Holt and Yandell(2011) pipeline gene SNAP 2.34.2 Korf(2004) finding gene Augustus 3.3.1 Stanke and Morgenstern(2005) prediction quality BUSCO 3.0.2 Waterhouse et al.(2017) assessment Appendix E

Data from relative species

Table 14: Summary of data sources from used relative species.

Species GenBank Assembly ID Assembly Name Date of access Link Pogonomyrmex barbatus GCA 000187915.1 Pbar UMD V03 May 2018 www.ncbi.nlm.nih.gov/genome/?term=txid144034 Solenopsis invicta GCA 000188075.2 Si gnH May 2018 www.ncbi.nlm.nih.gov/genome/?term=txid13686 Camponotus floridanus GCA 003227725.1 Cflo v7.5 July 2018 www.ncbi.nlm.nih.gov/genome/?term=txid104421 Apis mellifera GCA 000002195.1 Abel 4.5 May 2018 www.ncbi.nlm.nih.gov/genome/?term=txid7460

Table 15: Summary of quality values from genome assemblies of relative species. This data is based on data sources from table 14.

Assembly Sequencing Number of GC- Species Size N50 [nt] level method Sequences content Pogonomyrmex barbatus Scaffold Roche 454 4,645 235 Mbp 819,605 34 % Camponotus floridanus Scaffold PacBio 10,791 284 Mbp 1,585,631 26 % Roche 454 and Solenopsis invicta Scaffold 69,511 396 Mbp 558,018 25 % Illumina SOLiD, Sanger, Apis mellifera Chromosome 5,321 250 Mbp 13,219,345 21.67 % and Roche 454

Table 16: Summary of redundancy in published proteins and transcripts of relative species. This data is based on data sources from table 14.

Number of Number of Number of Number of Species all Proteins unique Proteins all Transcripts unique Transcripts C. floridanus 23,971 18,306 25,910 25,675 A. mellifera 22,451 17,216 28,739 28,605 P. barbatus 19,128 15,380 20,672 20,643 S. invicta 21,118 17,567 22,066 21,975

71 AppendixE: Data from relatives 72

Table 17: The P. californicus genome assembly from Supernova 2 was used as Query in a whole genome BLAST search. Displayed values may give a hint of genome sequence similarity between the relative species and P. californicus.

Identical aligned Aligned nucleotides Number of gaps nucleotide fraction [%] [x 106] Target: Assembly Mean of Identity Median of Identity [%] Alignment length (Size of Assembly) [%] [%] Query Target Query Target Query Target [Mb] P. barbatus 94.83 94.85 65.64 87.47 64.48 85.92 7.31 6.84 206.13 (235 Mb) S. invicta 68.75 69.48 54.35 43.10 46.24 36.66 28.72 27.91 170.68 (396 Mb) C. floridanus 66.53 67.93 39.15 43.29 32.33 35.74 18.85 21.31 122.95 (284 Mb) A. mellifera 65.30 69.23 9.56 12.00 7.37 9.25 3.38 4.17 30.04 (250 Mb) AppendixE: Data from relatives 73

Table 18: The ortho DB for Hymenopterans consist of consensus sequences based on preditions from 25 species.

Species in Hymenoptera Ortho DB (v. 9) Apis mellifera Athalia rosae Atta cephalotes Bombus impatiens Camponotus floridanus cinctus Cerapachys biroi Copidosoma floridanum Dufourea novaeangliae Eufriesea mexicana Fopius arisanus Habropoda laboriosa Harpegnathos saltator Lasioglossum albipes Linepithema humile Megachile rotundata Melipona quadrifasciata Monomorium pharaonis Nasonia vitripennis Orussus abietinus Pogonomyrmex barbatus Polistes dominula Trichogramma pretiosum Vollenhovia emeryi Wasmannia auropunctata Appendix F

Assessment of Annotation

Figure 26: BUSCO genome assessment results for Apis mellifera, Camponotus flori- danus, Pogonomyrmex barbatus, Pogonomyrmex californicus, and Solenopsis invicta genome assemblies based on the Hymenoptera specific single copy ortholog database from OrthoDB (v.9).

74 AppendixF: BUSCO Assessment 75

Figure 27: BUSCO transcript assessment results for A. mellifera, C. floridanus, P. barbatus, P. californicus (unique transcript), and S. invicta transcript assemblies based on the Hymenoptera specific single copy ortholog database from OrthoDB (v.9).

Figure 28: BUSCO protein assessment results for A. mellifera, C. floridanus, P. barbatus, P. californicus (unique proteins), and S. invicta protein prediction based on the Hymenoptera specific single copy ortholog database from OrthoDB (v.9). Appendix G

Repeat annotation

Table 19: RepeatMasker result table from repeat annotation.

Class No. elements Masked [bp] Masked [%] DNA 40955 8832185 2.82% CMC-Chapaev-3 107 11645 0.00% CMC-EnSpm 528 131519 0.04% CMC-Transib 112 41872 0.01% Crypton-V 249 48535 0.02% Kolobok-Hydra 351 176587 0.06% Kolobok-T2 2858 390566 0.12% MULE-NOF 683 49970 0.02% Maverick 5067 8225910 2.62% Merlin 523 105514 0.03% MuLE-NOF 12 6290 0.00% P 840 235940 0.08% PIF-Harbinger 903 212400 0.07% PIF-ISL2EU 22 8138 0.00% PIF-Spy 965 72300 0.02% PiggyBac 15 12435 0.00% Sola-2 81 6942 0.00% TcMar 314 40368 0.01% TcMar-Fot1 20 1817 0.00% TcMar-Mariner 4911 2503876 0.80% TcMar-Tc1 8212 1148757 0.37% TcMar-Tc4 1512 455723 0.15% hAT 293 69794 0.02%

76 AppendixG: Repeat annotation 77

Table 19: continued from previous page. Class No. elements Masked [bp] Masked [%] hAT-Ac 102 39823 0.01% hAT-Blackjack 1463 497071 0.16% hAT-Charlie 446 306081 0.10% hAT-hAT19 1 150 0.00% DNA-ORFs: 12728 3965929 1.27% LINE 5609 1206387 0.38% CR1 2 105 0.00% I 52 46748 0.01% Jockey 23 27342 0.01% L2 537 279951 0.09% LOA 36 24587 0.01% Penelope 1831 772858 0.25% R1 881 1013047 0.32% R1-LOA 59 36671 0.01% R2 34 26264 0.01% R2-NeSL 47 79204 0.03% RTE-BovB 11 2633 0.00% RTE-X 902 446129 0.14% LTR 6054 1475811 0.47% Copia 430 386351 0.12% DIRS 349 209765 0.07% ERVK 3 362 0.00% Gypsy 7062 7468521 2.38% Gypsy-Cigr 455 388061 0.12% Pao 952 910472 0.29% LTR-ORFs: 1047 545614 0.17% RC (Helitron) 2781 1197661 0.38% Retro 4505 720609 0.23% Retro-ORFs: 70 23655 0.01% SINE 689 83484 0.03% Unknown 24916 8160936 2.60% nonLTR 159 25421 0.01% unclear 4601 728498 0.23% unclear-ORFs: 616 565667 0.18% ———————————————————————– total interspersed 148986 54450951 17.37% AppendixG: Repeat annotation 78

Table 19: continued from previous page. Class No. elements Masked [bp] Masked [%] Low complexity 31226 1620888 0.52% Satellite 1 21 0.00% Simple repeat 244071 12693600 4.05% rRNA 25 11620 0.00% tRNA 48 3402 0.00% Total 424357 68780482 21.94% Appendix H

Functional annotation

Table 20: Comparison of functional annotations from P. californicus. Public Genbank annotations and functional annotation of this thesis were compared.

GI number Published description (Genbank) Annotation-ID Annotation (Thesis) 1033938453 elongation factor 1 alpha, partial Pcal 00003918-RA Elongation factor 1-alpha 428787255 elongation factor 2 alpha, partial Pcal 00003918-RA Elongation factor 1-alpha 428787257 elongation factor 2 alpha, partial Pcal 00003918-RA Elongation factor 1-alpha 428787291 elongation factor 2 alpha, partial Pcal 00003918-RA Elongation factor 1-alpha 428787297 elongation factor 2 alpha, partial Pcal 00003918-RA Elongation factor 1-alpha 428787461 elongation factor 1 alpha, partial Pcal 00008912-RA Elongation factor 1-alpha 428787463 elongation factor 1 alpha, partial Pcal 00008912-RA Elongation factor 1-alpha 428787493 elongation factor 1 alpha, partial Pcal 00008912-RA Elongation factor 1-alpha 1033938659 wingless, partial Pcal 00001627-RA WNT-1 Protein Wnt-1 428787311 wingless, partial Pcal 00001627-RA WNT-1 Protein Wnt-1 428787313 wingless, partial Pcal 00001627-RA WNT-1 Protein Wnt-1 428787341 wingless, partial Pcal 00001627-RA WNT-1 Protein Wnt-1 428787361 carbomoylphosphate synthase, partial Pcal 00002899-RA r CAD protein 428787363 carbomoylphosphate synthase, partial Pcal 00002899-RA r CAD protein 428787385 carbomoylphosphate synthase, partial Pcal 00002899-RA r CAD protein 428787403 long wavelength rhodopsin, partial Pcal 00009799-RA Rhodopsin 428787405 long wavelength rhodopsin, partial Pcal 00009799-RA Rhodopsin 428787439 long wavelength rhodopsin, partial Pcal 00009799-RA Rhodopsin 428787447 long wavelength rhodopsin, partial Pcal 00009799-RA Rhodopsin

79 AppendixH: Functional annotation 80

Figure 29: The distribution of classification coming from tRNAscan-SE run. The y- axis shows the frequency of the tRNA classification in logarithm scale. The x-axis show the tRNA classes. The amino acids where the tRNAs are coding for are listed as three letter code on x-axis, ”Pseudo” means pseudogenes, ”Sup” means possible suppressor tRNA genes and ”Undet” means undetermined or unknown t-RNA isotypes. Bibliography

Gy¨orgy Abrus´an, Norbert Grundmann, Luc DeMester, and Wojciech Makalowski. Teclass–a tool for automated classification of unknown eukaryotic transposable el- ements. Bioinformatics (Oxford, England), 25(10):1329–1330, 2009. ISSN 1367-4811. doi: 10.1093/bioinformatics/btp084.

Fred W. Allendorf, Paul A. Hohenlohe, and Gordon Luikart. Genomics and the future of conservation genetics. Nature Reviews Genetics, 11:697 EP –, 2010. doi: 10.1038/ nrg2844.

Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lip- man. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990. ISSN 0022-2836. doi: 10.1016/S0022-2836(05)80360-2.

Michael R. Anderberg. Cluster Analysis for Applications: Probability and Math- ematical Statistics: A Series of Monographs and Textbooks, volume 19 of Prob- ability and mathematical statistics. Elsevier Science, Burlington, 2014. ISBN 9780120576500. URL http://search.ebscohost.com/login.aspx?direct=true& scope=site&db=nlebk&AN=931378.

Simon Andrews. Fastqc: A quality control tool for high throughput sequence data. 2010, 2016. URL https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Manimozhiyan Arumugam, Jeroen Raes, Eric Pelletier, Denis Le Paslier, Takuji Ya- mada, Daniel R. Mende, Gabriel R. Fernandes, Julien Tap, Thomas Bruls, Jean- Michel Batto, Marcelo Bertalan, Natalia Borruel, Francesc Casellas, Leyden Fer- nandez, Laurent Gautier, Torben Hansen, Masahira Hattori, Tetsuya Hayashi, Michiel Kleerebezem, Ken Kurokawa, Marion Leclerc, Florence Levenez, Chaysa- vanh Manichanh, H. Bjørn Nielsen, Trine Nielsen, Nicolas Pons, Julie Poulain, Junjie Qin, Thomas Sicheritz-Ponten, Sebastian Tims, David Torrents, Edgardo Ugarte, Er- win G. Zoetendal, Jun Wang, Francisco Guarner, Oluf Pedersen, Willem M. de Vos, Søren Brunak, Joel Dor´e,MetaHIT Consortium, Mar´ıaAntol´ın, Fran¸coisArtigue- nave, Herv´eM. Blottiere, Mathieu Almeida, Christian Brechot, Carlos Cara, Christian Chervaux, Antonella Cultrone, Christine Delorme, G´erardDenariaz, Rozenn Dervyn, 81 Bibliography 82

Konrad U. Foerstner, Carsten Friss, Maarten van de Guchte, Eric Guedon, Florence Haimet, Wolfgang Huber, Johan van Hylckama-Vlieg, Alexandre Jamet, Catherine Juste, Ghalia Kaci, Jan Knol, Karsten Kristiansen, Omar Lakhdari, Severine Layec, Karine Le Roux, Emmanuelle Maguin, Alexandre M´erieux,Raquel Melo Minardi, Christine M’rini, Jean Muller, Raish Oozeer, Julian Parkhill, Pierre Renault, Maria Rescigno, Nicolas Sanchez, Shinichi Sunagawa, Antonio Torrejon, Keith Turner, Gae- tana Vandemeulebrouck, Encarna Varela, Yohanan Winogradsky, Georg Zeller, Jean Weissenbach, S. Dusko Ehrlich, and Peer Bork. Enterotypes of the human gut micro- biome. Nature, 473:174 EP –, 2011. ISSN 1476-4687. doi: 10.1038/nature09944.

Philip M. Ashton, Satheesh Nair, Tim Dallman, Salvatore Rubino, Wolfgang Rabsch, Solomon Mwaigwisya, John Wain, and Justin O’Grady. Minion nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nature Biotechnology, 33:296 EP –, 2014. ISSN 1546-1696. doi: 10.1038/nbt.3103.

Maurice R. Atkinson, Murray P. Deutscher, Arthur Kornberg, Alan F. Russell, and J. G. Moffatt. Enzymic synthesis of deoxyribonucleic acid. xxxiv. termination of chain growth by a 2’,3’-dideoxyribonucleotide. Biochemistry, 8(12):4897–4904, 1969. ISSN 0006-2960. doi: 10.1021/bi00840a037.

O. T. Avery, Colin M. MacLeod, and Maclyn McCarty. Studies on the chemical nature of the substance including transformation of pneumococcal types: Introduction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type 3. Journal of Experimental Medicine, 79(2):137–158, 1944. ISSN 0022-1007. doi: 10.1084/jem.79.2.137.

Weidong Bao, Kenji K. Kojima, and Oleksiy Kohany. Repbase update, a database of repetitive elements in eukaryotic genomes. Mobile DNA, 6:11, 2015. ISSN 1759-8753. doi: 10.1186/s13100-015-0041-9.

Zhirong Bao and Sean R. Eddy. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research, 12(8):1269–1276, 2002. ISSN 1088- 9051. doi: 10.1101/gr.88502.

H. Beier. Misreading of termination codons in eukaryotes by natural nonsense suppressor trnas. Nucleic acids research, 29(23):4767–4782, 2001. ISSN 1362-4962. doi: 10.1093/ nar/29.23.4767.

Barry Bolton. An online catalog of the ants of the world - antcat, 2019. URL http: //antcat.org/.

Se´anG. Brady, Ted R. Schultz, Brian L. Fisher, and Philip S. Ward. Evaluating alterna- tive hypotheses for the early evolution and diversification of ants. Proceedings of the Bibliography 83

National Academy of Sciences of the United States of America, 103(48):18172–18177, 2006. ISSN 1091-6490. doi: 10.1073/pnas.0605858103.

Se´anG. Brady, Brian L. Fisher, Ted R. Schultz, and Philip S. Ward. The rise of army ants and their relatives: Diversification of specialized predatory doryline ants. BMC evolutionary biology, 14:93, 2014. ISSN 1471-2148. doi: 10.1186/1471-2148-14-93.

David C. Brock, editor. Understanding Moore’s law: Four decades of innova- tion: Chapter 7: Moore’s law at 40. Chemical Heritage Press, Philadelphia, Pa., 2006. ISBN 978-0-941901-41-3. URL http://www.loc.gov/catdir/enhancements/ fy0643/2006010387-b.html.

Etienne Bucher, Jon Reinders, and Marie Mirouze. Epigenetic control of transposon transcription and mobility in arabidopsis. Current opinion in plant biology, 15(5): 503–510, 2012. doi: 10.1016/j.pbi.2012.08.006.

Nicolas Buisine, Hadi Quesneville, and Vincent Colot. Improved detection and annota- tion of transposable elements in sequenced genomes using multiple reference sequence sets. Genomics, 91(5):467–475, 2008. ISSN 0888-7543. doi: 10.1016/j.ygeno.2008.01. 005.

Michael S. Campbell, Carson Holt, Barry Moore, and Mark Yandell. Genome annotation and curation using maker and maker-p. Current protocols in bioinformatics, 48:4.11.1– 39, 2014. doi: 10.1002/0471250953.bi0411s48.

Brandi L. Cantarel, Ian Korf, Sofia M. C. Robb, Genis Parra, Eric Ross, Barry Moore, Carson Holt, Alejandro S´anchez Alvarado, and Mark Yandell. Maker: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome research, 18(1):188–196, 2008. ISSN 1088-9051. doi: 10.1101/gr.6743907.

P. S. G. Chain, D. V. Grafham, R. S. Fulton, M. G. Fitzgerald, J. Hostetler, D. Muzny, J. Ali, B. Birren, D. C. Bruce, C. Buhay, J. R. Cole, Y. Ding, S. Dugan, D. Field, G. M. Garrity, R. Gibbs, T. Graves, C. S. Han, S. H. Harrison, S. Highlander, P. Hugenholtz, H. M. Khouri, C. D. Kodira, E. Kolker, N. C. Kyrpides, D. Lang, A. Lapidus, S. A. Malfatti, V. Markowitz, T. Metha, K. E. Nelson, J. Parkhill, S. Pitluck, X. Qin, T. D. Read, J. Schmutz, S. Sozhamannan, P. Sterk, R. L. Strausberg, G. Sutton, N. R. Thomson, J. M. Tiedje, G. Weinstock, A. Wollam, and J. C. Detter. Genomics. genome project standards in a new era of sequencing. Science (New York, N.Y.), 326 (5950):236–237, 2009. ISSN 1095-9203. doi: 10.1126/science.1180614.

Erwin Chargaff. Chemical specificity of nucleic acids and mechanism of their en- zymatic degradation. Experientia, 6(6):201–209, 1950. ISSN 0014-4754. doi: Bibliography 84

10.1007/BF02173653. URL https://link.springer.com/content/pdf/10.1007/ BF02173653.pdf.

Geng Chen, KangPing Yin, Charles Wang, and TieLiu Shi. De novo transcriptome assembly of rna-seq reads with different strategies. Science China Life Sciences, 54 (12):1129–1133, 2011. ISSN 1869-1889. doi: 10.1007/s11427-011-4256-9.

Richard Cordaux and Mark A. Batzer. The impact of retrotransposons on human genome evolution. Nature Reviews Genetics, 10:691 EP –, 2009. doi: 10.1038/nrg2640.

F. Crick. On protein sythesis. Medical Research Council Unit for the Study of Molecular Biology, Cavendish Laboratory, Cambridge, (12):138–163, 1958.

F. Crick. Central dogma of molecular biology. Nature, 227(5258):561–563, 1970. ISSN 1476-4687. doi: 10.1038/227561a0.

Joseph de Vita. Mechanisms of interference and foraging among colonies of the harvester ant pogonomyrmex californicus in the mojave desert. Ecology, 60(4):729–737, 1979. ISSN 00129658. doi: 10.2307/1936610.

Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. Substan- tial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic acids research, 36(16):e105, 2008. ISSN 1362-4962. doi: 10.1093/nar/gkn425.

Elias Dohmen, Lukas P. M. Kremer, Erich Bornberg-Bauer, and Carsten Kemena. Dogma: Domain-based transcriptome and proteome quality assessment. Bioinfor- matics (Oxford, England), 32(17):2577–2581, 2016. ISSN 1367-4811. doi: 10.1093/ bioinformatics/btw231.

John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, Arkadiusz Bibillo, Keith Bjornson, Bid- han Chaudhuri, Frederick Christians, Ronald Cicero, Sonya Clark, Ravindra Dalal, Alex Dewinter, John Dixon, Mathieu Foquet, Alfred Gaertner, Paul Hardenbol, Cheryl Heiner, Kevin Hester, David Holden, Gregory Kearns, Xiangxu Kong, Ronald Kuse, Yves Lacroix, Steven Lin, Paul Lundquist, Congcong Ma, Patrick Marks, Mark Maxham, Devon Murphy, Insil Park, Thang Pham, Michael Phillips, Joy Roy, Robert Sebra, Gene Shen, Jon Sorenson, Austin Tomaney, Kevin Travers, Mark Trulson, John Vieceli, Jeffrey Wegener, Dawn Wu, Alicia Yang, Denis Zaccarin, Peter Zhao, Frank Zhong, Jonas Korlach, and Stephen Turner. Real-time dna sequencing from single polymerase molecules. Science (New York, N.Y.), 323(5910):133–138, 2009. ISSN 1095-9203. doi: 10.1126/science.1162986. Bibliography 85

Karen Eilbeck, Barry Moore, Carson Holt, and Mark Yandell. Quantitative measures for the management and comparison of annotated genomes. BMC bioinformatics, 10: 67, 2009. ISSN 1471-2105. doi: 10.1186/1471-2105-10-67.

Robert Ekblom and Jochen B. W. Wolf. A field guide to whole-genome sequencing, assembly and annotation. Evolutionary applications, 7(9):1026–1042, 2014. ISSN 1752-4571. doi: 10.1111/eva.12178.

Kurt Felix. Albrecht kossel. leben und werk. Naturwissenschaften, 42(17):473–478, 1955. ISSN 1432-1904. doi: 10.1007/BF00627952.

Timoth´eeFlutre, Elodie Duprat, Catherine Feuillet, and Hadi Quesneville. Considering transposable element diversification in de novo annotation approaches. PloS one, 6 (1):e16526, 2011. doi: 10.1371/journal.pone.0016526.

Werner E. Gerabek, Bernhard D. Haage, Gundolf Keil, and Wolfgang Wegner, editors. Enzyklop¨adieMedizingeschichte. de Gruyter, Berlin, 2005. ISBN 3110157144.

Sara Goodwin, John D. McPherson, and W. Richard McCombie. Coming of age: Ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17:333 EP –, 2016. doi: 10.1038/nrg.2016.49.

Deborah M. Gordon. The dynamics of the daily round of the harvester ant colony (pogon- omyrmex barbatus). Behaviour, 34(5):1402–1419, 1986. ISSN 00033472. doi: 10.1016/S0003-3472(86)80211-1.

Manfred G. Grabherr, Brian J. Haas, Moran Yassour, Joshua Z. Levin, Dawn A. Thomp- son, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Feder- ica Di Palma, Bruce W. Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Fried- man, and Aviv Regev. Full-length transcriptome assembly from rna-seq data without a reference genome. Nature Biotechnology, 29(7):644, 2011. ISSN 1546-1696. doi: 10.1038/nbt.1883. URL https://www.nature.com/articles/nbt.1883.pdf.

Roderic Guig´o,Paul Flicek, Josep F. Abril, Alexandre Reymond, Julien Lagarde, France Denoeud, Stylianos Antonarakis, Michael Ashburner, Vladimir B. Bajic, Ewan Bir- ney, Robert Castelo, Eduardo Eyras, Catherine Ucla, Thomas R. Gingeras, Jennifer Harrow, Tim Hubbard, Suzanna E. Lewis, and Martin G. Reese. Egasp: The human encode genome annotation assessment project. Genome biology, 7 Suppl 1:S2.1–31, 2006. doi: 10.1186/gb-2006-7-s1-s2.

Yujun Han and Susan R. Wessler. Mite-hunter: A program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic acids research, 38(22):e199, 2010. ISSN 1362-4962. doi: 10.1093/nar/gkq862. Bibliography 86

M. A. Harris, J. Clark, A. Ireland, J. Lomax, M. Ashburner, R. Foulger, K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G. M. Rubin, J. A. Blake, C. Bult, M. Dolan, H. Drabkin, J. T. Eppig, D. P. Hill, L. Ni, M. Ringwald, R. Balakrish- nan, J. M. Cherry, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. Engel, D. G. Fisk, J. E. Hirschman, E. L. Hong, R. S. Nash, A. Sethuraman, C. L. Theesfeld, D. Botstein, K. Dolinski, B. Feierbach, T. Berardini, S. Mundodi, S. Y. Rhee, R. Ap- weiler, D. Barrell, E. Camon, E. Dimmer, V. Lee, R. Chisholm, P. Gaudet, W. Kibbe, R. Kishore, E. M. Schwarz, P. Sternberg, M. Gwinn, L. Hannick, J. Wortman, M. Ber- riman, V. Wood, N. de La Cruz, P. Tonellato, P. Jaiswal, T. Seigfried, and R. White. The gene ontology (go) database and informatics resource. Nucleic acids research, 32 (Database issue):D258–61, 2004. ISSN 1362-4962. doi: 10.1093/nar/gkh036.

Martin Helmkampf, Alexander S. Mikheyev, Yun Kang, Jennifer Fewell, and J¨urgen Gadau. Gene expression and variation in social aggression by queens of the harvester ant pogonomyrmex californicus. Molecular ecology, 25(15):3716–3730, 2016. doi: 10. 1111/mec.13700.

Robert Henschel, Phillip M. Nista, Matthias Lieber, Brian J. Haas, Le-Shin Wu, and Richard D. LeDuc. Trinity rna-seq assembler performance optimization. In Craig Stewart, editor, Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment Bridging from the eXtreme to the cam- pus and beyond, page 1, New York, NY, 2012. ACM. ISBN 9781450316026. doi: 10.1145/2335755.2335842.

B. H¨olldobler, E.O.W. Bert H¨olldobler, F.P.B.B. Holldobler, E. O. Wilson, and H.C.E.U.R.P.E.E.O. Wilson. The Ants. Belknap Press of Harvard University Press, 1990. ISBN 9780674040755. URL https://books.google.de/books?id= ljxV4h61vhUC.

Carson Holt and Mark Yandell. Maker2: An annotation pipeline and genome-database management tool for second-generation genome projects. BMC bioinformatics, 12: 491, 2011. ISSN 1471-2105. doi: 10.1186/1471-2105-12-491.

International Human Genome Sequencing Consortium, Eric S. Lander, Lauren M. Lin- ton, Bruce Birren, Chad Nusbaum, R., Patrinos, A., and Morgan, M. J. Initial sequencing and analysis of the human genome. Nature, 409:860 EP –, 2001. ISSN 1476-4687. doi: 10.1038/35057062.

Camilla L. C. Ip, Matthew Loose, John R. Tyson, Mariateresa de Cesare, Bonnie L. Brown, Miten Jain, Richard M. Leggett, David A. Eccles, Vadim Zalunin, John M. Urban, Paolo Piazza, Rory J. Bowden, Benedict Paten, Solomon Mwaigwisya, Eliz- abeth M. Batty, Jared T. Simpson, Terrance P. Snutch, Ewan Birney, David Buck, Bibliography 87

Sara Goodwin, Hans J. Jansen, Justin O’Grady, and Hugh E. Olsen. Minion analysis and reference consortium: Phase 1 data release and analysis. F1000Research, 4:1075, 2015. ISSN 2046-1402. doi: 10.12688/f1000research.7201.1.

Miten Jain, Ian T. Fiddes, Karen H. Miga, Hugh E. Olsen, Benedict Paten, and Mark Akeson. Improved data analysis for the minion nanopore sequencer. Nature Methods, 12:351 EP –, 2015. ISSN 1548-7105. doi: 10.1038/nmeth.3290.

Philip Jones, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAn- ulla, Hamish McWilliam, John Maslen, Alex Mitchell, Gift Nuka, Sebastien Pesseat, Antony F. Quinn, Amaia Sangrador-Vegas, Maxim Scheremetjew, Siew-Yit Yong, Ro- drigo Lopez, and Sarah Hunter. Interproscan 5: genome-scale protein function classi- fication. Bioinformatics (Oxford, England), 30(9):1236–1240, 2014. ISSN 1367-4811. doi: 10.1093/bioinformatics/btu031.

M. Kanehisa. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1):27–30, 2000. ISSN 1362-4962. doi: 10.1093/nar/28.1.27.

Vladimir V. Kapitonov and Jerzy Jurka. A novel class of sine elements derived from 5s rrna. Molecular biology and evolution, 20(5):694–702, 2003. doi: 10.1093/molbev/ msg075.

Vladimir V. Kapitonov and Jerzy Jurka. A universal classification of eukaryotic trans- posable elements implemented in repbase. Nature Reviews Genetics, 9:411 EP –, 2008. doi: 10.1038/nrg2165-c1.

Yuri Kapustin, Alexander Souvorov, Tatiana Tatusova, and David Lipman. Splign: algorithms for computing spliced alignments with identification of paralogs. Biology direct, 3:20, 2008. doi: 10.1186/1745-6150-3-20.

Peter D. Karp, Markus Krummenacker, Suzanne Paley, and Jonathan Wagg. Integrated pathway–genome databases and their role in drug discovery. Trends in Biotechnology, 17(7):275–281, 1999. ISSN 01677799. doi: 10.1016/S0167-7799(99)01316-5.

Yechezkel Kashi and David G. King. Simple sequence repeats as advantageous mutators in evolution. Trends in genetics : TIG, 22(5):253–259, 2006. ISSN 0168-9525. doi: 10.1016/j.tig.2006.03.005.

Haig H. Kazazian and John V. Moran. The impact of l1 retrotransposons on the hu- man genome. Nature Genetics, 19(1):19–24, 1998. ISSN 1546-1718. doi: 10.1038/ ng0598-19.

Jens Keilwagen, Michael Wenk, Jessica L. Erickson, Martin H. Schattat, Jan Grau, and Frank Hartung. Using intron position conservation for homology-based gene Bibliography 88

prediction. Nucleic acids research, 44(9):e89, 2016. ISSN 1362-4962. doi: 10.1093/ nar/gkw092.

Jens Keilwagen, Frank Hartung, Michael Paulini, Sven O. Twardziok, and Jan Grau. Combining rna-seq data and homology-based gene prediction for plants, and fungi. BMC bioinformatics, 19(1):189, 2018. ISSN 1471-2105. doi: 10.1186/ s12859-018-2203-5.

O. Khorkova, J. Hsiao, and C. Wahlestedt. Basic biology and therapeutic implications of lncrna. Advanced drug delivery reviews, 87:15–24, 2015. doi: 10.1016/j.addr.2015. 05.012.

Szymon M. Kielbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C. Frith. Adaptive seeds tame genomic sequence comparison. Genome research, 21(3):487–493, 2011. ISSN 1088-9051. doi: 10.1101/gr.113985.110.

Daehwan Kim, Ben Langmead, and Steven L. Salzberg. Hisat: A fast spliced aligner with low memory requirements. Nature Methods, 12(4):357, 2015. ISSN 1548-7105. doi: 10.1038/nmeth.3317. URL https://www.nature.com/articles/nmeth.3317.pdf.

Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. Canu: Scalable and accurate long-read assem- bly via adaptive k-mer weighting and repeat separation. Genome research, 27(5): 722–736, 2017. ISSN 1088-9051. doi: 10.1101/gr.215087.116.

Ian Korf. Gene finding in novel genomes. BMC bioinformatics, 5:59, 2004. ISSN 1471- 2105. doi: 10.1186/1471-2105-5-59.

Jocelyn E. Krebs, Elliott S. Goldstein, Stephen T. Kilpatrick, and Benjamin Lewin. Lewin’s essential genes. Jones and Bartlett Publ, Sudbury, Mass., 2. ed., international ed. edition, 2010. ISBN 9780763774103.

Ben Langmead and Steven L. Salzberg. Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4):357, 2012. ISSN 1548-7105. doi: 10.1038/nmeth.1923. URL https://www.nature.com/articles/nmeth.1923.pdf.

Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome biology, 10(3):R25, 2009. doi: 10.1186/gb-2009-10-3-r25.

T. Laver, J. Harrison, P. A. O’Neill, K. Moore, A. Farbos, K. Paszkiewicz, and D. J. Studholme. Assessing the performance of the oxford nanopore technologies minion. Biomolecular detection and quantification, 3:1–8, 2015. ISSN 2214-7535. doi: 10.1016/ j.bdq.2015.02.001. Bibliography 89

ALBERT LEVAN, KARL FREDGA, and AVERY A. SANDBERG. Nomenclature for centromeric position on chromosomes. Hereditas, 52(2):201–220, 1964. ISSN 00180661. doi: 10.1111/j.1601-5223.1964.tb01953.x.

Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (Oxford, England), 22 (13):1658–1659, 2006. ISSN 1367-4811. doi: 10.1093/bioinformatics/btl158.

T. M. Lowe and S. R. Eddy. trnascan-se: A program for improved detection of transfer rna genes in genomic sequence. Nucleic acids research, 25(5):955–964, 1997. ISSN 1362-4962.

Hengyun Lu, Francesca Giordano, and Zemin Ning. Oxford nanopore minion sequencing and genome assembly. Genomics, proteomics & bioinformatics, 14(5):265–279, 2016. doi: 10.1016/j.gpb.2016.05.004.

Elaine R. Mardis. Next-generation sequencing platforms. Annual review of analytical chemistry (Palo Alto, Calif.), 6:287–303, 2013. doi: 10.1146/ annurev-anchem-062012-092628.

Elaine R. Mardis. Dna sequencing technologies: 2006–2016. Nature protocols, 12(2):213, 2017. ISSN 1750-2799. doi: 10.1038/nprot.2016.182. URL https://www.nature. com/articles/nprot.2016.182.pdf.

Mark S. Bretscher. Direct translation of a circular messenger dna. Nature, (220):1088– 1091, 1968. ISSN 1476-4687. URL https://link.springer.com/content/pdf/10. 1038/2201088a0.pdf.

John S. Mattick and Igor V. Makunin. Non-coding rna. Human molecular genetics, 15 Spec No 1:R17–29, 2006. ISSN 0964-6906. doi: 10.1093/hmg/ddl046.

B. J. McCarthy and J. J. Holland. Denatured dna as a direct template for in vitro protein synthesis. Proceedings of the National Academy of Sciences, 54(3):880–886, 1965. ISSN 0027-8424. doi: 10.1073/pnas.54.3.880.

B. McClintock. The origin and behavior of mutable loci in maize. Proceedings of the National Academy of Sciences, 36(6):344–355, 1950. ISSN 0027-8424. doi: 10.1073/ pnas.36.6.344.

Marcella A. McClure, Hugh S. Richardson, Rochelle A. Clinton, Crystal M. Hepp, Brad A. Crowther, and Eric F. Donaldson. Automated characterization of poten- tially active retroid agents in the human genome. Genomics, 85(4):512–523, 2005. ISSN 0888-7543. doi: 10.1016/j.ygeno.2004.12.006. Bibliography 90

Frazer Meacham, Dario Boffelli, Joseph Dhahbi, David I. K. Martin, Meromit Singer, and Lior Pachter. Identification and correction of systematic error in high-throughput sequence data. BMC bioinformatics, 12:451, 2011. ISSN 1471-2105. doi: 10.1186/ 1471-2105-12-451.

Andr´eE. Minoche, Juliane C. Dohm, and Heinz Himmelbauer. Evaluation of genomic high-throughput sequencing data generated on illumina hiseq and genome analyzer systems. Genome biology, 12(11):R112, 2011. doi: 10.1186/gb-2011-12-11-r112.

Mois`es Burset and Roderic Guig´o. Evaluation of gene structure prediction pro- grams. Genomics, 34(3):353–367, 1996. ISSN 0888-7543. doi: 10.1006/geno. 1996.0298. URL https://www.sciencedirect.com/science/article/pii/ S0888754396902980/pdf?md5=9ef5f903aa2d8e65e0767cf253329064&pid=1-s2. 0-S0888754396902980-main.pdf.

G. E. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82–85, 1998. ISSN 0018-9219. doi: 10.1109/JPROC.1998.658762. URL http://www.cs.utexas.edu/~fussell/courses/cs352h/papers/moore.pdf.

Corrie S. Moreau, Charles D. Bell, Roger Vila, S. Bruce Archibald, and Naomi E. Pierce. Phylogeny of the ants: Diversification in the age of angiosperms. Science (New York, N.Y.), 312(5770):101–104, 2006. ISSN 1095-9203. doi: 10.1126/science.1124891.

Stephen M. Mount, Valer Gotea, Chiao-Feng Lin, Kristina Hernandez, and Wojciech Makalowski. Spliceosomal small nuclear rna genes in 11 insect genomes. RNA (New York, N.Y.), 13(1):5–14, 2007. ISSN 1355-8382. doi: 10.1261/rna.259207.

K. Mullis, F. Faloona, S. Scharf, R. Saiki, G. Horn, and H. Erlich. Specific enzymatic amplification of dna in vitro: The polymerase chain reaction. Cold Spring Harbor Symposia on Quantitative Biology, 51(0):263–273, 1986. ISSN 0091-7451. doi: 10. 1101/SQB.1986.051.01.032.

Kensuke Nakamura, Taku Oshima, Takuya Morimoto, Shun Ikeda, Hirofumi Yoshikawa, Yuh Shiwa, Shu Ishikawa, Margaret C. Linak, Aki Hirai, Hiroki Takahashi, Md Altaf- Ul-Amin, Naotake Ogasawara, and Shigehiko Kanaya. Sequence-specific error profile of illumina sequencers. Nucleic acids research, 39(13):e90, 2011. ISSN 1362-4962. doi: 10.1093/nar/gkr344.

Nuala A. O’Leary, Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako- Adjei, Alexander Astashyn, Azat Badretdin, Yiming Bao, Olga Blinkova, Vyacheslav Brover, Vyacheslav Chetvernin, Jinna Choi, Eric Cox, Olga Ermolaeva, Catherine M. Bibliography 91

Farrell, Tamara Goldfarb, Tripti Gupta, Daniel Haft, Eneida Hatcher, Wratko Hlav- ina, Vinita S. Joardar, Vamsi K. Kodali, Wenjun Li, Donna Maglott, Patrick Mas- terson, Kelly M. McGarvey, Michael R. Murphy, Kathleen O’Neill, Shashikant Pujar, Sanjida H. Rangwala, Daniel Rausch, Lillian D. Riddick, Conrad Schoch, Andrei Shkeda, Susan S. Storz, Hanzhen Sun, Francoise Thibaud-Nissen, Igor Tolstoy, Ray- mond E. Tully, Anjana R. Vatsan, Craig Wallin, David Webb, Wendy Wu, Melissa J. Landrum, Avi Kimchi, Tatiana Tatusova, Michael DiCuccio, Paul Kitts, Terence D. Murphy, and Kim D. Pruitt. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic acids research, 44 (D1):D733–45, 2016. ISSN 1362-4962. doi: 10.1093/nar/gkv1189.

Len Ornstein. Disc electrohoresis: Background and theory. Cell Research Labora- tory, The Mount Sinai Hospital, New York, N. Y, 1962. URL https://pdfs. semanticscholar.org/7558/9937b2d4e631038b4263f9de1a0de15b881e.pdf.

Chandra Shekhar Pareek, Rafal Smoczynski, and Andrzej Tretyn. Sequencing technolo- gies and genome sequencing. Journal of applied genetics, 52(4):413–435, 2011. doi: 10.1007/s13353-011-0057-x.

Matthew M. Parks, Chad M. Kurylo, Jake E. Batchelder, C. Theresa Vincent, and Scott C. Blanchard. Implications of sequence variation on the evolution of rrna. Chromosome Research, 27(1):89–93, 2019. ISSN 1573-6849. doi: 10.1007/ s10577-018-09602-w. URL https://doi.org/10.1007/s10577-018-09602-w.

Genis Parra, Keith Bradnam, and Ian Korf. Cegma: A pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics (Oxford, England), 23(9):1061–1067, 2007. ISSN 1367-4811. doi: 10.1093/bioinformatics/btm071.

Mihaela Pertea, Geo M. Pertea, Corina M. Antonescu, Tsung-Cheng Chang, Joshua T. Mendell, and Steven L. Salzberg. Stringtie enables improved reconstruction of a transcriptome from rna-seq reads. Nature Biotechnology, 33(3):290, 2015. ISSN 1546- 1696. doi: 10.1038/nbt.3122. URL https://www.nature.com/articles/nbt.3122. pdf.

Mihai Pop. Genome assembly reborn: recent computational challenges. Briefings in bioinformatics, 10(4):354–366, 2009. ISSN 1477-4054. doi: 10.1093/bib/bbp026.

Alkes L. Price, Neil C. Jones, and Pavel A. Pevzner. De novo identification of repeat families in large genomes. Bioinformatics (Oxford, England), 21 Suppl 1:i351–8, 2005. ISSN 1367-4811. doi: 10.1093/bioinformatics/bti1018. Bibliography 92

Leszek P. Pryszcz and Toni Gabald´on. Redundans: an assembly pipeline for highly heterozygous genomes. Nucleic acids research, 44(12):e113, 2016. ISSN 1362-4962. doi: 10.1093/nar/gkw294.

Hadi Quesneville, Casey M. Bergman, Olivier Andrieu, Delphine Autard, Danielle Nouaud, Michael Ashburner, and Dominique Anxolabehere. Combined evidence an- notation of transposable elements in genome sequences. PLoS computational biology, 1(2):166–175, 2005. doi: 10.1371/journal.pcbi.0010022.

R. W. Holley, G. A. Everett, J. T. Madison, and A. Zamir. Nucleotide sequences in the yeast alanine transfer ribonucleic acid. Journal of Biological Chemistry, 240:2122– 2128, 1965. ISSN 0021-9258.

Anthony Rhoads and Kin Fai Au. Pacbio sequencing and its applications. Genomics, proteomics & bioinformatics, 13(5):278–289, 2015. doi: 10.1016/j.gpb.2015.08.002.

F. Sanger, S. Nicklen, and A. R. Coulson. Dna sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74(12):5463–5467, 1977. ISSN 0027-8424. doi: 10.1073/pnas.74.12.5463.

Jeffery A. Schloss. How to get genomes at one ten-thousandth the cost. Nature Biotech- nology, 26:1113 EP –, 2008. ISSN 1546-1696. doi: 10.1038/nbt1008-1113.

Ingo Schubert. Chromosome evolution. Current opinion in plant biology, 10(2):109–115, 2007. doi: 10.1016/j.pbi.2007.01.001.

Emily J. Shields, Lihong Sheng, Amber K. Weiner, Benjamin A. Garcia, and Roberto Bonasio. High-quality genome assemblies reveal long non-coding rnas expressed in ant brains. Cell reports, 23(10):3078–3090, 2018. doi: 10.1016/j.celrep.2018.05.014.

Felipe A. Sim˜ao,Robert M. Waterhouse, Panagiotis Ioannidis, Evgenia V. Krivent- seva, and Evgeny M. Zdobnov. Busco: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics (Oxford, England), 31(19): 3210–3212, 2015. ISSN 1367-4811. doi: 10.1093/bioinformatics/btv351.

David Sims, Ian Sudbery, Nicholas E. Ilott, Andreas Heger, and Chris P. Ponting. Se- quencing depth and coverage: key considerations in genomic analyses. Nature reviews. Genetics, 15(2):121–132, 2014. ISSN 1471-0064. doi: 10.1038/nrg3642.

Guy St C. Slater and Ewan Birney. Automated generation of heuristics for biological sequence comparison. BMC bioinformatics, 6:31, 2005. ISSN 1471-2105. doi: 10. 1186/1471-2105-6-31.

A.F.A. Smit and R. Hubley. Repeatmodeler open-1.0. 2008-2015. URL http://www. repeatmasker.org. Bibliography 93

A.F.A. Smit, R. Hubley, and P. Green. Repeatmasker open-4.0. 2013-2015. URL http://www.repeatmasker.org.

Chris R. Smith, Christopher D. Smith, Hugh M. Robertson, Martin Helmkampf, Alek- sey Zimin, Mark Yandell, Carson Holt, Hao Hu, Ehab Abouheif, Richard Benton, Elizabeth Cash, Vincent Croset, Cameron R. Currie, Eran Elhaik, Christine G. Elsik, Marie-Julie Fav´e,Vilaiwan Fernandes, Joshua D. Gibson, Dan Graur, Wulfila Gro- nenberg, Kirk J. Grubbs, Darren E. Hagen, Ana Sofia Ibarraran Viniegra, Brian R. Johnson, Reed M. Johnson, Abderrahman Khila, Jay W. Kim, Kaitlyn A. Mathis, Monica C. Munoz-Torres, Marguerite C. Murphy, Julie A. Mustard, Rin Nakamura, Oliver Niehuis, Surabhi Nigam, Rick P. Overson, Jennifer E. Placek, Rajendhran Ra- jakumar, Justin T. Reese, Garret Suen, Shu Tao, Candice W. Torres, Neil D. Tsutsui, Lumi Viljakainen, Florian Wolschin, and J¨urgenGadau. Draft genome of the red harvester ant pogonomyrmex barbatus. Proceedings of the National Academy of Sci- ences of the United States of America, 108(14):5667–5672, 2011. ISSN 1091-6490. doi: 10.1073/pnas.1007901108.

Mario Stanke and Burkhard Morgenstern. Augustus: A web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research, 33(Web Server issue):W465–7, 2005. ISSN 1362-4962. doi: 10.1093/nar/gki458.

Steven R. Head, H. Kiyomi Komori, Sarah A. LaMere, Thomas Whisenant, Filip Van Nieuwerburgh, Daniel R. Salomon, and Phillip Ordoukhanian. Library construction for next-generation sequencing: Overviews and challenges. Biotechniques, (Vol. 56 No. 2), 2018. URL https://www.future-science.com/doi/full/10.2144/000114133.

Peter H. Sudmant, Tobias Rausch, Eugene J. Gardner, Robert E. Handsaker, Alexej Abyzov, John Huddleston, Yan Zhang, Kai Ye, Goo Jun, Markus Hsi-Yang Fritz, Miriam K. Konkel, Ankit Malhotra, Adrian M. St¨utz,Xinghua Shi, Francesco Paolo Casale, Jieming Chen, Fereydoun Hormozdiari, Gargi Dayama, Ken Chen, Maika Ma- lig, Mark J. P. Chaisson, Klaudia Walter, Sascha Meiers, Seva Kashin, Erik Garrison, Adam Auton, Hugo Y. K. Lam, Xinmeng Jasmine Mu, Can Alkan, Danny Antaki, Taejeong Bae, Eliza Cerveira, Peter Chines, Zechen Chong, Laura Clarke, Elif Dal, Li Ding, Sarah Emery, Xian Fan, Madhusudan Gujral, Fatma Kahveci, Jeffrey M. Kidd, Yu Kong, Eric-Wubbo Lameijer, Shane McCarthy, Paul Flicek, Richard A. Gibbs, Gabor Marth, Christopher E. Mason, Androniki Menelaou, Donna M. Muzny, Bradley J. Nelson, Amina Noor, Nicholas F. Parrish, Matthew Pendleton, An- drew Quitadamo, Benjamin Raeder, Eric E. Schadt, Mallory Romanovitch, An- dreas Schlattl, Robert Sebra, Andrey A. Shabalin, Andreas Untergasser, Jerilyn A. Walker, Min Wang, Fuli Yu, Chengsheng Zhang, Jing Zhang, Xiangqun Zheng- Bradley, Wanding Zhou, Thomas Zichner, Jonathan Sebat, Mark A. Batzer, Steven A. Bibliography 94

McCarroll, Consortium, The 1000 Genomes Project, Ryan E. Mills, Mark B. Ger- stein, Ali Bashir, Oliver Stegle, Scott E. Devine, Charles Lee, Evan E. Eichler, and Jan O. Korbel. An integrated map of structural variation in 2,504 human genomes. Nature, 526(7571):75, 2015. ISSN 1476-4687. doi: 10.1038/nature15394. URL https://www.nature.com/articles/nature15394.pdf.

S. W. Taber, J. C. Cokendolpher, and O. F. Francke. Karyological study of north americanpogonomyrmex (hymenoptera: Formicidae). Insectes Sociaux, 35(1):47–60, 1988. ISSN 1420-9098. doi: 10.1007/BF02224137.

Ge Tan, Lennart Opitz, Ralph Schlapbach, and Hubert Rehrauer. Long fragments achieve lower base quality in illumina paired-end sequencing. Scientific Reports, 9(1): 2856, 2019. ISSN 2045-2322. doi: 10.1038/s41598-019-39076-7.

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68, 2015. ISSN 1476-4687. doi: 10.1038/nature15393. URL https: //www.nature.com/articles/nature15393.pdf.

The C. elegans Sequencing Consortium. Genome sequence of the nematode c. elegans: A platform for investigating biology. Science (New York, N.Y.), 282(5396):2012–2018, 1998. ISSN 1095-9203. URL http://www.jstor.org/stable/2897605.

The Uniprot Consortium. Uniprot: a worldwide hub of protein knowledge. Nucleic acids research, (47):D506–D515, 2018. ISSN 1362-4962. doi: 10.1093/nar/gky1049.

Jainy Thomas and Ellen J. Pritham. Helitrons, the eukaryotic rolling-circle trans- posable elements. Microbiology spectrum, 3(4), 2015. doi: 10.1128/microbiolspec. MDNA3-0049-2014.

Cole Trapnell, Lior Pachter, and Steven L. Salzberg. Tophat: Discovering splice junc- tions with rna-seq. Bioinformatics (Oxford, England), 25(9):1105–1111, 2009. ISSN 1367-4811. doi: 10.1093/bioinformatics/btp120.

Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R. Kelley, Harold Pimentel, Steven L. Salzberg, John L. Rinn, and Lior Pachter. Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nature protocols, 7(3):562–578, 2012. ISSN 1750-2799. doi: 10.1038/nprot.2012.016.

Todd J. Treangen and Steven L. Salzberg. Repetitive dna and next-generation sequenc- ing: Computational challenges and solutions. Nature reviews. Genetics, 13(1):36–46, 2011. ISSN 1471-0064. doi: 10.1038/nrg3117.

Neil D. Tsutsui, Andrew V. Suarez, Joseph C. Spagna, and J. Spencer Johnston. The evolution of genome size in ants. BMC evolutionary biology, 8:64, 2008. ISSN 1471- 2148. doi: 10.1186/1471-2148-8-64. Bibliography 95

Erwin L. van Dijk, Yan Jaszczyszyn, Delphine Naquin, and Claude Thermes. The third revolution in sequencing technology. Trends in genetics : TIG, 34(9):666–681, 2018. ISSN 0168-9525. doi: 10.1016/j.tig.2018.05.008.

Venter, J. C. et. al. The sequence of the human genome. Science (New York, N.Y.), 291 (5507):1304–1351, 2001. ISSN 1095-9203. doi: 10.1126/science.1058040.

Bruce J. Walker, Thomas Abeel, Terrance Shea, Margaret Priest, Amr Abouelliel, Sharadha Sakthikumar, Christina A. Cuomo, Qiandong Zeng, Jennifer Wortman, Sarah K. Young, and Ashlee M. Earl. Pilon: An integrated tool for comprehen- sive microbial variant detection and genome assembly improvement. PloS one, 9(11): e112963, 2014. doi: 10.1371/journal.pone.0112963.

Philip S. Ward, Se´anG. Brady, Brian L. Fisher, and T. R.E.D. SCHULTZ. The evolu- tion of myrmicine ants: Phylogeny and biogeography of a hyperdiverse ant clade (hy- menoptera: Formicidae). Systematic Entomology, 40(1):61–81, 2015. ISSN 03076970. doi: 10.1111/syen.12090.

Robert M. Waterhouse, Mathieu Seppey, Felipe A. Sim˜ao,Mos`eManni, Panagiotis Ioannidis, Guennadi Klioutchnikov, Evgenia V. Kriventseva, and Evgeny M. Zdobnov. Busco applications from quality assessments to gene prediction and phylogenomics. Molecular biology and evolution, 2017. doi: 10.1093/molbev/msx319.

J. D. Watson and F. H. C. Crick. The structure of dna. Cold Spring Harbor Symposia on Quantitative Biology, 18(0):123–131, 1953. ISSN 0091-7451. doi: 10.1101/SQB. 1953.018.01.020.

Neil I. Weisenfeld, Vijay Kumar, Preyas Shah, Deanna M. Church, and David B. Jaffe. Direct determination of diploid genome sequences. Genome research, 27(5):757–767, 2017. ISSN 1088-9051. doi: 10.1101/gr.214874.116.

David J. Witherspoon, W. Scott Watkins, Yuhua Zhang, Jinchuan Xing, Whitney L. Tolpinrud, Dale J. Hedges, Mark A. Batzer, and Lynn B. Jorde. Alu repeats in- crease local recombination rates. BMC genomics, 10:530, 2009. doi: 10.1186/ 1471-2164-10-530.

Ray Wu and A. D. Kaiser. Structure and base sequence in the cohesive ends of bac- teriophage lambda dna. Journal of molecular biology, 35(3):523–537, 1968. ISSN 0022-2836. doi: 10.1016/S0022-2836(68)80012-9.

Mark Yandell and Daniel Ence. A beginner’s guide to eukaryotic genome annotation. Na- ture reviews. Genetics, 13(5):329–342, 2012. ISSN 1471-0064. doi: 10.1038/nrg3174. Bibliography 96

Evgeny M. Zdobnov, Fredrik Tegenfeldt, Dmitry Kuznetsov, Robert M. Waterhouse, Fe- lipe A. Sim˜ao,Panagiotis Ioannidis, Mathieu Seppey, Alexis Loetscher, and Evgenia V. Kriventseva. Orthodb v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs. Nucleic acids research, 45(D1):D744–D749, 2017. ISSN 1362-4962. doi: 10.1093/nar/gkw1119.