The Utility of Long Read Sequencing for the

Discovery of Genomic Retroviral Insertions and for

Hybrid Assembly

by

Ryan Salinas

A thesis submitted as fulfilment

of the requirement for the degree of

Masters of Science

School of Biotechnology and Biomolecular Sciences

University of New South Wales

August, 2017

THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet

Surname or Family name: Salinas

First name: Ryan Other name/s: Matthew

Abbreviation for degree as given in the University calendar: MSC

School: School of Biotechnology and Biomolecular Sciences Faculty: Faculty of Science

Title: The Utility of Long Read Sequencing for the Discovery of Genomic Retroviral Insertions and for Hybrid Genome Assembly

Abstract 350 words maximum:

Long sequence reads from PacBio technology show promise in resolving hard-to-assemble regions of and should be of utility in hybrid genome assemblies. However, the benefits of such approaches are not yet fully understood. To help address this, two investigations were undertaken on the koala genome, the first which sought to find and characterise koala retrovirus (KoRV) insertions through progressive use of targeted assemblies and the second which used increasing amounts of long read sequence data in hybrid genome assemblies. For the analysis of KoRV insertions, targeted short read-based assemblies generated limited insights into the KoRV insertion points and were unable to reconstruct adjacent gene structures. The use of PacBio long read technology, by contrast, allowed KoRV inserts to be fully assembled and characterised and for adjoining genomic koala genes to be discovered. Within the full long read genome assembly, no cases of KoRV insertion into coding regions of annotated genes were observed. The investigation into use of differing amounts of long reads in hybrid assemblies generated some unexpected results, as measured by commonly used assembly statistics, the presence of conserved gene sets and of repeat regions. Within the range of assemblies generated, using 20X to 57X short read data with up to 20X PacBio long read data, there was almost no change in the presence of conserved gene sets and thus no improvement in the assembly of protein gene- containing regions in hybrid assembly with long reads. However, the hybrid assemblies saw increases in N50 and maximum scaffold size, and in the assembly of repeat-containing regions. It can be concluded that long sequence reads are of use in the assembly and annotation of regions that contain repeated sequences but their utility in hybrid assemblies, at least for high content regions containing protein coding genes, may be limited.

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

i ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis.

I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

ii COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

ABSTRACT

Long sequence reads from PacBio technology show promise in resolving hard-to-assemble regions of genomes and should be of utility in hybrid genome assemblies. However, the benefits of such approaches are not yet fully understood. To help address this, two investigations were undertaken on the koala genome, the first which sought to find and characterise koala retrovirus (KoRV) insertions through progressive use of targeted assemblies and the second which used increasing amounts of long read sequence data in hybrid genome assemblies. For the analysis of KoRV insertions, targeted short read-based assemblies generated limited insights into the KoRV insertion points and were unable to reconstruct adjacent gene structures. The use of PacBio long read technology, by contrast, allowed KoRV inserts to be fully assembled and characterised and for adjoining genomic koala genes to be discovered. Within the full long read genome assembly, no cases of

KoRV insertion into coding regions of annotated genes were observed. The investigation into use of differing amounts of long reads in hybrid assemblies generated some unexpected results, as measured by commonly used assembly statistics, the presence of conserved gene sets and of repeat regions. Within the range of assemblies generated, using 20X to 57X short read data with up to 20X PacBio long read data, there was almost no change in the presence of conserved gene sets and thus no improvement in the assembly of protein gene- containing regions in hybrid assembly with long reads. However, the hybrid assemblies saw increases in N50 and maximum scaffold size, and in the assembly of repeat-containing regions. It can be concluded that long sequence reads are of use in the assembly and annotation of regions that contain repeated sequences but their utility in hybrid assemblies, at least for high content regions containing protein coding genes, may be limited.

iii

PUBLICATIONS

Work undertaken in this thesis has been incorporated into two publications.

1. Long-read genome provides insight into ongoing retroviral invasion of the koala germline. Hobbs M, King A, Salinas R, Chen Z, Tsangaras K, Greenwood AD,

Johnson RN, Belov K, Wilkins MR, Timms P. Scientific Reports 2017. 7(1):15838.

2. Adaptation and conservation insights from the koala genome

Rebecca N. Johnson, Denis O’Meally, Zhiliang Chen, Graham J. Etherington, Simon Y. W.

Ho, Will J. Nash, Catherine E. Grueber, Yuanyuan Cheng, Camilla M. Whittington,

Siobhan Dennison, Emma Peel, Wilfried Haerty, Rachel J. O’Neill, Don Colgan1, Tonia

L.Russell, David E. Alquezar-Planas, Val Attenbrow, Jason G. Bragg, Parice A. Brandies,

Amanda Yoon-Yee Chong, Janine E. Deakin, Federica Di Palma, Zachary Duda, Mark D.

B. Eldridge, Kyle M. Ewart, Carolyn J. Hogg, Greta J. Frankham, Arthur Georges, Amber

K. Gillett, Merran Govendir, Alex D. Greenwood, Takashi Hayakawa, Kristofer M. Helgen,

Matthew Hobbs, Clare E. Holleley Thomas N. Heider, Elizabeth A. Jones, Andrew King,

Danielle Madden, Jennifer A. Marshall Graves, Katrina M. Morris, Linda E. Neaves,

Hardip R. Patel, Adam Polkinghorne, Marilyn B. Renfree, Charles Robin, Ryan Salinas,

Kyriakos Tsangaras, Paul D. Waters, Shafagh A. Waters, Belinda Wright, Marc R. Wilkins,

Peter Timms, Katherine Belov. Nature Genetics, in press.

iv

TABLE OF CONTENTS

1 INTRODUCTION ...... 9

1.1 A BRIEF HISTORY OF GENOME SEQUENCING ...... 9

1.2 SHORT-READ SEQUENCING AND ASSEMBLY ...... 11

1.2.1 Sequencing methods ...... 11

1.2.2 Assembly algorithm methods ...... 14

1.3 OPTICAL MAPPING METHODS ...... 16

1.4 LONG READ SEQUENCING AND ASSEMBLY ...... 17

1.4.1 Sequencing methods ...... 17

1.4.2 Long read-based genome assembly ...... 20

1.5 HYBRID ASSEMBLY ...... 21

1.6 EVALUATION OF ASSEMBLY QUALITY ...... 24

1.7 THE KOALA GENOME ...... 25

1.8 KOALA RETROVIRUS (KORV) ...... 26

1.9 AIMS OF THIS THESIS ...... 28

2 INVESTIGATING THE PRESENCE OF KOALA RETROVIRUS IN AN

INDIVIDUAL’S GENOME ...... 29

2.1 INTRODUCTION ...... 30

2.2 MATERIALS AND METHODS ...... 31

2.2.1 Short read genome sequence data ...... 31

2.2.2 Koala whole genome assembly with short reads ...... 32

2.2.3 Targeted short-read assembly of genomic regions adjacent to KoRV insertions ...... 32

2.2.4 Long read genome sequence data ...... 33

2.2.5 Whole koala genome assembly with long-read data and analysis of KoRV insertions

and flanking genes ...... 34

2.2.6 Targeted long-read assembly of genomic regions adjacent to KoRV insertions ...... 34

2.2.7 Analysis for sequence motifs at KoRV insertion points in the long-read genome ...... 35

2.3 RESULTS ...... 36 v

2.3.1 KoRV was not well represented in the short read-based whole genome assembly ..... 36

2.3.2 Motif analysis in short reads that contain KoRV ...... 38

2.3.3 KoRV was well-represented in the long read based assembly ...... 41

2.3.4 Proximity of KoRV to genes in the koala genome ...... 43

2.3.5 Motif analysis in KoRV insertions present in the koala PacBio long read genome

assembly 48

2.3.6 Are KoRV insertions homozygous or heterozygous in the koala genome? ...... 49

2.4 DISCUSSION ...... 51

2.4.1 Short read...... 53

2.4.2 Long read ...... 54

3 THE UTILITY OF LONG READS IN THE HYBRID ASSEMBLY OF THE KOALA

GENOME 56

3.1 INTRODUCTION ...... 57

3.2 METHODS ...... 59

3.2.1 Bioinformatics methods ...... 59

3.3 RESULTS ...... 61

3.3.1 PacBio data sampling and hybrid assembly run times ...... 61

3.3.2 General statistics of hybrid assemblies ...... 64

3.3.3 BUSCO results ...... 71

3.3.4 Repeat Masker results ...... 73

3.4 DISCUSSION ...... 75

3.4.1 Hybrid assemblers and their use to date ...... 75

3.4.2 Advantages and disadvantages of hybrid assembly versus short read assemblies ..... 76

3.4.3 Limitations of the current study ...... 76

3.4.4 Cost analysis ...... 77

4 CONCLUSIONS ...... 79

5 REFERENCES ...... 81

vi

ACKNOWLEDGMENTS

This thesis is dedicated first and foremost to my supervisor Professor Marc Wilkins whose patience and kindness were the driving force behind my research during my whole time spent with the Wilkin’s lab. Thank you so much for believing in me and my research and guiding me to succeed as much as I have.

Thank you to Dr. Zhiliang Chen and Dr. Nandan Despande for their guidance throughout with my research.

Thank you to all of the Wilkin’s lab who made my time during my degree the most enjoyable time I have ever had during my time at university.

Thanks to all of my friends throughout all of the School of Biotechnology and

Biomolecular Sciences especially William, Long and Jon for all the support and friendship you’ve given during my time.

Finally I’d like to thank my family and other friends especially my parents, Manuel and Helen, and cousins, Nathan, Sarah, Cathryn, and Michael, who supported all my successes with my research.

vii

ABBREVIATIONS bp base pairs d day(s) DNA deoxyribonucleic acid Gb gigabyte gDNA genomic DNA GHz gigahertz h hour(s) kbp kilobase pairs KoRV koala retrovirus min minute(s) ORF open read frame PC Pacific Chocolate (name of koala) PCR polymerase chain reaction RNA ribonucleic acid

viii

1 INTRODUCTION

1.1 A Brief History of Genome Sequencing

In the late 19th century, a cellular substance that determined cell function was discovered. This would later be the foundation of the discovery of deoxyribose nucleic acid or DNA.[1, 2] The concept of heredity and the purpose of DNA as the code for building of organisms was established by the 1950’s, as well as the four nucleotide bases as the composition of DNA i.e. adenine, cytosine, guanine and thymine.[3, 4] The 1977 breakthrough of Frederick Sanger and his method for DNA sequencing established the foundation for future DNA sequencing methods; this is even used to this day. The method utilised di-deoxy nucleotides which are incorporated during DNA strand synthesis but terminate the DNA elongation due to the lack of 3’-OH group required for the formation of a phosphodiester bond.[5-7] This reaction was used for the four nucleotide bases and the

DNA sequence can be determined from the different sized fragments produced.

The plans for sequencing of the human genome started in 1994 and eventually led to the publishing of a draft human genome assembly on the 12th of February, 2001.[8] The difficulties however from the project arose quickly after its inception, with the scale of the project being incompatible with the previously used sequencing methods which employed bacterial artificial chromosomes (BACs). Prior to the human genome project the use of

BACs was limited to short gene sequences or viral genomes and the utilisation of such a method for a genome the size of human was found to be too time consuming and expensive.

As a consequence, Craig Venter and Gene Meyers pioneered the method of ‘’, where the whole genome is randomly fragmented and the fragments are directly sequenced and assembled. This was the eventual method used by his team.[8]

However despite this method allowing for higher throughput, by allowing banks of sequencers to work concurrently at the time of the project, this exposed a problem which still remains for modern sequencers. The human genome and many other large eukaryotic genomes contain large sections of repeats,[9] with the method of shotgun sequencing resulting in difficulties in assembling these repeat regions. The shotgun sequencing method was made on the assumption that assembly of the sequenced fragments could be conducted due to the fragments uniquely overlapping with other regions to form contigs. However, in the case of repeat regions this cannot necessarily happen. The repeat regions, depending on their size, would go on to remain largely unresolved.

The human genome project, and a demand for other genome sequences, saw the need for higher throughput sequencing methods. As a result, newer approaches were established that are now known as ‘next generation sequencing’ (NGS) methods. These methods utilised the fundamentals established in the Sanger method of DNA sequencing, and the concepts of shotgun sequencing, but employed new chemistries, new means of doing sequencing on a massively parallel scale, and means of sequencing of single molecules of nucleic acids. Table 1 is a summary of some different ‘next generation sequencing’ platforms describing read length, throughput of each instrument run, and cost per Gb of data produced.[1]

10

Table 1: Summary of next generation sequencing technologies.

Read Length (bp) Throughput Cost per Gbp

($USD)[1]

Illumina HiSeq 2500 125 Paired end † 450-500Gbp $150 Single end ‡

36 Single end 64-72Gbp Single end $30 Paired end

SOLiD 5500 75 Single end 240Gbp $70

454 GS Upwards 600 single end 450Mbp $9,500

MinION Up to 35kbp Up to 1.5Gbp $750

PacBio RSII ~100kbp max 3-7Gbp $1000

† Paired end = where sequence is generated from both ends of a fragment of DNA. ‡ Single end = where sequence is generated only from one end of a fragment.

1.2 Short-read sequencing and assembly

1.2.1 Sequencing methods

The first innovation of next generation sequencing started with pyrosequencing, resulting in the commercial 454 Life Sciences instrument (later acquired by Roche

Diagnostics).[10] Pyrosequencing is known as a ‘sequencing during synthesis’ method, differing from the Sanger sequencing method which essentially determined sequence from a series of pre-synthesised sequence fragments.[11] The pyrosequencing method employed the shotgun sequencing strategy, producing reads from DNA fragments attached to a series of microscopic wells. The attachment to the wells was achieved by having microscopic beads, themselves chemically attached to the wells, covered in primers that could capture

DNA fragments. After fragments attach to the beads, the sequencing method then systematically adds a single dNTP (A, T, G, or C) whilst generating the complement strand

11

to the fragment. Due to pyrophosphate being released at each base addition, the method would utilise sulfurylase and luciferase to react with the released pyrophosphate and produce a detectable burst of light. This light would be detected by sensors below each well in the sequencing machine allowing to determine base integration. The method then used a series of washes, to remove the previous dNTP solutions, in preparation for the next stage of the cycle.[12, 13]

The following next-generation sequencing method to be commercialised was from

Solexa, which was ultimately acquired and further developed by Illumina. The first Solexa sequencing machine was named the Genome Analyser (GA),[2] which utilised a series of plates known as flow cells. These flows cells are key to the Illumina technology and used in all their sequencing instruments. The flow cells contain thousands of copies of two primers situated in clusters allowing for a physical region to contain copies of the same fragment. A nearby secondary primer then attaches to the other end of the fragment, allowing for fragment to be attached at both ends to the flowcell. The method would then sequence the fragment based on polymerase attachment targeting one of the primers attached to the flow cell, conducted using the 4 dNTPs. The dNTPs in this sequencing method are fluorescently labelled to emit different wavelengths of light when attached during chain elongation. This process continues along the fragment to generate a predetermined but short length of sequence; initially the read length for the Solexa GA machine was 36bp however the more modern machines from Illumina have improved the read length to 300bp.[14] Single end sequencing occurs when the sequencing takes place from one end of the DNA fragment.

However the Solexa sequencing method also allows for sequencing from both ends, enabling for paired end sequencing, summarised in Figure 1. The paired end method utilises the sequence reads on the two ends in conjunction with the determined size of fragment,

12

and reconstructs the fragment with a series of unknowns between the two paired ends. The distance between the two reads being from no gap to a few hundred base pairs, depending on the library preparation used. The advantage of the paired end method is that it increases the reach of the sequence fragments whilst still maintaining cost and accuracy. The later

Illumina systems can also generate sequences on the end of longer DNA fragments. Called mate pair libraries, these are sequenced in the same method as pair end however the distance between the two reads is of much greater length (e.g. more than 10kbp).

Algorithmically, this is useful for overcoming some of the smaller repeat regions during assembly of the reads. Due to the low costs and very low error rate, Illumina sequencing methods have become the dominant sequencing method with 90% of generated sequencing data being done on the different Illumina sequencing platforms.[15, 16]

Figure 1: Simplified drawing of Illumina cluster generation. Primers are attached to each flow cell, with multiple copies of the same primer attached to the same area (A). DNA templates to be sequenced are introduced with the attached adaptors, attaching to both adaptor sequences on both ends of the DNA fragment (B). The complimentary sequence is produced with a DNA polymerase, making the DNA double stranded (C and D). The double stranded DNA is denatured to produce two single strands of DNA (E) and the process repeated to create areas of clustered strands of the same sequence with the forward and reverse sequence being in close proximity of each other (F). After the cluster preparation, sequencing takes place with fluorescently labelled dNTPs to determine sequence.

13

The Applied Biosystems SOLiD sequencing machine was introduced in 2005 with methods that are similar to the 454 Life Sciences pyrosequencing platform. The library preparation starts with an emulsion of DNA fragments, to allow them to be attached to a primer-covered microscopic bead.[14] The DNA-covered beads are then attached to a solid plate, similar to in the Solexa method. The method then differed in that chain elongation is achieved with a DNA ligase rather than DNA polymerase due to the fact that the method uses fluorescently labelled nucleotide pentamers or octomers, depending on the amount of different fluorescent labels used, rather than single nucleotides in the Solexa and 454 pyrosequencing methods. After integration of the polymer in complement to the DNA fragment attached to the bead, the unique fluorescent label emits a pulse detectable by the

SOLiD sequencing machine.[17, 18]

1.2.2 Assembly algorithm methods

The advent of shotgun sequencing requited the generation of new methods for assembly of genomes, or de novo transcriptomes if data has been generated by RNA-seq.

The two main methods described for short read-based assembly are the overlap consensus method and the de Bruijn graph assembly method.[19] The two methods are similar in that the reads are first initially aligned against each other, this alignment differs based on the assembly tool, and then any sequences with identical regions are then consolidated into longer sequences known as a contigs (as seen in Figure 2). The determination of the alignment is also based upon the length of sequence read subsections used, which is known as a k-mer. The two methods are different however as the overlap consensus is typically applied to longer sequence reads (e.g. from Sanger or 454 methods) whereas the de Brujn graph assembler is typically applied to Illumina-derived short sequence read data.

14

Figure 2: A simplified summary of a contig assembly method. In this method only single end short reads are used and are noted in the middle of the figure. The assembler would then go on to utilise a set k-mer size, in this case 4, and produce subsets of the reads represented as the substrings displayed in boxes at the top of the figure. In this case, where two reads produce the same stretch of k-mers, the two reads are then consolidated into a single contig finalised and represented at the bottom of the figure.[20]

The overlap consensus method utilises the contigs and remaining reads to produce contiguous but error-containing sequences known as scaffolds. During the method of producing scaffolds, also known as scaffolding, any mismatches are resolved through multiple mappings, and the “mismatched” base is replaced by whatever base was found to align most at the position. Another level of ambiguity can be also added to scaffolds at this level as now the paired-end and mate-pair data is utilised, specifically the unknown inserts between the two reads are integrated into the assembly as large gaps of unknown nucleotides (N’s). Examples of assembly tools that utilise this method are the Celera and

Arachne tools.[21-23]

The de Bruijn graph method is similar, however the method takes note of the alignments made along the way and produces a directed graph to determine how the different reads match, align and extend each other. In terms of the graph, each node would

15

represent a single k-mer and after contig generation, a series of nodes could be consolidated into a single contig node. The graph then produces branches in areas where multiple reads/contigs align ambiguously and attempts to either produce a consensus or report the branches separately. Examples of de Bruijn graph based tools are Velvet, ABySS and

SOAPdenovo.[20, 22, 24]

An extension of the de Bruijn method is the algorithm employed by the DISCOVAR assembler (shortened form of “discover variant”), which first takes a directed assembly graph generated by its program but then proceeds to take branching paths as areas of variation. This variation is determined to be due to things such as single nucleotide polymorphisms (SNP’s) or heterozygosity. In other de Bruijn based assembly methods, branches in the directed graph are sometimes excluded due to uncertainties in the sequence consensus. The DISCOVAR assembler incorporates the majority of the branching path in order to produce a longer scaffold whilst denoting the areas of variation either as lower case nucleotide or as ambiguous nucleotides (N’s), as determined by the parameters set at assembly.[25, 26]

1.3 Optical mapping methods

The introduction of optical mapping as a supplement to assembly methods, has enabled assemblies to recreate sequence synteny, which is the physical localisation of sequences on the same chromosome, but in the context of assembly methods is the placing of two sequences on the same scaffold or contig. BioNano accomplishes this through unique barcoding of sequences, using fluorescent probes on large DNA molecules extended through enzyme-based means. The fluorescent probes are analysed by the BioNano optical mapper and the synteny data is recorded to be used later at assembly stages.[27]

16

1.4 Long read sequencing and assembly

1.4.1 Sequencing methods

In the evolution of sequencing technologies, the need for methods to address repeat regions and other regions of low complexity was the next step in producing more complete genome assemblies. Technology to generate longer reads, that are able to span the full length of repeats and low content regions have been formulated to produce this solution, with one of the first sequencers to employ this method being the PacBio RS from Pacific

Biosciences (PacBio) in late 2010.[28, 29]

17

Figure 3: Summary of fragment preparation for SMRT cell sequencing. Figure 3A denotes the DNA fragment prior to adaptor attachment, compatibility of the size of this fragment depends on the chemistry used during sequencing and in current chemistries, sizes of up to 70kb+ can be used. Figure 3B denotes the attachment of adapter sequences, made of DNA, to the fragment to be sequenced. This is achieved during fragment preparation through use of DNA ligases. Figure 3C shows the overall structure of circular fragment piece after the fragment is unzipped from its double stranded structure.

The PacBio RS, and later the RS II and Sequel instruments, utilise small sequencing chips call Single Molecule Real Time sequencing Cells or SMRT Cells. These are a solid plate containing thousands of nanoscopic wells called zero-mode waveguides (ZMWs). For sequencing, genomic DNA is first purified, fragmented and size selected, to yield fragments between 10 and 20 kb or larger than 20kb. These double stranded fragments have two adapter sequencers ligated on either end, which allow the double stranded fragment to be unzipped, producing a single circular piece of DNA [Figure 3]. After attachment of the

DNA to the ZMW, achieved through targeting of the adaptor sequences, the complimentary strand is synthesized in conjunction with fluorescently labelled dNTPs. This is done by a

18

single DNA polymerase molecule, chemically adhered to the bottom of each ZMW. The use of fluorescently labelled dNTPs is similar to Illumina sequencing however, in contrast to the Illumina method - which relies on clusters of the same DNA fragment to produce a detectable pulse, the ZMWs rely on a beam of light at the bottom of the well to amplify a pulse produced during incorporation to a detectable level. The pulse is recorded through optic sensors that determine the base being incorporated from the pulse being produced.

Fluorescent dNPT incorporation at any position can occur multiple times due the circular nature of the fragment being sequenced and multiple copies of the same sequence and its compliment can be produced. These multiple runs can be compared against one another, and the largest consensus at each base can be determined in an effort to reduce errors. As a result of this process, very long sequence reads can be produced. Originally these were upwards of 10kb with the original RS sequencing machine but now with the RS II and

Sequel machines, read lengths of over 70kb have been achieved [Table 1]. Despite this, error rates with the PacBio are up to 15%, which though stochastic is much higher than

Illumina methods which observe a ~1% error rate. [28, 29]

In late 2015 Oxford Nanopore Technologies commercially released a long read sequencer that was unique in that the sequencing module itself was a USB-attached device that is small enough to fit in the palm of your hand. The aim of the technology was to provide an alternative to large scale sequencing machines with possibilities of utilizing the technology in mobile or in-field situations. The MinION sequencer, utilises chips containing a single stranded fragment to be sequenced in solution with dNTPs and polymerase for complimentary strand synthesis. The sequence is determined through miniscule changes in the solution’s biochemical properties after dNTP incorporation. The main pitfall of the technology is the variable and sometimes high error rate of the sequence

19

produced, with a recent study finding the error rate varying from ~14-32% which is much higher than other currently available sequencing platforms.[30, 31]

1.4.2 Long read-based genome assembly

Long read assembly brings challenges that are different to assemblies done with short reads. The majority of short-read based assemblers utilise RAM to store the reads during the assembly, this meant that long reads could not be assembled without modifications to assembly programs. Currently the only refactored program has been the

Celera assembler, later reworked to become Canu, which utilises overlap consensus in the same way as for short reads.[32]

Figure 4: Summary of FALCON assembly method. The minimum seed length is determined from parameters set for the assembly, from there all reads larger than that given length are set to be seed reads. Reads below the set length are then used to map against the seed reads. Multiple reads are used to produce a consensus and in some cases are used to join two seed reads together, as seen in subfigure C.

Two popular methods for long-read only assembly are the HGAP and FALCON programs.[33, 34] These methods first require the definition of a set of seed reads; these are determined from the distribution of reads, and all reads higher than a cutoff length become seed reads. The length can be determined automatically either through the assembly 20

program itself or through supplementary programs that are packaged with the assemblers.

Next, all reads shorter than the minimum seed read length are mapped to the seed reads, in a similar fashion to the overlap consensus method as seen in Figure 4. This is to produce a set of error-corrected seed reads by substituting errors, inherent to PacBio data, with the

“correct” base i.e. the nucleotide that appears at the highest frequency at that position. At this point another stage of overlap consensus mapping is done to produce contigs, utilizing the error corrected seed reads. A final stage is the mapping of reads to the final contigs from the previous two stages in conjunction with quality data generated at sequencing the sequencing; this is to remove regions of contigs generated from reads with high error rates.[35]

1.5 Hybrid assembly

There are methods to utilise a mixture of both long read and short read data in a single assembly. The aim of these methods is to exploit the strengths of both data types, specifically the length of long read data and the exceptionally low error rate of short read data, and to take advantage of the lower cost of the short read data. The current methods of hybrid assembly can be split into general assembly, read correction and gap filling.

21

Table 2: Summary of current hybrid assembly tools.

Tool Name Assembly Largest reported Reference Approach genome made (size) pacBioToCA Read correction E. coli K12 strain [29, 36, 37]

(4.6Mbp)

ECTools Read correction E. coli K12 strain [38]

(4.6Mbp)

HybridSpades de Bruijn graph and E. coli K12 strain [39] gap closure (4.6Mbp)

Cerulean General (ABySS E. coli K12 strain [40] based) (4.6Mbp)

Dbg2olc Read correction E. coli K12 strain [41]

(4.6Mbp)

PBJelly Gap filling Cercocebus atys [42] (2.8Gbp)

The Table 2 lists the currently available tools, the method of hybrid assembly each tool utilises and the largest reported genome that the tool has generated. The majority of the largest report genomes produced are E. coli K12 strain due to the model data being readily available by PacBio.[43]

General assembly methods attempt to employ the assembly methods used for short read assembly but allow the integration of both short reads and long reads as inputs. The current methods only use overlap layout consensus algorithms for assembly. The current methods have only reportedly been successful in assembling genomes <100Mbp in size due

22

to the computational limitations of employing the originally short-read developed algorithm. Examples of this hybrid assembly method are SPAdes and Cerulean.[44, 45]

Read correction methods generally use differing alignment methods to create a consensus of the long read data by aligning the short read data against the long read data.

This is mostly done to utilise the long read data’s ability to span repeat regions but to then correct the erroneous sequence of the long read data with the higher quality short read data.

The output after read correction will not be of any higher length, i.e. n50, than the original long read input data, with the exception of addition of a short read at either end of the original long read. For this reason, read correction methods are typically used in conjunction with other long read assembly methods. Unassembled, corrected reads are only used where sequencing is focused on specific targeted sequences. Examples of current hybrid read correction methods include pacBioToCA, ECTools and dbg2olc.[34-37]

Gap filling methods utilise the long read data to bridge the gap between produced short read-based scaffolds or reads. Gap filling methods utilise short read scaffolds and attempt to align them against the long reads, in a similar way to the short read overlap layout consensus method that has been used for short read data. Consensus for the error filled long read data is determined through compounding of long reads after aligning the start and end of the long read to the short read data. Consensus of the regions joining the long read region is established through alignment of short read data against the long reads, acting as the read correction of this hybrid assembly method. PBJelly is the primary example of the gap filing hybrid assembly method.[42]

23

1.6 Evaluation of assembly quality

After the assembly process, due to the large amount of data produced, large scale heuristics and tools are used to evaluate the quality of one or more resulting assemblies.

A number of numerical heuristics are used to determine an assembly’s contiguity.

These include the number of contigs or scaffolds, the minimum and maximum length of contigs or scaffolds, the total size of the assembly and the N50 of the assembly. The N50 denotes the scaffold size that would be obtained if the sum of the largest sized scaffolds/contigs of an assembly were summed to be 50% of the total assembly size. Due to this calculation method, it is similar to the median scaffold size but is weighted heavier to include larger scaffold sizes. Other heuristics that employ similar calculation methods are

N20 and N80 which are used for 20% and 80% of total assembly size. Many assembly tools, such as ABySS, have in built tools for calculations of these heuristics.[46]

For the purposes of evaluating correctness of an assembly, a generalised method has been developed: the CEGMA (Core Eukaryotic Genes Mapping Approach) tool and its successor BUSCO (Benchmarking Universal Single-Copy Orthologs). The original

CEGMA tool utilised a reduced gene set of ~100 genes found to be present in 98% of all eukaryotic organisms without variation alongside a generalised annotation pipeline to search for this reduced gene set.[47] The successor to CEGMA was be the BUSCO tool which would utilise the same method as CEGMA however it would expand the gene set depending on the known taxonomy of organism that the assembly is based on.[48] BUSCO is used in the expectation that the proportion of universal single-copy orthologs found in an assembly will give an estimate of the total genes likely to have been well assembled. It will not, however, provide any estimate of the assembly quality of non-protein coding regions of a genome.

24

1.7 The Koala genome

The koala, Phascolarctos cinereus, is an arboreal marsupial native to Australia known for its diet of eucalypt leaves and for sleeping for the majority of the day. Prior to the genome sequencing efforts by the Koala genome consortium the koala genome has been poorly characterised, with there being a small number of targeted studies of genes of interest. At a genome level, only basics such as the karyotyping of koala, which has a 2n=16 karyotype, has been established.[49] However, other marsupial genomes have been sequenced such as those from Monodelphis domestica (gray short-tailed opossum)[50],

Sarcophilus harrisii (Tasmanian devil)[51], and Macropus eugenii (tammar wallaby).[52]

Table 3 states the name, scientific name, genome size, number of scaffolds and N50 of each of the assemblies.[48-50] However, whilst these have been generated, they have a wide range in quality. There is a clear opportunity to generate higher quality assemblies, using new sequencing technology and assembly approaches.

Table 3: Existing marsupial genome assemblies.

Common Name Scientific Name Genome Size Scaffold count N50 (Gbp) Grey short-tailed Monodelphis 3.59 5,223 59,509,810 opossum domestica Tasmanian devil Sarcophilus harrisii 3.17 35,974 1,847,106 Tammar wallaby Macropus eugenii 3.08 277,711 36,602

The most recent genetic resource generated for the koala was the full transcriptome assembly.[53] The transcriptome was compiled from samples taken from nine tissues from the female specimen (liver, heart, lung, brain, kidney, adrenal gland, spleen, uterus and

25

pancreas) and seven tissues in the male specimen (bone marrow, kidney, liver, lymph node, salivary gland, spleen and testes). The compiled transcriptome is comprised of 370,030 transcripts averaging 1,688bp in length from the female and 381,958 coding transcripts averaging 1,124bp in length from the male. At the draft de novo assembly stage of the transcriptome, there was a predicted number of 15,500 genes. There was an initial attempt at a short-read genome assembly after the transcriptome assembly however the resulting assembly was poor in quality with 256,745 scaffolds and an N50 of 13,492bp.[53]

1.8 Koala retrovirus (KoRV)

The koala retrovirus (KoRV) is an enveloped gammaretrovirus most closely related to the gibbon ape leukaemia virus (GALV).[54, 55] Prior studies have observed a link between high KoRV viral load and the propagation of various Koala diseases such as chlamydiosis, lymphoma and hematopoietic neoplasia. Whilst there is the evidence of the presence of viral proteins causing reduced interleukin (specifically IL-10) responses, analysis of total disease response was not conducted for the individuals.[55, 56] Whether immunodeficiency is solely due to this reduced interleukin response or is in combination with insertional mutagenesis is unclear due to the lack of in depth genomic analysis.[51]

Figure 5: General gene structure of KoRV. The long terminal repeat region (LTR) flanks both ends of the total genome. The gag (group specific antigen group) gene is responsible for different structural genes of the viral capsid. The pol (polymerase) gene is responsible for the reverse transcriptase that facilitates the integration of the retrovirus’ genome into the host genome. The env (envelope) gene is responsible for the coding of the envelope proteins of virus. The three coding regions are transferred into the host genome as single exons. Total genome is 8,431bp in size.

26

The main strains of KoRV are characterised as KoRV-A[57], KoRV-B[58] and

KoRV-J.[59] They all have the same genome structure [Figure 5]. KoRV has been of growing interest due to the observation of endogenisation of the virus into the koala’s germline.[60] Studies of KoRV have been few with many involving virus capture sequencing or simply viral load studies. The difficulty with such studies is that despite the implication of endogenisation of the virus, there has been no proper characterisation of the insert method of the virus nor the regions of the genome in which the virus is inserting itself.[56]

Past efforts to sequence koala genomic regions adjacent to KoRV have primarily utilised primer-based capture and small scale Sanger sequencing.[54] This is due to the difficulties in using short read next-generation sequencing in assembling regions of KoRV in that most assembly methods have difficulty with repeat regions. Since KoRV is a retrovirus, inserted into multiple regions in the koala genome, it behaves essentially as a repeat element during the assembly process. KoRV also contains two long tandem repeats within its sequence, which leads to additional difficulties in assembly.

27

1.9 Aims of this thesis

The first aim of this thesis is to utilise the assembly of the koala genome to identify the KoRV insertions present within the genome of a single koala and to characterise these inserts. Characterisation of many aspects of the inserts was done to determine any trends in insert motifs. The flanking genes of all inserts present within the genome were also examined.

The second aim of this thesis is to undertake hybrid assemblies of the koala genomes, using different fold coverage of short read and long read data. This will help determine an optimal strategy for the hybrid assembly of large genomes.

28

2 INVESTIGATING THE PRESENCE OF KOALA

RETROVIRUS IN AN INDIVIDUAL’S GENOME

Contributions to this chapter

All work done in this chapter was done by me, with the below exceptions.

- The koala tissue sample preparation and gDNA extractions were done by the Koala

Genome Consortium.

- All PacBio based assemblies used in this chapter were made by Dr. Zhiliang Chen.

o These are the assemblies using the FALCON tool for the full 57X coverage

obtained from the PacBIo RSII sequencing.

- The initial short read full genome assembly was also done by Dr. Zhiliang Chen.

- Annotation of the complete PacBio long read based assembly was done by Dr.

Denis O’Meally with the use of the MAKER/Apollo annotation pipeline.

29

2.1 Introduction

The koala retrovirus (KoRV) is an enveloped gamma retrovirus most closely related to the gibbon ape leukaemia virus (GALV). [54-56] The virus genome has a general retroviral genome structure, with group specific antigen group (gag), reverse transcriptase

(pol) and envelope (env) genes in single exons flanked by long terminal repeats (LTRs) as detailed in Figure 5. Prior to the present study there have been three previously found strains of the retrovirus, named KoRV-A[57], KoRV-B[58] and KoRV-J[59] with KoRV-

A being the most abundant in the wild koala population.[57] Transcriptomic analyses of koala tissues have shown that KoRV is present, with evidence of KoRV-A and a B-like strain being actively transcribed.[53] Due to observed viral loads in koalas with lowered immune response (specifically IL-10 responses) it has been hypothesized to be a factor in immunodeficiency in individuals.[56]

This study aims to provide a better understanding of the koala retrovirus (KoRV) in the whole koala genome, particularly its presence and variety within individual animals.

Studies prior to this have utilised targeted PCR approaches [61, 62] and have found there to be an association between the observed retroviral load and disease within the individuals tested.[58] KoRV was observed to be present in individuals, with KoRV-A viral loads found to be high.[56] However due to previous studies utilising targeted genomic capture methods, the expected number and type of KoRV insertions in the entire koala genome is unknown. This study will give a more detailed understanding of the number, type and position of KoRV insertions within the genome of a single koala, and to check for possible observations of insertional mutagenesis. This will be achieved through the utilisation of both the short and long read genome sequencing and assembly efforts of The Koala

Genome Consortium. 30

2.2 Materials and Methods

2.2.1 Short read genome sequence data

Tissue samples were obtained from two individuals, a female koala and male koala.

The female koala named “Pacific Chocolate” was euthanised at the Port Macquarie Koala

Hospital following severe chlamydial infection. Liver samples were used for gDNA extraction[53] which was done using a Qiagen DNEasy Blood and Tissue Spin-column Kit

(standard protocol with an on-column RNaseA treatment). The male koala named “Birke” was a wild koala; he was euthanized by veterinary care at Australia Zoo Wildlife Hospital following a dog attack. The tissue sampled was the same as the female (liver) and the methods for DNA extraction were the same.

The sequencing for both individuals for the genome was conducted at the

Ramaciotti Centre for Genomics at the University of New South Wales utilising the

Illumina HiSeq 2000 sequencing platform, according to the manufacturer’s instructions.

Fragments of 260bp in size were paired-end sequenced with 100bp reads. For both individuals ~8 billion paired end reads each were sequenced, in 7 flow cell lanes each. Mate pair libraries were also prepared using the TruSeq sample preparation, but this was only done for koala Pacific Chocolate (the female individual). The sequencing generated ~157x coverage for PC, on the assumption of an approximately 3.4Gbp sized genome.

31

2.2.2 Koala whole genome assembly with short reads

Utilising the 157x coverage, an assembly was generated utilising ABySS (version

1.9.0 and openmpi version 1.8.8[46]) with multiple kmer sizes from 31-81. The kmer size that was optimal, and the assembly analysed here, was k=69 after analysis of the multiple assemblies produced.

2.2.3 Targeted short-read assembly of genomic regions adjacent to KoRV insertions

To generate assemblies containing KoRV insertion sites, short-reads containing between 30-70% KoRV sequence and thus 70-30% koala genome sequence, were first extracted from all short reads. To find gDNA reads containing or part-containing KoRV, all genomic (non-assembled) reads were mapped to the genomes of three major strains of

KoRV (KoRV-A, KoRV-B and KoRV-J). Mapping was done utilising standard local alignment methods of the Bowtie2 alignment tool and the produced sam file output was filtered utilising standard samtools view command (Bowtie2 version 2.1.0 and samtools version 0.1.19). The set of reads containing KoRV were then compiled.

To generate assemblies that spanned KoRV insertion sites in the koala genome, all

KoRV-containing reads were assembled de novo utilising velvet (version 1.2.08[19]) or

ABySS (version 1.9.0[46]), each utilising the differing kmer sizes recommended for the two assembly programs: kmer sizes of 15-31 for velvet and kmer sizes of 21-79 for ABySS.

Contigs and scaffolds from all assemblies were collated and subject to deduplication and removal of substrings through cd-hit (version 4.5.4, standard parameters for 100% substring similarity).[63] This step was to remove all contigs of scaffolds that were already entirely present in the assembly in another contig or scaffold. All remaining contigs/scaffolds were run through BLASTn (version 2.2.30 with standard parameters used against nr/nt databases[64]) to confirm that they contained regions aligning with known isolates of

32

KoRV. Scaffolds containing no KoRV content and those comprised of only KoRV were removed as they would not allow analysis of KoRV insert junctions. All koala genome sequences at KoRV insertion sites were examined for possible flanking genes, by used of

BLASTn[64] against NCBI nr/nt databases and against annotation that had been done by the Koala Genome Consortium for the short-read whole genome assemblies of Pacific

Chocolate (through GMAP version 2014-12-28, using standard parameters[65], and

Augustus version 3.02, using standard mammalian prediction parameters[66]). Finally, three distinct data sets were generated: the complete hybrid junction contig, the KoRV component from the contig and the koala component. These three data sets were then run separately through MEME-suite in both command line and web application versions

(version 4.10.1 with standard parameters).[67] This was to understand if there were sequence motifs present at the insertion sites.

2.2.4 Long read genome sequence data

A second female koala, named “Bilbo”, was obtained from the wild population in south-east Queensland following severe chlamydiosis. The individual was euthanized at the

Australia Zoo Wildlife Hospital in August 2015. For the long read sequencing, samples of spleen were used. Genomic DNA extraction was conducted using the Qiagen Genomic-Tip

100/G columns and Qiagen DNA Buffer set. SMRTbell libraries were prepared using the

PacBio 20kb template preparation protocol. Size selection was done using a BluePippin™ device, with a minimum size cutoff of either 15 or 20kb. The SMRTbell libraries were single-molecule sequenced using the PacBio RS II, employing the P6 C4 chemistry with either 4 hour or 6 hour movie lengths. The sequencing was conducted at the Ramaciotti

Centre for Genomics at the University of New South Wales. A total 272 SMRT cells were

33

used to generated a total of ~182Gbp of sequence data; this equates to ~57X coverage under the assumption of a 3.2Gbp sized genome.

2.2.5 Whole koala genome assembly with long-read data and analysis of KoRV

insertions and flanking genes

Sequence data was as described above. The reads were assembled in FALCON

(v0.3.0 using standard parameters) by Dr. Zhiliang Chen. The MAKER pipeline annotation was used to produce annotation of gene models and repeats within the assembly, this step was conducted by Dr. Denis O’Meally. The scaffolds of the final assembly were BLASTed against the KoRV-A reference genome to determine scaffolds containing KoRV.

Annotations produced for KoRV-containing scaffolds were then utilised to observe the closest gene to the given KoRV inserts.

2.2.6 Targeted long-read assembly of genomic regions adjacent to KoRV insertions

To find long reads containing KoRV, the KoRV-A genome was matched against all

PacBio reads using BLASR[68] and KoRV-containing reads were kept. The genomic koala components flanking the KoRV inserts were isolated and BLASR used to match against nr/nt databases to identify possible flanking genes.

For a targeted KoRV-seeded assembly, 145 SMRT cells of long read sequences

(totalling ~96Gbp or 30X coverage under the assumption of a 3.2Gbp sized genome) were first examined. All reads were aligned against the KoRV-A reference genome using

BLASR[68] and reads containing regions with 85% identity to KoRV were kept; this level of homology accounted for error rate with PacBio reads and any possible KoRV sequence variants. The reads were assembled in FALCON (v0.3.0 using standard parameters) by Dr.

Zhiliang Chen. Flanking regions were then isolated and queried using BLAST[64] against

34

nr/nt databases (v2.2.30 standard parameters used) for possible identification of flanking genes.

2.2.7 Analysis for sequence motifs at KoRV insertion points in the long-read genome

The insertion points of KoRV were analysed as three separate components: the koala sequence upstream of the KoRV insertion, the KoRV sequence downstream of the insertion point, and the complete hybrid junction. The three data sets were then run separately through MEME-suite in both command line and web application versions

(version 4.10.1 with standard parameters[67]), as was also conducted for the short read data.

35

2.3 Results

2.3.1 KoRV was not well represented in the short read-based whole genome

assembly

A major aim of this work was to discover the insertion sites of KoRV in the genome of a single koala. To do this, we first examined a genome assembly produced from short read data. Table 4 gives details of this assembly, which was relatively poor in quality as evidenced from the large number of scaffolds and their small N50. Whilst KoRV was present in the raw short read data the only KoRV content present in the short-read assembly was a single scaffold containing a full length KoRV-A genome (Table 4). This was likely to be an assembly artefact, as KoRV is well-known to exist as multiple insertions in the koala genome.[69] There are a number of reasons why KoRV was not well-assembled in the short-read assembly. De novo short-read assemblers are known to face difficulty in assembling repeat regions[70], and the multiple insertions of KoRV essentially act as a repeat. The lack of KoRV within the short-read assembly is also likely due to low consensus at insertion points, leading to an inability to accurately assemble the regions at which KoRV is inserted into the genome. Since the short-read assembly did not provide any insights into the number or position of KoRV insertion sites, we subsequently explored a targeted, KoRV-seeded assembly.

36

Table 4 – Koala genome assembly quality and presence of KoRV: short read assembly from the female “PC” and long read assembly from the female “Bilbo”

ABYSS ASSEMBLY OF SHORT READS FALCON ASSEMBLY OF LONG READS

Number of scaffolds / contigs 800,663 1,803

Scaffold N50 (>2kb) 286,159 11,587,828

Genome size 3,192,246,348 3,396,189,353

KoRV-containing scaffolds 1 71

Complete Single-Copy BUSCOs 1,714 2,094

Complete Duplicated BUSCOs 37 50

Fragmented BUSCOs 636 587

Missing BUSCOs 673 292

The KoRV-seeded assembly produced scaffolds containing full-length KoRV genome, as expected, but also 261 further scaffolds. These had a minimum size of 100bp, which represented orphans, and a maximum of 343bp. The minimum scaffolds were essentially chimeric reads, containing some KoRV and some koala genome, and represent insertion sites. The larger scaffolds also represented KoRV insertion sites but with greater degrees of Koala or KoRV sequence. The production of these scaffolds was an important advance however the length of the scaffolds were still too short for identification of any flanking genes: a BLAST analysis of all the genomic koala components adjacent to insertion sites produced no significant hits for any genes.

37

Figure 6: Sequence Motifs associated with KoRV insertion sites. (A) A 21-base motif was present in the genomic koala sequences adjacent to the KoRV insertion sites. Here, the KoRV insertion site was 3’ to the motif. (B) A 21-base motif was present in the KoRV sequence at the insertion site. This motif corresponds to the long terminal repeat (LTR) region of the KoRV genome. Here, the koala genomic sequence was 5’ to this motif.

2.3.2 Motif analysis in short reads that contain KoRV

The short scaffolds arising from targeted assembly were not useful for the investigation of flanking genes, however they were of a length that could be examined for insertion motifs. Indeed the suspected insertion method for gamma retrovirus is through utilisation of an integrase protein which targets a specific sequence.[71] Accordingly, the scaffolds from the targeted assembly were analysed for the presence of sequence motifs (see

Materials and Methods).

The entire scaffolds were analysed, and the genomic koala portions and the KoRV portions of the scaffolds were also analysed separately. These analyses revealed a statistically significant 21-base GA-rich motif in the koala genome that was adjacent to

KoRV insertion sites (Figure 6A). However we noted that overall site count for this motif was low, being present in only 64 out of the 261 scaffolds. This suggests there is a range of

38

different insert sites, as this was the only produced motif for the genomic koala component.

For the KoRV sequence, as expected, the motif analysis also revealed a statistically significant 21-base motif, which was part of the KoRV long terminal repeat (LTR) that is present on both ends of the retrovirus (Figure 6B). There was some sequence variability within the KoRV motif. BLAST analysis of the KoRV component of the insertion sites had some of the LTR regions aligning to different sub-strains of the KoRV-A subtype (data not shown), which may explain the nucleotide differences present.

39

Figure 7: The GA-rich and KoRV LTR motifs are of variable distances to each other. The genomic koala motif and the KoRV motifs were aligned onto the original scaffolds. (A) MAST mapping of the koala genomic motif (corresponding to Figure 6A) visualised as blue and the KoRV LTR insert motif (corresponding to Figure 6B) visualised as red. A subset (21 contigs) of the total data is shown. There are variable distances between the two motifs. (B) Frequency histogram of the distance between the two motifs, for all contigs.

40

To better understand the relationship between the GA-rich motif adjacent to many

KoRV insertions and the KoRV LTR sequence itself, we investigated the distance between them on the contigs. The observed distance between the two motifs, from Figure 2, was found to be variable in the produced scaffolds (see Figure 7A) ranging from 0 to 262 bases.

A frequency histogram in Figure 7B shows a bimodal distribution with two observable peaks in the histogram, of 0 to 20 bases and of 100 to 120 bases.

2.3.3 KoRV was well-represented in the long read based assembly

The koala genome resulting from the PacBio long reads was of high quality.

Compared to the short read assembly, it produced an overall better assembly of only 1,803 contigs, not including the alternate contigs, compared to the 800,663 scaffolds produced by the short read assembly (Table 4). The high contiguity of the PacBio assembly is further highlighted with the much higher N50 of ~11Mbp compared to the ~286kbp N50 of the initial short read assembly. BUSCO analysis, as expected, showed a greater number of complete genes present. One of the most significant advantages of the PacBio assembly over the short read assembly was the assembly of KoRV in multiple scaffolds; there were a total of 71 scaffolds that contained KoRV as full length or as fragments. Even more importantly, there was a large degree of koala genome found to be present adjacent to each

KoRV insert.

To better understand the diversity of the KoRV insertions present in the long read assembly of the Koala genome, the 71 KoRV insertions were analysed (Figure 8). This revealed that there were 49 full-length insertions of KoRV-A, five insertions that had deletions in either the Pol or Pol and Env genes and two solo LTRs. Interestingly there were

41

also 15 other instances where there were pairs of LTRs, separated by ~7,500 bases of non-

KoRV sequences. These have been described elsewhere as recKoRV (recombinant KoRV).

Interestingly, in the alternate assembly generated by FALCON, there were instances of other KoRV strains present. The most common of these alternate strains was KoRV-B (data not shown). [72]

Figure 8: Types of KoRV found in the assembly of the koala genome: Figure 8A is the characterised unique KoRV insert types observed from the PacBio FALCON assembly mapped against the KoRV-A reference genome. Highlighted scaffold components with an adjoining thin, arrowed line shows that the two fragments exist on the same scaffold but are some distance from each other. Figure 8B is the phylogenetic tree of the KoRV types observed against previous described types of KoRV. The gibbon ape leukaemia virus (GALV) is used as a reference.[72]

42

2.3.4 Proximity of KoRV to genes in the koala genome

We annotated protein-coding genes adjacent to KoRV inserts successfully on the

PacBio assembly of the koala genome. The genes are listed in Tables 5 and 6. The distance of genes from KoRV inserts were of varying distance ranging from ~1kbp away (for

PNPLA4: Patatin-like phospholipase domain-containing protein 4) to over 600kbp away

(for RPS10: 40S ribosomal protein S10), raising the possibility of some inserts having a higher impact than some others [Tables 5 and 6]. It was notable that there were no cases of insertion into genes, in the annotation produced by the Maker pipeline. However it should be noted that regions are annotated with gene structures of the highest homology. Due to this, the possibility of mis-annotation could be present and insertional events could have been missed.

43

Table 5: Genes found 3’ to KoRV insertions in the koala PacBio genome assembly

aThe listed distance of the gene from the KoRV insert is given. bThe gene annotation is stated, with the organism of closest homology for the given annotation denoted in parenthesis.

44

Table 6: Genes found 5’ to the KoRV insertions in the koala PacBio genome assembly

aThe listed distance of the gene from the KoRV insert is given. bThe gene annotation is stated, with the organism of closest homology for the given annotation denoted in parenthesis.

45

In order to further investigate the possibility of KoRV insertions associated with protein coding genes, a KoRV-targeted assembly was produced with PacBio long reads.

This was done to examine regions of potential lower sequence coverage due to polymorphism. The resulting sequences adjacent to the KoRV insertions were queried. This analysis revealed a potential insertion of a full-length KoRV into a gene structure, specifically the annotation of an exonic region of HMBOX1 (Figure 9). In order to validate this result, Illumina short reads generated from the genome of the same animal (‘Bilbo’) were mapped against the 1000 bases containing the junction of the KoRV and HMBOX1 gene. The short reads were able to span the KoRV insert junction, seen in Figure 9B, validating the insertion in this region despite it not being present in the full PacBio assembly. Given the possibility that this was in a region of polymorphism, the given gene disruption was also investigated in the animal Pacific Chocolate (PC) on the expectation that it might be absent. Illumina short reads from PC were mapped to the same 1000 bases.

The resulting mapping, seen in Figure 9C, shows that the given gene disruption of

HMBOX1 was not present in PC as no reads were able to map across the junction.

46

Figure 9: Insertion of KoRV in the koala HMBOX1 gene, discovered from a targeted assembly using PacBio long reads. Evidence of gene disruption found within a scaffold from Bilbo’s KoRV-based sub-assembly. This gene disruption was not previously detected in the short read based assemblies, and thus highlights the value of using long read and short read data together in hybrid assemblies. Figure 9A is the BLAST result of the given scaffold. The query sequence is the PacBio contig. The green component represents a full length KoRV insertion. The orange component corresponds to the predicted gene structure of HMBOX1, with introns shown as grey lines between the orange. Figure 9B is the mapping of the Illumina short reads of Koala Bilbo against the 3’ end of the KoRV insert and adjacent koala genome, showing that short reads span the gene disruption site. Figure 9C is the mapping of the Illumina reads from the Koala Pacific Chocolate against the same region. In both Figures 7B and 7C, the disruption junction occurs at ~500bp in the reference sequence.

47

2.3.5 Motif analysis in KoRV insertions present in the koala PacBio long read genome

assembly

From analysis of Illumina short reads, above, it was shown that there were motifs associated with KoRV insertion events. These were present in the koala genome, and in the

KoRV retrovirus itself. To investigate whether these were also seen in the final, PacBio long read assembly of the genome, motif analysis was again carried out at KoRV insertion sites. No statistically significant motif was found in the koala genome sequences adjacent to the KoRV insertion sites. This may be due to the fact that only 71 insertion sites were examined (Table 4). However, a significant motif was present in the KoRV sequences from the PacBio assembly (Figure 10A). This was of higher consensus than the motif produced in the short read assembly (in Figure 6B) which is consistent with the KoRV insertions in the PacBio long read assembly mapping to the KoRV-A strain and not others (Figure 8). In fact the motif generated matched perfectly with a sequence inside the KoRV-A long terminal repeat (LTR).

Figure 10: KoRV motif present at insertion points of the retrovirus in the koala genome, in the PacBio assembly: Figure 10A is the KoRV insertion motif produced by MEME-suite utilising the PacBio final assembly scaffolds. Figure 10B is a mapping of the motif observed in subfigure A against the KoRV-A reference genome. The motif is found at position 107 to 135 in the KoRV-A long terminal repeat (LTR).

48

2.3.6 Are KoRV insertions homozygous or heterozygous in the koala genome?

As noted above, KoRV insertions in the genome can be either homozygous or heterozygous. We explored whether this could be determined from our PacBio long read genome assembly and long read mapping. At the same time we explored whether the KoRV was on primary or alternate contigs, given that the Falcon assembler generates a primary assembly and also alternate assemblies which can represent regions of chromosomal heterozygosity. The frequency of the read mapping against KoRV-containing contigs is shown in Figure 11. The mapping was done to the KoRV-insert region for each respective

KoRV-containing contig, with PacBio long reads that were mapping to multiple contigs to be thrown out of the dataset however this did not occur. We had expected that primary scaffolds would be homozygous scaffolds and thus have simply twice the number of reads mapping to them as alternate contigs, representing heterozygous scaffolds. However this trend was difficult to see. Whilst most alternate contigs showed an expected and similar low read mapping (~200 reads), some alternate contigs showed much higher mapping (500 to

700 reads). Similarly, some primary scaffolds showed higher degrees of read mapping (400 to 1200 reads) yet many others showed the same low degree of read mapping as seen in most alternate contigs (~200 reads). Due to the many variations here, it was concluded that a read mapping approach cannot alone be used to determine homo- or heterozygosity and thus the status of KoRV insertions in this regard remains unknown.

49

Figure 11: Homozygous/heterozygous read mapping: Figure 11A is the given read mapping counts against all KoRV-containing contigs. The blue data points refer to the primary contigs, which denote areas of homozygosity, and the orange data points are alternate contigs which denote areas of polymorphisms and likely heterozygosity. Figure 11B is the expected results of the given mapping with the y-axis instead referring to the comparative fold coverage of mapped reads which would dependant on the total sequencing depth. The primary scaffolds at ~2 fold coverage would denote homozygous scaffolds whereas all scaffolds at ~1 fold coverage denote heterozygous scaffolds. This trend of division at two levels was unable to be seen in the graphed data in Figure 11A.

50

2.4 Discussion

The results from this investigation have allowed the first full characterization of the insertion of KoRV within a single Koala genome. Through the utilization of the PacBio long read-based assembly, the results show that the genome of Bilbo predominantly contains the KoRV-A strain. Most significantly, relationships of adjacent protein coding genes to KoRV inserts could be determined. Evidence of insert motifs for the KoRV inserts were also established for both Koalas Pacific Chocolate and Bilbo.

The short read based assemblies were unable to produce definitive results for the investigation of adjacent coding genes to KoRV inserts. This was due to the lack of any assembled KoRV inserts that had any adjacent genomic Koala genome in the same scaffold, and the fact that there was only a single scaffold that containing a fully assembled KoRV-A genome. This issue is probably due to the retroviral element acting essentially as a repeat element which short read assembly technology has difficulties overcoming.[70] For this reason, the short read assembly methods utilized were unable to produce the expected

KoRV inserts due to lack of consensus. The use of long read technology however has been noted to overcome fragmentation that is normally associated with short read based technologies, primarily due to the larger read length reducing or removing ambiguities with repeat regions.[73] The ability of long read technologies to overcome the issue of repeats was confirmed in that multiple, fully assembled KoRV inserts were found in the long read based assemblies, along with adjoining genomic koala content.

After the observation of the presence of KoRV in the long read based assemblies, the strains present within Bilbo were characterised. This showed that the Bilbo Koala genome contained only KoRV that mapped to the KoRV-A reference genome. This result then implies that the KoRV-A has been endogenized and is now in the germline. This result 51

is on contrast to those from Oliveria et al. [60] who observed KoRV endogenisation of a different strain of KoRV, specifically the CETTAG motif strain. However our observation of KoRV insertion in the HMBOX1 gene, in one Koala but not another, implies that infection patterns between individuals can vary. Aside from the KoRV-A insertions, the only other KoRV types observed were either recombinant KoRV (recKoRV) and KoRV-B but these were not present in the main long read-based assembly. recKoRV was only present in the alternate contigs and KoRV-B only existed in a single PacBio read.[72] In the case of the single read containing KoRV-B, this low representation amongst the data (for both reads and the assemblies) implies that this KoRV-B may not be a germline insertion but part of a current infection present in Bilbo.

For the short read based assemblies, the establishment of an insert motif for both

KoRV and the corresponding genomic koala component was of interest. However this was not recreated with the long read data set. The successful characterisation of the protein coding genes adjacent to KoRV inserts is an important observation arising from the long read-based assemblies. An alternate method would have been to take the KoRV containing scaffolds, mask the KoRV content, then to BLAST the sequence against high quality annotated genome sets. However, this method was not undertaken with the short read dataset due to the low scaffold size which could not support this type of mapping. The biological impact of these insertion events is likely to be affected by whether the insertion is actually within exons, introns or UTR of a gene, or whether regulatory elements such as promotors or enhancers have been affected. The HMBOX1 disruption might affect this gene, and it is known that its suppression can supress natural killer (NK) cell activity.[74]

This may affect immune responses in the affected individual. Further work in regards to this result would require transcriptomic analysis of Bilbo. Likewise, the KoRV insert adjacent

52

to the VTCN1 (V-set domain-containing T-cell activation inhibitor 1) gene may be of biological significance, with disruption to the gene having documented association with autoimmune diabetes in humans.[75] However the possibility of VTCN1 disruption is low with the KoRV insert being ~230kbp away from the gene and most promoter regions occurring within less than 50kbp of any coding sequence.[76] As with the HMBOX1 disruption, the use of further transcriptomic analysis of the individual would be required to establish a possible disruption of this gene.

One of the advantages of the FALCON assembler for long reads is its ability to produce a diploid assembly; this allows us to determine areas of heterozygosity within an individual.[34] The implications of this, in terms of the KoRV analysis for this thesis, is the ability to determine whether an insertion is homozygous or heterozygous. The read based mapping, which aimed to establish homo- or heterozygosity however remained inconclusive due to inconsistencies in the results. This is likely due to inaccurate mapping methods, a problem which is difficult to overcome due to the error rate present in PacBio reads.

2.4.1 Short read

Here we showed that standard short read-based sequencing is unsuitable for the analysis of KoRV retroviral insertions. This is because KoRV behaves as a repeat element in the genome, leading to assembly methods not producing KoRV insert points accurately.

The possibility of differing strains being present, as was suggested in the strong but not perfect motif consensus presents further challenges for short read assemblers due to regions of low consensus being lost and assumed to be sequencing errors. These same reasons are why throughout the entirety of the short read analysis only KoRV-A sequences were able to be assembled, despite the KoRV-B strain being found during transcriptome analysis.[53]

53

The very small difference between the two strain types, there being a 6 amino acid (18bp) deletion in KoRV-B found in the envelope protein, made any attempt to produce KoRV-B assemblies near impossible.

Despite the targeted KoRV assembly producing better results than the general genomic assembly, the goal of finding Koala genes adjacent to KoRV insertions was not met. Very small assemblies were generated next to KoRV insertions, highlighted in Figure

7A, which reduced the chance of mapping any possible annotated genes.

2.4.2 Long read

The aim to annotate genes adjacent to KoRV inserts was successful with the PacBio whole genome assembly with the given annotations from the annotation pipeline. Due to the higher quality of the assembly produced with the PacBio-based assembly, seen in the higher BUSCO results and lower assembly fragmentation, this was the expected outcome.

The distance of gene annotations from KoRV inserts were of varying distance ranging from

~1kbp away to over 600kbp away, suggesting that insertions may have a higher impact than some others. Whilst no protein coding genes, apart from HMBOX1 – discussed below, were interrupted by KoRV insertions, there remains a possibility of mis-annotations in the genome. Most gene structures were inferred from H. Sapiens which is of high evolutionary distance from the koala. This was due to the high gene annotation quality of human and the low annotation quality of marsupial genes.

Prior to the production of the full assembly, a KoRV-targeted assembly was produced to observe the effects of KoRV inserts within the koala genome. The most significant result was an insertion of KoRV into the HMBOX1 gene. The absence of this event in the full PacBio assembly was notable and suggests it was missed due to it being

54

only in read subsets within the data. This phenomenon was later hypothesized to possibly be a somatic cell insertion event, which was detected by the capacity of the PacBio platform to sequence single molecules of DNA. In order to validate this result, the X-ten short reads were mapped against the junctions of KoRV and the Koala genome. The mapping of the short reads from the same individual were able to span the KoRV insert junction, validating the insertion in this region despite not being present in the final assembly. Due to the significance of the result, the given gene disruption was also queried with the short reads from a second koala (PC, Pacific Chocolate) to observe if the same gene disruption occurred across individuals. The resulting mapping showed that the KoRV gene insertion

HMBOX1 was not present in PC as no reads were able to map across the junction.

Difficulties with the given HMBOX1 are due to the disruption in an exon not present in previously annotated homologues of the gene. The low quality of annotation for the

HMBOX1 in other marsupial genomes makes it possible that the koala HMBOX1 annotation may be a pseudogene.

To conclude, the long read-assembled genome provided a unique opportunity to establish the insertion positions and types of KoRV in the koala genome. It also has potential to help understand the differences between homo- and heterozygous insertion events and distinguish germline from somatic cell KoRV insertions.

55

3 THE UTILITY OF LONG READS IN THE HYBRID

ASSEMBLY OF THE KOALA GENOME

Contributions to this chapter

All work done in this chapter was done by me, with the below exceptions.

- All Illumina based assemblies (20X, 40X, 57X, 57X with BioNano data) were

assembled by Dr. Graham Etherington.

- Hybrid assemblies utilising 57X Illumina coverage with BioNano data alongside

5X, 10X, 15X were conducted by Dr. Zhiliang Chen.

- Selection of 5X, 10X, 15X and 20X coverage PacBio sequencing sets were done by

Dr. Zhiliang Chen.

56

3.1 Introduction

With the advent of long read technology, the need for newer assembly methods was quickly realised. This was due to incompatibilities with short read assemblers and the long read lengths. The much higher cost of long read technology however has given rise to projects utilising both short and long read technologies as a cost saving measure, thus also leading to the development of hybrid assembly methods designed to use both technologies.

Whilst hybrid assemblies have been undertaken for smaller genomes[39, 40], due to the emerging nature of long-read sequencing, the value of hybrid assembly methods for larger genomes remains poorly understood.[33]

Benchmarking studies have shown that hybrid assembly methods are able to produce assemblies of higher[77] quality than assemblies using only short read technology.

However the annotation in these studies only utilised the generalised assembly quality heuristics of N50 and scaffold number to determine the level of fragmentation of the assembly.[42]

The purpose of this study is to observe the effect of changing the amount of long and short read data on the assembly quality of the Koala genome. The expected outcome is an increase in assembly quality as the amount of data introduced is increased. The number of scaffolds or contigs is expected to decrease with the introduction of long reads, or

BioNano data, as the long reads or BioNano data should be able to span areas of repeats which are a known to affect short read-based assembly methods. However, due to the much higher error rate of the long read technology, it is expected that hybrid assemblies utilising low amounts of long read data will have errors in assembly and in sequence. In situations

57

where small amounts of short read data are used in a hybrid assembly, sequence error may also be common as the short read data is also being utilised as a error correcting method in hybrid assembly algorithms.[42]

The assemblies here will be done by use of a hybrid method, namely the PBJelly tool,[42] with sequencing data from the Koala Genome Consortium. This tool uses assemblies built from short read data, and then PacBio long reads, as input. The resulting hybrid assemblies will be analysed for assembly statistics, gene content using BUSCO analysis and Repeat Masker for repeats to determine the degree of assembly improvement by use of different amounts of short and long read data.

58

3.2 Methods

3.2.1 Bioinformatics methods

Four short read assemblies, and one which included BioNano optical mapping, were made by Dr. Graham Etherington of The Earlham Institute (formerly TGAC/The Genome

Analysis Centre). The datasets used 20X, 40X and 57X coverage of Illumina HiSeq reads, under the assumption of 3.2Gbp sized genome, with an extra assembly being made with the

57X coverage dataset with supplementary BioNano optical mapping data. Sampling of the

PacBio reads was done utilising the sequencing data from the Koala Genome Consortium, with selection of randomised SMRT cells to total either 5X, 10X, 15X or 20X of desired coverage. Before the hybrid assemblies were undertaken, sampling of the total PacBio long read data was done to create datasets of 5X, 10X, 15X, and 20X coverage. After sampling, according to the Methods, frequency histograms were used to assay the length distributions of the resulting data sets.

The four short read assemblies were used with the PBJelly tool (v15.8.24 using standard BLASR parameters [42, 68]) and the four long read datasets to undertake a total of

15 hybrid assemblies. The assemblies were conducted on one of two computing cluster configurations, with either 6 x 3.5GHz 128Gb RAM 16-core nodes (with a maximum of 64 nodes to be occupied at single time) or 6 x 3.2GHz 256Gb RAM 24-core nodes (with a maximum of 96 nodes to be occupied at single time). Hybrid assemblies of the koala genome were undertaken with inputs of four short read assemblies (with coverage from

20X to 57X, including a 57X assembly with BioNano optical mapping), and a range of

PacBio long read data (5x, 10X, 15X, 20X). The wall time for each hybrid assembly, on clusters of configuration given in the Methods, are given in Table 7.

59

General statistics for hybrid assemblies were calculated using ABySS tools (v2.0.2 using standard parameters).[46] The maximum scaffold size, N50 (the length that more than 50% of the given assembly is equal to or larger than, comparable to the median length) and number of scaffolds was the general statistics used.

For establishing annotation of the conserved gene sets, the hybrid assemblies were analysed with BUSCO (v1.1b1 using standard parameters and the “Vertebrata” training set).[48] For the annotation of repeat regions within the assemblies, the 20X, 40X, and 57X

Illumina short read-based assemblies, as well as their hybrid assembly counterparts from use of 20X PacBio coverage, were run through Repeat Masker (v4.0.1 using standard parameters and “Phascolarctos cinereus” as the training species). [78]

60

3.3 Results

3.3.1 PacBio data sampling and hybrid assembly run times

Figure 12 shows that the length distributions of the data sets were consistent, with the majority of reads being up to 20kb in length but with reads being present of up to 40kb.

Six Kolmogorov–Smirnov tests were conducted to determine if the 4 datasets were of the same distribution. All test produced P-values < 0.05, signifying the datasets were statistically similarly distributed.

It was found that the assembly time ranged from 84 to 513 hours. The major influence on assembly time was the coverage of long read data used, with an approximately linear increase in assembly time with increases in long read data. The different short read assemblies, used in the hybrid assemblies, did not greatly influence the assembly time.

61

Figure 12: Frequency histogram of PacBio long reads used for input of hybrid assemblies. The distribution of the 4 datasets shown with the 5X coverage shown in panel A, 10X in panel B, 15X in panel C, and 20X in panel D. The coverage of each dataset is established under the assumption of a 3.2Gbp sized genome. Each data set is a subset of the higher coverage sets i.e. 5X is a subset of all others, 10X is a subset of 15X and 20X, and 15X is a subset of 20X.

62

Table 7: The time taken for each hybrid assembly to run.

PacBio Assembly time Illumina coverage coverage (hours) 20X 5X 88 20X 10X 123 20X 15X 208 20X 20X 450

40X 5X 84 40X 10X 117 40X 15X 218 40X 20X 467

57X 5X 89 57X 10X 120 57X 20X 507

57X w/ BioNano 5X 92 57X w/ BioNano 10X 118 57X w/ BioNano 15X 260 57X w/ BioNano 20X 513

63

3.3.2 General statistics of hybrid assemblies

With all of the hybrid assembly summary statistics, the general trend was that there were improvements in assembly length and quality by use of long read PacBio data. The maximum scaffold size of each assembly saw consistent increases with the introduction of increasing amounts of PacBio data as seen in Tables 8 and 9, and Figure 13. The largest change of any of the general assembly statistics was for the 57X Illumina / BioNano with

20X PacBio hybrid assembly, which produced an almost tripling in maximum scaffold size, i.e. a 197% increase (Table 9). This large increase suggests that the higher quality short read based assembly was able to translate into higher increases in maximum scaffold size. Large increases in scaffold size were also seen for the 20X Illumina with 20X PacBio (67% increase in maximum scaffold) and the 57X Illumina with 20X PacBio (85% increase) assmblies. Even the smallest overall final increase in maximum scaffold, with the 40X

Illumina-seeded hybrid assembly, was able to produce a 27% increase in this metric. This is an expected outcome, with the increase in PacBio data allowing the level of consensus required for a scaffold assembly algorithm to be more easily reached with more data.

In the case of N50, which is a measure of the contiguity of the assembly, it was seen to increase in almost all hybrid assemblies (see Tables 8 and 9 and Figure 14). The greatest increases in N50 were in the 20X Illumina with 20X PacBio (57% increase) and in the 57X

Illumina / BioNano with 20X PacBio (112% increase). One case in which the N50 did not increase was seen in the hybrid assembly containing 20X Illumina data and 5X PacBio data, as this actually resulted in a 1.28% decrease in N50. Another anomaly for the N50 results was the hybrid assembly containing 57X Illumina and 5x PacBio data which saw an abnormally large spike in N50, opposed to the generally linear trend observed. This assembly was re-run but showed no overall change in results.

64

Table 8: Statistics of hybrid assemblies of the koala genome.

a The given coverage is under the assumption of 3.2Gbp genome. b N50 calculated for scaffolds larger than 2kbp.

65

Table 9: Table of hybrid assembly statistics as percentage increases of the original

Illumina based assembly.

a Values are functions of the Illumina based assembly thus there is no increase of this value in the respective rows.

66

Figure 13: Maximum scaffold sizes generated by the hybrid assemblies. Figure 13A displays the increase in maximum scaffold size for each hybrid assembly whilst Figure 13B shows the percentage increase in maximum scaffold size. The percentage increase graphed in Figure 13B is a comparison of each hybrid assembly against the original short read assembly (with 0X coverage of PacBio long reads). Long read coverage based on an estimated 3.2Gbp sized genome.

67

Figure 14: N50 of scaffold sizes generated by the hybrid assemblies. Figure 14A displays the N50 for the hybrid assemblies graphed against each other whilst Figure 14B shows the percentage increase of N50 for each hybrid assembly. The percentage increase graphed in Figure 14B is a comparison of each hybrid assembly against the original short read assembly (0X PacBio coverage). Long read coverage based on a 3.2Gbp sized genome. The N50 values were calculated after removal of scaffolds smaller than 2kbp.

68

The statistic of scaffold number, larger than 2kb, was used to measure the fragmentation of the hybrid assemblies whilst excluding the smaller scaffolds that would be left over from the short read- based assemblies (see Tables 8 and 9 and Figure 15). We had expected a decrease in the scaffold count, with increasing amounts of PacBio data, as scaffolds from short read assemblies were joined by the introduction of long read data. This was seen for the hybrid assembly with 57X Illumina data, whereby there was up to a 33% decrease in overall scaffold count, as long read data was added. The trend observed elsewhere, however, was different. The two assembly sets that used either 20X or 40X

Illumina coverage saw an increase in this assembly statistic. In the case of the 57X Illumina with BioNano based hybrid assemblies, the observed scaffold count saw the most dramatic increase with 32-42% increase in scaffolds larger than 2kb. These increases may be due to two things. The first is the introduction of hybrid scaffolds utilising the long and smaller reads. In cases where smaller reads were not assembled to scaffolds >2kb, the co-assembly of these with long reads may have generated new scaffolds. The second is due to the generation of new scaffolds only from the use of PacBio data. However, this is not supported with low sequence coverage by the PBJelly tool.

69

Figure 15: Scaffold count resulting from hybrid assemblies. Figure 15A displays the raw values graphed against each other whilst Figure 15B shows the percentage change of each hybrid assembly. The percentage change graphed in Figure 15B is a comparison of each hybrid assembly against the original short read assembly. Long read coverage based on a 3.2Gbp sized genome. The given scaffold count was calculated after removal of scaffolds smaller than 2kb.

70

3.3.3 BUSCO results

To observe the effects of using increasing degrees of PacBio data on genome completeness in each assembly, BUSCO analysis[48] was undertaken on the hybrid assemblies. This searches for a previously defined conserved gene set in each assembly, and provides a measure of assembly for the high content gene containing regions of a genome.

We expected that the BUSCO analysis would reveal an increase in the number of the conserved gene set that were present, when hybrid assemblies used increasing amounts of long-read data. However, our results showed that there was no substantial increase in gene content (as seen in Table 10). The complete BUSCO genes only varied by a maximum of

15 out of the 3,072 tested genes for any set of hybrid assemblies (the 57X Illumina and

BioNano, with long reads). There was also little change in the numbers of fragmented or missing BUSCO genes in all the hybrid assemblies. These results suggest that the majority of increases in assembly quality was outside of gene structures, at least for those examined by the BUSCO tool, and was primarily in intergenic regions of the genome assembled.

To confirm this observation with a broader analytical approach, we used Augustus to predict all protein coding sequences in each of the hybrid assemblies. The Augustus predictions were taken from the BUSCO run for each assembly and are the predicted gene transcripts from the given annotations searching all open reading frames in the 6 translation frames. The output is given as a protein FASTA file and so the results are reported as an amino acid count. The results of the Augustus predictions were consistent with the BUSCO analysis in that we saw a maximum of 3% increase in the number of coding amino acids in the hybrid assemblies that used greater amounts of long read data (Table 10). This suggests that any increases seen in overall N50 and in maximum scaffold length must be associated with or due to the better assembly of non-coding regions of the koala genome.

71

Table 10: Table of annotation statistics of hybrid assemblies using the BUSCO tool.

a amino acids

72

3.3.4 Repeat Masker results

The above analyses suggested that high content regions of the genomes were well- assembled with short read data. However, the use of long read data in hybrid assemblies did not dramatically improve the assembly of coding regions. To then see if there was a significant increase in non-coding regions within the assemblies, especially those that contain repeats, the hybrid assemblies were analysed with the Repeat Masker tool.[78] This should measure genomic elements not seen through the previous annotation methods. Only the Illumina assemblies with 0X PacBio data and the corresponding hybrid assemblies with

20X PacBio data were run through the Repeat Masker tool, as these were expected to highlight the greatest differences. As seen in Table 11, the use of PacBio data in hybrid assemblies did not significantly increase the repeat regions. The repeat percentage of the genomes remained mostly the same at ~33% for all Illumina short read assemblies and hybrid assemblies. However, there was a dramatic increase in the number of bases carrying low complexity sequences in each hybrid assembly. The hybrid assemblies showed between

27% (57X Illuimina) and 33% (40X Illumina coverage) increase in low complexity base pair count. There was also a notable increase observed in the number of low complexity insert elements for all hybrid assemblies with the largest increase observed in the 40X hybrid assemblies showing a ~10,000 increase (4%) in low complexity repeat element count (Table 11). These observations, along with the BUSCO results, suggest that all the improvements in assembly quality are not due to increases in the assembly of high complexity regions but are due to improvements in the assembly of low complexity regions.

73

Table 11: Summary of Repeat Masker results.

74

3.4 Discussion

This study has tested the effects of changing the amounts of long and short read data inputs for the hybrid assembly tool, PBJelly, when assembling the koala genome. To do this, a series of hybrid assemblies were produced to establish the optimum combination of the two input datasets. The results showed that at lower amounts of short read data, an increase in assembly quality could be more easily established from the introduction of more short read data as opposed to the introduction of long read data and hybrid assembly.

Interestingly, the introduction of BioNano data at the short read assembly stage significantly increased the quality of assemblies, including for the hybrid assemblies.

3.4.1 Hybrid assemblers and their use to date

PBJelly’s utilisation of long reads as a ‘gap filler’ is a hybrid assembly method that remains mostly unique in the emerging field of hybrid assembly tools due mostly to the first party support of the tool by PacBio. The other comparable tool is hybridSpades[39] which utilises the same input, but in terms of algorithm treats the short read scaffolds and long reads as effectively the same dataset. The initial development of the PBJelly tool undertook benchmarking with 4 different organisms (Cercocebus atys, Drosophila melanogaster,

Drosophila pseudoobscura and Melopsittacus undulatus) however only a single data set was used for each organism.[42] For hybridSpades, the tool was benchmarked on a series of bacterial genomes but, similar to PBJelly, benchmarking was only done with single datasets.[39] Whilst the benchmarking of the PBJelly tool is more comprehensive, due to its testing of larger eukaryotic genomes (C. atys with a 2.8Gbp genome and M. undulatus with a 1.1Gbp genome) as well as the assembly of multiple datasets, the effect of varying inputs has not been established. As a result, the significance of this investigation is to provide insight in terms of assembly inputs and optimisation. 75

3.4.2 Advantages and disadvantages of hybrid assembly versus short read assemblies

In terms of advantages of the hybrid assembly tool PBJelly, the introduction of long read data to short read assemblies led to significant increases in both N50 and maximum scaffold size. Some of the most drastic changes resulted in more than doubling of N50 and almost tripling of maximum scaffold size (in the case of the hybrid assembly containing

57X Illumina coverage with BioNano and 20X PacBio coverage). The advantages of long reads for the better hybrid assembly of repeat regions were also seen, as observed in the

~30% increase in low complexity repeat regions when 20X PacBio data was introduced.

The major disadvantage of the PBJelly hybrid assemblies was that for the Augustus predictions, there were essentially no increases in protein coding regions that arose with the addition of PacBio data in the hybrid assembly. There was also no dramatic increase in scaffolds larger than 2kb after introduction of PacBio data with the exception of 57X

Illumina w/ BioNano.

3.4.3 Limitations of the current study

BUSCO conserved gene set analysis was used to approximate the level of completeness within a given assembly. However the use of a conserved gene set means that some genes are themselves of low complexity, due to the BUSCO set being an iteration of the CEGMA set.[47, 48] The lower complexity of some of the genes means that any increase in quality from the introduction of PacBio data might not be detectable. The observation of elements outside of basic repeat analysis and the conserved gene set with

BUSCO was not conducted, with the major exception of our KoRV analysis (from the previous chapter) which proved the advantages of long read technology for the detection of a retrovirus. The assemblies presented here were only conducted once, due to their long run time, yet a single assembly that was rerun to test conditions on a newer computing cluster

76

had results which varied by <1% for most assembly statistics. In terms of the PBJelly algorithm, the utilisation of a differing dataset may produce differing results due to potential bias with this dataset. It should also be noted that the observed trends are incomplete for most assembly statistics, and it remains possible that higher PacBio coverages could greatly increase quality especially with 40X coverage. However, this degree of coverage could be used for assemblies with just PacBio long reads, with a tool such as FALCON. The correctness of the produced assemblies was not tested due to the requirement of whole assembly annotation methods to determine correctness of an assembly to determine proper synteny of genes and gene families.

3.4.4 Cost analysis

One purpose of this study was to provide a simulation of a genome assembly project and to determine the merits of hybrid assembly. As part of this, how much sequencing should be done for a project is usually one of the first considerations. This is usually determined by the budget of the project. The two sequencing technologies used here were the Illumina HiSeq 2500 for the short reads and the PacBio RSII for long reads. However since this sequencing was completed, new and cheaper alternatives have merged that can produce similar outputs at a lower cost per base. The new platforms are the Illumina HiSeq

XTen or NovaSeq6000, for short reads, and the PacBio Sequel for long reads. At time of writing, the Australian Dollar cost for library and instrument run consumables normalised against the amount of data produced were:

 HiSeq XTEn: $13.00/Gb

 HiSeq2500: $65.00/Gb

 PacBio RSII: $833.00/Gb

 PacBio Sequel: $432.00/Gb

77

As our hybrid assembly results saw insignificant increases in most assembly statistics after the introduction of PacBio data, the given indication is that to maximise genome assembly quality the best method is to utilise short read technology. This is supported by the costs analysis as the introduction of 20X PacBio data costs approximately

$53,312 on the RSII platform or $27,648 on the Sequel platform in consumables alone and the increase in quality does not approach any of the other assembly statistics when compared to the introduction of additional 20X short read data.

Notwithstanding the above, we have shown that some regions of genomes are not well addressed by short read sequencing. For KoRV, it was simply not possible to used short read technology to investigate its presence in a single genome and no significant results were able to be obtained until the final PacBio-based assembly of the Koala genome was used. This suggests that whilst the short read methods are useful for many aspects of genome assembly and annotation, the use of long read sequence data will help answer specific questions and provide opportunities for more in-depth annotation of regions which contain repeats or multiple insertions of retroviruses.

78

4 CONCLUSIONS

This thesis has investigated the utility of PacBio long sequence reads for the discovery of genomic retroviral insertions and for hybrid genome assembly. Whilst it was anticipated that long reads would be of great use in both applications, the degree to which the long reads improved the results was different in each case.

The characterisation of KoRV inserts in two individual Koalas was undertaken, to discover areas of insertional mutagenesis [Chapter 2]. For Pacific Chocolate, the genome assembly efforts prior to this investigation, with short reads, lacked KoRV-containing scaffolds. The targeted assemblies undertaken here generated short but useful scaffolds that provided some insight into regions in which the retrovirus inserts into the koala genome, and revealed a possible GA-rich insertion motif. Despite this preliminary result, the use of short read assemblies could not provide insight into adjacent genes. The use of long read technology, specifically from the individual Koala named Bilbo, generated genome assemblies that contained full length KoRV insertions in addition to adjoining genomic koala sequences containing protein coding genes. This was a substantial improvement over results from the short read-based efforts and allows us to conclude that long read technology is essential for the characterisation of KoRV inserts in the Koala genome.

Similar approaches should also be useful for the analysis of retrovirus insertions in the genomes of other species.

The utility of long read technology in hybrid assembly methods was then investigated, to understand whether PacBio sequences at up to 20x coverage could improve short read-based genome assemblies [Chapter 3]. The determination of genome assembly quality was done through assembly size statistics, along with the annotation of conserved gene sets (through the use of the BUSCO tool) and repeats (through the Repeat Masker 79

tool). The results from the hybrid assembly investigation saw that the introduction of long read data provided no tangible increase in the annotation of the conserved gene sets tested nor any increase in protein coding regions. However, the assembly size statistics saw increases in both max scaffold size and N50 (the median scaffold size) and these were greatest in hybrid assemblies with higher fold coverage of short reads. For example, the

57X and 57X with BioNano short read assemblies saw the greatest increases in these statistics, compared to assemblies with 20X and 40X short read coverage. The capacity of long reads to improve the resolution of repeat regions was also seen in the hybrid assemblies, as observed in the increase in low complexity repeats. In order to fully evaluate the benefits of the use of long read data in hybrid assemblies, a more in-depth annotation analysis could also be conducted.

From the above investigations, it can be concluded that PacBio long reads are of particular utility for the characterisation of genomic elements that contain repeats – whether these be multiple insertions of a retrovirus or smaller genomic elements. These are poorly addressed by standard short read-based approaches.

80

5 REFERENCES

1. Goodwin, S., J.D. McPherson, and W.R. McCombie, Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet, 2016. 17(6): p. 333-51. 2. Mardis, E.R., Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif), 2013. 6: p. 287-303. 3. Watson, J.D. and F.H. Crick, Genetical implications of the structure of deoxyribonucleic acid. 1953. JAMA, 1993. 269(15): p. 1967-9. 4. Watson, J.D. and F.H. Crick, Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 1953. 171(4356): p. 737-8. 5. Gilbert, W. and A. Maxam, The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A, 1973. 70(12): p. 3581-4. 6. Sanger, F., et al., Nucleotide sequence of bacteriophage phi X174 DNA. Nature, 1977. 265(5596): p. 687-95. 7. Sanger, F., Sequences, sequences, and sequences. Annu Rev Biochem, 1988. 57: p. 1-28. 8. Venter, J.C., et al., The sequence of the human genome. Science, 2001. 291(5507): p. 1304-51. 9. de Koning, A.P., et al., Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet, 2011. 7(12): p. e1002384. 10. Rothberg, J.M. and J.H. Leamon, The development and impact of 454 sequencing. Nat Biotechnol, 2008. 26(10): p. 1117-24. 11. Hudson, T.J., et al., An STS-based map of the human genome. Science, 1995. 270(5244): p. 1945-54. 12. Nyren, P., Enzymatic method for continuous monitoring of DNA polymerase activity. Anal Biochem, 1987. 167(2): p. 235-8. 13. Ronaghi, M., et al., Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem, 1996. 242(1): p. 84-9. 14. Liu, L., et al., Comparison of next-generation sequencing systems. J Biomed Biotechnol, 2012. 2012: p. 251364. 15. Franca, L.T., E. Carrilho, and T.B. Kist, A review of DNA sequencing techniques. Q Rev Biophys, 2002. 35(2): p. 169-200. 16. Fedurco, M., et al., BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Res, 2006. 34(3): p. e22. 17. Turcatti, G., et al., A new class of cleavable fluorescent nucleotides: synthesis and optimization as reversible terminators for DNA sequencing by synthesis. Nucleic Acids Res, 2008. 36(4): p. e25. 18. Mardis, E.R., The impact of next-generation sequencing technology on genetics. Trends Genet, 2008. 24(3): p. 133-41. 19. Zerbino, D.R. and E. Birney, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008. 18(5): p. 821-829. 20. Schulz, M.H., et al., Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics, 2012. 28(8): p. 1086-92.

81

21. Margolin, A.A., et al., ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 2006. 7 Suppl 1: p. S7. 22. Schatz, M.C., A.L. Delcher, and S.L. Salzberg, Assembly of large genomes using second-generation sequencing. Genome Res, 2010. 20(9): p. 1165-73. 23. Myers, E.W., et al., A whole-genome assembly of Drosophila. Science, 2000. 287(5461): p. 2196-204. 24. Li, R., et al., SOAP: short oligonucleotide alignment program. Bioinformatics, 2008. 24(5): p. 713-4. 25. Institute, B. DISCOVAR: Assemble genomes, find variants. 2017 [cited 2017 03/08]; Available from: https://www.broadinstitute.org/software/discovar/blog. 26. Weisenfeld, N.I., et al., Comprehensive variation discovery in single human genomes. Nat Genet, 2014. 46(12): p. 1350-5. 27. Das, S.K., et al., Single molecule linear analysis of DNA in nano-channel labeled with sequence specific fluorescent probes. Nucleic Acids Res, 2010. 38(18): p. e177. 28. Rhoads, A. and K.F. Au, PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics, 2015. 13(5): p. 278-89. 29. Berlin, K., et al., Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol, 2015. 33(6): p. 623-30. 30. Jain, M., et al., MinION Analysis and Reference Consortium: Phase 2 data release and analysis of R9.0 chemistry. F1000Res, 2017. 6: p. 760. 31. Lu, H., F. Giordano, and Z. Ning, Oxford Nanopore MinION Sequencing and Genome Assembly. Genomics Proteomics Bioinformatics, 2016. 14(5): p. 265-279. 32. Koren, S., et al., Canu: scalable and accurate long-read assembly via adaptive k- mer weighting and repeat separation. Genome Res, 2017. 27(5): p. 722-736. 33. Chin, C.S., et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods, 2013. 10(6): p. 563-9. 34. Chin, C.S., et al., Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods, 2016. 13(12): p. 1050-1054. 35. Gordon, D., et al., Long-read sequence assembly of the gorilla genome. Science, 2016. 352(6281): p. aae0344. 36. Koren, S., et al., Reducing assembly complexity of microbial genomes with single- molecule sequencing. Genome Biol, 2013. 14(9): p. R101. 37. Koren, S., et al., Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol, 2012. 30(7): p. 693-700. 38. Lee, H., et al., Error correction and assembly complexity of single molecule sequencing reads. bioRxiv, 2014. 39. Antipov, D., et al., hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics, 2016. 32(7): p. 1009-15. 40. Viraj Deshpande, E.D.F., Son Pham, Vineet Bafna, Cerulean: A hybrid assembly using high throughput short and long reads. arXiv:1307.7933, 2013. 41. Ye, C., et al., DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Scientific Reports, 2016. 6: p. 31900. 42. English, A.C., et al., Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One, 2012. 7(11): p. e47768.

82

43. Biosciences, P. Large Genome Assembly with PacBio Long Reads. 2016 [cited 2017 18/03]; Available from: https://github.com/PacificBiosciences/Bioinformatics- Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads. 44. Lin, H.H. and Y.C. Liao, Evaluation and Validation of Assembling Corrected PacBio Long Reads for Microbial Genome Completion via Hybrid Approaches. PLoS One, 2015. 10(12): p. e0144305. 45. Bankevich, A., et al., SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol, 2012. 19(5): p. 455-77. 46. Simpson, J.T., et al., ABySS: A parallel assembler for short read sequence data. Genome Research, 2009. 19(6): p. 1117-1123. 47. Parra, G., K. Bradnam, and I. Korf, CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 2007. 23(9): p. 1061-7. 48. Simao, F.A., et al., BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015. 31(19): p. 3210-2. 49. Westerman, M., R.W. Meredith, and M.S. Springer, Cytogenetics meets phylogenetics: a review of karyotype evolution in diprotodontian marsupials. J Hered, 2010. 101(6): p. 690-702. 50. Mikkelsen, T.S., et al., Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature, 2007. 447(7141): p. 167-77. 51. Murchison, E.P., et al., Genome sequencing and analysis of the Tasmanian devil and its transmissible cancer. Cell, 2012. 148(4): p. 780-91. 52. Renfree, M.B., et al., Genome sequence of an Australian kangaroo, Macropus eugenii, provides insight into the evolution of mammalian reproduction and development. Genome Biol, 2011. 12(8): p. R81. 53. Hobbs, M., et al., A transcriptome resource for the koala (Phascolarctos cinereus): insights into koala retrovirus transcription and sequence diversity. BMC Genomics, 2014. 15: p. 786. 54. Hanger, J.J., et al., The nucleotide sequence of koala (Phascolarctos cinereus) retrovirus: a novel type C endogenous virus related to Gibbon ape leukemia virus. J Virol, 2000. 74(9): p. 4264-72. 55. Denner, J., Transspecies Transmission of Gammaretroviruses and the Origin of the Gibbon Ape Leukaemia Virus (GaLV) and the Koala Retrovirus (KoRV). Viruses, 2016. 8(12). 56. Denner, J. and P.R. Young, Koala retroviruses: characterization and impact on the life of koalas. Retrovirology, 2013. 10: p. 108. 57. Tarlinton, R.E., J. Meers, and P.R. Young, Retroviral invasion of the koala genome. Nature, 2006. 442(7098): p. 79-81. 58. Xu, W., et al., An exogenous retrovirus isolated from koalas with malignant neoplasias in a US zoo. Proc Natl Acad Sci U S A, 2013. 110(28): p. 11547-52. 59. Shojima, T., et al., Identification of a novel subgroup of Koala retrovirus from Koalas in Japanese zoos. J Virol, 2013. 87(17): p. 9943-8. 60. Oliveira, N.M., et al., Changes in viral protein function that accompany retroviral endogenization. Proceedings of the National Academy of Sciences, 2007. 104(44): p. 17506-17511. 61. Waugh, C.A., et al., Infection with koala retrovirus subgroup B (KoRV-B), but not KoRV-A, is associated with chlamydial disease in free-ranging koalas (Phascolarctos cinereus). Sci Rep, 2017. 7(1): p. 134.

83

62. Legione, A.R., et al., Koala retrovirus (KoRV) genotyping analyses reveal a low prevalence of KoRV-A in Victorian koalas and an association with clinical disease. J Med Microbiol, 2016. 63. Huang, Y., et al., CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010. 26(5): p. 680-682. 64. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10. 65. Wu, T.D. and C.K. Watanabe, GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics, 2005. 21(9): p. 1859-1875. 66. Stanke, M. and B. Morgenstern, AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Research, 2005. 33(Web Server issue): p. W465-W467. 67. Bailey, T.L., et al., MEME Suite: tools for motif discovery and searching. Nucleic Acids Research, 2009. 37(Web Server issue): p. W202-W208. 68. Chaisson, M.J. and G. Tesler, Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics, 2012. 13: p. 238-238. 69. Greenwood, A.D., et al., Transmission, Evolution, and Endogenization: Lessons Learned from Recent Retroviral Invasions. Microbiol Mol Biol Rev, 2018. 82(1). 70. Treangen, T.J. and S.L. Salzberg, Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews. Genetics, 2011. 13(1): p. 36-46. 71. Krishnan, L. and A. Engelman, Retroviral Integrase Proteins and HIV-1 DNA Integration. The Journal of Biological Chemistry, 2012. 287(49): p. 40858-40866. 72. Matthew Hobbs , A.K., Ryan Salinas , Zhiliang Chen , Kyriakos Tsangaras , Alex D. Greenwood , Rebecca N. Johnson , Katherine Belov , Marc R. Wilkins , Peter Timms, Long-read genome sequence assembly provides insight into ongoing retroviral invasion of the koala germline. Sci Rep, 2017. 73. Chaisson, M.J.P., et al., Resolving the complexity of the human genome using single-molecule sequencing. Nature, 2015. 517(7536): p. 608-611. 74. Wu, L., C. Zhang, and J. Zhang, HMBOX1 negatively regulates NK cell functions by suppressing the NKG2D/DAP10 signaling pathway. Cellular and Molecular Immunology, 2011. 8(5): p. 433-440. 75. Radichev, I.A., et al., Loss of peripheral protection in pancreatic islets by proteolysis-driven impairment of VTCN1 (B7-H4) presentation is associated with the development of autoimmune diabetes. Journal of immunology (Baltimore, Md. : 1950), 2016. 196(4): p. 1495-1506. 76. Griffiths AJF, M.J., Suzuki DT, et al., Transcription: an overview of gene regulation in eukaryotes. An Introduction to Genetic Analysis. 2000: New York: W. H. Freeman. 77. Sovic, I., et al., Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads. Bioinformatics, 2016. 32(17): p. 2582-9. 78. Smit, A., Hubley, R & Green, P. RepeatMasker Open-4.0. 2013-2015; Available from: http://www.repeatmasker.org.

84