Optical Maps in Guided Genome Assembly
Total Page:16
File Type:pdf, Size:1020Kb
Optical maps in guided genome assembly Miika Leinonen Helsinki September 9, 2019 UNIVERSITY OF HELSINKI Master’s Programme in Computer Science HELSINGIN YLIOPISTO — HELSINGFORS UNIVERSITET — UNIVERSITY OF HELSINKI Tiedekunta — Fakultet — Faculty Koulutusohjelma — Studieprogram — Study Programme Faculty of Science Study Programme in Computer Science Tekijä — Författare — Author Miika Leinonen Työn nimi — Arbetets titel — Title Optical maps in guided genome assembly Ohjaajat — Handledare — Supervisors Leena Salmela and Veli Mäkinen Työn laji — Arbetets art — Level Aika — Datum — Month and year Sivumäärä — Sidoantal — Number of pages Master’s thesis September 9, 2019 41 pages + 0 appendices Tiivistelmä — Referat — Abstract With the introduction of DNA sequencing over 40 years ago, we have been able to take a peek at our genetic material. Even though we have had a long time to develop sequencing strategies further, we are still unable to read the whole genome in one go. Instead, we are able to gather smaller pieces of the genetic material, which we can then use to reconstruct the original genome with a process called genome assembly. As a result of the genome assembly we often obtain multiple long sequences representing different regions of the genome, which are called contigs. Even though a genome often consists of a few separate DNA molecules (chromosomes), the number of obtained contigs outnumbers them substantially, meaning our reconstruction of the genome is not perfect. The resulting contigs can afterwards be refined by ordering, orienting and scaffolding them using additional information about the genome, which is often done manually by hand. The assembly process can also be guided automatically with the additional information, and in this thesis we are introducing a method that utilizes optical maps to aid us assemble the genome more accurately. A noticeable improvement of this method is the unification of the contigs, i.e. we are left with fewer but longer contigs. We are using an existing genome assembler called Kermit, which is designed to accept genetic maps as auxiliary long range information. Our contribution is the development of an assembly pipeline that provides Kermit with similar kind of information via optical maps. The initial results of our experiments show that the proposed genome assembly scheme can take advantage of optical maps effectively already during the assembly process to guide the reconstruction of a genome. ACM Computing Classification System (CCS): Applied computing ! Molecular sequence analysis Theory of computing ! Pattern matching Avainsanat — Nyckelord — Keywords genome assembly, optical map Säilytyspaikka — Förvaringsställe — Where deposited Muita tietoja — övriga uppgifter — Additional information Thesis for the Algorithms study track ii Contents 1 Introduction 1 2 Background 2 2.1 Genome . .2 2.2 DNA sequencing . .3 2.3 Genome assembly strategies . .4 2.4 Supplementary long range data . .7 3 Optical map guided genome assembly 9 3.1 Data: reads and optical maps . .9 3.2 Error correction . 11 3.3 Pre-coloring contig assembly . 14 3.3.1 Read aligning . 14 3.3.2 Assembly graph and contig generation . 17 3.4 Optical map mapping . 20 3.5 Contig coloring . 23 3.6 Read coloring . 24 3.7 Post-coloring contig assembly . 27 4 Experiments and results 29 5 Conclusions 35 References 40 1 1 Introduction Inspecting the genome can bring us a great number of benefits. For example, it gives us insight into different diseases allowing us to develop ways to combat them. Tak- ing a look at a person’s genetic material can help doctors to personalize medication for them. We can also explore the genetic diversity of a population of endangered species, which can help us come up with an effective conservation plan. The pos- sibility to inspect and modify the genome also provides a host of applications in biological engineering. All in all, there are many ways to improve our, and other species’ lives by exploring the contents of our genomes. The first methods to read small parts of the genome were developed in the 1970s. Since then, more advanced technologies have emerged and we are able to read even longer sections of the genome faster than before. Unfortunately, we are unable to read the whole genome in one go; we only have the ability to collect shorter subsections of it, called reads. This leads to the problem of genome assembly, a way to automatically rebuild the genome from the smaller parts of it that are available to us. There already exists a variety of genome assembler programs which take in reads and try to reconstruct the genome as well as possible. When working with a completely new genome, you might not have a relevant reference genome on hand. This kind of assembly is called de novo assembly. It is still possible to guide a de novo assembly through other means besides a ready genome we can refer to. In this research project we are utilizing one of such assemblers, called Kermit [WRS18]. Kermit is designed to accept long range information about the reads provided to it, so it can produce a more accurate representation of the genome. The long range information is given in the form of genetic maps. Other kinds of auxiliary data are also available, and in this thesis we will examine how optical maps can be used to guide genome assembly process. Typically optical maps have been used after the initial genome assembly is complete to refine the final reconstruction of the genome. Commonly this is done manually, which takes a lot of time and effort, and is susceptible to human error. Automated systems that involve optical maps already during the assembly process have been researched and developed to some extent, but the results are still quite slim. As an example, AGORA [LGM+12] is an assembler program that can utilize optical maps, but it was only tested with error free reads of very small bacterial 2 genomes. Nevertheless, AGORA got positive results from using optical maps. An- other example program called KOOTA [ASP+17] also takes advantage of optical maps during assembly. While KOOTA did not perform competitively compared to the other current assemblers, it demonstrated that optical maps can be used to improve specific phases of the assembly process. In this thesis we propose a genome assembly pipeline that provides long range infor- mation through optical maps to guide the long read genome assembly of Kermit. We determined that the usage of optical maps during the assembly can produce higher quality contigs compared to a similar unguided de novo assembly. Our method is also applicable with larger genomes than bacterial ones. While the initial results suggest our pipeline can use optical maps efficiently during the assembly, there is still room for possible improvements which could present themselves after more ex- perimentation. The results and ideas for future work are discussed further in Section 5. 2 Background Before diving into our genome assembly strategy, some background information about the subject is required. First we will introduce the genome; what it is and what kind of structure it has. Next we will discuss how we can get information about the genome, and why genome assembly is needed. At the end, some genome assembly strategies utilized in our assembly pipeline are introduced. 2.1 Genome The goal of genome assembly is to figure out the structure of the genetic material of an organism. The genome contains all the information on how the organism is built and maintained on cellular level. The genome can consist of a single or multiple chromosomes depending on the complexity of the species. A chromosome is a deoxyribonucleic acid (DNA) molecule. Some viruses have their genetic infor- mation encoded in ribonucleic acid (RNA) molecules instead of DNA molecules. Often chromosomes are linear molecules, but some species can also have circular chromosomes. Regardless of the number of chromosomes or their shape, a DNA molecule is com- posed of two strands of small building blocks called nucleotides. The strands are 3 intertwined together forming the characteristic double helix structure of a DNA molecule. There exists four different nucleotides, which are differentiated by their base components; adenine, cytosine, guanine and thymine. The bases, and the cor- responding nucleotides, are simply denoted by the first letters, A, C, T and G. A strand of a DNA can thus be represented by a string of these letters. Each base of a strand is paired to another base of the other strand. Generally, there exists two kinds of pairings, adenine pairs with thymine and cytosine pairs with guanine. Because of this redundant structure, we can deduce the base sequence of a strand if the other strand is already known. Two paired strands of a chromosome are also called complementary strands. A DNA sequence length is measured in base pairs, meaning the number of base pairs in two complementary sequences. As a side note, an RNA molecule is similar to a DNA molecule, except it has only one strand, and thymine nucleotide is substituted with another one, called uracil, U. However, we are only focusing on assembling genomes with DNA in this research. 2.2 DNA sequencing Figuring out the structure of the genome can give us very useful insight into different species and even individuals. DNA sequencing, the methods for discovering the nucleotide sequences of DNA molecules, got its start over 40 years ago in the 1970s. Two main methods defined this first generation of sequencing, Sanger sequencing [SC75] and Maxam-Gilbert sequencing [MG77]. These first generation technologies were very slow, and sequencing DNA molecules to collect data took a long time. The sequenced DNA segments known as reads were also quite short, less than 1000 bp (base pairs) long. For comparison, the human genome is longer than three billion base pairs, and even a simple single chromosome bacteria like Escherichia coli has a genome that is around five million bases long.