Assembly and annotation Oskar Karlsson Lindsjö Hadrien Gourlé Spring 2017 License and references All material within is modified from lectures at COMWA by Jose Blanca, bioinf.comav.upv.es and modified under the terms of the Creative Commons Attribution 3.0 Unported License

This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Sequencing project

Unknown sequence

read 1 read 4

read 2 read 5 experimental evidence read 3 read 6

read 7

result Computational requirements

mykl_roventine@flickr

Main requirements: • Memory. • Time. The kmer-based assemblers requirements depend on the genome complexity and not so much on the reads. Read cleaning

Some assemblers choke with: bad quality stretches, adapters, low complexity regions, contamination.

Bad quality vector Unknown sequence

Trim Trim or mask The assembly problem

Only one read is usually not capable of producing the complete sequence. Even if the problem sequence is short one read might have sequencing errors. repeats

Unknown sequence

Reads

contig5 (repeat)

contig1 contig2 contig3 contig4 Contigs / scaffolds

Repeated region no coverage Reasons for the gaps Long reads Read size influence Mate pairs and paired-ends Read length is critical, but constrained by sequencing technology, we can sequence molecule ends. Clone (template)

reads

Useful for dealing with repetitive genomic DNA and with complex transcriptome structure.

Paired-ends: Illumina can sequence from both ends of the molecules. (150-500 bp)

Mate-pairs: Can be generated from libraries with different lengths (2-20 Kb).

BAC-ends: Usually sequenced with Sanger. Mate-pairs Library types

Single reads

Illumina Pair Ends (150-500 pb)

Mate Pairs (2-10 kb) Illumina Illumina mate-pair chimeras Long reads

Third-generation sequencing and the future of genomics. Lee et al. Short reads are harder to assemble

Taken from Jason Miller The need for paired reads

Taken from Jason Miller Algorithm I Overlap – Layout - Consensus Overlap – Layout - Consensus

Algorithm: • All-against-all, pair-wise read comparison. • Construction of an overlap graph with approximate read layout. • Multiple sequence alignment determines the precise layout and the consensus sequence.

clean reads Overlaps detection Contigs All vs all alignment

Consensus Overlap problems Two reads are similar if they have a good overlap. We assume that overlap implies common origin in the genome. This goodness depends on the overlap quality and similarity.

But several things might go wrong

Vector contamination or repetitive repetitive Too many sequencing errors

Chimeric clone Short overlap false false positive false false negative Overlap problems

False positives will be induced by chance and repeats Avoid false positives with stringent criteria: • Overlaps must be long enough • Sequence similarity must be high (identity threshold) • Overlaps must reach the ends of both reads • Ignore high-frequency overlaps (repetitive regions in genomic?)

But stringency will induce false negatives. Assembly terminology

Consensus ACTGATTAGCTGACGNNNNNNNNNNTCGTATCTAGCATT

Scaffold = Contigs + Parings (gap lengths derived from pairings)

Contigs = Reads + Overlaps

Overlaps

Reads & Pairing

Taken from Jason Miller Unitigs

Contigs: • High-confidence overlaps • Maximal contigs with no contradiction (or almost no contradiction) in the data • Ungapped • Contigs capture the unique stretches of the genome. • Usually where unitigs end, repeats begin

Contig example. The ideal Contig is A, B, C. C, D and E have a repetitive sequence. Scaffolds

Build with mate pairs • Mate pairs help resolve ambiguity in overlap patterns

Scaffolds • Every contig is a scaffold • Every scaffold contains one or more contigs • Scaffolds can contain sequence gaps of any size (including negative) Overlap-Consensus-Layout limitation

30 Million 5Kb (long) reads Number of alignments: 30 M reads x 30 M reads = 900 trillion alignments It does not scale Algorithm II kmer-based K-mer based assembly

K-mer graphs are de Bruijn graph.

Nodes represent all K-mers.

Edges represent fixed-length overlaps between consecutive K- mers.

Assembly is reduced to a graph reduction problem.

Read 1 AACCGG Read 2 CCGGTT

Graph: AACC -> ACCG -> CCGG -> CGGT -> GGTT

K-mer length = 4 K-mer based assembly K-mer bubbles K-mer repetitive Graph problems

Repeats, sequencing errors and polymorphisms increase graph complexity, leading to tangles difficult to resolve K-mer graphs are more sensitive to repeats and sequencing errors than overlap based methods. Optimal graph reductions algorithms are NP-hard, so assemblers use heuristic algorithms Polymorphism Sequencing error

Repeat K-mer analysis

21-mer distribution of sea urchin (Strongylocentrotus purpuratus) genome (900 Mbase long) unique

repetitive K-mer analysis Genome K-mer distribution Simulated reads. 50X coverage No errors

reads unique

Repetitive (duplication) K-mer analysis Genome K-mer distribution real reads. 50X coverage With errors

Real reads (with errors) noise Simulated reads (no error)

signal Final remarks N50

N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome.

The N50 statistic is a measure of the average length of a set of sequences, with greater weight given to longer sequences.

It is widely used in genome assembly

N50: 10829 N95: 20 minimum: 100 maximum: 146,370 average: 1618.01 num. seqs.: 141503

[ 100 , 7414[ (131545): ************************************************************************ [ 7414 , 14728[ ( 6247): *** [ 14728 , 22042[ ( 2353): * [ 22042 , 29356[ ( 842): [ 29356 , 36670[ ( 324): [ 36670 , 43984[ ( 118): [ 43984 , 51298[ ( 42): [ 51298 , 58612[ ( 13): [ 58612 , 65926[ ( 10): [ 65926 , 73240[ ( 4): [ 73240 , 80554[ ( 0): [ 80554 , 87868[ ( 1): [ 87868 , 95182[ ( 0): Taken[ 95182 from , 146380[ Wikipedia ( 4): Common assembly errors

Collapsed repeats • Align reads from distinct (polymorphic) repeat copies • Fewer repeats in assembly than in genome • Fewer tandem repeat copies in assembly than genome

Missed joins • Missed overlaps due to sequencing error • Contradictory evidence from overlaps • Contradictory or insufficient evidence from pairs

Chimera • Enter repeat at copy 1 but exit repeat at copy 2 • Assembly joins unrelated sequences Ingredients for Good Assembly Coverage and read length • 20X for (corrected) PacBio, 150X for Illumina Paired reads • Read lengths long enough to place uniquely • Inserts longer than long repeats • Pair density sufficient to traverse repeat clusters • Tight insert size variance • Diversity of insert sizes Read Quality: • Sequencing errors • No vector contamination • Contamination: mitochondrial or chloroplastic The requirements depend greatly on the quality: it is not the same a draft that high quality genome. Common assemblers

Genomic

• SOAPdenovo, Illumina

• CABOG: Celera assembler, 454 and few Illumina

Staden for Sanger reads.

Transcriptomic assemblers are specialized software From scafflods to chromosomes Genetic maps Mapping platforms

Third-generation sequencing and the future of genomics. Lee et al. Bionano Bionano

1

2 Bionano Bionano Genome Annotation RNA Genes RNA Genes The most universal genes, such as tRNA and rRNA, are very conserved and thus easy to detect. Finding them first removes some areas of the genome from further consideration. One easy approach to finding common RNA genes is just looking for sequence homology with related species: a BLAST search will find most of them quite easily

Functional RNAs are characterized by secondary structure caused by base pairing within the molecule. Determining the folding pattern is a matter of testing many possibilities to find the one with the minimum free energy, which is the most stable structure. The free energy calculations are in turn based on experiments where short synthetic RNA molecules are melted

Related to this is the concept that paired regions (stems) will be conserved across species lines even if the individual bases aren’t conserved. That is, if there is an A-U pairing on one species, the same position might be occupied by a G-C in another species. This is an example of concerted evolution: a deleterious mutation at one site is cancelled by a compensating mutation at another site. Prokaryotic Genes

Gene finding in prokaryotes is relatively simple compared to eukaryotes: no introns, so all genes are in open reading frames starting at a start codon and ending at a stop codon most of the DNA is involved in coding from proteins.

•Thus, you can achieve 100% accuracy, if you don’t mind false positives, by simply listing all possible ORFs above a certain size. There is a problem in that it is not clear how many short ORFs (say less than 100 bp) are real genes.

•If you compare predicted genes with actual genes, you can classify each base according to whether it is: true positive: predicted correctly to be in a gene true negative: predicted correctly to not be in a gene Sn = TP / (TP + FN) false positive: predicted to be in a gene but actually not false negative: predicted to not be in a gene but actually is within a gene. Sp = TP / (TP + FP)

•The sensitivity (Sn) of a prediction is the fraction of bases in real genes that are predicted to be within genes. •The specificity (Sp) is the fraction of bases predicted to be in a gene that actually are. •Both of these parameters need to be optimized. General Considerations

Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and a few others are occasionally used. Remember that start codons are also used internally: the actual start codon may not be the first one in the ORF.

•The stop codons are the same as in eukaryotes: TGA, TAA, TAG stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation.

•Genes can overlap by a small amount. Not much, but a few codons of overlap is common enough so that you can’t just eliminate overlaps as impossible. •Cross-species homology works well for many genes. It is very unlikely that non-coding sequence will be conserved. But, a significant minority of genes (say 20%) are unique to a given species.

•Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often found just upstream from the start codon however, some aren’t recognizable genes in operons sometimes don’t always have a separate ribosome binding site for each gene GeneMark

GeneMark uses fifth order Markov chains to examine dicodons. That is, every base is evaluated in terms of its probability given the previous 5 bases. P(a|x1x2x3x4x5), the probability that the sixth base in the sequence is a given that the bases preceding it are x1x2x3x4x5, so the final sequence is x1x2x3x4x5a. The necessary parameters are obtained by looking at pentamers (5-mers) within known genes and counting the number of times each base appears in the sixth position. This is the training set. Possible use of pseudocounts here.

•GeneMark pays attention to reading frame also. Each reading frame gets its own set of statistics. Thus there is a separate P1(a|x1x2x3x4x5), P2(a|x1x2x3x4x5), and P3(a|x1x2x3x4x5), where 1, 2, and 3 are the reading frames. Based on the position of the stop codon, each base in an ORF has a unique codon position. Non-coding regions are assumed to have the same statistics for all frames. The final probability is given as the probability that it is coding for a specified reading frame.

•A 96 base sliding window is moved across the genome, scoring all possible reading frames. Start and stop codons are not accurately predicted, especially with overlapping genes--they need to be identified separately. GLIMMER

GLIMMER also uses Markov chains, but they vary from zeroth order (i.e. GC content) to eighth order (what is the probability of a base given the previous 8 bases?). The point of this is to help get around the need for huge sets of training data while avoiding pseudocounts. Called “interpolated Markov models”

•GLIMMER selects training data from a genome sequence by picking non-overlapping long ORFs, which are almost all genes. Note that high GC-content genomes need “long” defined differently than low GC genomes, since random stop codons are rarer.

•GLIMMER builds its Markov models from the lowest order up. At each step, there must be at least 400 observations to accept the model as valid. If there are too few observations, the model is compared to each of the next order down model, using a chi-square test. If the new model isn’t significantly different from the lower order model, it is discarded. If the new model is significantly different, it is weighted based on the number of observations and the significance level.

For example, if there are less than 400 observations of x1x2x3x4x5, the P(a|x1x2x3x4x5) for each base a is tested against P(a|x2x3x4x5) probabilities.

•After all parameters are obtained, only the highest order model is used for any given subsequence. •Each ORF longer than a minimum is scored (as opposed to using a sliding window that ignores ORFs) •New versions don’t require that the “given” bases be adjacent to the base they are scored with. Eukaryotic

Some fundamental differences between prokaryotes and eukaryotes: •There is lots of non-coding DNA in eukaryotes. First step: find repeated sequences and RNA genes Note that eukaryotes have 3 main RNA polymerases. RNA polymerase 2 (pol2) transcribes all protein-coding genes, while pol1 and pol3 transcribe various RNA-only genes.

•most eukaryotic genes are split into exons and introns. •Only 1 gene per transcript in eukaryotes. •No ribosome binding sites: translation starts at the first ATG in the mRNA thus, in eukaryotic genomes, searching for the transcription start site (TSS) makes sense.

•Many fewer eukaryotic genomes have been sequenced Annotation

Once genes have been identified, we need to assign them names and functions. In well-studied genomes, such as Drosophila, there are many already-named genes, some of which are quite whimisical. They often reflect the mutant phenotype, e.g. white eyes. A mutant whose wings are held at an unusual angle: Frodo (“lowered of the wings”). But in general, gene names from genome project tend to be descriptions of function. For example, the gene for glucose 6-phosphate dehydrogenase is just called that in , but it is “Zwischenferment” (a German word) in Drosophila.

•Who is going to do the annotations? There are a lot of genes, and no one is an expert in all of them.

One approach: use amateurs who are trained to follow certain guidelines and have easy access to as much useful information as possible. Problem: inconsistent results

Another approach: have experts in specific genes annotate all examples of that gene. Problem: getting experts for all genes and keeping them interested.

Yet another: do as much automated annotation as possible, with trained personnel examining only the hard cases. Problem: identifying the hard cases. More Annotation

Need for experimental evidence. All gene identification is based on experimental work: biochemistry, genetics, etc. Most annotation is thus based on logic like “Gene X in my organism is similar to gene Y in another organism that has been experimentally determined to have such-and-such a function.” How similar is “similar”? Are there other functions that might use similar proteins?

•Gene function predictions vary in their reliability: how well does the current gene match previously discovered genes? •We need gene names that are computer-recognizable. This means using a controlled vocabulary: only certain words and punctuation is used, and standard genes are named the same way in all organisms. Gene Ontology descriptions are useful, but they are not detailed enough, and they tend to focus on human genes at the expense of bacterial genes. Enzyme Commission (E.C.) numbers are very useful or enzymes because they describe a function precisely. Otherwise, you either follow the conventions of the group you are working with or try to mimic the best BLAST hits