Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing

TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Lehrstuhl fur¨ Bioinformatik Ultrafast and sensitive sequence search and clustering methods in the era of next generation sequencing Martin Steinegger Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Prof. Dr. Michael Georg Bader Prufer¨ der Dissertation: 1. Prof. Dr. Burkhard Rost 2. Priv.-Doz. Dr. Hanjo Taubig¨ Die Dissertation wurde am 07.05.2018 bei der Technischen Universitat¨ Munchen¨ eingere- icht und durch die Fakultat¨ fur¨ Informatik am 13.08.2018 angenommen. Abstract Environmental metagenomic studies are a rich source for unraveling genetic diversity of yet un- known and mostly unculturable microbes, with terabytes of short read sequencing data already publicly available. The growth rate of such data has been increasing steadily, driven by the expo- nential increase in throughput, which has been outperforming Moore’s law by two-fold in the last decade. To date, the main bottleneck is not the generation but rather the analysis of sequence data. To address this analysis bottleneck we propose three novel methods: (1) MMseqs, a fast and sensitive clustering method, (2) MMseqs2, a fast and sensitive homology search method, and (3) Linclust, the first sequence clustering algorithm whose runtime scales linearly with the number of sequences and independently of the number of clusters obtained. Clustering is the process of assigning sequences to distinct groups based on similarity. Thus, clustering serves as a means to discover biological connections or define families of homologous sequences. Furthermore, clustering can speed up downstream analysis considerably by reducing highly similar sequences to a single sequence. MMseqs is a sensitive all-against-all search-based clustering tool that can cluster huge protein sets with high sensitivity. Current fast state-of-the-art clustering methods lack sensitivity below 50% while MMseqs detects similarities at 30% sequence identity. We show that our method is not only more sensitive but also faster than other state-of- the-art methods. Homology search is widely used in life science research to infer the functions and structures of the novel protein sequences by identifying significant similarities to known sequences. Sensitive methods such as BLAST/PSI-BLAST are too slow to handle huge data sets. Fast methods trade sensitivity for runtime speed, leading to many missed annotations. To address this dilemma, we developed MMseqs2, a successor to MMseqs, which is as sensitive as BLAST but 35 times faster. Moreover, MMseqs2 is the first fast iterative profile search tool with even higher sensitivity than that of PSI-BLAST at 400 times the speed. MMseqs2 also supports fast reverse profile searches, which we used to annotate 1:1 billion metagenomic protein sequences in 8:3 hours on a single 28 core computer. The speed and sensitivity of MMseqs2 makes it a powerful tool to annotate protein sequences in the era of big data. State-of-the-art sequence clustering methods scale in O(NK) where N is the input set size and K is the amount of clusters. Since K is often close to N, these methods follow a nearly quadratic runtime complexity. Linclust is the first algorithm that overcomes this quadratic runtime and runs in linear time O(N). Linclust can cluster protein sets down to 50% sequence identity three to four orders of magnitude faster than other methods. We clustered 1.6 billion protein sequences from about ∼2200 metagenome and metatranscriptome assemblies down to 50% sequence identity using Linclust, which took 10 hours on a single server with 28 cores. The near quadratic run time of state-of-the-art methods would make this analysis nearly infeasible. Zusammenfassung Daten aus metagenomischen Studien sind eine reichhaltige Quelle fur¨ Informationen uber¨ genetis- che Diversitat.¨ Viele Terabyte an Sequenzdaten dieser Studien sind o¨ffentlich frei verfugbar¨ und wachsen stetig an. Der Grund fur¨ das schnelle Wachstum sind die exponentiell abnehmenden Kosten und der stetig steigende Durchsatz der Sequenzierungsverfahren. Technische Innovatio- nen in diesem Feld haben das Moore’sche Gesetz in den letzten zehn Jahren um ein Zweifaches ubertro¨ ffen. Aktuell ist nicht mehr die Erzeugung, sondern die Analyse der Sequenzdaten der großte¨ Zeit- und Kostenfaktor. In dieser Arbeit stellen wir drei neue Methoden der Analyse metagenomisher Daten vor, die schneller und somit kostengunstiger¨ sind als bisherige Methoden: (1) MMseqs, eine schnelle und sensitive Clustering-Methode, (2) MMseqs2, eine schnelle und sensitive Methode zur Suche nach homologen (verwandten) Sequenzen und (3) Linclust, den ersten Sequenz-Clustering- Algorithmus, dessen Laufzeit linear mit der Anzahl der Sequenzen und unabhangig¨ von der An- zahl der Cluster wachst.¨ Clustering teilt Sequenzen in biologisch verwandte Gruppen ein. MMseqs ist ein sensitives Clustering-Tool, dass hierfur¨ alle Sequenzen der gegebenen Menge miteinander vergleicht und diese dann auf Grundlage der ermittelten paarweisen Sequenzidentitat¨ in Gruppen einteilt. MM- seqs kann Sequenzahnlichkeiten¨ von bis zu 30% der paarweisen Sequenzidentitat¨ erkennen, wobei es bis zu 2500 Mal schneller ist, als vergleichbare Clustering-Methoden. Suche nach homologen Sequenzen ist in den Biowissenschaften ein fundamentaler Baustein zur Vorhersage der Funktion eines Proteins. Viel genutzte sensitive Such-Methoden, wie PSI-BLAST und BLAST, sind zu langsam, um viele Suchen in einer großen Datenbank durchfuhren¨ zu konnen.¨ Gleichzeitig sind andere schnelle Methoden nicht sensitiv genug und konnen¨ viele Sequenzen somit nicht zuordnen. MMseqs2 ist 35 Mal schneller als BLAST und weist eine ahnliche¨ Empfind- lichkeit der Suche auf. Die Profil-basierte Suche in MMseqs2 ist ahnlich¨ empfindlich wie PSI- BLAST und etwa 400 Mal schneller. Wir konnten damit 1,1 Milliarden metagenomischer Protein- sequenzen in 8,3 Stunden auf einem Computer mit 28 Kernen zu bekannten Sequenzen zuordnen. Ein großes Problem fur¨ das Clustering der metagenomischen Sequenzdaten stellt die Laufzeit der derzeit verfugbaren¨ Clustering-Algorithmen dar. Aktuelle Methoden skalieren mit einer Laufzeitkomplexitat¨ von O(NK), wobei N die Anzahl der analysierten Sequenzen und K die re- sultierende Anzahl der Cluster darstellt. Hiermit ergeben sich fast quadratische Laufzeiten, da K oft eine ahnliche¨ Großenordnung¨ wie N hat. Linclust ist der erste Clustering-Algorithmus, dessen Laufzeit linear in N und unabhangig¨ von K ist. Linclust kann Proteinsequenzen mit bis zu 50% Sequenzidentitat¨ einem Cluster zuweisen und ist dabei drei bis vier Großenordnun-¨ gen schneller als andere vergleichbare Methoden. Wir haben 1,6 Milliarden Proteinsequenzen von ∼2200 Metagenom- und Metatranscriptom-Assemblierungen mit Linclust in 10 Stunden auf einem einzelnen Server mit 28 Kernen auf 50% Sequenzidentitat¨ geclustert. Mit anderen Metho- den ware¨ diese Analyse aufgrund der quadratischen Laufzeit nahezu unmoglich¨ gewesen. List of Publications This work constitutes a cumulative dissertation based on peer-reviewed publications. Chap- ters 2, 3, and 4 describe the published and submitted articles included in this dissertation: • Maria Hauser#, Martin Steinegger∗#, Johannes Soding¨ ∗, MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics 2016. doi: 10.1093/bioinformatics/btw006 (#Equal contributions.) (∗Corresponding authors.) • Martin Steinegger, Johannes Soding,¨ MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 2017. doi: 10.1038/nbt.3988. • Martin Steinegger, Johannes Soding,¨ Clustering huge protein sequence sets in linear time. Nature Communications, 2018. accepted During the dissertation work, I was co-/corresponding author of other peer-reviewed publications: • Milot Mirdita#, Lars von den Driesch#, L., Galiez, G., Maria Martin, Johannes Soding¨ ∗, and Martin Steinegger∗ (2017) Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017, doi: 10.1093/nar/gkw1081. (#Equal contributions.) (∗Corresponding authors.) • Yannick Mahlich∗, Martin Steinegger, Burkhard Rost, Yana Bromberg∗ HFSP: High speed homology-driven function annotation of proteins. Bioinformatics, 2018. accepted (∗Corresponding authors.) Acknowledgments First and foremost, I want to thank Dr. Johannes Soding¨ for guiding me through academia and life in general. I am very thankful for all the freedom and guidance you gave me. Our scientific discussions helped me shape my own thinking and are now the basis for my future academic career. All the talented people at the Soding¨ Lab, for supporting my projects and accepting my high usage of CPU/hard disk of the compute cluster. Special thanks to Milot Mirdita for many hours of reviewing my texts, discussing ideas and helping me when I was feeling down. Not only this thesis, but even my life, would not be the same without your help. To many more discussions about crazy ideas. Thank you to Maria Hauser for the great collaboration on MMseqs. Nikos Papadopoulos for helping me to review this thesis. Susann Vorberg for presenting in my name at the CASP12 meeting. Eli Levy Karin for the great discussion about benchmarking and her offer to show me around the beautiful Tel Aviv. Clovis Galiez for the great collaboration and all the work you did for MMseqs2. Thank you

Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing

Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Sequence-Based Microrna Clustering

Representative Based Protein Sequence Clustering

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis

Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent

De Novo Clustering of Long Reads by Gene from Transcriptomics Data

Application of Subspace Clustering in DNA Sequence Analysis

Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material

Scalable Clustering for Immune Repertoire Sequence Analysis

The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics

State-Of-The-Art Bioinformatics Protein Structure Prediction Tools (Review)