Composition-Based Classification of Short Metagenomic Sequences
Total Page:16
File Type:pdf, Size:1020Kb
Published online 31 August 2012 Nucleic Acids Research, 2013, Vol. 41, No. 1 e3 doi:10.1093/nar/gks828 Composition-based classification of short metagenomic sequences elucidates the landscapes of taxonomic and functional enrichment of microorganisms Jiemeng Liu1,2,3, Haifeng Wang1,4, Hongxing Yang1,4, Yizhe Zhang5, Jinfeng Wang6, Fangqing Zhao6,* and Ji Qi1,4,* 1State Key Laboratory of Genetic Engineering, 2State Key Laboratory of Surface Physics, 3The T-Life Research Center, 4Institute of Plant Biology, School of Life Sciences, Fudan University, 5School of Life Sciences, Shanghai Jiaotong University, Shanghai 200433, 6Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing 100101, People’s Republic of China Received March 20, 2012; Revised July 27, 2012; Accepted August 9, 2012 ABSTRACT environmental samples. The application of NGS technologies on studies of microorganisms in human gut Compared with traditional algorithms for long meta- (6–9), oral cavity and skin, has greatly revealed the genomic sequence classification, characterizing relationship of microorganisms to human disease, like microorganisms’ taxonomic and functional abun- adiposity, high blood pressure and dental cavity. NGS dance based on tens of millions of very short technologies also benefit the studies of bacterial reads are much more challenging. We describe an communities in soil (1,2), deep ocean (3,4) and even efficient composition and phylogeny-based algo- ancient specimens (5). The fast expansion of NGS-based rithm [Metagenome Composition Vector (MetaCV)] metagenomics calls for new bioinformatic algorithms to classify very short metagenomic reads which can handle the vast amount of sequence data (75–100 bp) into specific taxonomic and functional more efficiently and effectively. Yet, the biggest concern groups. We applied MetaCV to the Meta-HIT data for bioinformaticians is not only on the amount of the sequencing data but also on their read length. As the (371-Gb 75-bp reads of 109 human gut metage- most popular NGS platforms, Illumina and Roche/454 nomes), and this single-read-based, instead of output sequences as short as 75–400 bp, which bring assembly-based, classification has a high resolution more challenges and difficulties to environmental to characterize the composition and structure of biologists. human gut microbiota, especially for low abundance There are currently two types of approaches to deal species. Most strikingly, it only took MetaCV 10 days with metagenomic sequences. One is alignment based, to do all the computation work on a server with five like Megan (10), which compares short reads against 24-core nodes. To our knowledge, MetaCV, bene- coding sequences in public databases of coding genes fited from the strategy of composition comparison, using BlastX and then assigns them to their latest is the first algorithm that can classify millions of very common ancestors (LCA) of targeted organisms. BlastX- short reads within affordable time. based comparisons are commonly adopted to obtain fairly similar alignment between query sequences and reference proteins. Although many efforts have been made to improve the alignment speed (11,12), this process is still INTRODUCTION computationally expensive when working on tens of Recent advances in next-generation sequencing (NGS) millions of very short reads. A possible solution is to technologies have opened a new era in the field of meta- first assemble the short reads into long contigs, and all genomics (1–5), by providing much higher throughput and subsequent analyses are based on these assembled lower cost to sequence DNA directly taken from contigs. Yet, almost all currently available assembly *To whom correspondence should be addressed. Tel: +86 21 5566 5635; Fax: +86 21 5566 4187; Email:[email protected] Correspondence may also be addressed to Fangqing Zhao. Tel/Fax: +86 10 6486 9325; Email: [email protected] The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. ß The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. e3 Nucleic Acids Research, 2013, Vol. 41, No. 1 PAGE 2 OF 10 algorithms are designed for single genomes and require an among far related species. Given a protein sequence with even coverage distribution across chromosome. Applying length L, we count the number of appearances of strings them to assemble mass sequences from multiple species of a fixed length K in the sequence, and there are totally can result in heterozygous and chimeric contigs contri- N ¼ 20K possible types of such strings. The collection of buted by closely related species/strains and repetitive such frequencies is also known as ‘term frequency’ (TF) in elements. Most importantly, assembly-based approaches the field of text mining. We also adopt ‘inverse document may fail to assemble a certain amount of data, especially frequency’ (IDF) to weight different types of strings by for those low abundance species. different signal strength for inferring composition The other type of approaches, i.e. Phymm (13), TETRA similarities between gene pairs. Given a k-string and (14) and PhyloPythia (15) and CD-hit (16), uses informa- an organism tree fgd of the protein set, IDF of the string jjfgd tion of oligonucleotide/oligopeptide composition for is defined as IDFð Þ¼log jjfgd: 2d , while jjfgd represents microorganism identification, because different organisms the total number of nodes in the organism tree and tend to have different composition patterns (17), e.g. GC jjfgd : 2 d stands for the number of nodes including or- content, codon usage or recognition sites of restriction ganisms containing this string and internal nodes on the endonuclease, which could be used to recognize taxo- paths to their LCA. nomic source of sequenced environmental reads. Composition-based strategy greatly facilitates the fast (2) Building of a k-string composition database for classification of microbial communities, compared with reference genes the BlastX-based algorithms. However, most of these al- MetaCV calculates TF–IDF values for all possible strings gorithms (14,15) aim to work on sequence >1 kb, which as components to form a composition vector for each require a pre-assembly of short reads into contigs and thus gene. A reference database contains a collection of all meet the same problem from misassembly. Furthermore, gene vectors and a correlation matrix for each pair of due to the complexity of microbial communities and genes. The correlation CðA, BÞ between any two sequences limited number of sequenced genomes, distant related or- A and B is calculated as the cosine similarity of the two ganisms are less likely to be detected by nucleotide-based representative vectors in the N-dimensional composition algorithms comparing with those by BlastX. space as described in (18). MetaCV calculates correlations In this study, we present a new composition-based for any pair of proteins with at least three 6-mers shared approach, Metagenome Composition Vector (MetaCV), by them, and the final correlation matrix includes 3 Â 109 which directly works on raw short sequencing reads. values, which used 12 Gb space by using strategy of com- Unlike any other composition-based methods, MetaCV pressed sparse row. In this analysis, which contains first translates a nucleotide sequence into six-frame 5 million proteins from 1691 organisms, a memory of peptides, and these translated six-frame peptides are 26 Gb (almost identical to the sum of db files) is further decomposed to k-strings (oligopeptides of fixed required for downstream read binning. length K), which are weighted and selected for further taxonomical classification based on their frequency in a (3) Comparing query reads with reference genes pre-built reference protein database. Benefiting from this Query nucleotide sequences are translated into coding se- new composition-based strategy, MetaCV performs nearly quences by considering six-frame translations. Here as good as BlastX (even better at the Genus level), but MetaCV adopts an open reading frame (ORF) searching significantly reduces the computing time (300 Â faster) strategy similar to OrfPredictor (19), which is designed for on a huge amount of metagenomic sequences. More im- expressed sequence tags (ESTs). Each frame is evaluated portantly, besides focusing on taxonomic classification, individually to give complete/partial ORF candidates in MetaCV can also functionally annotate those cases of having/lacking of start/stop codons. These candi- unassembled short reads, to address the question ‘what dates are then compared with all proteins in the database, they are doing’ after investigating ‘who they are’, and and only the one who has the optimal correlation score, is allows researchers to study the metabolic activities in mi- considered as the correct translation and kept for further crobial communities. analysis. Users could also use a third party gene finders, i.e. MetaGene (20) to translate short reads into amino acid MATERIALS AND METHODS sequences as an alternative input for MetaCV. For paired-end reads which are resulted from DNA libraries The nature of this algorithm is to classify short reads to with short insert size, both two ends of a given