Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Clone-Array Pooled Shotgun Strategy for Sequencing Large Genomes
Downloaded from genome.cshlp.org on September 24, 2021 - Published by Cold Spring Harbor Laboratory Press Perspective A Clone-Array Pooled Shotgun Strategy for Sequencing Large Genomes Wei-Wen Cai,1,2 Rui Chen,1,2 Richard A. Gibbs,1,2,5 and Allan Bradley1,3,4 1Department of Molecular and Human Genetics, 2Human Genome Sequencing Center, and 3Howard Hughes Medical Institute, Baylor College of Medicine, Houston, Texas 77030, USA A simplified strategy for sequencing large genomes is proposed. Clone-Array Pooled Shotgun Sequencing (CAPSS) is based on pooling rows and columns of arrayed genomic clones,for shotgun library construction. Random sequences are accumulated,and the data are processed by sequential comparison of rows and columns to assemble the sequence of clones at points of intersection. Compared with either a clone-by-clone approach or whole-genome shotgun sequencing,CAPSS requires relatively few library constructions and only minimal computational power for a complete genome assembly. The strategy is suitable for sequencing large genomes for which there are no sequence-ready maps,but for which relatively high resolution STS maps and highly redundant BAC libraries are available. It is immediately applicable to the sequencing of mouse,rat,zebrafish, and other important genomes,and can be managed in a cooperative fashion to take advantage of a distributed international DNA sequencing capacity. Advances in DNA sequencing technology in recent years have Drosophila genome, and the computational requirements to greatly increased the throughput and reduced the cost of ge- perform the necessary pairwise comparisons increase approxi- nome sequencing. Sequencing of a complex genome the size mately as a square of the size of the genome (see Appendix). -
Probabilities and Statistics in Shotgun Sequencing Shotgun Sequencing
Probabilities and Statistics in Shotgun Sequencing Shotgun Sequencing. Before any analysis of a DNA sequence can take place it is first necessary to determine the actual sequence itself, at least as accurately as is reasonably possible. Unfortunately, technical considerations make it impossible to sequence very long pieces of DNA all at once. Current sequencing technologies allow accurate reading of no more than 500 to 800bp of contiguous DNA sequence. This means that the sequence of an entire genome must be assembled from collections of comparatively short subsequences. This process is called DNA sequence “assembly ”. One approach of sequence assembly is to produce the sequence of a DNA segment (called as a “contig”, or perhaps a genome) from a large number of randomly chosen sequence reads (many overlapping small pieces, each on the order of 500-800 bases). One difficulty of this process is that the locations of the fragments within the genome and with respect to each other are not generally known. However, if enough fragments are sequenced so that there will be many overlaps between them, the fragments can be matched up and assembled. This method is called “ shotgun sequencing .” Shotgun sequencing approaches, including the whole-genome shotgun approach, are currently a central part of all genome-sequencing efforts. These methods require a high level of automation in sample preparation and analysis and are heavily reliant on the power of modern computers. There is an interplay between substrates to be sequenced (genomes and their representation in clone libraries), the analytical tools for generating a DNA sequence, the sequencing strategies, and the computational methods. -
BIOGRAPHICAL SKETCH NAME: Berger
BIOGRAPHICAL SKETCH NAME: Berger, Bonnie eRA COMMONS USER NAME (credential, e.g., agency login): BABERGER POSITION TITLE: Simons Professor of Mathematics and Professor of Electrical Engineering and Computer Science EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, include postdoctoral training and residency training if applicable. Add/delete rows as necessary.) EDUCATION/TRAINING DEGREE Completion (if Date FIELD OF STUDY INSTITUTION AND LOCATION applicable) MM/YYYY Brandeis University, Waltham, MA AB 06/1983 Computer Science Massachusetts Institute of Technology SM 01/1986 Computer Science Massachusetts Institute of Technology Ph.D. 06/1990 Computer Science Massachusetts Institute of Technology Postdoc 06/1992 Applied Mathematics A. Personal Statement Advances in modern biology revolve around automated data collection and sharing of the large resulting datasets. I am considered a pioneer in the area of bringing computer algorithms to the study of biological data, and a founder in this community that I have witnessed grow so profoundly over the last 26 years. I have made major contributions to many areas of computational biology and biomedicine, largely, though not exclusively through algorithmic innovations, as demonstrated by nearly twenty thousand citations to my scientific papers and widely-used software. In recognition of my success, I have just been elected to the National Academy of Sciences and in 2019 received the ISCB Senior Scientist Award, the pinnacle award in computational biology. My research group works on diverse challenges, including Computational Genomics, High-throughput Technology Analysis and Design, Biological Networks, Structural Bioinformatics, Population Genetics and Biomedical Privacy. I spearheaded research on analyzing large and complex biological data sets through topological and machine learning approaches; e.g. -
Metagenomics Analysis of Microbiota by Next Generation Shotgun Sequencing
THE SWISS DNA COMPANY Application Note · Next Generation Sequencing Metagenomics Analysis of Microbiota by Next Generation Shotgun Sequencing Understand the genetic potential of your community samples Provides you with hypothesis-free taxonomic analysis Introduction Microbiome studies are often based on as the dependency on a single gene to microbiome analysis overcoming said the sequencing of specific marker genes analyze a whole community, the intro- limitations. Whole genomic DNA of as for instance the prokaryotic 16S rRNA duction of PCR bias and the restriction a sample is isolated, fragmented and gene. Such amplicon-based approaches to describe only the taxonomic compo- finally sequenced. This allows a detailed are well established and widely used. sition and diversity. Shotgun metage- analysis of the taxonomic and functional However, they have limitations such nomics is a cutting edge technique for composition of a microbial community. Microsynth Competences and Services Microsynth offers a full shotgun to your project requirements, thus pro- requirements to guarantee scientifically metagenomics service for taxonomic viding you with just the right amount of reliable results for your project. After and functional profiling of clinical, envi- data to answer your questions. quality processing the reads are aligned ronmental or engineered microbiomes. Bioinformatics: Taxonomic and func- against a protein reference database The service covers the entire process tional analysis of metagenomic data- (e.g. NCBI nr). Taxonomic and functional from experimental design, DNA isola- sets is challenging. Alignment, binning binning and annotation are performed. tion, tailored sequencing to detailed and annotation of the large amounts The analysis is not restricted to prokar- and customized bioinformatics analy- of sequencing reads require expertise yotes but also includes eukaryotes and sis. -
Lior Pachter Genome Informatics 2013 Keynote
Stories from the Supplement Lior Pachter! Department of Mathematics and Molecular & Cell Biology! UC Berkeley! ! November 1, 2013! Genome Informatics, CSHL The Cufflinks supplements • C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515. Supplementary Material: 42 pages. • C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq, Nature Biotechnology, 31 (2012), p 46–53. Supplementary Material: 70 pages. A supplementary arithmetic progression? • C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515. Supplementary Material: 42 pages. • C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA- seq, Nature Biotechnology, 31 (2012), p 46–53. Supplementary Material: 70 pages. • A. Roberts and L. Pachter, Streaming algorithms for fragment assignment, Nature Methods 10 (2013), p 71—73. Supplementary Material: 98 pages? The nature methods manuscript checklist The nature methods manuscript checklist Emperor Joseph II: My dear young man, don't take it too hard. Your work is ingenious. It's quality work. And there are simply too many notes, that's all. -
Modeling and Analysis of RNA-Seq Data: a Review from a Statistical Perspective
Modeling and analysis of RNA-seq data: a review from a statistical perspective Wei Vivian Li 1 and Jingyi Jessica Li 1;2;∗ Abstract Background: Since the invention of next-generation RNA sequencing (RNA-seq) technolo- gies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date. Results: We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations. Conclusion: The development of statistical and computational methods for analyzing RNA- seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statical models and exhibit dif- ferent performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development. 1 Introduction RNA sequencing (RNA-seq) uses the next generation sequencing (NGS) technologies to reveal arXiv:1804.06050v3 [q-bio.GN] 1 May 2018 the presence and quantity of RNA molecules in biological samples. Since its invention, RNA- seq has revolutionized transcriptome analysis in biological research. RNA-seq does not require any prior knowledge on RNA sequences, and its high-throughput manner allows for genome-wide profiling of transcriptome landscapes [1,2]. -
Randomness Versus Order
MILESTONES DOI: 10.1038/nrg2245 M iles Tone 1 0 Randomness versus order Whereas randomness is avoided used — in their opinion mistakenly The H. influenzae genome, in most experimental techniques, — direct sequencing strategies to however, was a mere DNA fragment it is fundamental to sequencing finish compared with the 1,500-fold longer approaches. In the race to sequence the last 10% of the bacteriophage λ ~3 billion base-pair human genome. the human genome, research groups sequence. In 1991, Al Edwards and In 1996, Craig Venter and colleagues had to choose between the random Thomas Caskey proposed a method proposed that the whole-genome whole-genome shotgun sequencing to maximize efficiency by minimiz- shotgun approach could be used to approach or the more ordered map- ing gap formation and redundancy: sequence the human genome owing based sequencing approach. sequence both ends (but not the to two factors: its past successes When Frederick Sanger and middle) of a long clone, rather than in assembling genomes and the colleagues sequenced the 48-kb the entirety of a short clone. development of bacterial artificial bacteriophage λ genome in 1982, Although the shotgun approach chromosomes (BAC) libraries, which the community was still undecided was now accepted for sequencing allowed large fragments of DNA to as to whether directed or random short stretches of DNA, map-based be cloned. sequencing strategies were better. techniques were still considered A showdown ensued, with the With directed strategies, DNA necessary for large genomes. Like biotechnology firm Celera Genomics sequences were broken down into the directed strategies, map-based wielding whole-genome shotgun ordered and overlapping fragments sequencing subdivided the genome sequencing and the International to build a map of the genome, and into ordered 40-kb fragments, which Human Genome Sequencing these fragments were then cloned were then sequenced using the Consortium wielding map-based and sequenced. -
Shotgun Metagenomics, from Sampling to Sequencing and Analysis Christopher Quince1,^, Alan W
Shotgun metagenomics, from sampling to sequencing and analysis Christopher Quince1,^, Alan W. Walker2,^, Jared T. Simpson3,4, Nicholas J. Loman5, Nicola Segata6,* 1 Warwick Medical School, University of Warwick, Warwick, UK. 2 Microbiology Group, The Rowett Institute, University of Aberdeen, Aberdeen, UK. 3 Ontario Institute for Cancer Research, Toronto, Canada 4 Department of Computer Science, University of Toronto, Toronto, Canada. 5 Institute for Microbiology and Infection, University of Birmingham, Birmingham, UK. 6 Centre for Integrative Biology, University of Trento, Trento, Italy. ^ These authors contributed equally * Corresponding author: Nicola Segata ([email protected]) Diverse microbial communities of bacteria, archaea, viruses and single-celled eukaryotes have crucial roles in the environment and human health. However, microbes are frequently difficult to culture in the laboratory, which can confound cataloging members and understanding how communities function. Cheap, high-throughput sequencing technologies and a suite of computational pipelines have been combined into shotgun metagenomics methods that have transformed microbiology. Still, computational approaches to overcome challenges that affect both assembly-based and mapping-based metagenomic profiling, particularly of high-complexity samples, or environments containing organisms with limited similarity to sequenced genomes, are needed. Understanding the functions and characterizing specific strains of these communities offer biotechnological promise in therapeutic discovery, or innovative ways to synthesize products using microbial factories, but can also pinpoint the contributions of microorganisms to planetary, animal and human health. Introduction High throughput sequencing approaches enable genomic analyses of ideally all microbes in a sample, not just those that are more amenable to cultivation. One such method, shotgun metagenomics, is the untargeted (“shotgun”) sequencing of all (“meta”) of the microbial genomes (“genomics”) present in a sample. -
Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press Letter Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction Serafim Batzoglou,1,4,7 Lior Pachter,2,7 Jill P. Mesirov,3 Bonnie Berger,1,4,6 and Eric S. Lander3,5,6 1Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA; 2Department of Mathematics, University of California Berkeley, Berkeley, California 94720 USA; 3Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142 USA; 4Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA; 5Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA We describe a novel analytical approach to gene recognition based on cross-species comparison. We first undertook a comparison of orthologous genomic loci from human and mouse, studying the extent of similarity in the number, size and sequence of exons and introns. We then developed an approach for recognizing genes within such orthologous regions by first aligning the regions using an iterative global alignment system and then identifying genes based on conservation of exonic features at aligned positions in both species. The alignment and gene recognition are performed by new programs called GLASS and ROSETTA, respectively. ROSETTA performed well at exact identification of coding exons in 117 orthologous pairs tested. A fundamental task in analyzing genomes is to identify by comparison of syntenic human and mouse genomic the genes. This is relatively straightforward for organ- sequences. isms with compact genomes (such as bacteria, yeast, It is well known that cross-species sequence com- flies and worms) because exons tend to be large and the parison can help highlight important functional ele- introns are either non-existent or tend to be short. -
Eric S. Lander
Chancellor's Distinguished Fellows Program 2004-2005 Selective Bibliography UC Irvine Libraries Eric S. Lander February 25, 2005 Prepared by: John E. Sisson Biological Sciences Librarian [email protected] Journal Articles (Selected from over 220 articles he has authored or co-authored) Poirier, C., Y. J. Qin, C.P. Adams, Y. Anaya, J. B. Singer, A. E. Hill, Eric S. Lander, J.H. Nadeau, and C. E. Bishop. "A complex interaction of imprinted and maternal-effect genes modifies sex determination in odd sex (Ods) mice." Genetics 168, no. 3 (2004): 1557-1562. Skuse, D. H., S. Purcell, M. J. Daly, R. J. Dolan, J. S. Morris, K. Lawrence, Eric S. Lander, and P. Sklar. "What can studies on Turner syndrome tell us about the role of X- linked genes in social cognition?." American Journal of Medical Genetics Part B- Neuropsychiatric Genetics 130B, no. 1 (2004): 8-9. Rioux, J. D., H. Karinen, K. Kocher, S. G. McMahon, P. Karkkainen, E. Janatuinen, M. Heikkinen, R. Julkunen, J. Pihlajamaki, A. Naukkarinen, V. M. Kosma, M. J. Daly, Eric S. Lander, and M. Laakso. "Genomewide search and association studies in a Finnish celiac disease population: Identification of a novel locus and replication of the HLA and CTLA4 loci." American Journal of Medical Genetics Part A 130A, no. 4 (2004): 345- 350. Michalkiewicz, M., T. Michalkiewicz, R. A. Ettinger, E. A. Rutledge, J. M. Fuller, D. H. Moralejo, B. Van Yserloo, A. J. MacMurray, A. E. Kwitek, H. J. Jacob, Eric S. Lander, and A. Lernmark. "Transgenic rescue demonstrates involvement of the Ian5 gene in T cell development in the rat." Physiological Genomics 19, no. -
Random Shotgun Fire
Downloaded from genome.cshlp.org on September 28, 2021 - Published by Cold Spring Harbor Laboratory Press EDITORIAL Random Shotgun Fire Craig Venter’s and Perkin-Elmer’s May 9th announce- lengths. The sequencing machine will be accompanied ment of a new joint venture to complete the sequence by an automated workstation that does colony picking, of the human genome in just 3 years set off a furor extraction, PCR, and sequencing reactions. The com- among the scientific community. The uproar, how- pany is committed to publicly releasing data on a quar- ever, was unsurprising given that the earliest newspa- terly basis on contigs greater than 2 kb, though the per articles presented the plan as if it were a fait accom- exact details for this release are still under discussion. pli and accused the publicly funded Human Genome The general business plan of the company includes the Project of being a ‘‘waste’’ of money. The announce- intention to patent between 100 and 300 interesting ment, made just prior to the Genome Mapping, Se- gene systems. Additionally, they plan to position quencing, and Biology Meeting, held May 13–17 at themselves as a supplier of a sequence database and Cold Spring Harbor Laboratory, was discussed, at least analysis tools. Finally, they intend to exploit the dis- briefly, at the sequencing center director’s meeting covered single nucleotide polymorphisms (SNPs), that preceded the CSHL meeting, in a very well- probably through a new genotyping service. With attended session during the CSHL meeting, and in nu- their approach, they claim they can sequence the hu- merous speculative debates over meals and beers man genome in just 3 years at a cost of roughly $200 among attending scientists. -
Whole Genome Shotgun Sequencing
Whole Genome Shotgun Sequencing Before we begin, some thought experiments: Thought experiment: How big are genomes of phages, Bacteria, Archaea, Eu- karya?1 Thought experiment: Suppose that you want to sequence the genome of a bac- teria that is 1,000,000 bp and you are using a sequencing technology that reads 1,000 bp at a time. What is the least number of reads you could use to sequence that genome?2 Thought experiment: The human genome was “finished” in 2000 — to put that in perspective President Bill Clinton and Prime Minister Tony Blair made the announcement! — How many gaps remain in the human genome?3 A long time ago, when sequencing was expensive, these questions kept people awake. For example, Lander and Waterman published a paper describing the number of clones that need to be mapped (sequenced) to achieve representative coverage of the genome. Part of this theoretical paper is to discuss how many clones would be needed to cover the whole genome. In those days, the clones were broken down into smaller fragments, and so on and so on, and then the fewest possible fragments sequenced. Because the order of those clones was known (from genetics and restriction mapping), it was easy to put them back together. In 1995, a breakthrough paper was published in Science which the whole genome was just randomly sheared, lots and lots of fragments sequenced, and then big (or big for the time — your cell phone is probably computationally more powerful!) computers used to assemble the genome. This breakthrough really unleashed the genomics era, and opened the door for genome sequencing including the data that we are going to discuss here! As we discussed in the databases class, the NCBI GenBank database is a central repository for all the microbial genomes.