Out of the Sequencer and Into the Wiki As We Face New Challenges in Genome Informatics Zemin Ning1* and Stephen B Montgomery2
Total Page:16
File Type:pdf, Size:1020Kb
Ning and Montgomery Genome Biology 2010, 11:308 http://genomebiology.com/2010/11/10/308 MEETING REPORT Out of the sequencer and into the wiki as we face new challenges in genome informatics Zemin Ning1* and Stephen B Montgomery2 Abstract genotyped positions, he revealed that a majority occurred in regions with poorly aligned reads. Margulies noted A report on the joint Cold Spring Harbor Laboratory/ that he would be highly suspicious of genotype calls in Wellcome Trust Conference ‘Genome Informatics’, regions with high coverage but with low mapping scores. 15-19 September 2010, Hinxton, Cambridge, UK. When these events were filtered out, the number dropped to 13,140, a reduction of 84%. By then further introducing other filtering mechanisms, such as incorrect Next generation sequencing (NGS) analysis, open-source alignments of short reads across indels, Q20 (99% software, cloud computing and wiki-style genomics were confidence) evidence in the other twin and 10% allele among the hot topics and discussions at the recent frequency, Margulies’ final number of discordant geno- Genome Informatics meeting at the Wellcome Trust type calls was only in the range of 500 to 1,000. Genome Campus, Cambridge, UK. Here we summarize Margulies’ suspicion was certainly shared by Richard some highlights of the meeting. Durbin (Wellcome Trust Sanger Institute, Cambridge, UK), who told the audience, “If someone tells me that his Accuracy of polymorphism detection accuracy on variant detection is 99%, I should be Comparison of related genomes can generate a wealth of cautious”. Durbin described that in the pilot phase of the knowledge about genome evolution and function. Recent 1000 Genomes Project, the sequencing strategy with low advances in NGS technologies have greatly increased the coverage (2X to 4X) worked well, and it can efficiently scale and scope with which we can interrogate novel find shared sequence variation with good accuracy. e genomes and uncover genetic variation. However, for consortium has now started sequencing 2,500 people in variation detection and statistical analysis, there are false five different ethnic groups. Furthermore, Durbin high- positive errors for various reasons, notably incomplete- lighted the recently launched UK 10K project, which, in ness of reference genomes, read mapping errors or limit a- collaboration with clinical investigators, will involve tions, and sequencing-induced features. Benjamin Dickins sequencing 4,000 cohort samples with rich phenotypes. (Penn State University, University Park, USA) discussed With massive sample collections under way, further an approach to estimate polymorphism accuracy from advances in informatics pipelines to reduce errors in NGS data by deeply sequencing a small plasmid genome variation detection will be indispensable. and comparing it with Sanger sequencing. Elliott Margulies (National Human Genome Research Genome assembly and read alignment Institute, Bethesda, USA) gave an enticing presentation NGS data provide challenging but exciting prospects for on this topic entitled ‘Analysis of identical twins’ genomes de novo assembly. ere were several presentations that reveals sources of false-positive variation detection’. With focused on development and evaluation of genome 55X and 50X depth of read coverage from each twin’s assemblers using such data. David Jaffe (Broad Institute, sample, they initially identified 83,538 discordant geno- Cambridge, USA) outlined a practical and general type calls across 97.6% of the human reference genome. laboratory/computational method for generating high- rough inspection of a random set of discordantly quality de novo genome assemblies. ree types of data were produced with the Illumina platform: 180 bp short- insert reads; 3,000 bp mate-pair reads; and 40,000 bp *Correspondence: [email protected] 1Sequencing Informatics, The Wellcome Trust Sanger Institute, Hinxton, Cambridge fosmid ends. e short-insert data were mainly intended CB10 1SA, UK to ensure contig base accuracy, whereas the longer Full list of author information is available at the end of the article jumping fragments provide potential for long-range connectivity. Jaffe generated data for mouse C57BL/6 and © 2010 BioMed Central Ltd © 2010 BioMed Central Ltd human Yoruban NA18507 and then compared the Ning and Montgomery Genome Biology 2010, 11:308 Page 2 of 3 http://genomebiology.com/2010/11/10/308 short-read assembler, ALLPATHS, with the reference psychoactive drugs on transcriptional alternations of assembly that used the capillary long-read assembler, about 20,000 genes in the mouse brain. And Maximillian Arachne. From the comparison of the results, the quality Hauessler (University of Manchester, UK) presented of short-read assemblies was found to be close to that of text2genome, a method for annotating the genome using capillary assemblies, in terms of both base accuracy and sequences that are reported in scientific publications. contig connectivity. Although NGS has required the development and Transcriptome data also offer great value and unique improvement of many novel visualization tools, Martin challenges in helping to identify and assemble gene sets Krzywinski (British Columbia Cancer Agency, Vancou - for multiple species. Daniel Zerbino (University of ver, Canada), the author of the Circos tool, took us back California, Santa Cruz, USA) presented a new method- to basics by redefining network diagrams in his ology, Oasis, which is built in with the popular short-read compelling talk in which he introduced a technique for assembler Velvet. This new program takes in a preli- turning network diagrams from incomprehensible ‘hair- minary assembly produced by Velvet and exploits the balls’ into information-rich schematics. read sequence and pairing information to produce trans- cript isoforms. When possible, it also detects and reports New insights from sequencing repetitive elements standard alternative splicing events. It is specifically Repetitive element sequencing is revealing some interest- designed to get around the issues of unequal expression ing new biology. Karen Hayden (Duke University, levels and alternative splicing breakpoints. Durham, USA) demonstrated unique patterns of repeats Short-read alignment saw further new developments at in human centromeres, which could facilitate completion the meeting. New tools include: SMALT by Hannes and understanding of specific biology in these hard-to- Ponstingl (Wellcome Trust Sanger Institute, Cambridge, assemble regions. Kateryna Makova (Penn State Uni- UK); NextGenMap by Fritz Sedlazeck (Max F Perutz versity) highlighted the lifecycle of microsatellites and Laboratories, Vienna, Austria); and YASRAT by Masahiro their potential functional impact near genes. And Mark Kasahara (The University of Tokyo, Kashiwa, Japan). One Batzer (Louisiana State University, Baton Rouge, USA) distinguishing feature in these new alignment tools is showed from 37 orangutans clear differences in retro- that they all try to address the cases in which there is a transposon polymorphism, which could support the higher level (over 2%) of base-pair errors in the raw reads. separation of the Sumatran from the Bornean orangutan. Repeats are not always a good thing. There were also Databases and data visualization some presentations of genome analysis on large plant As sequence availability has increased, data access, genomes. The highly repetitive wheat genome, with a size representation, analysis and visualizations pose signifi- of 17 GB, poses significant challenges using the current cant challenges in the community. Enis Afgan (Emory sequencing platforms, in physical mapping as well as de University, Atlanta, USA) has developed a solution that novo assembly. Frederic Choulet (Institut National de la allows experimentalists to perform large-scale analysis Recherche Agronomique, Clermont-Ferrand, France) using cloud-computing resources (http://usegalaxy.org/ des cribed the effort to sequence and analyze wheat cloud). Because the solution is built on the open-source chromosome 3B. Contiguous bacterial artificial chromo- Galaxy framework, analyses using this service are some pools were sequenced by combining Roche 454 and accessible, transparent and reproducible. Furthermore, Illumina platform sequencing. Shiran Pasternak (Cold popular tools and workflows for sequence analysis from Spring Harbor Laboratory, Cold Spring Harbor, USA) various types of experiment are already built in and described the current progress on assembling the wheat ready to run. Marc Fiume (University of Toronto, D-genome, and Rachel Brenchley (University of Liver- Canada) presented the Sequence Annotation pool, UK) highlighted the advantages of comparative Visualization and ANalysis Tool (Savant), a fast and genome analysis in distinguishing genic loci from the interactive genome browser that can display sequence, abundance of repetitive content in the wheat genome. read alignment, single nucleotide polymorphism and other genomic datasets stored in standard file formats. RNA-seq and epigenomics Harminder Sehra (European Bioinformatics Institute, There was one dedicated session on RNA sequencing Hinxton, UK) des cribed how the well established (RNA-Seq) and its application to genome annotation, in UniProt Knowledgebase (UniProKB) has developed which various issues were discussed. Specifically, identi- from manual to automatic annotation. Marcin Piechota fying the underlying biases in the RNA-Seq presents a (Institute of Pharmacology, Krakow, Poland) described significant analytical