Ensembl Tools

Total Page:16

File Type:pdf, Size:1020Kb

Ensembl Tools Ensembl Tools EBI is an Outstation of the European Molecular Biology Laboratory. Questions? • We’ve muted all the mics • Ask questions in the Chat box in the webinar interface • I will check the Chat box periodically for questions • There’s no threading so please respond with @name Objectives • What is Ensembl? • What tools are available in Ensembl? • How to use the online tools in Ensembl. • Where to go for help and documentation. Overview • Introduction to Ensembl • BLAST/BLAT • Sequence searching • Assembly Converter • Convert files between genome assemblies • Data Slicer • Pull out sections of VCF and BAM files • File Chameleon • Custom download of reference files for NGS analysis • Variant Effect Predictor (VEP) • Analyse your own variants Introduction Why do we need genome browsers? 1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 Gb) CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATCTGAAATTTCTTGGAA ACACGATCACTTTAACGGAATATTGCTGTTTTGGGGAAGTGTTTTACAGCTGCTGGGCACGCTGTATTTGCCTTACTTAAGC CCCTGGTAATTGCTGTATTCCGAAGACATGCTGATGGGAATTACCAGGCGGCGTTGGTCTCTAACTGGAGCCCTCTGTCCCC ACTAGCCACGCGTCACTGGTTAGCGTGATTGAAACTAAATCGTATGAAAATCCTCTTCTCTAGTCGCACTAGCCACGTTTCG AGTGCTTAATGTGGCTAGTGGCACCGGTTTGGACAGCACAGCTGTAAAATGTTCCCATCCTCACAGTAAGCTGTTACCGTTC CAGGAGATGGGACTGAATTAGAATTCAAACAAATTTTCCAGCGCTTCTGAGTTTTACCTCAGTCACATAATAAGGAATGCAT CCCTGTGTAAGTGCATTTTGGTCTTCTGTTTTGCAGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTA TTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGGTATTGACAAATTTTATATAAC TTTATAAATTACACCGAGAAAGTGTTTTCTAAAAAATGCTTGCTAAAAACCCAGTACGTCACAGTGTTGCTTAGAACCATAA ACTGTTCCTTATGTGTGTATAAATCCAGTTAACAACATAATCATCGTTTGCAGGTTAACCACATGATAAATATAGAACGTCT AGTGGATAAAGAGGAAACTGGCCCCTTGACTAGCAGTAGGAACAATTACTAACAAATCAGAAGCATTAATGTTACTTTATGG CAGAAGTTGTCCAACTTTTTGGTTTCAGTACTCCTTATACTCTTAAAAATGATCTAGGACCCCCGGAGTGCTTTTGTTTATG TAGCTTACCATATTAGAAATTTAAAACTAAGAATTTAAGGCTGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGA GGCCGAGGTGGGCGGATCACTTGAGGCCAGAAGTTTGAGACCAGCCTGGCCAACATGGTGAAACCCTATCTCTACTAAAAAT ACAAAAAATGTGCTGCGTGTGGTGGTGCGTGCCTGTAATCCCAGCTACACGGGAGGTGGAGGCAGGAGAATCGCTTGAACCC TGGAGGCAGAGGTTGCAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCCACATAGCATGACTCTGTCTCAAAACAA ACAAACAAACAAAAAACTAAGAATTTAAAGTTAATTTACTTAAAAATAATGAAAGCTAACCCATTGCATATTATCACAACAT TCTTAGGAAAAATAACTTTTTGAAAACAAGTGAGTGGAATAGTTTTTACATTTTTGCAGTTCTCTTTAATGTCTGGCTAAAT AGAGATAGCTGGATTCACTTATCTGTGTCTAATCTGTTATTTTGGTAGAAGTATGTGAAAAAAAATTAACCTCACGTTGAAA AAAGGAATATTTTAATAGTTTTCAGTTACTTTTTGGTATTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGAT ATACCATAGGTCTTTCCCATGTCGCAACATCATGCAGTGATTATTTGGAAGATAGTGGTGTTCTGAATTATACAAAGTTTCC AAATATTGATAAATTGCATTAAACTATTTTAAAAATCTCATTCATTAATACCACCATGGATGTCAGAAAAGTCTTTTAAGAT TGGGTAGAAATGAGCCACTGGAAATTCTAATTTTCATTTGAAAGTTCACATTTTGTCATTGACAACAAACTGTTTTCCTTGC AGCAACAAGATCACTTCATTGATTTGTGAGAAAATGTCTACCAAATTATTTAAGTTGAAATAACTTTGTCAGCTGTTCTTTC AAGTAAAAATGACTTTTCATTGAAAAAATTGCTTGTTCAGATCACAGCTCAACATGAGTGCTTTTCTAGGCAGTATTGTACT TCAGTATGCAGAAGTGCTTTATGTATGCTTCCTATTTTGTCAGAGATTATTAAAAGAAGTGCTAAAGCATTGAGCTTCGAAA TTAATTTTTACTGCTTCATTAGGACATTCTTACATTAAACTGGCATTATTATTACTATTATTTTTAACAAGGACACTCAGTG GTAAGGAATATAATGGCTACTAGTATTAGTTTGGTGCCACTGCCATAACTCATGCAAATGTGCCAGCAGTTTTACCCAGCAT CATCTTTGCACTGTTGATACAAATGTCAACATCATGAAAAAGGGTTGAAAAAAGGAATATTTTAATAGTTTTCAGTTACTTT We need to make the data mean something… http://www.ncbi.nlm. http://www. nih.gov/mapview ensembl.org http://genome.ucsc.edu Ensembl Features • Gene builds for ~70 species • Gene trees • Regulatory build (ENCODE) • Variation display and VEP • Display of user data • BioMart (data export) • Programmatic access via the APIs • Completely Open Source Access scales One by one Main browser Mobile site BioMart REST API Perl API VEP MySQL FTP Groups Whole genome Vertebrate species on Ensembl Image obtained using Dendroscope: Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks D.H. Huson and C Scornavacca Systematic Biology, 2012 Non-vertebrates on Ensembl genomes Bacteria Protists Fungi Metazoa Plants www.ensemblgenomes.org Ensembl and Ensembl Genomes Ensembl EnsemblGenomes Released 2000 2009 Species Vertebrates (fly, worm and Non-vertebrates (protists, yeast as outgroups) plants, fungi, metazoa, bacteria) Annotation by Ensembl in collaboration with the scientific communities URL www.ensembl.org www.ensemblgenomes.org Release cycle New/updated interfaces 9089 JulyMay 2017 2017 Updated New regulation genome assemblies data 2-3 months Updated Underlying variation software data updates Compara on new genes and genomes Updated gene sets Ensembl Tools Tools allow: • Interpretation and processing of your own data • Custom download of Ensembl data for further analysis BLAST/BLAT for sequence searching • Find Ensembl sequences that match your sequence using BLAST/BLAT • Search: • Nucleotide sequences • Protein sequences • Short sequences (eg primers, morpholinos, siRNAs) • Search against • Genomic sequences • cDNA sequences • Protein sequences Hands on – BLAST/BLAT • I’ve designed a pair of primers for RT-PCR against human BRCA2 • I want to make sure they don’t have any non-specific hits that will mess up my RT-PCR results • The sequences are: >fwd GAGGACTCCTTATGTCCAAATTT >rev GAGAATCAGCTTCTGGGGTAATAA Assembly converter • You have data mapped to an old genome assembly • You want to update your data to map it to a new one What is a genome assembly? Sequence reads CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA CAGCTGTCCCAGATGAC ACTTAACTTCCCTCCCAGCTGTCC GGGCTCCGCCTTCAGCTC TCCCAGCTGTCCCAGATGACGCCATC AACTTCCCTCCCAGCT CGGCCTTTGGGCTCC TCCGCCTTCAGCTCAAGACTTAACTTC CAGATGACGCC Match up overlaps CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA AACTTCCCTCCCAGCT CAGATGACGCC TCCGCCTTCAGCTCAAGACTTAACTTC TCCCAGCTGTCCCAGATGACGCCATC GGGCTCCGCCTTCAGCTC ACTTAACTTCCCTCCCAGCTGTCC CGGCCTTTGGGCTCC CAGCTGTCCCAGATGAC Genome assembly CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATC Genome contigs BL102 BL AL476 AL CM553 CM IM768 IM Reference alleles BL102 BL102 BL AL476 AGTCGTAGCTAGC TAGGCCATAGGCGA AL Frequency T = 0.05, frequency G = 0.95 CM553 G is the allele in all primates T causes disease susceptibility CM Perhaps G should be the reference IM768 allele? We can replace the region with a new IM contig Genome Gaps BL102 BL102 Gap in the genome caused by: AL476 BL AL476 ● Poor sequencing at this region ● No contig was ever cloned AL CM553 CM IM768 We can fill in the gap with a new contig IM Incorrectly assembled contigs BL102 BL102 CM553 BL AL476 BL CM AL CM553 AL476 AL CM IM768 IM768 IM IM New genome assemblies • Fixing errors in the genome produces a new genome assembly • New genome assemblies mean re-mapping of all genome features • Ensembl will stop updating the old assembly when a new one is brought in • You’ve got data mapped to the old assembly and you want to compare to the up-to-date Ensembl annotation Assembly converter • Converts genome coordinates to a different genome assembly. • Works with: • BED (simple coordinates) • GFF (gene, transcript and exon coordinates) • GTF (gene, transcript and exon coordinates) • WIG (values plotted against the genome) • VCF (variants) Hands-on – Assembly converter • We’re going to convert a small BED file from the human genome assembly GRCh37 to the more recent GRCh38 • BED is a simple features format which lists the start and end coordinate of the feature. 5 36821734 37091336 P1 5 36731578 36978408 P2 5 36908654 37108773 P3 Data Slicer for variants • Whole genome VCF files are unwieldy • They contain all variants in the genome • They contain all genotypes from all individuals studied • Sometimes you just want to analyse a small region and one population • The Data Slicer allows you to take a slice of a VCF and narrow down to only individuals and populations of interest • Data Slicer currently only accesses the 1000 Genomes data • It is only available for human and only on GRCh37 Hands on – Data Slicer • I want to get a VCF of the region containing the MC1R gene for the British population • MC1R is found at 16:89978527-89987385 in GRCh37 • The three-letter code for the British population in 1000 Genomes is GBR FTP • Files of our complete database: • Genomic, cDNA, CDS, ncRNA and protein sequence (FASTA) • Annotated sequence (EMBL, GenBank) • Gene sets (GTF, GFF) • Whole-genome multiple and gene-based multiple alignments (MAF) • Variants (VCF, GVF) • Constrained elements (BED) • Regulatory features (BED, BigWig) • RNA-Seq files (BAM, BigWig) • MySQL database Access FTP Your favourite FTP client FTP site ftp://ftp.ensembl.org/pub/ FTP downloads page http://www.ensembl.org/info/data/ftp/index.html FTP files are big • Multiple Mb/Gb • Lots of time to download/unzip • Do you really need this data? • Make sure it’s the right file before you download. File chameleon for NGS analysis • Although files on the Ensembl FTP site are in a standard format, different tools define the standards differently (sigh!) • Your NGS analysis tool might need files that are slightly different to the Ensembl formats • File chameleon allows you to download files with these adjustments Hands on – File Chameleon • I need a GFF3 file of cat for my RNA-seq analysis. • My tool requires: • UCSC-style chromosome naming like chr1 • Only genes shorter than 4 Mb • Transcript IDs in every line • We will use File Chameleon to download this customised file. Analyse your own variants with the VEP • Find out the effects of your own variants on Ensembl genes • Analyse whole genome variant calls • Filter variants to find those that might be interesting Your own variant data Variant coordinates 1 881907 881906 -/C + 5 140532 140532
Recommended publications
  • The Encodedb Portal: Simplified Access to ENCODE Consortium Data Laura L
    Downloaded from genome.cshlp.org on September 30, 2021 - Published by Cold Spring Harbor Laboratory Press Resource The ENCODEdb portal: Simplified access to ENCODE Consortium data Laura L. Elnitski, Prachi Shah, R. Travis Moreland, Lowell Umayam, Tyra G. Wolfsberg, and Andreas D. Baxevanis1 Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA The Encyclopedia of DNA Elements (ENCODE) project aims to identify and characterize all functional elements in a representative chromosomal sample comprising 1% of the human genome. Data generated by members of The ENCODE Project Consortium are housed in a number of public databases, such as the UCSC Genome Browser, NCBI’s Gene Expression Omnibus (GEO), and EBI’s ArrayExpress. As such, it is often difficult for biologists to gather all of the ENCODE data from a particular genomic region of interest and integrate them with relevant information found in other public databases. The ENCODEdb portal was developed to address this problem. ENCODEdb provides a unified, single point-of-access to data generated by the ENCODE Consortium, as well as to data from other source databases that lie within ENCODE regions; this provides the user a complete view of all known data in a particular region of interest. ENCODEdb Genomic Context searches allow for the retrieval of information on functional elements annotated within ENCODE regions, including mRNA, EST, and STS sequences; single nucleotide polymorphisms, and UniGene clusters. Information is also retrieved from GEO, OMIM, and major genome sequence browsers. ENCODEdb Consortium Data searches allow users to perform compound queries on array-based ENCODE data available both from GEO and from the UCSC Genome Browser.
    [Show full text]
  • Differential Gene Expression Profiling in Bed Bug (Cimex Lectularius L.) Fed on Ibuprofen and Caffeine in Reconstituted Human Blood Ralph B
    Herpe y & tolo og g l y: o C th i u Narain et al., Entomol Ornithol Herpetol 2015, 4:3 n r r r e O n , t y R g DOI: 10.4172/2161-0983.1000160 e o l s o e a m r o c t h n E Entomology, Ornithology & Herpetology ISSN: 2161-0983 ResearchResearch Article Article OpenOpen Access Access Differential Gene Expression Profiling in Bed Bug (Cimex Lectularius L.) Fed on Ibuprofen and Caffeine in Reconstituted Human Blood Ralph B. Narain, Haichuan Wang and Shripat T. Kamble* Department of Entomology, University of Nebraska, Lincoln, NE 68583, USA Abstract The recent resurgence of the common bed bug (Cimex lectularius L.) infestations worldwide has created a need for renewed research on biology, behavior, population genetics and management practices. Humans serve as exclusive hosts to bed bugs in urban environments. Since a majority of humans consume Ibuprofen (as pain medication) and caffeine (in coffee and other soft drinks) so bug bugs subsequently acquire Ibuprofen and caffeine through blood feeding. However, the effect of these chemicals at genetic level in bed bug is unknown. Therefore, this research was conducted to determine differential gene expression in bed bugs using RNA-Seq analysis at dosages of 200 ppm Ibuprofen and 40 ppm caffeine incorporated into reconstituted human blood and compared against the control. Total RNA was extracted from a single bed bug per replication per treatment and sequenced. Read counts obtained were analyzed using Bioconductor software programs to identify differentially expressed genes, which were then searched against the non-redundant (nr) protein database of National Center for Biotechnology Information (NCBI).
    [Show full text]
  • Bedtools Documentation Release 2.30.0
    Bedtools Documentation Release 2.30.0 Quinlan lab @ Univ. of Utah Jan 23, 2021 Contents 1 Tutorial 3 2 Important notes 5 3 Interesting Usage Examples 7 4 Table of contents 9 5 Performance 169 6 Brief example 173 7 License 175 8 Acknowledgments 177 9 Mailing list 179 i ii Bedtools Documentation, Release 2.30.0 Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line. bedtools is developed in the Quinlan laboratory at the University of Utah and benefits from fantastic contributions made by scientists worldwide. Contents 1 Bedtools Documentation, Release 2.30.0 2 Contents CHAPTER 1 Tutorial We have developed a fairly comprehensive tutorial that demonstrates both the basics, as well as some more advanced examples of how bedtools can help you in your research. Please have a look. 3 Bedtools Documentation, Release 2.30.0 4 Chapter 1. Tutorial CHAPTER 2 Important notes • As of version 2.28.0, bedtools now supports the CRAM format via the use of htslib. Specify the reference genome associated with your CRAM file via the CRAM_REFERENCE environment variable.
    [Show full text]
  • Quantification of Experimentally Induced Nucleotide Conversions in High-Throughput Sequencing Datasets Tobias Neumann1* , Veronika A
    Neumann et al. BMC Bioinformatics (2019) 20:258 https://doi.org/10.1186/s12859-019-2849-7 RESEARCH ARTICLE Open Access Quantification of experimentally induced nucleotide conversions in high-throughput sequencing datasets Tobias Neumann1* , Veronika A. Herzog2, Matthias Muhar1, Arndt von Haeseler3,4, Johannes Zuber1,5, Stefan L. Ameres2 and Philipp Rescheneder3* Abstract Background: Methods to read out naturally occurring or experimentally introduced nucleic acid modifications are emerging as powerful tools to study dynamic cellular processes. The recovery, quantification and interpretation of such events in high-throughput sequencing datasets demands specialized bioinformatics approaches. Results: Here, we present Digital Unmasking of Nucleotide conversions in K-mers (DUNK), a data analysis pipeline enabling the quantification of nucleotide conversions in high-throughput sequencing datasets. We demonstrate using experimentally generated and simulated datasets that DUNK allows constant mapping rates irrespective of nucleotide-conversion rates, promotes the recovery of multimapping reads and employs Single Nucleotide Polymorphism (SNP) masking to uncouple true SNPs from nucleotide conversions to facilitate a robust and sensitive quantification of nucleotide-conversions. As a first application, we implement this strategy as SLAM-DUNK for the analysis of SLAMseq profiles, in which 4-thiouridine-labeled transcripts are detected based on T > C conversions. SLAM-DUNK provides both raw counts of nucleotide-conversion containing reads as well as a base-content and read coverage normalized approach for estimating the fractions of labeled transcripts as readout. Conclusion: Beyond providing a readily accessible tool for analyzing SLAMseq and related time-resolved RNA sequencing methods (TimeLapse-seq, TUC-seq), DUNK establishes a broadly applicable strategy for quantifying nucleotide conversions.
    [Show full text]
  • Penguin: a Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data
    bioRxiv preprint doi: https://doi.org/10.1101/2021.03.31.437901; this version posted June 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data Doaa Hassan1,4, Daniel Acevedo1,5, Swapna Vidhur Daulatabad1, Quoseena Mir1, Sarath Chandra Janga1,2,3 1. Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, 535 West Michigan Street, Indianapolis, Indiana 46202 2. Department of Medical and Molecular Genetics, Indiana University School of Medicine, Medical Research and Library Building, 975 West Walnut Street, Indianapolis, Indiana, 46202 3. Centre for Computational Biology and Bioinformatics, Indiana University School of Medicine, 5021 Health Information and Translational Sciences (HITS), 410 West 10th Street, Indianapolis, Indiana, 46202 4. Computers and Systems Department, National Telecommunication Institute, Cairo, Egypt. 5. Computer Science Department, University of Texas Rio Grande Valley Keywords: RNA modifications, Pseudouridine, Nanopore *Correspondence should be addressed to: Sarath Chandra Janga ([email protected]) Informatics and Communications Technology Complex, IT475H 535 West Michigan Street Indianapolis, IN 46202 317 278 4147 bioRxiv preprint doi: https://doi.org/10.1101/2021.03.31.437901; this version posted June 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.
    [Show full text]
  • Reference Genomes and Common File Formats Overview
    Reference genomes and common file formats Overview ● Reference genomes and GRC ● Fasta and FastQ (unaligned sequences) ● SAM/BAM (aligned sequences) ● Summarized genomic features ○ BED (genomic intervals) ○ GFF/GTF (gene annotation) ○ Wiggle files, BEDgraphs, BigWigs (genomic scores) Why do we need to know about reference genomes? ● Allows for genes and genomic features to be evaluated in their genomic context. ○ Gene A is close to gene B ○ Gene A and gene B are within feature C ● Can be used to align shallow targeted high-throughput sequencing to a pre-built map of an organism Genome Reference Consortium (GRC) ● Most model organism reference genomes are being regularly updated ● Reference genomes consist of a mixture of known chromosomes and unplaced contigs called as Genome Reference Assembly ● Genome Reference Consortium: ○ A collaboration of institutes which curate and maintain the reference genomes of 4 model organisms: ■ Human - GRCh38.p9 (26 Sept 2016) ■ Mouse - GRCm38.p5 (29 June 2016) ■ Zebrafish - GRCz10 (12 Sept 2014) ■ Chicken - Gallus_gallus-5.0 (16 Dec 2015) ○ Latest human assembly is GRCh38, patches add information to the assembly without disrupting the chromosome coordinates ● Other model organisms are maintained separately, like: ○ Drosophila - Berkeley Drosophila Genome Project Overview ● Reference genomes and GRC ● Fasta and FastQ (unaligned sequences) ● SAM/BAM (aligned sequences) ● Summarized genomic features ○ BED (genomic intervals) ○ GFF/GTF (gene annotation) ○ Wiggle files, BEDgraphs, BigWigs (genomic scores) The
    [Show full text]
  • De Novo Human Genome Assemblies Reveal Spectrum of Alternative Haplotypes in Diverse Populations
    ARTICLE DOI: 10.1038/s41467-018-05513-w OPEN De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations Karen H.Y. Wong 1, Michal Levy-Sakin1 & Pui-Yan Kwok 1,2,3 The human reference genome is used extensively in modern biological research. However, a single consensus representation is inadequate to provide a universal reference structure 1234567890():,; because it is a haplotype among many in the human population. Using 10× Genomics (10×G) “Linked-Read” technology, we perform whole genome sequencing (WGS) and de novo assembly on 17 individuals across five populations. We identify 1842 breakpoint-resolved non-reference unique insertions (NUIs) that, in aggregate, add up to 2.1 Mb of so far undescribed genomic content. Among these, 64% are considered ancestral to humans since they are found in non-human primate genomes. Furthermore, 37% of the NUIs can be found in the human transcriptome and 14% likely arose from Alu-recombination-mediated deletion. Our results underline the need of a set of human reference genomes that includes a com- prehensive list of alternative haplotypes to depict the complete spectrum of genetic diversity across populations. 1 Cardiovascular Research Institute, University of California, San Francisco, San Francisco, 94158 CA, USA. 2 Institute for Human Genetics, University of California, San Francisco, San Francisco, 94143 CA, USA. 3 Department of Dermatology, University of California, San Francisco, San Francisco, 94115 CA, USA. Correspondence and requests for materials should be addressed to P.-Y.K. (email: [email protected]) NATURE COMMUNICATIONS | (2018) 9:3040 | DOI: 10.1038/s41467-018-05513-w | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-05513-w ext-generation sequencing (NGS) is being used in numer- distinctive from one another were selected for 10×G WGS using Nous ways in both basic and clinical research.
    [Show full text]
  • Introduction to High-Throughput Sequencing File Formats
    Introduction to high-throughput sequencing file formats Daniel Vodák (Bioinformatics Core Facility, The Norwegian Radium Hospital) ( [email protected] ) File formats • Do we need them? Why do we need them? – A standardized file has a clearly defined structure – the nature and the organization of its content are known • Important for automatic processing (especially in case of large files) • Re-usability saves work and time • Why the variability then? – Effective storage of specific information (differences between data- generating instruments, experiment types, stages of data processing and software tools) – Parallel development, competition • Need for (sometimes imperfect) conversions 2 Binary and “flat” file formats • “Flat” (“plain text”, “human readable”) file formats – Possible to process with simple command-line tools (field/column structure design) – Large in size – Space is often saved through the means of archiving (e.g. tar, zip) and human-readable information coding (e.g. flags) • File format specifications (“manuals”) are very important (often indispensable) for correct understanding of given file’s content • Binary file formats – Not human-readable – Require special software for processing (programs intended for plain text processing will not work properly on them, e.g. wc, grep) – (significant) reduction to file size • High-throughput sequencing files – typically GBs in size 3 Comparison of file sizes • Plain text file – example.sam – 2.5 GB (100 %) • Binary file – example.bam – 611 MB (23.36 %) – Possibility of indexing
    [Show full text]
  • Gffread and Gffcompare[Version 1; Peer Review: 3 Approved]
    F1000Research 2020, 9:304 Last updated: 10 SEP 2020 SOFTWARE TOOL ARTICLE GFF Utilities: GffRead and GffCompare [version 1; peer review: 3 approved] Geo Pertea1,2, Mihaela Pertea 1,2 1Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21218, USA 2Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA v1 First published: 28 Apr 2020, 9:304 Open Peer Review https://doi.org/10.12688/f1000research.23297.1 Latest published: 09 Sep 2020, 9:304 https://doi.org/10.12688/f1000research.23297.2 Reviewer Status Abstract Invited Reviewers Summary: GTF (Gene Transfer Format) and GFF (General Feature Format) are popular file formats used by bioinformatics programs to 1 2 3 represent and exchange information about various genomic features, such as gene and transcript locations and structure. GffRead and version 2 GffCompare are open source programs that provide extensive and (revision) efficient solutions to manipulate files in a GTF or GFF format. While 09 Sep 2020 GffRead can convert, sort, filter, transform, or cluster genomic features, GffCompare can be used to compare and merge different version 1 gene annotations. 28 Apr 2020 report report report Availability and implementation: GFF utilities are implemented in C++ for Linux and OS X and released as open source under an MIT 1. Andreas Stroehlein , The University of license (https://github.com/gpertea/gffread, https://github.com/gpertea/gffcompare). Melbourne, Parkville, Australia Keywords 2. Michael I. Love , University of North gene annotation, transcriptome analysis, GTF and GFF file formats Carolina-Chapel Hill, Chapel Hill, USA 3. Rob Patro, University of Maryland, College This article is included in the International Park, USA Society for Computational Biology Community Any reports and responses or comments on the Journal gateway.
    [Show full text]
  • Genomic Files
    Genomic Files University of Massachusetts Medical School November, 2016 A Typical Deep-Sequencing Workflow Samples Deep Sequencing Fastq Files Further Processing Fastq Files Deep Sequencing Data Aligning Reads pipelines involve a lot of text Sam / Bam Files processing. Downstream processing and quantification Various files other bed files text files csv files This is an oversimplified model and your workflow can look different from this! 2/55 Toolbox Unix has very useful tools for text processing. Some of them are: Viewing: less Searching: grep Table Processing: awk Editors: nano, vi, sed 3/55 Searching Text Files Problem Say, we have our RNA-Seq data in fastq format. We want to see the reads having three consecutive A’s. How can we save such reads in a separate file? grep is a program that searches the standard input or a given text file line-by-line for a given text or pattern. grep AAA control.rep1.1.fq | {z } | {z } text to be searched for Our text file For a colorful output, use the --color=always option. $ grep AAA control.rep1.1.fq --color=always 4/55 Using Pipes We don’t want grep print everything all at once. We want to see the output line-by-line. Pipe the output to less. $ grep AAA control.rep1.1.fq --color=always | less 5/55 Using Pipes We don’t want grep print everything all at once. We want to see the output line-by-line. Pipe the output to less. $ grep AAA control.rep1.1.fq --color=always | less We have escape characters but less don’t expect them by default.
    [Show full text]
  • BIOINFORMATICS APPLICATIONS NOTE Doi:10.1093/Bioinformatics/Btq033
    Vol. 26 no. 6 2010, pages 841–842 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btq033 Genome analysis Advance Access publication January 28, 2010 BEDTools: a flexible suite of utilities for comparing genomic features Aaron R. Quinlan1,2,∗ and Ira M. Hall1,2,∗ 1Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine and 2Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA Associate Editor: Martin Bishop ABSTRACT analyses often require iterative testing and refinement. In this sense, Motivation: Testing for correlations between different sets of faster and more flexible tools allow one to conduct a greater number genomic features is a fundamental task in genomics research. and more diverse set of experiments. This necessity is made more However, searching for overlaps between features with existing web- acute by the data volume produced by current DNA sequencing based methods is complicated by the massive datasets that are technologies. In an effort to address these needs, we have developed routinely produced with current sequencing technologies. Fast and BEDTools, a fast and flexible suite of utilities for common operations flexible tools are therefore required to ask complex questions of these on genomic features. data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features 2 FEATURES AND METHODS in Browser Extensible Data (BED) and General Feature Format 2.1 Common scenarios (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools Genomic analyses often seek to compare features that are discovered are extremely efficient and allow the user to compare large datasets in an experiment to known annotations for the same species.
    [Show full text]
  • Bed Bug Cytogenetics: Karyotype, Sex Chromosome System, FISH Mapping of 18S Rdna, and Male Meiosis in Cimex Lectularius Linnaeus, 1758 (Heteroptera: Cimicidae)
    © Comparative Cytogenetics, 2010 . Vol. 4, No. 2, P. 151-160. ISSN 1993-0771 (Print), ISSN 1993-078X (Online) Bed bug cytogenetics: karyotype, sex chromosome system, FISH mapping of 18S rDNA, and male meiosis in Cimex lectularius Linnaeus, 1758 (Heteroptera: Cimicidae) S. Grozeva1, V. Kuznetsova2, B. Anokhin2 1Institute of Biodiversity and Ecosystem Research, Bulgarian Academy of Sciences, Blvd Tsar Osvoboditel 1, Sofi a 1000, Bulgaria; 2Zoological Institute, Russian Academy of Sciences, Universitetskaya nab. 1, St. Petersburg 199034, Russia. E-mails: [email protected], [email protected] Abstract. Bugs (Insecta: Heteroptera) are frequently used as examples of unusual cy- togenetic characters, and the family Cimicidae is one of most interest in this respect. We have performed a cytogenetic study of the common bed bug Cimex lectularius Linnaeus, 1758 using both classical (Schiff-Giemsa and AgNO3-staining) and mo- lecular cytogenetic techniques (base-specifi c DAPI/CMA3 fl uorochromes and FISH with an 18S rDNA probe). Males originated from a wild population of C. lectularius were found to have 2n = 26 + X1X2Y, holokinetic chromosomes, 18S rRNA genes located on the X1 and Y chromosomes; achiasmate male meiosis of a collochore type; MI and MII plates nonradial and radial respectively. Key words: holokinetic chromosomes, karyotype, multiple sex chromosomes, achi- asmate collochore meiosis, FISH with an 18S rDNA probe, Cimex lectularius. INTRODUCTION i.e. chromosomes having, instead оf localized The bеd bug genus Cimex Linnaeus, 1758 centromere, a kinetochore plate spread along is a relatively small group of highly specialized their whole or almost whole length. Among hematophagous ectoparasites, with 17 species several peculiarities of Cimex cytogenetics, distributed primarily across the Holarctic multiple sex chromosome systems are the most and associated with humans, bats, and birds conspicuous.
    [Show full text]