<<

Genomics for today

• Cancer genomics

• Reproductive health

• Forensic genomics

• Agrigenomics

• Complex disease genomics

• Microbial genomics

• Genomics in Drug and development

• and more …omics Data/File formats

• File format, a format for encoding data for storage in a computer file which is a standardized file format

• Storage, access, sharing, interpretation, security, etc.

http://en.wikipedia.org/wiki/Data_format for dummies

http://www.dummies.com/how-to/content/bioinformatics-data-formats.html Scientific data formats 23andMe microarray track data Browser Extensible Data Format AB1 (Chromatogram files used by DNA instruments from Applied Biosystems) MINiML (MIAME Notation in Markup Language) ABCD (Access to Biological Collection Data) mini Format ABCDDNA (Access to Biological Collection Data DNA extension) MIQAS-TAB (Minimal Information for QTLs and Association Studies Tabular) ABCDEFG (Access to Biological Collection Data Extension For Geosciences) MITAB ACE (Sequence assembly format) mmCIF (macromolecular Crystallographic Information File) Affymetrix Raw Intensity Format Multiple Alignment Forma ARLEQUIN Project Format mzData (deprecated) Axt Alignment Format mzIdentML BAM (Binary compressed SAM format) mzML BED (Browser extensible display format describing genes and other features of DNA sequences) mzQuantML BEDgraph mzXML (deprecated) Big Browser Extensible Data Format NCD (Natural Collections Descriptions) Big Wiggle Format NDTF (Neurophysiology Data Translation Format) Binary Alignement Map Format net alignment annotation Format Binary Probe Map Format NeuroML (Neuroscience eXtensible Markup Language) Binary sequence information Format New Hampshire eXtended Format Biological Pathway eXchange Newick tree Format BLAT alignment Format NEXUS (Encodes mixed information about genetic sequence data in a block structured format) BRIX generated O Format Nimblegen Design File Format CAF (Common Assembly Format for sequence assembly) Nimblegen Gene Data Format CellML NMR-STAR (NMR Self-defining Text Archive and Retrieval format) CHADO XML interchange Format nucleotide inFormation binary Format Chain Format for pairwise alignment ODM (Operational Data Model) CHARMM Card File Format Open Biomedical Ontology Flat File Format -W Alignment Format Personal Genome SNP Format CLUSTAL-W Dendrogram Guide File Format PHD (Output from the basecalling software Phred) Clustered Data Table Format phyloXML (XML for evolutionary biology and comparative genomics) Complete Genomics Pre-Clustering File Format DELTA (DEscription Language for TAxonomy) Protein Data Bank (PDB; Structures of biomolecules deposited in Protein Data Bank) DAS (Distributed Sequence Annotation System) Protein InFormation Resource Format DBN (Dot Bracket Notation (DBN) - Vienna Format) PRM (Protocol Representation Model (Medical Research)) EMBL (Flatfile format used by the EMBL for nucleotide and peptide sequences) PSI-MI XML EML (Environmental Markup Language) not to be confused with EML (Ecological Metadata Language) PSI-PAR ENCODE (Peak information Format) RDML (Real-time PCR Data Markup Language) FASTA and FASTQ (File format for sequence data, FASTQ with quality) SAM (/Map format) FuGEFlow SCF (Staden chromatogram files used to store data from DNA sequencing) FuGE-ML (Functional Genomics Experiment Markup Language) SBML (Systems Biology Markup Language used to store biochemical network computational models) Gating-ML SDD (Structured Descriptive Data) GCDML (Genomic Contextual Data Markup Language) SED-ML (Simulation Experiment Description Markup Language) GelML Gel electrophoresis Markup Language Sequence Alignment Map Format GenBank (Flatfile format used by NCBI for nucleotide and peptide sequences) SOFT (Simple Omnibus Format in Text) Gene Feature File (Versions 1 and 3) spML (Separation Markup Language) GFF ( for describing genes and other features of DNA, RNA and protein sequences) SRA-XML (Short Read Archive eXtensible Markup Language) Gene Prediction File Format Standard Flowgram Format GenePattern GeneSet Table Format Stockholm Multiple Alignment Format (Representing multiple sequence alignments) Genome Annotation File (version 1 and 2) SBML (System Biology Markup Language) GTF (Gene transfer format holds information about gene structure) SBGN (Systems Biology Graphical Notation) HMMER SBRML (Systems Biology Results Markup Language) ICB (ICM binary file Format) Swiss-Prot (Flatfile format used for protein sequences from the Swiss-Prot database) Image Cytometry Standard (ICS) TAIR annotation data Format imzML (imaging mz Markup Language) TAPIR (TDWG Access Protocol for Information Retrieval) ISA-Tab (Investigation Study Assay Tabular) TCS (Taxonomic Concept transfer Schema) ISND sequence record XML TraML (Transition Markup Language) KGML (KEGG Mark-up Language) UniProtKB XML Format MAGE-Tab (MicroArray Gene Expression Tabular) VCF () MCL (Microbiological Common Language) Wiggle Format MIARE-TAB (Minimum Information About a RNAi Experiment Tabular) http://en.wikipedia.org/wiki/List_of_file_formats#Biology http://fileformats.archiveteam.org/wiki/Scientific_Data_formats Fasta format >Description line Sequence >gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens] MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGF

Sequence identifiers

GenBank gb|accession|locus EMBL Data Library emb|accession|locus DDBJ, DNA Database of Japan dbj|accession|locus NBRF PIR pir||entry Protein Research Foundation prf||name SWISS-PROT sp|accession|entry name Brookhaven Protein Data Bank pdb|entry|chain Patents pat|country|number GenInfo Backbone Id bbs|number General database identifier gnl|database|identifier NCBI Reference Sequence ref|accession|locus Local Sequence identifier lcl|identifier File extension

Extension Meaning Notes faa fasta amino acid Contains amino acids. A multiple protein fasta file can have the more specific extension mpfa. fasta (.fas) generic fasta Any generic fasta file. Other extensions can be fa, seq, fsa

ffn FASTA nucleotide coding regions Contains coding regions for a genome.

fna fasta nucleic acid Used to generically specify nucleic acids. frn FASTA non-coding RNA Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA BioXSD: the common data-exchange format for everyday bioinformatics web services

• for basic bioinformatics data

• XML schema

• syntax for biological sequences, annotations, alignments, and references to resources

http://bioinformatics.oxfordjournals.org/content/26/18/i540.full 8. How to identify disease biomarkers Balaji Rajashekar 20.11.14 Human

By mass, human cells consist of 65–90% water (H2O). Oxygen therefore contributes a majority of a human body's mass. Almost 99% of the mass of the human body is made up of the six elements oxygen, carbon, hydrogen, nitrogen, calcium, and phosphorus. About 0.75% of the remainder is composed of only five elements: sodium, phosphorus, potassium, sulfur, and chlorine. The remaining elements are trace elements. Note that not all elements which are found in the human body play a role in life. Elemental composition The average 70 kg adult human body contains approximately 6.7 x 1027 atoms and is composed of 60 chemical elements. The elements needed for life are relatively common in the Earth's crust, and conversely most of the common elements are necessary for life. An exception is aluminium, which is the third most common element in the Earth's crust (after oxygen and silicon). Contents : Flesh, Blood,

http://en.wikipedia.org/wiki/Composition_of_the_human_body Composition

The composition can also be expressed in terms of chemicals, such as:

Water

Proteins – including those of hair, connective tissue, etc.

Fats (or lipids)

Apatite in bones

Carbohydrates such as glycogen and glucose

DNA

Dissolved inorganic ions such as sodium, potassium, chloride, bicarbonate, phosphate

Gases such as oxygen, carbon dioxide, nitrogen oxide, hydrogen, carbon monoxide, methanethiol. These may be dissolved or present in the gases in the lungs or intestines.

Many other small molecules, such as amino acids, fatty acids, nucleobases, nucleosides, nucleotides, vitamins, cofactors.

Free radicals such as superoxide, hydroxyl, and hydroperoxyl. Materials

Body composition can also be expressed in terms of various types of material, such as:

Muscle

Fat

Bone and teeth

Brain and nerves

Connective tissue

Blood – 7% of body weight.

Lymph

Contents of digestive tract, including intestinal gas

Urine

Air in lungs

Composition by cell type : There are many species of bacteria and other microorganisms that live on or inside the healthy human body. In fact, 90% of the cells in (or on) a human body are microbes, by number (much less by mass or volume). Some of these symbionts are necessary for our health. Those that neither help nor harm us are called commensal organisms. Data and analysis Formats : NetCDF

netcdf out { dimensions: __string = 11 ; n = 4 ; m = 5 ; variables: char empty(__string) ; int year(n) ; Population growth in Cities in corresponding year char city(m, __string) ; float population(m, n) ; 1900 1940 1970 2000 Los 0.102 1.504 2.812 3.695 // global attributes: Angeles :__str_len = 11 ; data: Washingt 0.279 0.663 0.757 0.572 on empty = "" ;

New York 3.437 7.455 7.896 8.008 year = 1900, 1940, 1970, 2000 ; Seattle 0.081 0.368 0.531 0.563 city = London 6.528 8.197 7.452 7.322 "Los Angeles", "Washington", "New York", "Seattle", "London" ;

population = 0.102, 1.504, 2.812, 3.695, 0.279, 0.663, 0.757, 0.572, 3.437, 7.455, 7.896, 8.008, 0.081, 0.368, 0.531, 0.563, 6.528, 8.197, 7.452, 7.322 ; } Genomic data

netcdf out { dimensions: __string = 11 ; Expression matrix n = 4 ; m = 5 ; variables: char empty(__string) ; int sample(n) ; Organism char gene(m, __string) ; DataType float expression(m, n) ; Email // global attributes: Date.... :__str_len = 11 ; data:

Organism = “Homo sapiens" ; Annotations datatype = “microarray” ; Track id sample = HumanH1, HumanH2, Disease1, Disease2 ; Track id Sample IDs genes = "ENSG00001", "ENSG00002", "ENSG00005", "ENSG00009", Numeric values "ENSG00011" ; Gene ids

Annotations expression = 0.102, 1.504, 2.812, 3.695, 0.279, 0.663, 0.757, 0.572, 3.437, 7.455, 7.896, 8.008, 0.081, 0.368, 0.531, 0.563, 6.528, 8.197, 7.452, 7.322 ; } Databases

• Arrayexpress : http://www.ebi.ac.uk/arrayexpress/

• http://www.ebi.ac.uk/gxa/home

• GEO : http://www.ncbi.nlm.nih.gov/gds BIIT tools

• MEM : http://biit.cs.ut.ee/mem/

• gProfiler : http://biit.cs.ut.ee/gprofiler

• GOSummaries : http://www.bioconductor.org/packages/release/bioc/html/ GOsummaries.html

• KEGGanim : http://biit.cs.ut.ee/kegganim/

• DiffExp : http://biit.cs.ut.ee/diffexp/

• ExpressView : http://biit.cs.ut.ee/expressview/

• pHeatmap : http://cran.r-project.org/web/packages/pheatmap/

• gPathways : http://biit.cs.ut.ee/gpathways/

• ProjectBrowser : http://biit.cs.ut.ee/projectbrowser/?project=affy Major datatypes

• Microarray and Next-generation sequencing

• Both for DNA, RNA and epigenetic applications

• Array works on fixed number of genes or locations on the genome, in contrast sequencing has much more novel data

• NGS had the greatest detection sensitivity, largest dynamic range of detection and highest accuracy in differential expression analysis when compared with gold-standard quantitative real- time PCR.

• technical reproducibility was high, with intrasample correlations in all cases.

• Expression profiles between paired frozen and FFPE samples were similar

• These results show the superior sensitivity, accuracy and robustness of NGS for the study.

http://www.nature.com/labinvest/journal/v94/n3/full/labinvest2013157a.html Microarrays

• A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles (10−12 moles) of a specific DNA sequence, known as probes (or reporters or oligos). These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA (also called anti-sense RNA) sample (called target) under high- stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target.

http://en.wikipedia.org/wiki/DNA_microarray Microarrays

• Microarray technology is a complex mixture of numerous technology and research fields such as mechanics, microfabrication, chemistry, DNA behaviour, microfluidics, enzymology, optics and bioinformatics. (http://www.ncbi.nlm.nih.gov/pubmed/19381982)

• Microarrays slides - 1, 2, 3

• R scripts - 1, 2, 3

• Online material : http://www.wiley-vch.de/books/sample/ 3527318224_c01.pdf Tasks

• Further reading :

• http://genomebiology.com/2014/15/11/519/ abstract Projects

• https://docs.google.com/document/d/ 1w5rg_EiPjdyhpJtFqBCVvQ9C6VrNr99rJyFfzDfVOx w/edit?usp=sharing

• Early Discussion

• Deadline : January

• Exam : December 18th RNA-sequencing Application to genomic medicine

http://bgiamericas.com/wp-content/uploads/2013/12/nlcrna_homepage.jpg Case

• Identical twins are widely used to study the contributions of genetics and environment to human disease.

• cases : A study in which one twin had multiple sclerosis and the other did not.

• Possible solution is in the latest techniques of genome sequencing and analysis from such studies.

http://www.msgenes.ucsf.edu/press-new06_10.html http://bgiamericas.com/applications/drug-rd/rna-seq-100-ng/ fastQ FastQ format

Raw data sequence identifier

nucleotide sequence @HWUSI-EAS582_160:4:1:1:1866/1 NGACTCTTAGCGGTGGATCACTCGGCTCGTATGCCGTCTTCTGC identifier +HWUSI-EAS582_160:4:1:1:1866/1 DNWSSWUWUUPUWUTSTTPTBBBBBBBBBBBBBBBBBBBBBBBB @HWUSI-EAS582_160:4:1:1:185/1 PHRED scores NTATGCCGTCTTCTGCTTGAAAAAAAAAACACCGGCCCGCGCTC +HWUSI-EAS582_160:4:1:1:185/1 DKVQXYVWUWYWUUWUSTVSBBBBBBBBBBBBBBBBBBBBBBBB @HWUSI-EAS582_160:4:1:1:532/1 NTCTGTGTAGAAGAGAAGTTCGTATGCCGTCTTCTGCTTGAAAA +HWUSI-EAS582_160:4:1:1:532/1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

A FASTQ file normally uses four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Wednesday, November 23, 11 fastQFastQ to to fasta fasta

trinucleotide score unique sequence id width of sequence the read depth of sequence (i.e. how many identical reads were found for this in the FASTQ file. >9_t6_w29_x44197 TGGTGGTCTAGTGGTTAGGAT >17_t6_w30_x25765 TGGTGGTTCAGTGGTAGAATTC >33_t7_w28_x11707 TGGTGGTATAGTGGTGAGCA >35_t5_w28_x11233 TGTGGTCTAGTGGTTAGGAT >37_t6_w29_x10577 TGGTGGTTCAGTGGTAGAATT >44_t6_w30_x8367 TGGTGGTCTAGTGGTTAGGATT

Wednesday, November 23, 11 Datadata toto analysis analysis

Preliminary QC

Short Alignment to reads ref. genome

cpu/memory intensive Raw alignments

QC/filtering

Postprocessing Final output

Wednesday, November 23, 11 Sequencing analysis schema

http://www.thetcr.org/article/view/2651/html RNA-seq videos

• RNA-seq vs Microarray : http://youtu.be/2c3t3tDEmsU? list=PLTTs13_4Ig_N8MGNTUGkxYWVTtPXkQSsw

• Rna-seq basics: http://youtu.be/V_4n8n5Z6I8? list=PLTTs13_4Ig_N8MGNTUGkxYWVTtPXkQSsw

• RNA-seq analysis : http://youtu.be/hlbbtgauwpg? list=PLTTs13_4Ig_N8MGNTUGkxYWVTtPXkQSsw

• Popular RNA-seq videos 200+ http:// www.youtube.com/playlist? list=PLTTs13_4Ig_N8MGNTUGkxYWVTtPXkQSsw Article

• Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

• Nature Protocols 7, 562–578 (2012) doi:10.1038/ nprot.2012.016

• Published online 01 March 2012 Corrected online 07 August 2014

• http://www.nature.com/nprot/journal/v7/n3/full/nprot. 2012.016.html Abstract

• Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open- source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ~1 h of hands-on time. RNA-seq pipeline

http://www.nature.com/nprot/journal/v7/n3/images_article/nprot.2012.016-F1.jpg Assembly Analysing genes Transcript difference BIIT (Tõnis)

• Automatic Pipeline for processing RNA-Seq experiments from SRA (FastQ) with metadata

• Final output : TabCDF

• which can be used directly in BIIT tools