Introduction to Bio(Medical)- Informatics
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Bio(medical)- Informatics Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Outline • Overview of bioinformatics • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics 2 Technical areas of bioinformatics IMMPORT https://immport.niaid.nih.gov/immportWeb/home/home.do?loginType=full# Databases and Query Databases and Query 547 5 Visualization 6 Visualization http://supramap.osu.edu 7 Network 8 Visualization and Visual Analytics § Visualizing genome data annotation 9 Network Mining Hopkins 2007, Nature Biotechnology, Network pharmacology 10 Role of Bioinformatics – Hypothesis Generation Hypothesis generation Bioinformatics Hypothesis testing 11 Training for Bioinformatics § Biology / medicine § Computing – programming, algorithm, pattern recognition, machine learning, data mining, network analysis, visualization, software engineering, database, etc § Statistics – hypothesis testing, permutation method, multivariate analysis, Bayesian method, etc Beyond Bioinformatics • Translational Bioinformatics • Biomedical Informatics • Medical Informatics • Nursing Informatics • Imaging Informatics • Bioimage Informatics • Clinical Research Informatics • Pathology Informatics • … • Health Analytics • … • Legal Informatics • Business Informatics Source: Department of Biomedical Informatics 13 Beyond Bioinformatics Basic Science Clinical Research Informatics Biomedical Informatics Theories & Methods Imaging & Clinical Public Health Bioinformatics Structural Informatics Informatics Informatics Technologies Technologies Technologies Technologies Applied & Tools & Tools & Tools & Tools Science Translational Bioinformatics Source: Department of Biomedical Informatics 14 Beyond Bioinformatics Driving Generating & Advancing Biological Translating Personalized Discovery Knowledge Healthcare Translational Clinical Bioinformatics Innovation Bioinformatics Informatics Focus Areas Clinical Computational Imaging Research Biology Informatics Informatics Human Factors Cross-Cutting Knowledge Engineering Competencies High Performance Computing Data Science Source: Department of Biomedical Informatics Outline • Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics Data 16 Public Data • Clinical Data or Health Records • Local cancer registries and SEER • Osteoarthris Iniave (hp://www.oai.ucsf.edu/datarelease/) • Framingham Heart Study ( hp://www.framinghamheartstudy.org/research/index.html) • WHO clinical trials data registry ( hp://apps.who.int/trialsearch/) • dBGap: hp://www.ncbi.nlm.nih.gov/gap • Women's Health Iniave (hp://www.whiscience.org/data/) 17 Public Data • Molecular Data • NCBI Gene Expression Omnibus (GEO) • Cancer Genome Atlas - TCGA (Goal: 500 patients for each type of cancer, more than 20 types of cancer, genotype, gene expression, microRNA, CGH, DNA methylation, histological images, clinical*) • 1000 Genome/HapMap • International cancer consortium (Goal: 1000 patients for each cancer) • CCLE – Broad Institute/Norvatis • ENCODE 18 Workflow to Mine Frequent Co-expression Network Zhang J, Lu K, Xiang Y, Islam M, et al. (2012) PLoS Comput Biol 8(8): e1002656. 19 Public Data • Molecular Data 20 Outline • Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics 21 22 Lodish et al, Molecular Cell Biology 23 24 Central Dogma 25 Eukaryotic Nuclear Gene Structure TSS (+1) TES Splicing Searle and Hopkins, BJA, 2009 TSS: Transcription Starting Site; TES: Transcription “Ending” Site 26 Alternative Splicing "the discovery that genes in eukaryotes are not contiguous strings but contain introns, and that the splicing of messenger RNA to delete those introns can occur in different ways, yielding different proteins from the same DNA sequence". Splicing Searle and Hopkins, BJA, 2009 27 http://en.wikipedia.org/wiki/Alternative_splicing Post-genomic Era: Genome Annotation • The value of a genome is in its annotations Lodish et al, Molecular Cell Biology Protein folding and structure Sequencing • Sanger sequencing Fred Sanger • Nobel prize in chemistry in 1958 "for his work on the structure of proteins, especially that of insulin" • Nobel prize in chemistry in 1980 "for their contributions concerning the determination of base sequences in nucleic Winnick, The Scientist, 18(18), 2004 acids" Sequencing Technology • Automatic sequencers Leroy Hood Winnick, The Scientist, 18(18), 2004 Initial Analysis of the Human Genome 32 Next Generation Sequencing (NGS) Sequencers and capacity • Sequencers Roche/454 Illumina (solexa) Life Technologies (ABI) SOLiD • Both DNA and RNA • Short sequences § 32-150bp for Illumina and SOLiD § 400-1000bp for 454 • Ultra-high throughput § Up to 600G bases per run (Illumina and SOLiD) (a human genome is about 3G bases) § 1G bases per run (454) Major Sequencing Platforms Illumina SOLiD (Life technology) • Leading platform • Emulsion PCR (beads) • Bridge amplification • Sequencing-by-ligation • Sequencing-by-synthesis • Color space 454 (Roche) Pacific Biosciences • Pyrosequencing • Single molecule • Long(er) reads, 400bp • Nano hole • First platform • Very long reads (<5K) Ion Torrent • Personal genome analyzer • Ion-sensitive semi-conductor • Cheap equipment, lower throughput, higher error rates (now) 34 Main Applications of NGS Sequence DNA Sequence RNA • De novo sequencing • RNA-seq (transcriptome-wide • Reference-based re- sequencing) sequencing • miRNA-seq • SNP, CNV, Indels • novel ncRNAs • Metagenomics Illumina – Solexa • Identify “who is there?” in a mixture of microbes Roche – 454 Life Technology – SOLiD Study Protein-DNA/RNA Epigenetics interaction • DNA methylation • ChIP-seq (for TF, Pol II binding) • Histone modification (ChIP-seq) • CLIP-seq (for RNA binding • Nucleosome positioning proteins) • Chromosome looping 35 Illumina sequencing • Sequencing by synthesis Quality Scores Sequence Files Date Cost per Mb Cost per Date Cost per Mb Cost per Genome Genome Sep-01 $5,292.39 $95,263,072 Jul-07 $495.96 $8,927,342 Mar-02 $3,898.64 $70,175,437 Oct-07 $397.09 $7,147,571 Sep-02 $3,413.80 $61,448,422 Jan-08 $102.13 $3,063,820 Mar-03 $2,986.20 $53,751,684 Apr-08 $15.03 $1,352,982 Oct-03 $2,230.98 $40,157,554 Jul-08 $8.36 $752,080 Jan-04 $1,598.91 $28,780,376 Oct-08 $3.81 $342,502 Apr-04 $1,135.70 $20,442,576 Jan-09 $2.59 $232,735 Jul-04 $1,107.46 $19,934,346 Apr-09 $1.72 $154,714 Oct-04 $1,028.85 $18,519,312 Jul-09 $1.20 $108,065 Jan-05 $974.16 $17,534,970 Oct-09 $0.78 $70,333 Apr-05 $897.76 $16,159,699 Jan-10 $0.52 $46,774 Jul-05 $898.90 $16,180,224 Apr-10 $0.35 $31,512 Oct-05 $766.73 $13,801,124 Jul-10 $0.35 $31,125 Jan-06 $699.20 $12,585,659 Oct-10 $0.32 $29,092 Apr-06 $651.81 $11,732,535 Jan-11 $0.23 $20,963 Jul-06 $636.41 $11,455,315 Apr-11 $0.19 $16,712 Oct-06 $581.92 $10,474,556 Jul-11 $0.12 $10,497 Jan-07 $522.71 $9,408,739 Oct-11 $0.09 $7,743 Apr-07 $502.61 $9,047,003 Jan-12 $0.09 $7,666 NGS vs Moore’s Law 38 NGS – Things keep Moving 39 NGS Processing Pipeline NGS Processing Pipeline 41 Outline • Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics 42 Database Search • PubMed / Entrez • Gene information (e.g., Entrez gene database, genecards) • GenBank (accession number, GI number, version number, etc) • File format (e.g., FASTA, SAM, BAM) • ID conversion (e.g., DAVID) 43 Visualization of Genomics Data • UCSC Genome Browser (http://genome.ucsc.edu/) • Ensembl (http://www.ensembl.org/index.html) • Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/) • VEGA – VErtebrate Genome Annotation database (http://vega.sanger.ac.uk/index.html) • Integrative Genome Viewer (IGV) – by Broad Instititute • … • Check out http://www.openhelix.com/cgi/freeTutorials.cgi 44 Sequence Alignment • Dynamic programming • Global alignment – Needleman-Wunch algorithm • Local alignment – Smith-Waterman algorithm • BLAST • Entrez Blast tool • Multiple sequence alignment 45 Outline • Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics 46 Outline • http://web.cse.ohio-state.edu/~raghu/teaching/ CSE5599-BMI7830/ 47 .