Introduction to Bio(medical)- Informatics
Kun Huang, PhD Raghu Machiraju, PhD
Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Outline
• Overview of bioinformatics • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics
2 Technical areas of bioinformatics
IMMPORT https://immport.niaid.nih.gov/immportWeb/home/home.do?loginType=full# Databases and Query Databases and Query
547
6 Visualization
http://supramap.osu.edu 7 Network
8 Visualization and Visual Analytics § Visualizing genome data annotation
9 Network Mining
Hopkins 2007, Nature Biotechnology, Network pharmacology
10 Role of Bioinformatics – Hypothesis Generation
Hypothesis generation
Bioinformatics
Hypothesis testing
11 Training for Bioinformatics
§ Biology / medicine § Computing – programming, algorithm, pattern recognition, machine learning, data mining, network analysis, visualization, software engineering, database, etc § Statistics – hypothesis testing, permutation method, multivariate analysis, Bayesian method, etc Beyond Bioinformatics • Translational Bioinformatics • Biomedical Informatics • Medical Informatics • Nursing Informatics • Imaging Informatics • Bioimage Informatics • Clinical Research Informatics • Pathology Informatics • … • Health Analytics • … • Legal Informatics • Business Informatics
Source: Department of Biomedical Informatics 13 Beyond Bioinformatics
Basic Science Clinical Research Informatics Biomedical Informatics Theories & Methods
Imaging & Clinical Public Health Bioinformatics Structural Informatics Informatics Informatics
Technologies Technologies Technologies Technologies Applied & Tools & Tools & Tools & Tools Science
Translational Bioinformatics
Source: Department of Biomedical Informatics 14 Beyond Bioinformatics
Driving Generating & Advancing Biological Translating Personalized Discovery Knowledge Healthcare
Translational Clinical Bioinformatics Innovation Bioinformatics Informatics Focus Areas
Clinical Computational Imaging Research Biology Informatics Informatics
Human Factors
Cross-Cutting Knowledge Engineering Competencies High Performance Computing
Data Science
Source: Department of Biomedical Informatics Outline
• Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics
Data 16 Public Data
• Clinical Data or Health Records • Local cancer registries and SEER • Osteoarthri s Ini a ve (h p://www.oai.ucsf.edu/datarelease/) • Framingham Heart Study ( h p://www.framinghamheartstudy.org/research/index.html) • WHO clinical trials data registry ( h p://apps.who.int/trialsearch/) • dBGap: h p://www.ncbi.nlm.nih.gov/gap • Women's Health Ini a ve (h p://www.whiscience.org/data/)
17 Public Data • Molecular Data • NCBI Gene Expression Omnibus (GEO) • Cancer Genome Atlas - TCGA (Goal: 500 patients for each type of cancer, more than 20 types of cancer, genotype, gene expression, microRNA, CGH, DNA methylation, histological images, clinical*) • 1000 Genome/HapMap • International cancer consortium (Goal: 1000 patients for each cancer) • CCLE – Broad Institute/Norvatis • ENCODE
18 Workflow to Mine Frequent Co-expression Network
Zhang J, Lu K, Xiang Y, Islam M, et al. (2012) PLoS Comput Biol 8(8): e1002656.
19 Public Data • Molecular Data
20 Outline
• Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics
21 22 Lodish et al, Molecular Cell Biology
23 24 Central Dogma
25 Eukaryotic Nuclear Gene Structure
TSS (+1) TES
Splicing
Searle and Hopkins, BJA, 2009 TSS: Transcription Starting Site; TES: Transcription “Ending” Site 26 Alternative Splicing
"the discovery that genes in eukaryotes are not contiguous strings but contain introns, and that the splicing of messenger RNA to delete those introns can occur in different ways, yielding different proteins from the same DNA sequence".
Splicing
Searle and Hopkins, BJA, 2009
27 http://en.wikipedia.org/wiki/Alternative_splicing Post-genomic Era: Genome Annotation • The value of a genome is in its annotations
Lodish et al, Molecular Cell Biology Protein folding and structure Sequencing • Sanger sequencing
Fred Sanger • Nobel prize in chemistry in 1958 "for his work on the structure of proteins, especially that of insulin" • Nobel prize in chemistry in 1980 "for their contributions concerning the determination of base sequences in nucleic Winnick, The Scientist, 18(18), 2004 acids" Sequencing Technology • Automatic sequencers
Leroy Hood
Winnick, The Scientist, 18(18), 2004 Initial Analysis of the Human Genome
32 Next Generation Sequencing (NGS) Sequencers and capacity • Sequencers
Roche/454 Illumina (solexa) Life Technologies (ABI) SOLiD • Both DNA and RNA • Short sequences § 32-150bp for Illumina and SOLiD § 400-1000bp for 454 • Ultra-high throughput § Up to 600G bases per run (Illumina and SOLiD) (a human genome is about 3G bases) § 1G bases per run (454) Major Sequencing Platforms
Illumina SOLiD (Life technology) • Leading platform • Emulsion PCR (beads) • Bridge amplification • Sequencing-by-ligation • Sequencing-by-synthesis • Color space
454 (Roche) Pacific Biosciences • Pyrosequencing • Single molecule • Long(er) reads, 400bp • Nano hole • First platform • Very long reads (<5K)
Ion Torrent • Personal genome analyzer • Ion-sensitive semi-conductor • Cheap equipment, lower throughput, higher error rates (now)
34 Main Applications of NGS
Sequence DNA Sequence RNA • De novo sequencing • RNA-seq (transcriptome-wide • Reference-based re- sequencing) sequencing • miRNA-seq • SNP, CNV, Indels • novel ncRNAs • Metagenomics Illumina – Solexa • Identify “who is there?” in a mixture of microbes
Roche – 454 Life Technology – SOLiD
Study Protein-DNA/RNA Epigenetics interaction • DNA methylation • ChIP-seq (for TF, Pol II binding) • Histone modification (ChIP-seq) • CLIP-seq (for RNA binding • Nucleosome positioning proteins) • Chromosome looping
35 Illumina sequencing
• Sequencing by synthesis
Quality Scores Sequence Files Date Cost per Mb Cost per Date Cost per Mb Cost per Genome Genome Sep-01 $5,292.39 $95,263,072 Jul-07 $495.96 $8,927,342 Mar-02 $3,898.64 $70,175,437 Oct-07 $397.09 $7,147,571 Sep-02 $3,413.80 $61,448,422 Jan-08 $102.13 $3,063,820 Mar-03 $2,986.20 $53,751,684 Apr-08 $15.03 $1,352,982 Oct-03 $2,230.98 $40,157,554 Jul-08 $8.36 $752,080 Jan-04 $1,598.91 $28,780,376 Oct-08 $3.81 $342,502 Apr-04 $1,135.70 $20,442,576 Jan-09 $2.59 $232,735 Jul-04 $1,107.46 $19,934,346 Apr-09 $1.72 $154,714 Oct-04 $1,028.85 $18,519,312 Jul-09 $1.20 $108,065 Jan-05 $974.16 $17,534,970 Oct-09 $0.78 $70,333 Apr-05 $897.76 $16,159,699 Jan-10 $0.52 $46,774 Jul-05 $898.90 $16,180,224 Apr-10 $0.35 $31,512 Oct-05 $766.73 $13,801,124 Jul-10 $0.35 $31,125 Jan-06 $699.20 $12,585,659 Oct-10 $0.32 $29,092 Apr-06 $651.81 $11,732,535 Jan-11 $0.23 $20,963 Jul-06 $636.41 $11,455,315 Apr-11 $0.19 $16,712 Oct-06 $581.92 $10,474,556 Jul-11 $0.12 $10,497 Jan-07 $522.71 $9,408,739 Oct-11 $0.09 $7,743 Apr-07 $502.61 $9,047,003 Jan-12 $0.09 $7,666
NGS vs Moore’s Law
38 NGS – Things keep Moving
39 NGS Processing Pipeline NGS Processing Pipeline
41 Outline
• Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics
42 Database Search • PubMed / Entrez • Gene information (e.g., Entrez gene database, genecards) • GenBank (accession number, GI number, version number, etc) • File format (e.g., FASTA, SAM, BAM) • ID conversion (e.g., DAVID)
43 Visualization of Genomics Data • UCSC Genome Browser (http://genome.ucsc.edu/) • Ensembl (http://www.ensembl.org/index.html) • Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/) • VEGA – VErtebrate Genome Annotation database (http://vega.sanger.ac.uk/index.html) • Integrative Genome Viewer (IGV) – by Broad Instititute • …
• Check out http://www.openhelix.com/cgi/freeTutorials.cgi
44 Sequence Alignment • Dynamic programming • Global alignment – Needleman-Wunch algorithm • Local alignment – Smith-Waterman algorithm • BLAST • Entrez Blast tool • Multiple sequence alignment
45 Outline
• Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics
46 Outline • http://web.cse.ohio-state.edu/~raghu/teaching/ CSE5599-BMI7830/
47