<<

Introduction to Bio(medical)- Informatics

Kun Huang, PhD Raghu Machiraju, PhD

Department of Biomedical Informatics Department of and Engineering The Ohio State University Outline

• Overview of • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics

2 Technical areas of bioinformatics

IMMPORT https://immport.niaid.nih.gov/immportWeb/home/home.do?loginType=full# and Query Databases and Query

547

5

6 Visualization

http://supramap.osu.edu 7 Network

8 Visualization and Visual Analytics § Visualizing genome data annotation

9 Network Mining

Hopkins 2007, Nature Biotechnology, Network pharmacology

10 Role of Bioinformatics – Hypothesis Generation

Hypothesis generation

Bioinformatics

Hypothesis testing

11 Training for Bioinformatics

§ Biology / medicine § – programming, , , machine , , network analysis, visualization, , , etc § Statistics – hypothesis testing, permutation method, multivariate analysis, Bayesian method, etc Beyond Bioinformatics • Translational Bioinformatics • Biomedical Informatics • Medical Informatics • Nursing Informatics • Imaging Informatics • Bioimage Informatics • Clinical Research Informatics • Pathology Informatics • … • Health Analytics • … •

Source: Department of Biomedical Informatics 13 Beyond Bioinformatics

Basic Science Clinical Research Informatics Biomedical Informatics Theories & Methods

Imaging & Clinical Public Health Bioinformatics Structural Informatics Informatics Informatics

Technologies Technologies Technologies Technologies Applied & Tools & Tools & Tools & Tools Science

Translational Bioinformatics

Source: Department of Biomedical Informatics 14 Beyond Bioinformatics

Driving Generating & Advancing Biological Translating Personalized Discovery Knowledge Healthcare

Translational Clinical Bioinformatics Innovation Bioinformatics Informatics Focus Areas

Clinical Computational Imaging Research Biology Informatics Informatics

Human Factors

Cross-Cutting Knowledge Engineering Competencies High Performance Computing

Data Science

Source: Department of Biomedical Informatics Outline

• Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics

Data 16 Public Data

• Clinical Data or Health Records • Local cancer registries and SEER • Osteoarthris Iniave (hp://www.oai.ucsf.edu/datarelease/) • Framingham Heart Study ( hp://www.framinghamheartstudy.org/research/index.html) • WHO clinical trials data registry ( hp://apps.who.int/trialsearch/) • dBGap: hp://www.ncbi.nlm.nih.gov/gap • Women's Health Iniave (hp://www.whiscience.org/data/)

17 Public Data • Molecular Data • NCBI Gene Expression Omnibus (GEO) • Cancer Genome Atlas - TCGA (Goal: 500 patients for each type of cancer, more than 20 types of cancer, genotype, gene expression, microRNA, CGH, DNA methylation, histological images, clinical*) • 1000 Genome/HapMap • International cancer consortium (Goal: 1000 patients for each cancer) • CCLE – Broad Institute/Norvatis • ENCODE

18 Workflow to Mine Frequent Co-expression Network

Zhang J, Lu K, Xiang Y, Islam M, et al. (2012) PLoS Comput Biol 8(8): e1002656.

19 Public Data • Molecular Data

20 Outline

• Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics

21 22 Lodish et al, Molecular Cell Biology

23 24 Central Dogma

25 Eukaryotic Nuclear Gene Structure

TSS (+1) TES

Splicing

Searle and Hopkins, BJA, 2009 TSS: Transcription Starting Site; TES: Transcription “Ending” Site 26 Alternative Splicing

"the discovery that genes in eukaryotes are not contiguous strings but contain introns, and that the splicing of messenger RNA to delete those introns can occur in different ways, yielding different proteins from the same DNA sequence".

Splicing

Searle and Hopkins, BJA, 2009

27 http://en.wikipedia.org/wiki/Alternative_splicing Post-genomic Era: Genome Annotation • The value of a genome is in its annotations

Lodish et al, Molecular Cell Biology Protein folding and structure Sequencing • Sanger sequencing

Fred Sanger • Nobel prize in chemistry in 1958 "for his work on the structure of proteins, especially that of insulin" • Nobel prize in chemistry in 1980 "for their contributions concerning the determination of base sequences in nucleic Winnick, The Scientist, 18(18), 2004 acids" Sequencing Technology • Automatic sequencers

Leroy Hood

Winnick, The Scientist, 18(18), 2004 Initial Analysis of the Human Genome

32 Next Generation Sequencing (NGS) Sequencers and capacity • Sequencers

Roche/454 Illumina (solexa) Life Technologies (ABI) SOLiD • Both DNA and RNA • Short sequences § 32-150bp for Illumina and SOLiD § 400-1000bp for 454 • Ultra-high throughput § Up to 600G bases per run (Illumina and SOLiD) (a human genome is about 3G bases) § 1G bases per run (454) Major Sequencing Platforms

Illumina SOLiD (Life technology) • Leading platform • Emulsion PCR (beads) • Bridge amplification • Sequencing-by-ligation • Sequencing-by-synthesis • Color space

454 (Roche) Pacific Biosciences • Pyrosequencing • Single molecule • Long(er) reads, 400bp • Nano hole • First platform • Very long reads (<5K)

Ion Torrent • Personal genome analyzer • Ion-sensitive semi-conductor • Cheap equipment, lower throughput, higher error rates (now)

34 Main Applications of NGS

Sequence DNA Sequence RNA • De novo sequencing • RNA-seq (transcriptome-wide • Reference-based re- sequencing) sequencing • miRNA-seq • SNP, CNV, Indels • novel ncRNAs • Metagenomics Illumina – Solexa • Identify “who is there?” in a mixture of microbes

Roche – 454 Life Technology – SOLiD

Study Protein-DNA/RNA Epigenetics interaction • DNA methylation • ChIP-seq (for TF, Pol II binding) • Histone modification (ChIP-seq) • CLIP-seq (for RNA binding • Nucleosome positioning proteins) • Chromosome looping

35 Illumina sequencing

• Sequencing by synthesis

Quality Scores Sequence Files Date Cost per Mb Cost per Date Cost per Mb Cost per Genome Genome Sep-01 $5,292.39 $95,263,072 Jul-07 $495.96 $8,927,342 Mar-02 $3,898.64 $70,175,437 Oct-07 $397.09 $7,147,571 Sep-02 $3,413.80 $61,448,422 Jan-08 $102.13 $3,063,820 Mar-03 $2,986.20 $53,751,684 Apr-08 $15.03 $1,352,982 Oct-03 $2,230.98 $40,157,554 Jul-08 $8.36 $752,080 Jan-04 $1,598.91 $28,780,376 Oct-08 $3.81 $342,502 Apr-04 $1,135.70 $20,442,576 Jan-09 $2.59 $232,735 Jul-04 $1,107.46 $19,934,346 Apr-09 $1.72 $154,714 Oct-04 $1,028.85 $18,519,312 Jul-09 $1.20 $108,065 Jan-05 $974.16 $17,534,970 Oct-09 $0.78 $70,333 Apr-05 $897.76 $16,159,699 Jan-10 $0.52 $46,774 Jul-05 $898.90 $16,180,224 Apr-10 $0.35 $31,512 Oct-05 $766.73 $13,801,124 Jul-10 $0.35 $31,125 Jan-06 $699.20 $12,585,659 Oct-10 $0.32 $29,092 Apr-06 $651.81 $11,732,535 Jan-11 $0.23 $20,963 Jul-06 $636.41 $11,455,315 Apr-11 $0.19 $16,712 Oct-06 $581.92 $10,474,556 Jul-11 $0.12 $10,497 Jan-07 $522.71 $9,408,739 Oct-11 $0.09 $7,743 Apr-07 $502.61 $9,047,003 Jan-12 $0.09 $7,666

NGS vs Moore’s Law

38 NGS – Things keep Moving

39 NGS Processing Pipeline NGS Processing Pipeline

41 Outline

• Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics

42 Database Search • PubMed / Entrez • Gene (e.g., Entrez gene database, genecards) • GenBank (accession number, GI number, version number, etc) • File format (e.g., FASTA, SAM, BAM) • ID conversion (e.g., DAVID)

43 Visualization of Genomics Data • UCSC Genome Browser (http://genome.ucsc.edu/) • Ensembl (http://www.ensembl.org/index.html) • Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/) • VEGA – VErtebrate Genome Annotation database (http://vega.sanger.ac.uk/index.html) • Integrative Genome Viewer (IGV) – by Broad Instititute • …

• Check out http://www.openhelix.com/cgi/freeTutorials.cgi

44 Sequence Alignment • Dynamic programming • Global alignment – Needleman-Wunch algorithm • Local alignment – Smith-Waterman algorithm • BLAST • Entrez Blast tool • Multiple sequence alignment

45 Outline

• Overview of bioinformatics and BMI • Other people’s (public) data • Basic biology review • Basic bioinformatics techniques • Logistics

46 Outline • http://web.cse.ohio-state.edu/~raghu/teaching/ CSE5599-BMI7830/

47