<<

, Content and Biomedical Informatics • Computational science • Bioinformatics and Biomedical informatics Tu Bao Ho • Examples in hepatitis study School of Knowledge Science Japan Advanced Institute of Science and Technology

A number of slides are adapted from ‐ American Medical Informatics Association: www.amia.org ‐ Talk on Biomedical informatics, Kun Huang ‐ Talk of Dr. T. V. Rao MD ‐ Talk of Eleftherios P. Diamandis

2

PITAC report: ‘‘Computational Science: Science: Two, three or four legs? Ensuring America’s Competitiveness’’

1. A Wake‐up Call: The Challenges to U.S. Science Preeminence and Competitiveness 2. Medieval or Modern? Research and Comput- Data- Education Structures for the 21st Experim- ational Intensive Century entation Science Science 3. Multi‐decade Roadmap for Computational Science 4. Sustained Infrastructure for Discovery and Competitiveness PITAC: President’s Information Technology Advisory Committee. 5. Research and Development Challenges (24 leaders in industry and academia in the 6 of them in Computational 6. Appendices Science Subcommittee.) CACM, Dec. 2010 CACM, Sep. 2010

4 Data, Information and Knowledge vs. computational science

Metaphor: Is the size and staffs of Chiba University • Computer science: • Computational science: Data: rock; Also commonly understood knowledge: ore. hospital appropriate Science about using Miner? for such amount of as informatics or information and mathematics patients? technology, e.g., processing to do research in other Average of number of of information by computer. sciences, e.g. Knowledge patients each hour, each day, each week, • Science about producing each at Chiba Univ. hardware (computer) and – Computational hospital. software (programs) for – Information Number of patients different usage purpose. – Computational linguistics counted at Chiba – Univ. hospital by Data hours, by days of the week, by months.

Computational Science High performance computing competition ‘‘is a rapidly growing multidisciplinary field that uses advanced computing capabilities to understand and solve complex problem’’. Japan national key project (2007-2012) Its fuses three distinct elements

and modeling and software developed to solve science, engineering, and humanities problems • Computer and information science that develops and optimizes the system hardware, software, networking and data management components needed to solve computationally demanding problems • The computing infrastructure that supports both the Our lab PC cluster: 16 nodes dual Intel Xeon science and engineering problem solving and the 2.4 GHz CPU/512 KB development computer and information science. cacheFujitsu K‐rack

Book: Computational Science: Ensuring America’s Competitiveness Tianhe‐1A: 7,168 NVIDIA® Tesla™ M2050 GPUs and 14,336 CPUs (2.507 Peta flops) PITAC: President’s Information Technology Advisory7 Committee High performance computing competition Computational science and

• Japan’s ‘‘K computer’’ (K is 1016 and large gateway), 800 computer racks with Fujitsu ultrafast CPUs, targeting by 2012 to 10 petaflop, Japan's K computer (RIKEN’s Advanced Institute for Computational Science) • IBM’s computers BlueGene and BlueWaters, targeting to 20 petaflop by 2012 (Lawrence Livermore National Laboratory). IBM BlueGene http://www.fujitsu.com/global/news/pr/archives/month/2010/20100928‐01.html (28.9.2010) http://www.hightechnewstoday.com/nov‐2010‐high‐tech‐news/38‐nov‐23‐2010‐high‐tech‐news.shtml (Nov. 2010) 9 Kennichi Miura, DEISA Symposium, 5.2007

Doctors are very accepting of technology! Computer usage in medicine

• Earliest broad recognition of statistical issues in diagnosis and the potential role of computers occurred in the late 1950s – “Reasoning foundations in medical diagnosis”: Classic article by Ledley and Lusted appeared in Science in 1959. • Computers began to be applied in biomedicine in the 1960s – Most applications dealt with clinical issues, including diagnostic systems. “Fundamental theorem” for using Computer usage in medicine computers in medicine • “Computers in medicine” in the 1960s – First Federal grant review group – Most applications dealt with clinical issues • No consistency in naming the field for many years – “Computer applications in medicine” – “Medical information sciences” – “Medical computer science” • Emergence in the 1980s of a single, consistent name, derived from the European (French) term for computer science: informatique (informatics) – Medical Informatics Charles P. Friedman. J Am Med Inform Assoc. 2009;16:169 –170.

Related disciplines Medical Informatics

• Informatics (computer science) Medical informatics is the intersection of computer • Computational science science, computational • Medical informatics (Clinical informatics) science and health care. • Public It deals with the resources, devices, and methods • Biomedical informatics required to optimize the • Molecular medicine acquisition, storage, • Bioinformatics (computational biology) retrieval, and use of information in health and biomedicine. Medical informatics is rapidly developing Examples of medical informatics areas

Medical informatics is the • Hospital information systems rapidly developing scientific – Electronic medical records & medical vocabularies field that deals with resources, devices and formalized methods – Laboratory information systems for optimizing the storage, – Pharmaceutical information systems retrieval and management of – Radiological (imaging) information systems biomedical information for – problem solving and decision Patient monitoring systems making. • Clinical decision‐support systems – Diagnosis/interpretation Edward Shortliffe, M.D., Ph.D. What is medical informatics? Stanford University, 1995. – Therapy/management

Are too slow adopting the change Patient data is the most important resource

As medical knowledge continues Medical schools to expand rapidly with demands have long for more efficient coordination of recognized the patient data become paramount, and the pressures for improved need to revise practice and application of their teaching evidence based medicine methodology, but increases, medical informatics will have increasing influence in have been slow to our working lives as clinicians. change. Clinical data vs. omics data! Biomedical informatics: What is biomedical informatics? Corollaries to the definition Biomedical informatics (BMI) is the interdisciplinary field that studies and pursues 1. BMI develops, studies and applies , the effective uses of biomedical data, methods and processes for the generation, storage, information, and knowledge for scientific retrieval, use, and sharing of biomedical data, information, and knowledge. inquiry, problem solving, and decision making, motivated by efforts to improve human health. 2. BMI builds on computing, communication and information sciences and technologies and their application in biomedicine.

Source: www.amia.org

Biomedical informatics: Biomedical informatics in perspective Corollaries to the definition

Basic Research Biomedical Informatics Methods, 3. BMI investigates and supports reasoning, modeling, Techniques, and Theories simulation, experimentation and translation across the spectrum from molecules to populations, dealing with a variety of biological systems, bridging basic and clinical research and practice, and the healthcare enterprise. Biomedical Informatics ≠ Bioinformatics 4. BMI, recognizing that people are the ultimate users of biomedical information, draws upon the social and behavioral sciences to inform the design and evaluation of technical solutions and the evolution of complex Imaging Clinical Public Health economic, ethical, social, educational, and Bioinformatics Applied Research Informatics Informatics Informatics organizational systems and Practice Biomedical informatics in perspective Biomedical informatics in perspective

Basic Research Biomedical Informatics Methods, Basic Research Biomedical Informatics Methods, Techniques, and Theories Techniques, and Theories Consumer Health Biomedical Informatics ≠ Health Informatics Pharmaco- Biomolecular genomics Imaging Health Informatics

Imaging Clinical Public Health Imaging Clinical Public Health Bioinformatics Bioinformatics Informatics Informatics Informatics Informatics Informatics Informatics Applied Research Applied Research and Practice and Practice

Molecular and Tissues and Individuals Populations Molecular and Tissues and Individuals Populations Cellular Processes Organs (Patients) and Society CellularProcessesContinuum withOrgans “fuzzy”(Patients) boundariesand Society

KDD: Knowledge Discovery and Example: mining associations in market data 知識発見とデータマイニング マーケット・バスケット分析(IBM)

“Data-driven discovery of models and patterns Super market data “Young men buy diaper and beer together” from massive observational data sets” 「紙おむつを買う男性は缶ビールを一緒に買うことが多い」

Statistics, Languages, Data + Inference Representations Management 売上データ データマイニング 20-30歳の男性 紙おむつ ビール

(解釈:顧客像) 紙おむつを買うように頼まれた男性がついでに自分用の缶ビール Applications を購入していた  今後の陳列に活かすことのできる知識.

28 Text mining: a real example (Swanson,1997) Data schemas vs. mining methods データ・スキーマ vs. 学習手法

Extract pieces of evidence from article titles in the biomedical Types of data Mining tasks and methods literature 生物医学文献タイトルからの科学的根拠の抽出 マイニングの課題と手法 . Flat data tables 表形式データ  “ストレスは片頭痛を伴う . Classification/Prediction 分類/予測 “stress is associated with migraines” ” . Relational 関係DB  “stress can lead to loss of magnesium” . Temporal & spatial data  Decision trees 決定木 “ストレスはマグネシウム損失の原因となる” 時空間データ  Neural networks 神経回路網 ルール帰納法  “calcium channel blockers prevent some migraines” . Transactional databases  Rule induction “カルシウム拮抗薬は片頭痛を予防することがある” 取引データ  Support vector machines SVM  Hidden Markov Model 隠れマルコフ  “magnesium is a natural calcium channel blocker” . Multimedia data “マグネシウムは天然のカルシウム拮抗薬である” マルチメディアデータ  etc. 記述 . Genome databases ゲノムデータ • Description Induce a new hypothesis not in the literature by combining  Association analysis 相関分析 . data culled text fragments with human medical expertise 材料データ  Clustering クラスタリング 抜粋した文の断片を人間の医学専門知識を使って組合せ,文献にない新し  Summarization 要約 . Textual data テキストデータ い仮説を導き出す  etc. . Web data ウェブデータ  Magnesium deficiency may play a role in some kinds of migraine . etc. マグネシウムはある種の片頭痛に関与するらしい headache 29 30

The KDD process 知識発見とデータマイニングのプロセス Case study: Hepatitis

Putting the results a step in the KDD process 5 LC: liver cirrhosis in practical use Hepatitis LC HCC consisting of methods that HCC: hepatocellular carcinoma produce useful patterns or models 結果を実践に用いる from the data データから有用なパターンやモデル 4 Interpret and evaluate を生成する手法からなるステップ discovered knowledge Fibrosis stage 発見した知識を解釈し評価する Fibrosis stage 20-30 years IFN Maybe 70-90% of effort and cost in 3 Data mining HCC F4 KDD 全体の70-90%の Extract Patterns/Models F4 労力を要するステップ パターン・モデルを抽出する LC F3 F3 2 Collect and preprocess data F2 F2 データを収集し前処理する F1 F1

F0 F0 1 Understand the domain KDD is inherently time time onset of infection onset of infection and define problems interactive and iterative 領域を理解し問題を定義する 知識発見の本質は繰り返しとインタラクション

(In many cases, viewed KDD as data mining) The natural course of hepatitis The hepatitis dataset Example of the hepatitis dataset

 Temporal relational (Chiba Univ. Hospital)

 Patient’s data contain about 983 tests taken in different periods

 varying from several weeks to twenty years

 irregular time- stamped points

Research problems Our solution: temporal abstraction

ZTT first was P1. Differences in temporal patterns between increasingly high ZTT: H>NS ZTT then changed to the hepatitis B and C? (HBV, HCV) normal region and stable P2. Evaluate whether laboratory examinations can be used to estimate the stage of liver fibrosis? (F0, F1, F2, F3, F4) normal region   P3. Evaluate whether the interferon therapy is effective or not? (Response, Partial response, Aggravation, No response)

Two methods: APE (abstraction pattern extraction) and TRE (temporal relation extraction) System D2MS Lessons learned

1. Understanding the domain and determining the target are the primary factor to failure or success. 2. Pre‐processing takes more than 90% of time and effort. “It took 40 years to collect data, months to preprocess data, minutes to learn from data, and hours/days to evaluates obtained results”. 3. The collaboration between data miners and domain experts is the most decisive factor. 4. Different views on interestingness of discovered patterns are the main reason of un‐satisfaction. 5. Model selection requires the active participation of users, and domain knowledge in mining is crucial.

C:\Users\HO TU BAO\Desktop\D2MS.lnk C:\Users\HO TU BAO\Desktop\DBMS‐noN.lnk

What is bioinformatics? Bioinformatics is about biological data

• The collection, classification, storage, and analysis of • Nucleotide –DNA, RNA, … biochemical and biological information using •Genome – Sequences, chromosomes, expressed data, … •Protein – Sequences, 3‐D structure, interaction, … computers especially as applied in molecular •System –Gene network, protein network, TFs, … genetics and genomics*. •Other –Microarray, images, lab records, journals, literatures, … • The application of math and computing to solve problems in biology. The goal is to understand how the biological system Bioinformatics  Computational biology works. * Merriam‐Webster's Medical Dictionary, © 2002 Merriam‐Webster, Inc. Explosion of biological data From data to knowledge

10,267,507,282 • Data …TACATTAGTTATTACATTGAGAAACTTTATAA TTAAAAAAGATTCATGTAAATTTCTTATTTGTTT bases in – Nucleotide –DNA, RNA, … ATTTAGAGGTTTTAAATTTAATTTCTAAGGGTT 9,092,760 TGCTGGTTTCATTGTTAGAATATTTAACTTAAT records. – Genome – Sequences, chromosomes, expressed CAAATTATTTGAATTTTTGAAAATTAGGATTAAT TAGGTAAGTAAATAAAATTTCTCTAACAAATAA data, … GTTAAATTTTTAAATTTAAGGAGATAAAAATAC TACTCTGTTTTATTATGGAAAGAAAGATTTAAA – Protein – Sequences, 3‐D structure, interaction, … TACTAAAG… – System – Gene network, protein network, TFs, … – Other – Masspec, microarray, images, lab records, journals, literatures, … 3000 Metabolomics metabolites • Knowledge – Genotype – Phenotype Proteomics 2,000,000 Proteins – Genotype‐Phenotype relationship – SNPs Genomics 25,000 Genes – Pathways – Drug targets Getting data is “easy”, extracting knowledge is hard!

Computer is intelligent The … omics

• Pros Level of Definition Status Method of analysis analysis • Repeated work Genome Complete set of genes of an Context‐dependent Systematic DNA • Accurate storage organism or its organelles (modifications to the yeast sequencing •Precise computation genome may be made with exquisite precision) •Fast communication •… Transcriptome Complete set of messenger RNA Context‐dependent (the Hybridizations arrays molecules present in a cell, complement of messenger SAGE tissue or organ RNAs varies with changes High‐throughput • Cons in physiology, development Northern analysis • Cannot generalize or pathology) •No real intelligence Proteome Complete set of protein Context‐dependent Two‐dimensional gel •… molecules present in a cell, electrophoresis, peptide tissue or organ mass fingerprinting

The results must be reviewed and validated by biologists. Metabolome Complete set of metabolites Context‐dependent Infre‐red In addition, biologists must have some understanding of (low‐molecular‐weights Spectroscopy how computer processes data (algorithms). intermediates) in a cell, tissue or Mass spectroscopy organ NMR spectrometry Elements of bioinformatics How to extract knowledge?

• Genomics is a discipline in genetics concerning the study of the genomes of organisms. Computational tools • Transcriptomics (functional genetics) is the process • Building the databases of creating a complementary RNA copy of a sequence •Perform analysis/extract features Biological of DNA. •Data fusion/Integration information • Proteomics is the large‐scale study of proteins, and particularly their structures and functions. •Data mining/Statistical learning knowledge • Metabolomics is the scientific study of chemical • /representation processes involving metabolites.

How to extract knowledge? vs. data mining

• Statistics • Data mining provides principles – Finding knowledge from and methodology for data designing the process of – Strongly based on statistics, especially – Data Collection modern multivariate – Summarizing and statistics Interpreting the data – Also based all other disciplines – Drawing conclusions or – Motivated by real‐world generalities problems Future work Molecular Medicine

Barabasi A-L, Network medicine – from obesity to “Diseasome”, NEJM, 357(4): 404-407, 2007. “The branch of medicine that deals with the influence of gene expression on disease processes and with genetically based treatments, such as gene therapy.” (American Heritage Dictionary)

Molecular medicine Our work in biomedicine • Computational medicine • Computational biology • Molecular medicine is a broad field, where – Mining stomach cancer data, – Transcriptional regulation physical, chemical, biological and medical Tokyo Cancer Center (1999‐2003) – Epigenetics techniques are used to describe molecular – Mining hepatitis data, Chiba – Protein‐protein interactions University hospital (2001‐2005) – miRNA structures and mechanisms, identify fundamental – Hepatitis study (2007‐) – molecular and genetic errors of disease, and to Metabolomics develop molecular interventions to correct them. • The molecular medicine perspective emphasizes cellular and molecular phenomena and interventions rather than the previous conceptual and observational focus on patients and their organs.

Trends in Molecular Medicine 4 PhD graduated, 4 in progress HCV NS5A and IFN/RBV therapy RNA interference (RNAi) and hepatitis HBV HCV • 300 and 170 millions of carriers of HBV and HCV worldwide, • Amechanism wherein small RNAs, esp. respectively. miRNA and siRNA, control the expression of • Lead to liver cirrhosis & HCC genes. • Molecular mechanisms of • RNAi target to HBV and HCV genes to inhibit hepatitis pathogenesis and their replication or host genes required for hepatitis therapies their replication. • How to select appropriate siRNA molecules • 50% SVR for IFN/RBV that have satisfactory silencing capabilities or • Why and why not hepatitis minimum off‐target effect and maximum viruses disappear after the knockdown efficiency? treatment?

Fire, A., Mello, C., Nature 391, 1998 (Nobel Prize 2006)

Problem of siRNA selection HCV NS5A and IFN/RBV therapy

• NS5A is the protein most reported • Given: in interferon resistance (Gao, – A set of siRNA with score Nature 465, 2010). of knockdown efficacy. • What is the remained enigmatic – A number of role of the domains II and III experimental design rules (Lemon, 2010)? • Find methods: • V3 is a more accurate biomarker Set of siRNA with known score of than the ISDR region (AlHefnawi, knockdown efficacy (about 5000) – To predict the score of any 2010)? siRNA • Distinguish the responders and – To artificially creates non‐responders between subtypes siRNAs with high scores 1a‐c (448 aa) and 1b (447 aa). of knockdown efficacy. Design rules obtained from experiments The data and two methods for motif finding DOOPS motifs found

• Labeled data: • Given Los Alamos HCV – SVR and non‐SVR samples database (134 RVR – unlabeled samples and 93 S5A non‐SVR • Find sequences. 1. All strong motifs of type DOOPS • Unlabeled data: (discriminative one occurrence per sequence) for each class of 5000 NS5A sequences SVR or non‐SVR? belonging to 6 2. Strong motifs of type DMOPS genotypes, mostly in (discriminative multi occurrence 1a‐c and 1b taken from per sequence) satisfied two Genbank and Nagoya complete and discriminant conditions. City University.

Strong motifs in terms of coverage and discrimination ability

DOOPS motifs found DMOPS motifs found Conclusion What makes proteomics important?

• Computer science and computational science • There are more than 160,000 genes in each cell, only play an increasing important role in medicine. a handful of which actually determine that cell’s • Might biomedical informatics be to public health structure. in the 21st century what infectious diseases were • Many of the interesting things about a given cell’s to public health in the previous centuries. current state can be deduced from the type and structure of the proteins it expresses. • Learn and use more molecular biology in • Changes in, for example, tissue types, carbon medicine by informatics. sources, temperature, and stage in life of the cell can be observed in its proteins. Thanks 61