Basic Biogrid

Basic BioGrid Rick Stevens Argonne National Laboratory University of Chicago Outline • Biology: The Science for the 21st Century • Sizing up the opportunities for BioGrid • Biology is a different way of thinking • Systems biology and whole cell modeling • Requirements for the BioGrid • A modest proposal • Some recommended reading • Conclusions Rick Stevens Argonne z Chicago Many BioGrid Projects • EUROGRID BioGRID • Asia Pacific BioGRID • NC BioGrid • Bioinformatics Research Network • Osaka University Biogrid • Indiana University BioArchive BioGrid Rick Stevens Argonne z Chicago The New Biology • Genomics • High-throughput methods • Functional Genomics • Low cost • Robotics • Proteomics • Bioinformatics driven • Structural Biology • Quantitative • Gene Expression • Enables a systems view • Metabolomics • Basis for integrative understanding • Advanced Imaging • Global state • Time dependent Rick Stevens Argonne z Chicago Predicting Life Processes: Reverse Engineering Living Systems DNA (storage) Transcription Gene Expression Translation Proteins Proteomics Biochemical Circuitry Environment Metabolomics Phenotypes (Traits) From Bruno Sobral VBI 24 Orders Magnitude of Spatial and Temporal Range 6 10106 DiscreteDiscrete AutomataAutomata 3 modelsmodels 10103 FiniteFinite elementelement EvolutionaryEvolutionary modelsmodels 00 Processes Organisms Processes Organisms 1010 EcosystemsEcosystems andand EpidemiologyEpidemiology 66 1010 OrganOrgan functionfunction 33 ElectrostaticElectrostatic 10 continuum models 10 continuum models CellCell signalling signalling Cells Cells 0 10100 66 DNADNA 1010 replicationreplication Enzyme 33 Enzyme Mechanisms 1010 AbAb initioinitio Mechanisms Size Scale Size Size Scale Size QuantumQuantum ChemistryChemistry 1000 ProteinProtein Biopolymers 10 Biopolymers FoldingFolding 1066 10 EmpiricalEmpirical forceforce fieldfield MolecularMolecular DynamicsDynamics 33 1010 Homology-basedHomology-based FirstFirst PrinciplesPrinciples Atoms Protein modeling Atoms Protein modeling MolecularMolecular DynamicsDynamics 0 10100 Geologic & -15-15 -12-12 -9-9 -6-6 -3-3 00 33 66 99 Geologic & 1010 1010 1010 1010 1010 1010 1010 1010 1010 EvolutionaryEvolutionary TimescaleTimescale (seconds)(seconds) TimescalesTimescales Genes → Cell Networks → Organisms → Populations → Ecosystems Rick Stevens Argonne z Chicago Computational analysis and simulation have important roles in the study of each step in the hierarchy of biological function DNA Protein sequence Protein Protein/enzyme sequence and regulation structure function T A T Promoter A C A Q Sequence G Homology based Molecular Annotation T A Y protein structure simulations Message C prediction C G R T Expt. data Network integration Organism Pathway analysis simulations simulations Bacterial communities Bacteria and cells Metabolic pathways Multi-protein & multicellular organisms & regulatory networks machines The New Biology in Action Hypothetical example: Fill in component in New data showing that a gene forms DNA repair pathway complex with DNA repair enzyme Predict chemical Update network inhibitors model of microbe Predict fold and assign behavior function ATP ADP Adducted DNA-enzyme DNA complex Add function to + genome database Identify regulatory sequence associated with DNA repair ACTGGCATTCA TATAGCT From Mike Colvin, LLNL An Integrated View of Simulation, Experiment, and Bioinformatics Problem Browsing & Simulation Specification Visualization SIMS* Analysis Tools Database LIMS Experimental Experiment Browsing & Design Visualization *Simulation Information Management System Rick Stevens Argonne z Chicago Genomics is Powering the New Biology, but Computing is in the Drivers Seat Ecological Processes and Populations Computation Tissue and Organismal Physiology Cellular & Developmental Morphogenesis and Development Processes Simulation of Metabolic and Signal Transduction Pathways Biochemical Path ways Predicting & Catalysis, Molecular Processes Dynamics Structures of Multi- molecular Predicting Effects of Experiments complexes Variation Gene s, Prot eins, RNAs, an d other Predicting Three-Dimensional Predicting Functions Biomolecules Structures of Proteins and RNAs Predicting Protein Simulating and Understanding Gene Sequence Expression Networks Reconstructing Phylogeny, Genes and Gene Structures Homology, and Comparitive Genome Large-scale Approaches Sequences Genome Sequencing Assembled Genomes Sequence Variation of Rick Stevens PopulationsArgonne z Chicago Systems Biology • Integrative (synthetic) understanding of a biological system • Cell, organism, community and ecosystem • Counterpoint to reductionism • Requires synthesizing knowledge from multiple levels of the system • Discovery oriented not necessarily hypothesis driven • Data mining vs theorem proving Rick Stevens Argonne z Chicago Biology is BIG!! • 3500 Millions years of evolution • ~10M extant species (distinct genomes) • 200 genomes sequenced so far many to go!! • >100M extinct species • 108 average genome size (coding region) • ~108 bp x ~107 sp = 1015 bp of genetic diversity • 1011-1013 total genes and gene products • ~2,000 protein structures determined • typical bacteria has about 3,000 types of proteins Rick Stevens Argonne z Chicago • Overview of Life on Earth • Basic Microbiology Introduction • Excellent Summary From Five Kingdoms by Lynn Margulis and Karlene Schwartz From John Wooley Biomedical Data: High Complexity and Large Scale billions Protein-Protein Physiology Interactions Cellular biology metabolism Biochemistry pathways Polymorphism Neurobiology millions receptor-ligand Endocrinology 4º structure and Variants genetic variants etc. millions Proteins individual patients sequence epidemiology Hundred 2º structure thousands 3º structure ESTs Genetics and Maps Expression patterns MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT... Linkage Large-scale screens DNA sequences Cytogenetic billions alignments Clone-based ...atcgaattccaggcgtcacattctcaattcca... millions Will Biology Dominate the Grid? • The largest science discipline: • The most scientists (Globally ~500,000-1,000,000) • The most research funding (Globally ~$50 Billion/year) • The most graduate students (>20,000 year) • Strong couplings to: • Medicine and human health • Agriculture and food supplies • Energy and ecology • Future industrial processes (bio-nano) • Consumer of other scientific technologies Rick Stevens Argonne z Chicago A Diverse Bacterial Community • Pocket in the hindgut wall of the Sonoran desert termite Pterotermes occidentis • 10 billion bacteria per milliliter • Anoxic environment • ~30 strains are facultative aerobes From Five Kingdoms • Many/most are unknown Lynn Margulis and Karlene Schwartz Rick Stevens Argonne z Chicago Human Microbial Ecology: Sorting Out Cause, Effect + Cure Helicobacter pylori Stomach cancer Cytomegalovirus Stomach ulcers Porphyromonas gingivalis Coronary artery disease Alzheimer's Chlamydia pneumoniae Cervical cancer Human papilloma virus Liver cancer Hepatitis B, C virus Breast cancer EpsteinBarr Virus Nasopharyngeal carcinoma = suspected linkages Rick Stevens Argonne z Chicago An Initial Focus on Prokaryotic life • ~5,000 known species of prokaryotic life • It is estimated that we have identified only 1%-5% of extant species ⇒ 80,000-400,000 species • More diversity than Eukaryotic life forms • More diverse metabolisms, more diverse environments • A Human contains 1012 cells and 1013 bacterial cells • We have little current understanding of human micro ecology Rick Stevens Argonne z Chicago Biological CAD: Enabling BioDesign • Understanding and manipulating biological systems from an information systems standpoint [e.g. organization, communication, transformation] • Genotype + Environment = Phenotype • Strong analogs to VLSI design tools • Ultimately goal is designing new biological structures and systems • Custom microbe design • Metabolic Engineering • Synthetic model organisms Rick Stevens Argonne z Chicago SynechocystisSynechocystis 33563356 GenesGenes 925925 PathwaysPathways Rick Stevens Argonne z Chicago Degree of whole-genome structure & function assignment varies by organism (and required confidence level) Results for three organisms using Genequiz (EBI) Mycoplasma Genitalium Synechocystis sp. Saccharomyces cer. (468 genes) (3168 genes) (6284 genes) High-confidence structure Function assigned No homologues & function by homology by weak homology Function assigned Homologue exists, by strong homology but function unknown Genome Features Annotated Genome Maps Gene Annotations Genomes Comparisons Annotated Data Sets Visualization Visualization Whole genome Analysis Sequence Analysis And Architecture Module Module Predictions of Regulation Predictions of New pathways Conserved Chromosomal Gene Clusters Functions of Hypotheticals Gene Functions Assignments Networks Comparisons Networks Analysis Module Proteomics Experimentation Operons, regulons networks Metabolic Reconstructions (Annotated stoichiometric Matricies) Conjectures about Gene Functions Metabolic Phenotypes Simulation Module Experimentation Metabolic Engineering Reconstructing and Modeling a Prokaryotic Cell? • Typical bacteria [E. coli] • 1000nm x 300nm x 300nm volume • ~4000 genes and gene products • 1/4 genes ⇒ protein synthesis • 1/4 genes ⇒ glycolysis • 1/4 genes ⇒ citric acid cycle • rest genes involved in regulation, synthesis and degrading tasks • Relatively few genes related to sensing and motility • 1,000’s of small molecular species, not tracked individually • ~3 million total large molecules to track Rick Stevens Argonne

Basic Biogrid

Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection

Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

"Phylogenetic Analysis of Protein Sequence Data Using The

EMBL-EBI Powerpoint Presentation

A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

Webnetcoffee

Genomic Sequencing of SARS-Cov-2: a Guide to Implementation for Maximum Impact on Public Health

Impact of the Protein Data Bank Across Scientific Disciplines.Data Science Journal, 19: 25, Pp

The Biogrid Interaction Database

PINOT: an Intuitive Resource for Integrating Protein-Protein Interactions James E

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases