Bioinformatics for Cancer Research

Bioinformatics for Cancer Research

Bioinformatics for cancer research Bing Zhang, Ph.D. Professor of Molecular and Human Genetics Lester & Sue Smith Breast Center Baylor College of Medicine [email protected] What is bioinformatics Bio Bioinformatics informatics § Hypotheses § Storage/retrieval § Questions § Visualization § Samples § Computational methods § Experiments § Statistical methods Data § DNA § Sequence § RNA § Expression § Protein § Structure § Metabolite § Interaction § Phenotype Translational Breast Cancer Research, 2016 Why now? Bio informatics informatics § Hypotheses § Storage/retrieval § Questions § Visualization § Samples § Computational methods § Experiments § Statistical methods Data § DNA § Sequence § RNA § Expression § Protein § Structure § Metabolite § Interaction § Phenotype Translational Breast Cancer Research, 2016 Roles for different investigators in bioinformatics n Algorithm developer q Statisticians q Mathematicians q Computer scientists n Tool developer q Bioinformaticians n Data provider/consumer q Biologists iPGDAC team Translational Breast Cancer Research, 2016 Comprehensive list of bioinformatics resources n October 2016 q 176 Resources q 621 Databases q 1548 Tools http://bioinformatics.ca/links_directory/ Translational Breast Cancer Research, 2016 Sequence and structure databases n Genbank: http://www.ncbi.nlm.nih.gov/genbank/ q Annotated collection of all publicly available DNA sequences q 220,731,315,250 bases in 197,390,691 sequences as of October 2016 q Whole Genome Sequencing (WGS) data: ftp://ftp.ncbi.nih.gov/ncbi- asn1/wgs ftp://ftp.ncbi.nih.gov/genbank/wgs q WGS: 1,676,238,489,250 bases in 363,213,315 sequences as of October 2016 n UniProt: http://www.uniprot.org/ q Comprehensive resource for protein sequences and functional information q 552,259 reviewed entries as of October 2016 n PDB: http://www.rcsb.org/ q 3D structures of large biological molecules, including proteins, nucleic acids, and complex assemblies q 123,870 structures as of October 2016 n Pfam: http://pfam.xfam.org/ q Collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs) q 16,306 families as of October 2016 Translational Breast Cancer Research, 2016 Genome browsers Graph interface for browsing and visualizing genome-wide sequence and annotation data. n UCSC genome browser n Integrative Genomics Viewer (IGV) q http://genome.ucsc.edu/cgi-bin/hgGateway q http://software.broadinstitute.org/software/igv/ n Ensembl genome browser q http://www.ensembl.org/index.html UCSC genome browser screenshot Base position GenCode annotation RefSeq annotation OMIM Alleles Human mRNAs Tissue expression Regulatory elements Comparative genomics Translational Breast Cancer Research, 2016 C O RRESPONDEN CE introduction of transgenic maize developed Comisión Nacional para el Conocimiento y Uso multiple data types, including clinical data. for pharmaceutical or other non-food de la Biodiversidad, México D.F., México. However, the sheer volume and scope of purposes, and its impact on landraces9,10. e-mail: [email protected] data pose a significant challenge to the Mexico needs to be able to define what kind development of such tools. of transgenic materials (for maize and any 1. Anderson, E. & Cutler, H.C. Ann. Mo. Bot. Gard. 29, To address this challenge, we have 69–86 (1942). other relevant crop) it needs for its ecological, 2. Kato, T.A., Mapes, L.M., Mera, L.M., Serratos, J.A. & developed the Integrative Genomics social and economic requirements. This Bye, R.A. Origen y Diversificación del Maíz: una Revisión Viewer (IGV), a lightweight visualization responsibility must be carefully analyzed in Analítica (Universidad Autónoma de México, Comisión tool that enables intuitive real-time Nacional para el Conocimiento y Uso de la Biodiversidad, order to provide farmers with adequate and México, D.F. (2009). exploration of diverse, large-scale genomic necessary elements to help achieve a level of 3. Bellón, M.R. & Brush, S.B. Econ. Bot. 48, 196–209 data sets on standard desktop computers. (1994). food security for the present and future of 4. Bellón, M.R. et al. Diversidad y Conservación de Recursos It supports flexible integration of a wide Mexican society, while conserving genetic Genéticos en Plantas Cultivadas, en Capital Natural de range of genomic data types including diversity and helping develop adequately the México, vol. II: Estado de la Conservación y Tendencias aligned sequence reads, mutations, copy de Cambio. (CONABIO, México, 2009). social structures of the rural economy and 5. Bourges, H. in La Alimentación de los Mexicanos (eds. number, RNA interference screens, gene society. Alarcón-Segovia, D. & Bourges, H.) 97–134 (El Colegio expression, methylation and genomic Nacional, México D.F., 2002). annotations (Supplementary Fig. 1). 6. Hernández-Xolocotzi, E. Econ. Bot. 39, 416–430 COMPETING FINANCIAL INTERESTS The IGV makes use of efficient, multi- The authors declare no competing financial interests. (1985). 7. Pressoir, G. & Berthaud, J. Heredity 92, 88–94 resolution file formats to enable real-time (2004). DISCLAIMER 8. Dalton, R. Nature 462, 404 (2009). exploration of arbitrarily large data sets The manuscript reflects only the opinion of the 9. Acevedo, F. Nat. Biotechnol. 22, 803 (2004). over all resolution scales, while consuming authors and not the institution they represent. 10. Acevedo G.F. et al. La Bioseguridad en México y los minimal resources on the client computer Organismos Genéticamente Modificados: Como Enfrentar Francisca Acevedo, Elleli Huerta, un Nuevo Desafío, en Capital Natural de México, vol. (Supplementary Notes). Navigation Caroline Burgeff, Patricia Koleff & II: Estado de la Conservación y Tendencias de Cambio through a data set is similar to that of José Sarukhán (CONABIO, México, 2009). Google Maps, allowing the user to zoom and pan seamlessly across the genome at any level of detail from whole genome to base pair (Supplementary Fig. 2). Data sets can be loaded from local or Integrative genomics viewer remote sources, including cloud-based resources, enabling investigators to view To the Editor: of these large, diverse data sets holds their own genomic data sets alongside Rapid improvements in sequencing the promise of a more comprehensive publicly available data from, for example, and array-based platforms are resulting understanding of the genome and its The Cancer Genome Atlas1, 1000 in a flood of diverse genome-wide relation to human disease. Experienced Genomes2 (http://www.1000genomes. data, including data from exome and and knowledgeable human review is org/) and ENCODE3 (http://www.genome. whole-genome sequencing, epigenetic an essential component of this process, gov/10005107) projects. In addition, IGV surveys, expression profiling of coding complementing computational approaches. allows collaborators to load and share data and noncoding RNAs, single nucleotide GenomeThis calls for efficientbrowsers and intuitive locally or remotely over the internet. polymorphism (SNP) and copy number visualization tools able to scale to very IGV supports concurrent visualization profiling, and functional assays. Analysis large data sets and to flexibly integrate of diverse data types across hundreds, IGV: copy number, expression and mutation data grouped by tumor subtype Figure 1 Copy number, expression and mutation data grouped by tumor subtype. This figure illustrates an integrated, multi-modal view of 202 glioblastoma multiforme samples from The Cancer Genome Atlas (TCGA). Copy number data are segmented values from Affymetrix (Santa Classical Clara, CA, USA) SNP6.0 arrays. Expression data are limited to genes represented on all TCGA-employed platforms and displayed across the entire gene locus. Red shading indicates Neural relative upregulation of a gene and the degree of copy gain of a region; blue shading indicates relative downregulation and copy loss. Small Proneural black squares indicate the position of point missense mutations. Samples are grouped by tumor subtype (2nd annotation column) and data type (1st sample annotation column) and sorted Mesen- by copy number of the EGFR locus. Linking chymal by sample attributes ensures that the order of sample tracks is consistent across data types within their respective tumor subtypes. EGFR Robinson et al. Nat Biotechnol, 2011 24 VOLUME 29 NUMBER 1 JANAURY 2011 NATURE BIOTECHNOLOGY Translational Breast Cancer Research, 2016 Genome browsers IGV: view of aligned reads at 20Kb resolution Robinson et al. Nat Biotechnol, 2011 Translational Breast Cancer Research, 2016 Gene-centric databases n Entrez Gene q http://www.ncbi.nlm.nih.gov/gene q NCBI/NIH q All completely sequenced genomes q One gene per page n Ensembl BioMart q http://www.ensembl.org/biomart/martview q EMBL-EBI and Sanger Institute q Vertebrates and other selected eukaryotic species q Batch information retrieval Translational Breast Cancer Research, 2016 Gene/protein expression data repositories n Gene Expression Omnibus (GEO) q http://www.ncbi.nlm.nih.gov/geo/ n ArrayExpress q http://www.ebi.ac.uk/arrayexpress/ n PRIDE q https://www.ebi.ac.uk/pride/archive/ Translational Breast Cancer Research, 2016 Pathway and network databases n Gene Ontology (GO): http://www.geneontology.org/ n Pathway databases q KEGG: http://www.genome.jp/kegg/pathway.html q Reactome: http://www.reactome.org/ q WikiPathways: http://www.wikipathways.org/ n Protein-protein interaction databases q DIP: http://dip.doe-mbi.ucla.edu/ q MINT: http://mint.bio.uniroma2.it/mint/ q BioGRID: http://www.thebiogrid.org/

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    47 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us