Bioinformatics to Integrate Protein and Gene Information in a Relational Context, Application to Human Proteomic and Transcriptomic Data
Total Page:16
File Type:pdf, Size:1020Kb
TESIS DOCTORAL BIOINFORMATICS TO INTEGRATE PROTEIN AND GENE INFORMATION IN A RELATIONAL CONTEXT, APPLICATION TO HUMAN PROTEOMIC AND TRANSCRIPTOMIC DATA CONRAD FRIEDRICH DROSTE DIRECTOR DR. JAVIER DE LAS RIVAS SANZ SALAMANCA, JULIO DE 2017 El Dr. Javier De Las Rivas Sanz, con D.N.I. 15949000H, Investigador Científico del Consejo Superior de Investigaciones Científicas (CSIC), director del grupo de Bioinformática y Genómica Funcional en el Instituto de Biología Molecular y Celular del Cáncer (CiC-IBMCC), y profesor del Programa de Doctorado y del Máster de Biología y Clínica del Cáncer de dicho Instituto y la Universidad de Salamanca (USAL). CERTIFICA Que ha dirigido esta Tesis Doctoral titulada "BIOINFORMATICS TO INTEGRATE PROTEIN AND GENE INFORMATION IN A RELATIONAL CONTEXT, APPLICATION TO HUMAN PROTEOMIC AND TRANSCRIPTOMIC DATA" realizada por D. Conrad Friedrich Droste, alumno del Programa de Doctorado de 2012/2013 de la Universidad de Salamanca. y AUTORIZA La presentación de la misma, considerando que reúne las condiciones de originalidad y contenidos requeridos para optar al grado de Doctor por la Universidad de Salamanca. En Salamanca, a 7 de julio de 2017 Dr. Javier De Las Rivas Sanz To my parents, Heike and Friedel and my loved ones. I have to apologize to Pigena for all the time I missed to enjoy with her. ACKNOWLEDGEMENTS & APPRECIATIONS I am deeply grateful to my Ph.D. director Dr. Javier De Las Rivas for the opportunity to realize this work in his research group. His guidance, encouragement and support during these time were always appreciated and needed. I owe a very important debt to Dr. Manuel Fuentes and Paula Díez which allowed me to support their proteomic research. Without their tremendous efforts to generate the proteomic data and interpret the results a big part of this Ph.D. project would have not been achieved. My deepest heartfelt appreciations go to all the current and previous members of the Bioinformatics and Functional Genomics group of the Cancer Research Center Salamanca (CIC- IBMCC). I only can say thank you, for the countless support and encouragement during this Ph.D. time. I also thank all members of the Cancer Research Center from Salamanca, without this precious scientific environment this research could not have been done. I would also like to express my gratitude to the Junta de Castilla y León for their financial support during my predoctoral time. TABLE OF CONTENTS ACKNOWLEDGEMENTS & APPRECIATIONS..............................................................................7 TABLE OF CONTENTS................................................................................................................9 1 INTRODUCTION ...............................................................................................................13 1.2 Status of biological network analysis .......................................................................21 1.2.1 Graph Theory.....................................................................................................21 1.2.2 Biological network datasets ..............................................................................22 1.2.2.1 Protein-Protein Interaction networks ............................................................22 1.2.2.2 Biological Pathways ........................................................................................24 1.2.3 Bioinformatic tools available.............................................................................25 2 OBJECTIVES......................................................................................................................31 2.1 Problem position and hypothesis.............................................................................31 2.2 Objectives.................................................................................................................31 3 MATERIAL AND METHODS ..............................................................................................33 3.1 Data to build pathway-derived networks in an expression specific context............33 3.1.1 Network specific data........................................................................................33 3.1.1.1 Kyoto Encyclopedia of Genes and Genomes (KEGG)..................................33 3.1.1.2 Protein-Protein Interaction data (APID) .....................................................35 3.1.2 Expression data .................................................................................................35 3.1.2.1 Expression Sequence Tags (ESTs) ...............................................................36 3.1.2.2 Microarray expression datasets .................................................................36 3.1.2.2.1 Pre-processed datasets of Gene Expression Barcode 3.0....................36 3.1.2.2.2 B- and T-lymphocytes datasets............................................................36 3.1.2.2.3 Ramos cell line dataset........................................................................37 3.1.2.3 RNA-Seq dataset.........................................................................................37 3.2 Data used for the qualitative proteomics analyses ..................................................37 3.2.1 Qualitative analysis of Ramos Burkitt's lymphoma-derived B-cell line .............38 9 / 144 3.2.1.1 Microarray expression dataset...................................................................38 3.2.1.2 Mass Spectrometry derived proteomics dataset .......................................38 3.2.2 Integrated analysis of MS/MS, RNA-Seq and Afiinity Proteomics of Ramos B- Cell data .........................................................................................................................38 3.2.2.1 RNA-Seq dataset.........................................................................................39 3.2.2.2 LC-MS/MS dataset......................................................................................39 3.2.2.3 Antibody dataset ........................................................................................39 3.3 Quantitative analysis of proteome and phosphoproteome of B-cell lymphocytosis of patients’ samples...........................................................................................................39 3.4 Methodoly and technical environment....................................................................40 3.4.1 Technical environment of the Path2enet tool ..................................................40 3.4.1.1 The general environment...........................................................................40 3.4.1.2 R and Bioconductor environment ..............................................................40 3.4.1.3 The basic R-libraries and R-packages of the Path2enet tool ......................41 3.4.1.4 MySQL ........................................................................................................41 3.4.1.5 Network parameters ..................................................................................42 3.4.2 Processing of the microarray datasets ..............................................................43 3.4.3 Processing of the RNA-Seq datasets..................................................................43 4 RESULTS...........................................................................................................................45 4.1 Processing of biological data to build the biomolecular networks ..........................45 4.1.1 ID mapping ........................................................................................................45 4.1.1.1 Generating ID mapping for KeggXML2SQLDatabase and EST dataset .......46 4.1.1.2 Generating ID mapping for the transcriptomic datasets............................47 4.1.2 Setting up the network data included in Path2enet tool ..................................48 4.1.2.1 Setting up the MySQL database .................................................................48 4.1.2.2 APID............................................................................................................48 4.1.2.3 KEGG...........................................................................................................49 4.1.2.3.1 Structure of the data KEGG PATHWAY provides .................................49 4.1.2.3.2 Generating KEGG database inside Path2enet .....................................50 4.1.2.4 Access to the database via R ......................................................................54 4.1.3 Processing the transcriptomic datasets ............................................................56 4.1.3.1 EST data......................................................................................................56 4.1.3.2 Microarray Gene Expression Barcode ........................................................57 4.1.3.3 RNA-Seq data..............................................................................................57 4.2 Path2enet tool to generate, analyse and visualize the pathway-driven biomolecular networks ............................................................................................................................58 4.2.1 Generating the biomolecular networks ............................................................58 10 / 144 4.2.1.1 Graphs of the Path2enet tool.....................................................................59 4.2.1.2 Attributes of the graphs in Path2enet........................................................62 4.2.1.3 Visualization of