Novel Bioinformatics Applications for Protein Allergology

AND ! "#$% &'()* +% + ,-.,-/,0 + 121,..0-10- ! 3 4 33!!3 ,,,1/ !"# $% # $# &'()$ $*+,'-./ $ "Por la ciencia, como por el arte, se va al mismo sitio: a la verdad" Gregorio Marañón Madrid, 19-05-1887 - Madrid, 27-03-1960 List of Papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I Martínez Barrio, Á., Soeria-Atmadja, D., Nister, A., Gustafsson, M.G., Hammerling, U., Bongcam-Rudloff, E. (2007) EVALLER: a web server for in silico assessment of potential protein allergenicity. Nucleic Acids Research, 35(Web Server issue):W694-700. II Martínez Barrio, Á.∗, Lagercrantz, E.∗, Sperber, G.O., Blomberg, J., Bongcam-Rudloff, E. (2009) Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX. BMC Bioinformatics, 10(Suppl 6):S18. III Martínez Barrio, Á., Xu, F., Lagercrantz, E., Bongcam-Rudloff, E. (2009) GeneFinder: In silico positional cloning of trait genes. Manuscript. IV Martínez Barrio, Á., Ekerljung, M., Jern, P., Benachenhou, F., Sperber, G.O., Bongcam-Rudloff, E., Blomberg, J., Andersson, G. (2009) Data mining of the dog genome reveals novel Canine Endogenous Retro- viruses (CfERVs). Manuscript. ∗ Authors contributed equally. I is an open access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by- nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. II is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. List of Additional Papers I Martínez Barrio, Á.∗, Eriksson, O.∗, Badhai, J., Fröjmark, A.-S., Bongcam-Rudloff, E., Dahl, N., Schuster, J. (2009) Resequencing and Analysis of the Diamond-Blackfan Anemia Disease Locus RPS19. PLoS ONE 4(7): e6172. ∗ Authors contributed equally. Contents 1 Introduction . ........................................ 11 2 Background . ........................................ 13 2.1 The immune system ................................. 16 2.2 The central dogma of molecular biology . ................ 20 2.3 Phenotypic traits and association studies . ................ 27 2.4 Retroviruses and their endogenization . ................ 31 2.5 Bioinformatics . .................................. 40 2.6 Statistics and Biology in a complex interplay scenario ....... 41 2.7 The -omics age.................................... 43 3 Aims of this thesis . ................................... 45 4 Methods . ............................................ 47 4.1 Sequence alignments . ............................. 47 4.2 Classification, hypothesis testing and evaluation . ........... 53 4.3 Phylogenetic analysis . ............................. 58 4.4 Databases . ...................................... 60 4.5 Information sources, annotations and ontologies ........... 62 4.6 Web services in bioinformatics ........................ 66 4.7 RetroTectorc ..................................... 69 5 Present investigations ................................... 71 5.1 Paper I: EVALLER - a web server for in silico assessment of potential protein allergenicity. ......................... 71 5.1.1 Future prospects . ............................ 73 5.2 Paper II: Annotation and visualization of endogenous retroviral sequences using the Distributed Annotation System (DAS) and eBioX ........................................... 73 5.2.1 Future prospects . ............................ 74 5.3 Paper III: GeneFinder - in silico positional cloning of trait genes 75 5.3.1 Future prospects . ............................ 76 5.4 Paper IV: Data mining of the dog genome reveals novel Canine Endogenous Retroviruses (CfERVs) ..................... 77 5.4.1 Future prospects . ............................ 79 6 Sammanfattning på Svenska .............................. 81 6.1 Backgrund . ...................................... 81 6.2 Sammanfattning av studierna . ......................... 81 7 Acknowledgements . ................................... 83 References . ............................................ 87 Abbreviations API Application Programming Interface DAS Distributed Annotation System FLAP Filtered Length-adjusted Allergen Peptides DNA Deoxyribonucleic acid ERV Endogenous RetroVirus FN False Negative FP False Positive FPR False Positive Rate GO Gene Ontology HGNC HUGO Gene Nomenclature Committee LINE Long INterspersed Element LTR Long Terminal Repeat mRNA messenger RNA OMIM Online Mendelian Inheritance in Man Pol Polymerase protein REST Representational State Transfer RNA Ribonucleic acid RT Reverse Transcriptase SINE Short INterspersed Element SOA Service Oriented Architectures SOAP Simple Object Access Protocol SQL Structured Query Language TFR Transposon Free Region TFBS Transcription Factor Binding Site WSDL Web Services Description Language XML Extensible Markup Language XSD XML Schema Definition 9 1. Introduction The lapse of time between the almost manually assembled C. elegans [42] and the first human genome draft [118] became the provenance of what today is a common nexus among the Life Sciences fields, a newborn Science called Bioinformatics. Bioinformatics is a highly inter-disciplinary field which is evolving at a exponential pace due to the recent advent of a plethora of new technologies bringing an equally exponential quantity of data. Unfortunately, although the majority of this research data is publicly available, its analysis becomes un- feasible because it may be placed in legacy systems or it may belong to small groups without the skills to publicly offer it. At the same time, the sole analysis of experimental data has turned out not to be significant anymore unless it is collated with the available knowledge sources interconnected across this abstraction we call Internet. These different data sources are enriching our analyses until a turning point where research can be based on expensive public data that it is generated at big biomedical institutions. However, integration of tangled knowledge sources is not an easy task and can be quite cumbersome due to the large number of resources to be dedicated for the efficient accom- plishment of the information technology (IT) management. Recently, several Networks of Excellence led by a combination of technology and biology experts were launched to contribute and develop a pan- European distributed infrastructure for Bioinformatics. Their success has been measured by the establishment of many public and openly released projects that have matured in order to provide means of data integration, annotation, distribution as well as analysis by enacting remotely synchronized and high- performance services. By collaborating in one of these European consortiums called EMBRACE [4], I had the opportunity to participate in several research projects and learn some of the latest information technologies to help accelerate distributed Bioinformatics. The creation of new tools and data sources was driven by an expanding set of test problems and the goal was that they would be easily distributed and integrable as well as possess a biomedical significance. Some of them are explained in this thesis. The following chapters of this thesis introduce the reader to the common and varied Bioinformatics terminology by describing the background of the study as well as the methods utilized. In every paper, a different information 11 architecture is devised to achieve the presented results. Specifically, the first paper is a machine learning approach to allergenicity assessment implemented in a web server which interacts with a specialized trained vector classifier to recognize linear allergen descriptors in proteins. The second paper creates a data resource for human endogenous retroviruses. Thereafter, these proviruses are visualized and correlated with previously reported human genome regions deemed to be transposons-free. The third paper presents GeneFinder, a dedicated pipeline application which queries a mixture of biological data providers and ultimately ranks the collated information to anlyze regions of association to genetic diseases. We made the application available through a web service and tested it with the ridgeless phenotype in dogs mapped by a genome-wide association study [232]. Finally, a fourth paper, presents a prototype pipeline used to characterize previously unknown groups of endogenous retroviruses in dogs and their integrational landscape. 12 2. Background Life in all organisms on Earth is characterised by the capability of respond- ing to stimuli, autonomous reproduction, growth, the ability to show some form of development, and maintenance of homeostasis in order to regulate its internal environment while maintaining a stable, constant condition. The term organism describes any living entity composed

Novel Bioinformatics Applications for Protein Allergology

Lecture 5: Sequence Alignment – Global Alignment

Information-Theoretic Bounds of Evolutionary Processes Modeled As a Protein Communication System

Testing the Independence Hypothesis of Accepted Mutations for Pairs Of

A Thesis Entitled Homology-Based Structural Prediction of the Binding

Bioinformatics Scoring Matrices

Oxidising Bacteria (SAOB)

Lecture 10: Local Alignment and Substitution Matrices 10.1

2-PAM Matrices

Pairwise Alignment

PHAT: a Transmembrane-Specific Substitution Matrix

Substitution Matrices E S V U

Mutational Effects on Protein Structure and Function