Master Thesis
Total Page:16
File Type:pdf, Size:1020Kb
MASTER THESIS B.Sc. Martin Zerbst Transcriptome progression in community acquired pneumonia 2018 Faculty of Applied Computer Sciences and Biosciences MASTER THESIS Transcriptome progression in community acquired pneumonia Author: Martin Zerbst Study Programme: Molecular Biology/Bioinformatics Seminar Group: Mo15 First Referee: Prof. Dr. rer. nat. Dirk Labudde Second Referee: PD Dr. rer. nat. habil. Hans Binder Mittweida, January 2018 Bibliographic Information Zerbst, Martin: Transcriptome progression in community acquired pneumonia, 75 pages, 13 fig- ures, Hochschule Mittweida, University of Applied Sciences, Faculty of Applied Computer Sci- ences and Biosciences Master Thesis, 2018 Abstract Community acquired pneumonia (CAP) is a very common, yet infectious and sometimes lethal disease. Therefor, this disease is connected to high costs of diagnosis and treatment. To actually reduce the costs for health care in this matter, diagnosis and treatment must get cheaper to conduct with no loss in predictive accuracy. One effective way in doing so would be the identification of easy detectable and highly specific transcriptomic markers, which would reduce the amount of work required for laboratory tests by possibly enhanced diagnosis capability. Transcriptomic whole blood data, derived from the PROGRESS study was combined with several documented features like age, smoking status or the SOFA score. The analysis pipeline included processing by self organizing maps for dimensionality and noise reduction, as well as diffusion pseudotime (DPT). Pseudotime enabled modelling a disease run of CAP, where each sample represented a state/time in the modelled run. Both methods combined resulted in a proposed disease run of CAP, described by 1476 marker genes. The additional conduction of a geneset analysis also provided information about the immune related functions of these marker genes. Contents Contents List of Figures List of Tables Nomenclature 1 Introduction 1 2 Methods & Materials3 2.1 Origin of data . 3 2.2 SOFA score . 3 2.3 Used software . 4 2.4 Self organizing maps . 4 2.5 Geneset analysis . 5 2.6 Correlation of documented features with metagenes . 5 2.7 Pseudotime analysis . 6 2.8 Identification of markers and features . 6 2.9 Determination of transcriptomic changes . 7 2.10 Describing progression of CAP . 8 3 Results 9 3.1 Relation of features and pneumonia . 9 3.2 Pseudotime analysis . 10 3.3 Identification of possible marker spots . 12 3.4 Describing progression of CAP . 14 4 Discussion 19 4.1 Relations of features and CAP . 19 4.2 Functional analysis of spots and marker genes . 19 4.3 Description of CAP progression . 20 5 Summary 29 A Supplementary data 31 List of Figures 3.1 DC of DPT analysis, using 2500 metagene expression profiles, with descriptive colour overlays . 10 3.2 DC of DPT analysis, performed using all 48,107 gene expression profiles, with descriptive colour overlays . 11 3.3 DC of DPT analysis, performed only with 1476 marker gene expression profiles, with descriptive coloured overlays . 11 3.4 Sample-wise expression profiles of metagenes with respective significance, ordered by pseudotime . 12 3.5 Significant, sufficient variant metagenes and their positions within the SOM . 13 3.6 Spreading of CAP to other organs with increasing pulmonal subSOFA score by plotting mean subSOFA scores of each organ against the respective pulmonal subSOFA score. 15 3.7 Modelled, pseudotime dependent likelihood of metagene expression, sorted spot- wise by turning point of likelihood distribution model . 15 3.8 Modelled, pseudotime dependent likelihood for marker gene expression, sorted spotwise and by switch point . 17 A.1 Starting parameters and sorting of preprocessed/normalized data . 32 A.2 Definition of function: fetch (sub-)SOFA scores from files . 33 A.3 Calculation and visualization of pseudotime . 33 A.4 Function definition for the generalized additively model describing relations be- tween pseudotime and meta-/genes . 34 A.5 Function for distributing metagenes into separate spots . 35 List of Tables 1.1 Most common pathogens causing CAP . 1 2.1 Determination of subSOFA scores . 3 2.2 Packages used within R/Rstudio, without dependencies . 4 3.1 Additional documented features for each patient correlated with metagene expressions 9 3.2 Selection of Chaussabels genesets and their enrichment in spot A and B . 13 3.3 Frequencies of most relevant keywords from geneset titles by ’oposSOM’ . 14 3.4 Results of the geneset analysis performed with Chaussabels genesets for the 1476 marker genes determined. 18 A.1 Transcriptional switches of meta-/genes . 35 I I. Nomenclature CAP . Community acquired pneumonia CNS . Central nervous system DC . Diffusion components DN . Downregulation/Downregulated at DPT . Diffusion pseudotime GPCR . G-protein coupled receptor GTP . Guanosine triphosphate ICU . Intensive care unit LTa ............... Lymphotoxin a MAP . Mean arterial pressure mmHg . Millimetre mercury ROS . Reactive oxygen species SOFA . Sequential organ failure assessment SOM . Self organizing maps TNF . Tumor necrose factor UP . Upregulated/Upregulation at WHO . World Health Organization Chapter 1: Introduction 1 1 Introduction Community acquired pneumonia (CAP) is one of the most common infectious and po- tential serious diseases worldwide. Especially children and elderly persons are infected at most. Considering a rising life expectancy, the incidences of CAP will increase in the future, [1] leading to higher treatment and health care costs ($8.4 billion annual in the USA) [2]. In 2014, 16,000 people died in Germany, while 2015 the numbers increased to 20,000 [3]. CAP thereby was the eighth leading cause of death with 2.1% frequency, compared to chronic ischemic heart failure, which is the leading cause of death with 8.2% frequency. Acute myocardial infarction, categorized after ICD-10 by WHO [4], is the second leading cause of death with a frequency of 5.3%. To reduce mortality, several aetiologic studies were conducted, showing that usually either bacterial infec- tion, viral infection or a combination of both cause CAP. Most common pathogens are listed in Table 1.1 [5] [6] [1]. Typically, S. pneumoniae causes CAP in children and se- niors, whereas young adults suffer more from atypical infections (e.g. with Mycoplasma pneumoniae [6]) including other pathogens [2]. Table 1.1: Most common pathogens causing CAP Pathogen Frequency [%] Streptococcus pneumoniae 25 - 43.6 Coxiella burnetti 6 - 18.5 Haemophilus influenzae 5 - 11 Virus in general 10-14.4 Legionella sp. 8 Chlamydia spp. 7 - 10.6 Gram negative enteric bacilli 6 Pseudomonas aeruginosa 5 Mycoplasma pneumoniae 15.9 The Diagnosis of CAP consists of 3 steps: physical examination (auscultation), radio- logical examination (actual establishment of diagnosis) and finally laboratory tests (e.g. leukocyte count, sputum Gram stain, blood cultures and urine antigens) [2]. With the identification of specific, transcriptomic markers, diagnosis of CAP would become more secure and cheaper. Such markers would greatly provide help in choosing the right treatment like whether antibiotics are required or if an enhanced risk of death exists. To know, which genes are expressed during which phase of the disease would greatly reduce costs of tests usually checking a broad range of criteria [7]. Most studies yet focus influence of single nucleotide polymorphisms in CAP regarding TNF- and LTa- gene polymorphisms, pattern recognition molecules (esp. mannose binding lectins), inflammatory molecules and the coagulation system [8] [9]. In this thesis, the main objectives are sorting the blood transcriptome data from 392 CAP patients, taken from the PROGRESS study, by degree of progressed CAP in order 2 Chapter 1: Introduction to find relations between disease progression and gene expression. Mostly anamnetic features and the SOFA score, a typical scoring system for sepsis and organ failure, will be taken into acount as well. The resulting marker genes from this procedures shall be a proposed model, applicable as tool in diagnosing stages of CAP. With the help of diffusion pseudotime, this model will also contain markers for different courses of CAP, if they exist. Chapter 2: Methods & Materials 3 2 Methods & Materials 2.1 Origin of data Expression data including additional anamnetic patient data were obtained from the PROGRESS study. This study deals with adults, being invasive or non-invasive artificially respirated and who suffer from CAP. The 796 samples contained RNA from stabilized whole blood serum, partially taken at different time points during the treatment of the 392 patients [10]. Expression values were determined with Illumina Human HT12v4 chips. The data were obtained preprocessed by Dr. Holger Kirsten and Prof. Markus Scholz from the Institute of Medical Informatics, Statistics and Epidemologics (IMISE) at the Leipzig University. Methodics for the PROGRESS study are mentioned within the NIH database for clinical trials [11] and in the respective paper [10]. 2.2 SOFA score To deal with organ failure, sepsis and dysfunction in general, the SOFA-score (Sequential organ failure assessment - score) was introduced. It is one of the most relevant scoring systems to estimate stages of pneumonia [12]. It’s usually applied to patients at the intensive care unit (ICU). In order to gain the SOFA score of a patient, several criteria depending on the organ/organic system are tested (see Table 2.1). For every organ, a subSOFA score is determined, adding up to a total SOFA score [13]. This score, as well as the single