Evaluating differential expression using RNA-sequencing data: a case study in host-pathogen interaction upon Listeria monocytogenes infection

Joana Rita Gonc¸alves da Cruz

Thesis to obtain the Master of Science Degree in Biomedical Engineering

Examination Committee

Chairperson: Profa. Dra. Claudia´ Alexandra Martins Lobato da Silva Supervisor: Profa. Dra. Ana Teresa Correia de Freitas Co-Supervisor: Dr. Luis Pedro Coelho Members of the Committee: Dr. Joao˜ Andre´ Nogueira Custodio´ Carric¸o Members of the Committee: Dr. Daniel Sobral

November 2013

Source of cover’s figure: http://www.ploughshareinnovations.com/uploads/images/dna-binary-big.jpg

Science, my boy, is composed of errors, but errors that it is right to make, for they lead step by step towards the truth.

Jules Verne, Journey to the Centre of the Earth (1872)

Acknowledgments

I start by thanking my supervisors, Professor Ana Teresa Freitas and Luis Pedro Coelho, for the guidance that they provided me along this work. Specially to Luis Pedro Coelho, for his precious support, patience and motivation. I also thank Musa Mhlanga for giving me the opportunity to participate in this exciting project and to Youtaro Shibayama and Loretta Magagula from Mhlanga laboratory at CSRI, for helping me to contextualize the computational results into a biological view. Moreover, my sincere thanks also goes to Daniel Neves for supporting me in the beginning of my path in bioinformatics. I express my deeply gratitude to my family, specially to my parents, for all the support and compre- hension. Thank you for always encourage me to give my best in every situation. To Telma for her companionship along not only the development of this thesis but throughout all the Biomedical engineering journey. Finally, to my boyfriend and best friend, Pedro, with whom I shared my joys and frustrations. Thank you for being my solid anchor and for always believe in me, even when I didn’t.

This work was supported by Fundac¸ao˜ para a Cienciaˆ e Tecnologia (FCT) through the Super reso- lution Imaging of Gene Expression & Nuclear Architecture project (PTDC/SAU-GMG/115652/2009).

iii

Abstract

Unlike the genome, the cell transcriptome is dynamic and specific for a given cell developmental stage or physiological condition. Understanding the transcriptome is essential for interpreting the func- tional elements of the genome and revealing the molecular constituents of cells. Recently, developments of high-throughput DNA sequencing methodologies have provided a new method to sequence RNA at unprecedented high resolutions. This method is termed RNA-Seq and has been emerging as the pre- ferred technology for both characterization and quantification of the cell transcripts. Bearing this in mind, in the first part of this thesis I propose a bioinformatics pipeline to compare two RNA-Seq samples. This pipeline permits biological insight about the analysed samples, by extracting the main biological processes that are differentially active among the samples in analysis. Subsequent to this pipeline, I developed a novel methodology to inspect the activation of a given cellular pathway in a time-course RNA-Seq dataset. This methodology was implemented in an HTML interface. Finally, I formulated a novel approach to statistically reinforce the inference of differentially expressed among two RNA- Seq samples, using publicly available RNA-Seq datasets. The evaluation of a Listeria monocytogenes RNA-Seq dataset with the developed tools testified its proper functioning. Particularly, this dataset was obtained from a population of cells submitted to two dif- ferent conditions: 1) infected with wild-type Listeria monocytogenes and 2) infected with mutant Listeria monocytogenes, lacking a gene that synthesises for a virulence factor. Employing the RNA-Seq analysis pipeline, I concluded that upon infection with mutant Listeria monocytogenes the cell immunological re- sponse was not activated and that, therefore, the deleted gene and its subsequent product are essential on the virulence of this bacteria. For the wild-type infected population, a strong immunological response was verified and it was possible to identify specific points of Listeria monocytogenes’ life-cycle. Next, I studied how well the Listeria monocytogenes’ dataset is modelled by two gene regulatory networks, one describing the cellular response upon infection with Escherichia Coli and another upon infection with Listeria monocytogenes. The results confirmed the existence of the genes relationships described on the latter network. Finally, I tested the hypothesis of using public RNA-Seq data, extracted from cell populations in similar conditions as the control of the Listeria monocytogenes’ dataset, to improve the statistical confidence on the inference of differentially expressed genes. The results evidenced that this methodology is not reliable.

Keywords

RNA-Sequencing, gene expression, gene networks, RNA-Seq analysis pipeline, RNA-Seq valida- tion, Listeria monocytogenes.

v

Resumo

Contrariamente ao genoma, o transcriptoma da celula´ e´ dinamicoˆ e espec´ıfico para uma dada fase de desenvolvimento e condic¸ao˜ fisiologica.´ A compreenc¸ao˜ do transcriptoma e´ essencial na interpretac¸ao˜ dos elementos funcionais do genoma e na revelac¸ao˜ dos constituintes moleculares celu- lares. O recente desenvolvimento de novos metodos´ de alto rendimento para a sequenciacao˜ de DNA levaram ao surgimento de uma nova metodologia que permite a sequenciac¸ao˜ de moleculas´ de RNA em resoluc¸oes˜ sem precedentes. Este metodo´ e´ denominado de RNA-Seq e tem vindo a emergir como a tecnologia preferida na caracterizac¸ao˜ e quantificac¸ao˜ de transcritos celulares. Tendo isso em mente, desenvolvi uma pipeline bioinformatica´ que permite a comparac¸ao˜ de duas amostras de RNA- Seq, extraindo os principais processes biologicos´ que estao˜ diferencialmente activos entre ambas as amostras em analise.´ Subsequente a esta pipeline, desenvolvi uma nova metodologia que inspecciona a activac¸ao˜ de uma dada via celular num conjunto de leituras de RNA-Seq adquiridas ao longo de um curso temporal. Esta ferramenta foi implementada numa interface HTML. Por fim, formulei uma nova abordagem para reforc¸ar estatisticamente a inferenciaˆ de genes diferencialmente expresses entre duas amostras de RNA-Seq, usando dados de RNA-Seq publicamente dispon´ıveis. A avaliacc¸ao˜ de um conjunto de dados de RNA-Seq com as ferramentas desenvolvidas foi utilizado para comprovar o seu funcionamento. Particularmente, este foi obtido de uma populac¸ao˜ de celulas´ submetidas a duas condic¸oes:˜ 1) infectadas com Listeria monocytogenes selvagem e 2) infectadas com Listeria monocytogenes mutante, a` qual foi retirado um gene que sintetiza um factor de virulencia.ˆ Processando os dados com a pipeline desenvolvida, foi poss´ıvel concluir que apos´ infecc¸ao˜ com Listeria monocytogenes mutante a resposta imunitaria´ da celula´ nao˜ foi activada e, portanto, o gene eliminado e o seu respectivo produto sao˜ essenciais na sua virulencia.ˆ Contrastante, para a populac¸ao˜ infectada com a versao˜ selvagem, foi verificada uma forte resposta imunitaria´ e, para alem´ disso, foi tambem´ poss´ıvel identificar pontos espec´ıficos do ciclo de vida da bacteria.´ De seguida, foi estudado o quao˜ bem o conjunto de dados em analise´ e´ modelado por duas redes geneticas,´ uma a descrever a resposta celular apos´ infecc¸ao˜ com Echerichia Coli e a outra apos´ infecc¸ao˜ com Listeria monocytogenes. Os resultados confirmaram a existenciaˆ das relac¸oes˜ retratadas pela ultima´ rede. Finalmente, foi testada a hipotese´ de utilizar dados de RNA-Seq publicos,´ obtidos em condic¸oes˜ semelhantes as` do controlo na experienciaˆ com a Listeria monocytogenes, para aumentar a confianc¸a estat´ıstica na inferenciaˆ de genes diferencialmente expressos. Os resultados demonstraram que esta metodologia nao˜ e´ viavel.´

Palavras Chave

RNA-Sequencing, expressao˜ genetica,´ rede de genes, pipeline para a analise´ de RNA-Seq, validac¸ao˜ de RNA-Seq, Listeria monocytogenes.

vii

Contents

1 Introduction 1 1.1 Context and motivation ...... 2 1.2 Problem formulation ...... 3 1.3 Original Contributions ...... 3 1.4 Thesis Outline ...... 4

2 Gene Expression 5 2.1 Introductory note ...... 6 2.2 The concept of gene expression ...... 6 2.3 Gene expression regulation ...... 8 2.3.1 Transcriptional regulation ...... 9 2.3.2 Post-transcriptional regulation ...... 10 2.3.3 Translational regulation ...... 11 2.3.4 degradation ...... 11 2.4 Gene expression analysis using RNA-Sequencing data ...... 11 2.4.1 Approaches for genome-wide expression analysis ...... 12 2.4.2 RNA-Sequencing experiment workflow ...... 13 2.4.3 Quality control of sequence reads ...... 16 2.4.4 Mapping reads ...... 17 2.4.5 Expression quantification and normalization ...... 18 2.4.6 Differential expression ...... 19 2.4.7 Pathway analysis ...... 20

3 Listeria monocytogenes 21 3.1 Intracellular infectious cycle ...... 22 3.2 Listeria monocytogenes as a model organism ...... 22 3.3 Listeria monytogenes case study: dataset description ...... 23

4 RNA-Sequencing analysis pipeline 25 4.1 Methods ...... 26 4.2 Listeria monocytogenes case study ...... 28

ix 4.2.1 Quality Control – FastQC ...... 29 4.2.2 Mapping reads – Bowtie 2 ...... 30 4.2.3 Expression quantification – HTSeq-count ...... 31 4.2.4 Differential expression – DESeq ...... 31 4.2.5 Gene Ontology enrichment – GOStats ...... 34 4.2.5.A Biological Processes ...... 34 4.2.5.B Cellular Components ...... 36 4.2.5.C Molecular Functions ...... 38 4.3 Discussion ...... 40 4.3.1 State-of-the-art: RNA-Seq processing tools ...... 40 4.3.2 Limitations of the developed pipeline ...... 41 4.3.3 Limitation of the experimental data ...... 43 4.3.4 Case study: congruency between expected and obtained results ...... 44 4.4 Future work ...... 46 4.4.1 Laboratory approach: new data acquisition ...... 46 4.4.2 Computational approach: improvement of the pipeline ...... 47

5 Gene networks to prove the existence of a given biological response in RNA-Sequencing data 49 5.1 Methods ...... 50 5.2 Listeria monocytogenes case study ...... 53 5.2.1 Network 1 ...... 53 5.2.2 Network 2 ...... 60 5.3 Discussion ...... 63 5.4 Interface ...... 65 5.5 Future work ...... 67

6 Using publicly available data as RNA-seq replicates 69 6.1 Methods ...... 70 6.2 Listeria monocytogenes case study ...... 71 6.2.1 Comparison between Listeria monocytogenes control and the publicly available data 72 6.2.1.A Dataset 1 ...... 72 6.2.1.B Dataset 2 ...... 73 6.2.1.C Dataset 3 ...... 74 6.2.1.D Dataset 4 ...... 76 6.2.2 Comparison between publicly available data ...... 77 6.3 Discussion ...... 78 6.4 Future work ...... 79

7 Conclusions 81 x Bibliography 85

Appendix A Supplementary material for chapter 4 A-1 A.1 Similarity between samples ...... A-2 A.2 Supplementary Tables ...... A-4 A.3 Scripts ...... A-8 A.3.1 Makefile ...... A-8 A.3.2 trim.py ...... A-12 A.3.3 paired end.py ...... A-12 A.3.4 DESeq.R ...... A-13 A.3.5 GOstats.R ...... A-14

Appendix B Supplementary material for chapter 5 B-1 B.1 Scripts ...... B-2 B.1.1 network.R ...... B-2 B.1.2 network-LM1.R ...... B-8

xi

List of Figures

2.1 Central dogma of molecular biology...... 6 2.2 Overview of gene expression process in metazoans...... 8 2.3 Gene expression regulation points of eucaryotic organisms...... 9 2.4 Schematic overview of an eukaryotic gene and the regulatory regions that control tran- scription initiation...... 10 2.5 First steps in the conversion of total RNA into a library of template molecules suitable for high throughput DNA sequencing...... 14 2.6 Cluster generation in Illumina sequencing process...... 15 2.7 Sequencing-by-synthesis in Illumina sequencing process...... 16 2.8 Mapping of a paired-end RNA-Seq reads with different reference files: genome and tran- scriptome...... 17 2.9 Transcripts of different lengths with different read coverage levels, total read counts ob- served for each transcript and FPKM-normalized read counts...... 18

3.1 Schematic representation and electron micrographs of Listeria monocytogenes life-cycle. 23

4.1 Flowchart of the developed RNA-Seq analysis pipeline...... 29 4.2 Modules with inadvertences in the quality control analysis of non-mutagenic Listeria mono- cytogenes data, for time-point 20 and for the first strand of the paired-end read...... 30

4.3 Normalized counts mean versus log2 fold change for the contrast Listeria monocytogenes’ non-infected versus infected for each time-point at the differential expression analysis. . . 32 4.4 P-value distribution for the statistically significant differentially expressed genes (p-value < 0.1)...... 33

5.1 Methodology used to infer if a gene regulatory network is present on a RNA-Seq dataset. 51 5.2 Gene network extracted from Guthke et al., 2005 [1], describing the genetic relations upon bacterial infection by Escherichia coli...... 54 5.3 Measured and simulated expression kinetics in log-ratios of the genes in the nodes of network 1, for both wild-type and mutant infected datasets...... 58 5.4 Distribution of the ratio between the mean square error and the variance in the permuta- tion tests, regarding network 1...... 59

xiii 5.5 Network constructed from literature review on L. monocytogenes infection...... 60 5.6 Measured and simulated expression kinetics in log-ratios of the genes in the nodes of network 2, for both wild-type and mutant infected datasets...... 61 5.7 Distribution of the ratio between the mean square error and the variance in the permuta- tion tests, regarding network 2...... 62 5.8 Interface of methodology described in chapter 5: analysis tab interface with all the Listeria monocytogenes data information to perform the network congruency analysis...... 66 5.9 Interface of methodology described in chapter 5: results tab interface after the evaluation of the Listeria monocytogenes data information...... 67

6.1 Flowchart of the methodology developed to test the congruency between public RNA-Seq samples and the sample in analysis...... 71

6.2 Plot of normalized counts mean versus log2 fold change for the contrast between Listeria monocytogenes’ control sample versus dataset’s 1 sample...... 73

6.3 Plot of normalized counts mean versus log2 fold change for the contrast between Listeria monocytogenes’ control sample versus dataset’s 2 samples...... 74

6.4 Plot of normalized counts mean versus log2 fold change for the contrast between Listeria monocytogenes’ control sample versus dataset’s 3 samples...... 75

6.5 Plot of normalized counts mean versus log2 fold change for the contrast between Listeria monocytogenes’ control sample versus dataset’s 4 samples...... 76

6.5 Plot of normalized counts mean versus log2 fold change for the contrast between the public datasets...... 78

A.1 Heatmap showing the Euclidean distances between the Listeria monocytogenes’ samples. A-3 A.2 Samples cluster by strain. Plot of the first three principal components for all 9 samples. . A-3

xiv List of Tables

4.1 Alignment rate for each one of the RNA-Seq reads against the ...... 30 4.2 Highly differentially expressed genes for the comparison between control and sample acquired at time-point 20 from the wild-type infected cell population...... 31 4.3 Summary of the gene ontology terms for the biological processes ontology and respective p-value associated with the set of differentially expressed genes from non-infected versus Listeria monocytogenes infected samples, for all the experimental time-points...... 35 4.3 Summary of the gene ontology terms for the cellular components ontology and respective p-value associated with the set of differentially expressed genes from non-infected versus Listeria monocytogenes infected samples, for all the experimental time-points...... 37 4.4 Summary of the gene ontology terms for the molecular function ontology and respective p-value associated with the set of differentially expressed genes from non-infected versus Listeria monocytogenes infected samples, for all the experimental time-points...... 39 4.5 Summary of the available tools for each step of the RNA-Seq data processing...... 41 4.6 Pre-build tools that allow users without without programming or informatics expertise to process RNA-Seq data...... 42 4.7 Micrographs illustrating the evolution of HeLa cells upon infection with wild-type and mu- tant Listeria monocytogenes...... 45 4.8 First ten genes with lower coefficient of variation on the Listeria monocytogenes dataset. 47

5.1 Statistical values associated with the set of genes in the nodes of network 1 on both analysed conditions...... 59 5.2 Index of each gene in the nodes of network 2 and its respective Ensembl identification. . 60 5.3 Statistical values associated with the set of genes in the nodes of network 2 on both analysed conditions...... 62

6.1 Summary of the publicly datasets used in the methodology evaluation...... 71

A.1 Statistically significant differentially expressed genes among samples infected with wild- type and mutant Listeria monocytogenes for time-point 20...... A-4 A.2 Statistically significant differentially expressed genes among samples infected with wild- type and mutant Listeria monocytogenes for time-point 240...... A-4

xv A.3 Statistically significant differentially expressed genes among samples infected with wild- type and mutant Listeria monocytogenes for time-point 60...... A-5 A.4 Statistically significant differentially expressed genes among samples infected with wild- type and mutant Listeria monocytogenes for time-point 120...... A-6 A.5 Summary of the GO terms for the biological processes ontology and respective p-value associated with the set of differentially expressed genes from wild-type versus mutant infected analysis, for all the experimental time-points...... A-7 A.6 Summary of the GO terms for the molecular function ontology and respective p-value associated with the set of differentially expressed genes from wild-type versus mutant infected analysis, for all the experimental time-points...... A-7 A.7 Summary of the GO terms for the cellular components ontology and respective p-value associated with the set of differentially expressed genes from wild-type versus mutant infected analysis, for all the experimental time-points...... A-8

xvi Abbreviations

RNA-Seq RNA-Sequencing

NGS Next-Generation Sequencing

Listeria monocytogenes L. monocytogenes

LLO Listeriolysin O

DNA Deoxyribonucleic Acid

A Adenine

G Guanine

T Thymine

C Cytosine

RNA Ribonucleic Acid mRNA Messenger Ribonucleic Acid ncRNA Non-Coding Ribonucleic Acid

RNA pol RNA Polymerase rRNA Ribossomal Ribonucleic Acid tRNA Transfer Ribonucleic Acid

TF Transcription Factor siRNA Small Interference Ribonucleic Acid miRNA Micro Ribonucleic Acid

GEA Gene Expression Analysis qPCR Quantitative Polymerase Chain Reaction cDNA Complementary Desoxirribonucleic Acid

SAGE Serial Analysis of Gene Expression

xvii CAGE Cap Analysis of Gene Expression

MPSS Massively Parallel Signature Sequencing

QC Quality Control

SAM Sequence Alignment/Map

RPKM Reads Per Kilobase of Transcript per Million Mapped

FPKM Fragments Per Kilobase of Transcript per Million Mapped

DE Differential Expression

FDR False Discovery Rate

NB Negative Binomial

GO Gene Ontology

KEGG Kyoto Encyclopedia of Genes and Genomes

ORA Over-Representation Analysis

PI3K Phosphatidylinositol 3-Kinase

MAPK Ras-Mitogen-Activated Protein Kinase

SAM Sequence Alignment/Map

BP Biological Processes

CC Cellular Components

MF Molecular Functions qRT-PCR Real Time Quantitative Reverse Transcription Polymerase Chain Reaction

CV Coefficient of Variation

ChIP Chromatin ImmunoPrecipitation

NG Node Genes

MSE/Var Ratio between the Mean Square Error and the Variance

xviii 1 Introduction

Contents

1.1 Context and motivation ...... 2 1.2 Problem formulation ...... 3 1.3 Original Contributions ...... 3 1.4 Thesis Outline ...... 4

1 1.1 Context and motivation

Since the chemistry Nobel prize was awarded to Fred Sanger and Walter Gilbert, in 1980, for their crucial contribution concerning the determination of base sequences in nucleic acids [2, 3], many were the developments of DNA sequencing technologies [4–12]. The improvement of these techniques brought a revolutionized approach to biological questions and, nowadays, the use of deep sequenc- ing technologies such as RNA-Sequencing (RNA-Seq) in order to access the cell transcriptome has become an integral part of biological research [8, 13]. However, these technologies output a massive amount of data and it is not always straightforward to extract meaningful biological information from it. In this context, bioinformatics and computational biology are interdisciplinary fields that combine computer technology, mathematics and molecular biology to answer fundamental questions in life sci- ences [14–16]. Particularly, they present powerful tools that are suitable to analyse data generated from high-throughput sequencing platforms. Consequently, the success of next-generation sequenc- ing (NGS) technologies is tightly related with the creation of efficient computational tools that are able to process the data. In fact, without the means provided by computational tools, the processing of NGS data would be almost impossible [17, 18]. Until the mid-1990s, studies of gene expression were limited to measuring transcription from one or a few genes. But then microarray technology changed this, allowing the study of hundreds or thou- sands of transcripts at a time. At that time, this technology revolutionized many areas of biology, from basic research to the understanding and treatment of human disease [19–23]. In an analogous way, the conjunction of RNA-Seq data with the means of analysis provided by bioinformatics tools have the po- tential to revolutionize the way how many biological problems are approached, supporting the resolution of previously existing obstacles and limitations and leading to significant leaps in our understanding of biological processes [24]. An important biological topic is the understanding of the complex and sophisticated mechanisms by which diverse pathogenic agents evade host defense mechanisms and subvert their host networks to suit their lifestyle needs [25]. Infection of eukaryotic cells may be induced by a wide variety of agents, from viruses to eukaryotic parasites. A global understanding of host and pathogen transcriptomes can provide new insights into the infection process, leading to the identification of unknown virulence factors in the pathogen, new pathways in the host response when exposed to the pathogen or even pathogen- associated molecular patterns [24]. Even though the majority of bacteria are harmless, pathogenic bacteria such as Mycobacterium tuberculosis, which is responsible for tuberculosis, remains an important public health problem [26]. Within bacterial pathogens, they are able to exploit an enormous range of niches inside their host. Most pathogens exploit non-pathogenic cells being able to survive in a membrane-bound compartment. Only a small portion of bacteria organisms are capable of accessing and proliferate within the host cell cytosol [27]. Listeria monocytogenes (L. monocytogenes) is a model bacterial pathogen which, after internalization, is capable of disrupting a double-membrane vacuole, replicate in the host cytosol and manipulate the innate response triggered in the cell cytosol [27]. Its intracellular lifecycle in the human

2 host provides insight into the dynamics of general host-pathogen interactions, including modifications in host gene expression levels pre- and post- infection. The study of these variations can provide the foundation to better understand how bacterial infection influences human cells.

1.2 Problem formulation

This thesis has two underlying goals that are complementary to each other: the first is related to computational methodologies and the second to biologic knowledge.

• Computational goal Firstly, I aim to integrate available bioinformatics tools in a congruent pipeline that is able to process RNA-Seq data and extract reliable biological conclusions from it. Subsequent to this task, I intent to identify limitations in the already implemented tools and improve them or even develop new methodologies.

• Biological goal The second main goal is to use the developed tools to comprehend in which way L. monocytogenes influences human host cells, by identifying global changes in the host transcriptome and character- izing the alterations in host nuclear architecture. Furthermore, we aim to associate these changes to different stages of the L. monocytogenes infection life-cycle, including cell entry, phagosomal escape, cytosolic replication, actin recruitment, actin tail formation and cell escape.

1.3 Original Contributions

The main contributions resulting from this work are the following:

• Computational contributions

1. Implementation of a congruent sequence of tools (pipeline) that is suitable to extract the cellular processes that are differentially active among RNA-Seq samples;

2. Development and implementation of a novel methodology to validate RNA-Seq data by inves- tigating the activation of a given biological pathway in a RNA-Seq time-course dataset, using known gene interactions;

3. Refute a hypothesised methodology of using public RNA-Seq reads as replicates of the con- trol sample, which usually corresponds to a cell growing on a plate, on the statistical test to find differentially expressed genes among samples. It was shown that the use of public RNA-Seq data is not valid even when the acquired and public transcriptome samples were obtained in similar conditions and from the same cell line;

3 • Biological contributions

4. Association of the different L. monocytogenes life-cycle stages with transcriptome alterations on human host cells;

5. Evidence the L. monocytogenes’ loss of virulence when is lacking the hly gene which syn- thesises for listeriolysin O (LLO), a virulence factor.

1.4 Thesis Outline

Apart from this introduction, this thesis is structured in six chapters. In chapter 2 I introduce the concept of gene expression, its main regulation points and the several used approaches to have insight into this information. Particularly, in section 2.4, I give special relevance to RNA-Seq data and the current methods that are used to access the gene expression profile and extract novel biological knowledge from this type of data. Chapter 3 presents a review on the L. monocytogenes life-cycle and how this bacterium is commonly used as a model organism to study immunological response upon bacterial infection. Moreover, in this chapter, it is described the RNA-Seq L. monocytogenes dataset that will be used in this thesis as a case study to test the developed methodologies. Then, in chapter 4 I describe a pipeline developed to analyse a time-course RNA-Seq dataset. As a case study, I used the implemented pipeline to process a RNA-Seq dataset extracted from human HeLa cells infected with L. monocytogenes. From its output I conclude which are the cell processes that are differentially active upon distinct cell growth environments and along a given time-course. This analysis is described in section 4.2. Afterwards, in chapter 5, I formulate a new methodology that uses known gene networks to inves- tigate if a certain biological process is active in a RNA-Seq dataset. To testify the methodology proper functioning, in section 5.2 I use this methodology to study the existence of two gene networks in the previously refereed RNA-Seq dataset: one network describing the immunological response of human cells upon Escherichia coli infection and another describing immunological response of human cells upon L. monocytogenes internalization. Lastly, in chapter 6 I hypothesise the use of publicly available data as replicates of RNA-Seq’s con- trol sample, which usually corresponds to the transcriptome extracted from a cell growing in a healthy medium. Bearing this in mind, in this chapter I describe the results of using published RNA-Seq reads of four different datasets as replicates of the control sample in the L. monocytogenes dataset.

4 2 Gene Expression

Contents

2.1 Introductory note ...... 6 2.2 The concept of gene expression ...... 6 2.3 Gene expression regulation ...... 8 2.4 Gene expression analysis using RNA-Sequencing data ...... 11

5 2.1 Introductory note

Eukaryotic organisms have its hereditary information encoded in molecules of deoxyribonucleic acid (DNA) which are packed and organized into the cell nucleus in structures called . The DNA monomers are called nucleotides and are organized in a double-stranded helix [28–30]. Each nucleotide is constituted by a phosphatase group, a deoxyribose, and a nitrogenous base also called nucleobase. The genetic information in a DNA molecule is represented by the sequence of nucleotides containing one of four types of nucleobases: adenine (A), guanine (G), cytosine (C) and thymine (T). Following the Watson-Crick model [30], the two strands that constitute the DNA molecule are held together by hydrogen bonds that can only be established between specific pairs of nucleobases: A with T and G with C. Because of this restriction, both strains are complementary to one another and, therefore, contain the same genetic information. Over the course of embryonic development, a fertilized egg cell gives rise to all the organism cell types. However, each cell genetic information is almost an exact copy of the DNA that was in the fertilized egg cell and from which the whole organism developed. The distinct cell phenotypes are possible because different cell types make use of different stretches of the DNA molecule, called genes, to serve as templates to build functional cellular products in a process designated by gene expression.

2.2 The concept of gene expression

The central dogma of molecular biology was firstly proposed by Francis Crick, in 1958 [31, 32], and ”deals with the detailed residue-by-residue transfer of sequential information”. Particularly, states that information in nucleic acids can be perpetuated or transferred but the transfer of information into protein is irreversible (figure 2.1). Bearing this in mind, gene expression is the process by which a particular segment of DNA is copied into a ribonucleic acid (RNA) molecule which, in turn, will be used in the synthesis of functional gene products. Some RNA molecules can be the end product in themselves and some can be used as a template for the creation of other molecules, , in a process called translation. According to this distinction, RNAs are classified as either messenger RNAs (mRNAs) or non-coding RNAs (ncRNAs).

Figure 2.1: Central dogma of molecular biology. Solid arrows represent the usual flow of biological information. Dotted arrows represent special cases that occur only in specific cases. Adapted from Crick, 1970 [32].

The process by which the DNA genetic information is transferred into a RNA molecule is designated by transcription and performed in the cell nucleus by an enzyme called RNA polymerase (RNA pol). RNA

6 pol catalyses the formation of the phosphodiester bonds that link the nucleotides together and form the sugar-phosphate backbone of the RNA chain. In eukaryotes, there are multiple types of RNA pol that synthesize various types of RNA. The first one, RNA pol I, transcribes ribosomal RNAs (rRNAs) which combined with ribosomal proteins constitute ribosomes, structures on which mRNA is translated into protein. RNA pol II transcribes, mainly, protein-coding genes (mRNAs). Lastly, RNA pol III catalyses the transcription of transfer RNAs (tRNAs), which function as adaptors selecting amino acids and holding them in place on a ribosome for their incorporation into protein. Furthermore, all these polymerases synthesise small RNAs that play structural and catalytic roles in the cell. The classical model of gene transcription includes three steps: initiation, elongation and termination. Initiation process begins when the RNA pol molecule binds to the DNA upstream (5’ end) of the gene at a specialized sequence called promoter. For that binding to occur, RNA pol requires the assistance of a large number of accessory proteins. These include the general transcription factors (TFs) which must assemble on promoter along with the polymerase before the polymerase can begin transcription. Once transcription is initiated, most of the TFs are released from the DNA. Then, the DNA double helix unwinds and RNA polymerase reads the template strand, adding nucleotides to the 3’ end of the growing RNA chain – Elongation. Finally, transcription termination occurs after RNA pol reaches a termination site. At this point, RNA pol is released from the DNA and RNA is cleaved and released from the transcriptional complex. Simultaneously to the transcription process, the novel RNA molecule is submitted to several pro- cessing steps. The way in which this molecule is processed depends on which type of RNA it is. Polyadenylation, for instance, is an important processing step on transcripts destined to become mRNA molecules. In this process, a series of repeated A nucleotides – poly-(A) tail – are added to the RNA molecule being produced. Among others, RNA processing increases the stability of the RNA molecule, aiding its exportation from the nucleus to the cell cytosol. Furthermore, still in the cell nucleus, the newly synthesised RNA molecules require an extensive processing step to became a functional RNA. mRNA molecules consist on alternating segments of gene coding portions, called exons, and gene non- coding portions, called introns. In a process designated by splicing, introns are removed from the mRNA molecule and then neighbouring exons are stitched together. This process, in particular, is exclusive of eukariotic organisms. The mature RNA is then selectively transported from the nucleus to the cytoplasm. Once in the cell cytosol, mRNAs are submitted to the translation process, where the information in a mRNA molecule is converted into a protein. In this process, a mRNA molecule is used as template by a ribosome, which will match each sequence of three nucleotides (condon) on the template mRNA chain with a sequence of three complementary nucleotides (anti-condon) on a tRNA molecule. Bearing in mind that each tRNA has associated an amino acid that its anti-codon sequence calls for, this molecule will recognize and bind to a codon at one site and to an amino acid at another site of its surface. Thus, tRNAs function as translators between nucleotide sequences in RNAs and amino acid sequences in proteins. The ribosome, as the mRNA moves through it, covalently links each amino acid to the end of the growing

7 polypeptide chain by peptide bonds. When the translation reaches a Stop codon, denoting the end of the protein, the completed protein chain and the mRNA molecule are released and the ribosome is dissociated into two separated subunits. Gene expression can be seen, therefore, as a mediator that interprets the organism genetic informa- tion (genotype) giving rise to an outward physical manifestation (phenotype), via gene transcription and mRNA processing.

Figure 2.2: Overview of gene expression process in metazoans. Adapted from [33].

2.3 Gene expression regulation

Given that genes encode for proteins and proteins dictate the cell functional and structural propri- eties, each step in the flow of information from DNA to RNA to protein provides the cell with a potential control point for self-regulating its functioning. This system allows cells to respond to environmental changes and maintain their cell-type specific expression patterns. Thus, a cell can adjust the amount and type of proteins that it is manufacturing by:

1. controlling when and how often a given gene is transcribed;

2. controlling how a RNA transcript is processed;

3. selecting which mRNAs are exported from the nucleus to the cytosol;

4. selectively degrading certain mRNA molecules;

8 5. selecting which mRNAs are translated by ribosomes;

6. selectively activating or inactivating proteins after they have been synthesized;

7. selectively degrading proteins.

Figure 2.3: Gene expression regulation points of eucaryotic organisms [33].

2.3.1 Transcriptional regulation

Regulation at the transcriptional level assumes a paramount role in gene expression control. This is the first step in gene expression process and, consequently, the only regulation point where it can be ensured that no unnecessary intermediates are synthesized. This regulation can be performed at the promoter level by controlling the connection of TFs to the gene promoter region. As referred in section 2.2, the establishment of this connection will influence the binding of the RNA pol enzyme and, subsequently, the initiation of translation. In addition to the promoter, nearly all genes are controlled by regulatory DNAs that that may increase or decrease the transcription activity of a certain gene. For that, sequence-specific TFs bind to these regulatory DNA regions called enhancers or silencers switching on or off a gene, respectively. A gene can have several DNA regulatory regions and they can exist inside or outside the gene region, occurring sometimes thou- sands of nucleotides away from it. Often, the sequence specific factors and the general TFs assembled in the promoter region interact via additional proteins. These proteins are named co-factors and the most important is a large complex of proteins known as Mediator. Regulation of gene transcription rate is achieved by aiding/preventing the assembly of the general TFs and RNA pol at the promoter [33–36]. Furthermore, in order for the TFs to bind the gene’s regulatory regions and for the RNA pol to start polymerizing RNA, the DNA chain needs to be accessible. Hence, transcriptional activity can be also influenced by the level of DNA packaging. DNA is usually densely packed with histones, forming a struc- ture called chromatine. The first and most fundamental level of chromatin packing is the nucleosome. At this level, transcription can be inhibited by physically blocking the gene promoter region. Thereby, chromatin structure is also tightly related with the efficiency of transcription initiation [37–39]. After transcription initiation, the activity rate of the RNA pol II enzyme is decreased and paused on a promoter proximal position. From this stage, depending on the type of transcription elongation factor that interacts with the RNA pol II, transcription may halt or enter elongation phase [40–42].

9 Figure 2.4: Schematic overview of an eukaryotic gene and the regulatory regions that control transcription initiation. Adapted from [33].

2.3.2 Post-transcriptional regulation

During RNA synthesis, the cell has further control points where it is able to manage the amount, life-time and the final function of the novel RNA molecule. The process of polyadenylation, introduced in section 2.2, deeply influences the transcripts lifetime, protecting them from degradation and aiding theirs exportation to the cell cytosol. In a similar way, a process designated by capping, where a modified guanine nucleotide cap is added to the 5’ end of pre-mRNA molecules, is crucial for the novel transcript to exit the cell nucleus. Therefore, both these processes are essential for the stability of the mRNA molecule into an ideal time-window. The process of splicing, also referred in section 2.2, enables the production of distinct proteins from transcripts with the same source gene. The alternative inclusion or exclusion of gene parts permits a single coding gene to give rise to several distinct transcripts and expand the number of functional gene products that the genome can code for. Mature mRNAs which resulted from this phenomena are said to be isoforms. Additionally, intentional degradation of mRNA can be used not only as a defence mechanism from foreign RNA, such as viruses, but also as a route for controlling stabilisation and translation of mRNAs. The cell can, for instance, destroy transcripts by the RNA interference pathway. In this degradation mechanism, a small interfering RNA (siRNA) is assembled with a set of proteins into a complex called RISC. This complex patrols the cytoplasm, searching for mRNAs that are complementary to the siRNA it carries. If an mRNA molecule has a complementary sequence to the siRNA incorporated in the RISC complex, it is destroyed. siRNA is classified as a microRNA (miRNA), which are defined as small non- coding RNA molecules with approximately 22 nucleotides. Therefore, miRNAs are able to efficiently block protein production by eliminating the mRNA that encodes it.

10 2.3.3 Translational regulation

Direct regulation of translation is less prevalent than control of transcription or mRNA stability but it is occasionally used. This is usually performed by biding a repressor to the 5’ untranslated region of the mRNA, which helps to guide the ribossome to the mRNA start condon. The ribosome is, thereby, kept from finding the translation start site. When conditions change, the cell can inactivate the repressor and increase translation of the mRNA.

2.3.4 Protein degradation

Once protein synthesis is complete the level of expression of that protein can be reduced by pro- tein degradation. Cells possess specialized pathways to degrade proteins, using enzymes designated by proteases. In these pathways, proteins which lifetime must be short or which are damaged or mis- folded are marked by the attachment of a small protein called ubiquitin. Ubiquitylated proteins are then recognized and destroyed.

2.4 Gene expression analysis using RNA-Sequencing data

Gene expression analysis (GEA) gives insight about the transcriptional behaviour of biological sys- tems, being among the most commonly used methods in modern biology. It is an exceptionally pow- erful tool that is used in many areas of biology and medicine. Namely, it can be employed to identify genes differentially expressed among tissues or experimental conditions [19], to perform classification or discrimination analysis in heterogeneous diseases such as cancer [43], to understand the relations between genes profile and covariates such as survival or tumor aggressiveness [44, 45], to discover new drugs or optimize its production [46–48], to diagnose diseases [49], to tailor therapeutics to specific pathologies [50] or to generate databases with information about living processes [51]. GEA can be roughly split into two separate approaches, depending on what it is intended to study: genome-wide and targeted approaches. When the key genes of interest are not known, the data is acquired at the biological system level. For that, genome-wide approaches such as microarrays [52] or RNA-Seq [53] are used. In this sort of studies the frequency of RNA species in a certain biological system is measured. This is done by profiling the cell transcriptome, which is defined by the complete set of transcripts in a cell and their amount at a specific acquisition point. Remembering the considerations stated in section 2.2, knowledge of the cell transcriptome can be invaluable in providing the link between the information encoded in the genome and the phenotype presented by the cell [13]. On the other hand, when the genes of interest are already known, more targeted methodologies can be employed. These approaches are usually based on quantitative polymerase chain reaction (qPCR) techniques [54]. In this chapter I will focus genome-wide GEA, particularly using RNA-Seq data and explain the pros and cons associated with this analysis. Furthermore, I will also introduce the RNA-Seq experimental protocol and review the current methodologies designed to estimate from the raw nucleotides sequence the active cellular processes upon the transcriptome collection.

11 2.4.1 Approaches for genome-wide expression analysis

Over the years, a wide number of methodologies have been developed with the purpose of profile the transcriptome. Two of these methods revolutionized expression profiling by enabling the measurement of thousands of genes simultaneously: the older sequence-based microarray technology and the modern high-throughput sequencing-based RNA-Seq technology. Since the mid-1990s, DNA microarrays have been the technology of choice for genome-wide studies. The ability of these arrays to simultaneously interrogate thousands of transcripts has led to important advances in a wide range of biological problems. Nowadays, DNA microarrays are a relatively inexpen- sive and mature technology. In this technology, a large set of short single stranded DNA molecules, called probes, are attached to fixed locations on a solid substrate. The RNA molecules (transcriptome) are then extracted from the sample, copied into complementary DNA (cDNA) with the help of reverse- transcriptase and labelled with a fluorescent dye. Finally, cDNA is passed over the solid substrate where the probes were fixed and complementary sequence will tend to hybridize. Expression is then estimated by a fluorescence scanner which measures the amount of fluorescence coming from each probe on the slide [55]. The importance of this technology can be proven by a simple search for the term ”microarray” in PubMed database, which identified nearly 57000 citations (website accessed on 13 October 2013)1. Nevertheless, this technology presents several limitations. For instance, expression measurement have high background levels due to cross-hybridization; the probe sequences must be pre-specified so it is necessary to know a priori the sequences to be interrogated; the accuracy of expression measure- ments is limited due to both background and saturation of signals [56]; used probes can differ in their hybridization properties [57, 58]. Thus, comparison between different transcripts in the same array is unreliable. The use of microarrays is narrowed to the detection of differential expression of the same probe target between samples. Finally, the array contains a limited number of probes and, thereby, there is a limitation on the number of interrogated transcripts. Contrasting with microarray technology, sequence-based approaches directly determine the cDNA sequence. The older sequencing technology of this type is designated by Sanger sequencing [59]. This approach is expensive, have relatively low throughput and generally is not quantitative [13]. Trying to overcame these limitations, tag-based methods were developed. Examples of this type of technology are serial analysis of gene expression (SAGE) [60], cap analysis of gene expression (CAGE) [61] and massively parallel signature sequencing (MPSS) [62]. These technologies provide a digital readout of gene expression levels using DNA sequencing, being able to report the expression of genes at levels below the sensitivity of microarrays. However are based on Sanger sequencing technology and, thus, have associated high costs. Moreover, a significant amount of the short tags can be mapped to more than one place in the reference genome. These disadvantages determine the limited application of these technologies. The development of novel high-throughput DNA sequencing methodologies has provided tools to

1Source: http://www.ncbi.nlm.nih.gov/pubmed/?term=microarray

12 overcame the limitations of both microarrays and tag-based methodologies [63, 64]. Particularly, these new methods can be employed for both mapping and quantifying transcriptomes. This methodology, termed RNA-Seq, in theory enables the acquisition of the transcriptome across all cell types, perturba- tions and states [58]. As a result of its low cost associated with its high throughput, this technique has been increasingly used as an integral part of microbiological research [8, 13]. A further advantage is that RNA-Seq technology can be used not only for gene expression profiling but also for detection of gene fusion events [65], discovery of single nucleotide polymorfisms [66], investigation of post transcrip- tional RNA mutations [67], study of alternative splicing events [68], discovery of novel transcripts [69] or investigation of allele specific expression [70].

2.4.2 RNA-Sequencing experiment workflow

There are several technologies available to perform high-throughput sequencing of DNA molecules. Currently, the ones that dominate the field are: 454 GS-FLX from Roche Applied Science, Genome An- alyzer II from Illumina, Inc. and AB SOLiD from Applied Biosystems. Nevertheless, other technologies are being developed with potentially even higher quality throughputs (e.g. Pacific Biosciences, Helicos). The different technologies require distinct experimental protocols. Although, the essence of these sys- tems is the same: to miniaturize individual sequencing reactions. Of the techniques referred above, the most commonly used is the one utilized by Illumina’s machines [71, 72]. Bearing this in mind, the tools used and developed in this thesis are appropriate for NGS data from Illumina platform. Particularly, the approach of these machines usually comprises the following fundamental steps:

1. Informative RNA enrichment A typical RNA-Seq experiment starts by purifying a subset of RNAs from the bulk of the total RNA. Particularly, for mRNAs, this enrichment is usually performed by selecting poly-(A) molecules using poly-(T) oligo-attached magnetic beads. In addition, bearing in mind that a vast majority of RNA (>90%) present in cells consists on rRNA [73], the spectrum of RNAs of interest can be enriched by performing rRNA depletion [74, 75]. This enrichment step ensures that a strong signal is obtained for the RNA population of interest.

2. RNA fragmentation Following purification, the RNA is fragmented into small pieces via RNA hydrolysis or nebulisation (RNA fragments with 100-300 base-pair long). Alternatively, the fragmentation step can be performed on cDNA by DNase I treatment or sonication (which is explained in detail in the point bellow). Each of these approaches create a different bias in the outcome molecules. Fragmentation of cDNA is more biased towards the 3’ end of the transcript. Contrasting, RNA fragmentation provides more even coverage along the gene body, but is relatively depleted for both the 5’ and 3’ ends.

3. Synthesis of double stranded cDNA The cleaved RNA fragments are, then, converted into DNA using reverse transcriptase. In order to do so, reverse transcription requires the hybridization of primers into the RNA chain. These primers

13 can be either short sequence of Ts (oligo dT primers) or sequence of random oligonucleotides (random primers). For long mRNAs, the use of oligo dT primers is not advisable because reverse transcriptase enzyme will diminishing the frequency of reads towards the chain 5’ end, resulting in a bias towards the 3’ end of transcripts [73]. In these cases random primers, which have the potential of hybridize to random sites along the RNA molecule, are preferred. Once the first strand of cDNA is synthesised, the RNA template is removed and a second cDNA is created using DNA pol I and ribonuclease H, generating a double stranded cDNA molecule.

4. Adapters ligation

In this process, cDNA 3’ overhangs are converted into blunt ends by specialized enzymes. Pos- teriorly, an A base is added to the 3’ depleted end, preparing the cDNA fragments for ligation to the adapters, which contain a single T base over-hanged at their 3’ end. Distinct adapters are, then, ligated to 3’ ends of each strand of the double-ended cDNA molecule.

Figure 2.5: First steps in the conversion of total RNA into a library of template molecules suitable for high throughput DNA sequencing: mRNA is polyA-selected from total RNA and fragmented – step 1; mRNA is then converted into DNA by reverse transcriptase – step 2 and 3; the ends of the cDNA chain are blunted and an A base is added to it – step 4a and 4b; the final product is created when a T from the adaptor hybridize with a inserted A in the cDNA chain – step 4c. Adapted from [76].

5. Size selection and PCR amplification

In the fragmentation step, DNA molecules are split into different sizes. To ensure that all molecules are of similar length, a desired range of DNA length is purified by gel extraction. Moreover, this process removes unligated adapters as well as adapters that ligated to one another. After this step, two primers are annealed to the adapters tail and the purified cDNA molecules are amplified by PCR.

6. Cluster generation

Prior to sequencing, single-stranded DNA templates are bridged-amplified to form clonal clusters inside the flow cell. In order to do that, the double-stranded molecules that resulted from the PCR amplification step need to be denatured into single strands. Firstly, the DNA templates are hybridized to a slide with an high-density of immobilized forward and reverse primers. The templates are copied

14 from the hybridized primers by 3’ extension using DNA polymerase. The original templates are then denatured, leaving the copies immobilized on the flow cell surface. After fixation, the immobilized copies loop over to hybridize to adjacent primers and DNA polymerase copies the templates forming double- stranded DNA bridges which, in turn, are denatured forming two single-stranded molecules with the same information of the first DNA template. Finally, the reverse DNA strand is removed by specific base cleavage, and the 3’-ends of the immobilized forward strand are blocked to prevent interference in the sequencing process. This procedure is repeated on each template by cycles of isothermal denaturation and amplification to create dense clonal clusters containing at least 1000 molecules per cluster.

Figure 2.6: Cluster generation process: the double-stranded DNA that resulted from the PCR amplification is denatured and hybridizes to adaptors immobilized in substrate surface – step 6a; the fixed single-stranded DNA molecules are amplified by bridge amplification, forming millions of unique clusters – step 6b; the reverse strands are then cleaved and washed away and, finally, sequencing primers are hybridized to the DNA templates – step 6c. Adapted from [77].

7. Sequencing-by-synthesis This process starts with the hybridization of sequencing primers to each single-stranded molecule in the clonal clusters. The DNA-templates are then simultaneously reverse complemented. This is performed using fluorescent-labeled nucleotides. After a nucleotide is added, the clusters are excited by a laser which causes fluorescence of the last incorporated base. The fluorescent dye and the blocking group are then removed from the new nucleotide and the process is repeated typically 30 to 200 times. The output of this process is a sequence of images, one for each new incorporated nucleotide, in which the fluorescence signal of each cluster is captured and the colour of the lighted spot represents a different base type. Combining the acquired sequence of images, it is possible to obtain the nucleotide sequences for each cluster. Finally, this information is saved in a text file. This is usually in the FASTQ file format and contains for each read: an unique ID to identify the read, the raw sequence of nucleotides and the quality values per base in the read. Regarding the quality scores, they usually are defined by the Phred score. Defining the quality score of a given base as Q and e as the estimated probability of the base call being incorrect, the Phred quality score is defined as Q = −10log(e). Furthermore, it is important to note that the DNA fragments can be sequenced on only one end (single-end sequencing) or both ends (paired-end sequencing).

15 Figure 2.7: Sequencing-by-synthesis process: after the hybridization of the sequencing primers, four labelled reversible terminators and DNA polymerase are added to the environment – step7b; one nucleotide is added to the acid nucleic chain and after laser excitation, the colour of its fluorescent label is registered – step 7c and 7d. The fluorescent label and the blocking group are then washed away and the described process is repeated. Adapted from [78].

2.4.3 Quality control of sequence reads

As previously referred, high-throughput sequencers are able to generate huge amounts of data in a single run. Before analysing the acquired sequences and extract biological conclusions from it, it is critical to evaluate the library quality as well as the sequencing performance. Thereby, prior to any alignment or assembly of the reads the low-quality bases must be removed. Base calling on any of the commonly used NGS platforms can be compromised by low-quality starting material, errors in chemistry or faults in the instrument [79–81]. Quality Control (QC) usually takes into account duplication rate, as it is expected that most se- quences occur only once when the universe is greater than the sample size [82]; rRNA abundance, which should be low; strand specificity, defined as the number of reads mapping to known transcribed regions at the expected strand; coverage continuity at annotated transcripts and performance at 5’ and 3’ ends, defined as agreement with known end annotation [79]. For sequencing quality evaluation, valuable information is extracted from the variation of the Phred value across the sequenced bases, which usually translates the progressive decrease of the quality confidence from 5’ to 3’ end as a consequence of the cluster’s fluorescent disturbance due to accident stop of elongation [83]; the content of bases, which is expected to have little to no difference between the different bases of a sequence run; the amount of bases N, which are added when the sequencer is unable to make a base call with sufficient confidence or the sequenced reads length, which is expected to be uniform [84, 85]. Based on this type of analysis, the bases with low sequencing quality should be trimmed ensuring the quality of the high throughput data in which cellular activity is characterized and from which biological conclusions are going to be extracted.

16 2.4.4 Mapping reads

Once abnormal reads were filtered from the raw cDNA sequence reads, the short sequenced reads must be mapped to a reference genome or transcriptome. The main goal of this processing step is, therefore, to find the real location of each acquired RNA sequence on a given reference. Computa- tionally this problem was formalized by Fonseca et al. [86], as follows: given a set of sequences Q (produced by a NGS technology), a set of reference sequences R, a possible set of constrains and a distance threshold k, find all substrings m of R that are within a distance k to a sequence q in Q. Matching the reads with the reference data can be challenging given that millions of short reads need to be mapped to genomes that usually are very large. This means that mapping algorithms need to be extraordinarily efficient and use processors and memory in a optimal way. Moreover, given that for complex organisms (such as human or mouse) repetitive sequences represent nearly 50% of the genome [87, 88], the mapping tools need to be able to handle multiple mapping locations. Read align- ment against the genome is slower towards the transcriptome because it needs to take into account all non-coding positions (which is a much larger problem than just introns). Currently, there are several alignment programs which are capable of performing spliced alignments, including TopHat2 [89], SOAP- Splice [90], Blat [91] or Exonerate [92]. On the other hand, aligners such as Bowtie 2 [93], BWA [94], MAQ [95] or SOAP2 [96] are specialised in the alignment of short reads contiguously to a reference.

(a) Alignment of a paired-end read to the genome. (b) Alignment of a paired-end read to the tran- The read needs to be mapped across introns scriptome. (spliced alignment).

Figure 2.8: Mapping of a paired-end read with different reference files: genome and transcriptome. Adapted from Trapnell and Salzberg, 2009 [97].

Short-read mappers like Bowtie 2 use a computational strategy known as indexing to speed up their mapping process and allow an efficient and relatively small memory footprint search [93, 98]. Partic- ularly, Bowtie 2 indices are based on the Burrows-Wheeler Transform (BWT) [99] and on the full-text minute-space (FM) index [98, 100, 101]. These transformations allow the human genome to fit in 3.5 gigabytes. The alignment process is done one character at a time comparing the seed substrings (short segments extracted from the raw NGS read which are likely to have unique matches in the genome) with the BW-tranformed genome. Each successfully aligned character narrows the list of possible map- ping positions. If Bowtie 2 cannot find a location where the read align perfectly, seed placements in the genome are prioritized to find the most likely map locations. Seeds are then extended into full align- ments (allowing gaps) with a Single Instruction Multiple Data (SIMD)-accelerated dynamic programming algorithm until all seed hits are examined, until a sufficient number of alignments are examined, or until the dynamic programming effort limit is reached [93, 97, 98]. The output file of the referred mappers is a SAM format file which contains the information of aligned

17 and non-aligned reads. Particularly, for the aligned reads, this file contains, among others, the informa- tion about the genome portion where the read was mapped and its respective score [102].

2.4.5 Expression quantification and normalization

Following RNA-Seq reads mapping, the data needs to be converted into a quantitative measure of gene expression. The simplest approach to this problem is to sum the number of reads which fall within the coordinates of each element [73, 80]. For reads that were aligned to the transcriptome it is straightforward to extract this information. For reads that were aligned to the genome, gene expression measurement can be performed using tools such as those available in the HTSeq package [103]. Given that RNA-seq protocols use a RNA fragmentation approach prior to sequencing to gain se- quence coverage of the whole transcript, long transcripts will have more reads mapping to it when compared with short transcripts of similar expression level [53, 104]. Considering that there is more power to detect differential expression for longer genes, read counts need to be properly normalized to extract meaningful expression estimates [53]. Moreover, each sequencing run has a given variability which will influence the number of fragments mapped across samples. Hence, it is also necessary to normalize for each sequencing run in order to avoid the possibility that genes will appear to be differen- tially expressed only as a result of the presence of more sequences in one condition when compared to another [58, 104, 105]. One approach to normalize RNA-Seq data is to use the reads per kilobase of transcript per million mapped (RPKM) metric, which normalizes a transcript’s read by both its length and the total number of reads mapped in the sample [53]. In a similar way, fragments per kilobase of transcript per million mapped (FPKM) metric normalizes paired-end data.

Figure 2.9: Transcripts of different lengths with different read coverage levels (left), total read counts observed for each transcript (middle) and FPKM-normalized read counts (right). Extracted from Garber et al., 2011 [106].

However, several studies have shown that the previously described normalization approaches do not account for RNA population composition [105] or for the fact that a small number of highly expressed genes can consume a significant amount of the total sequence [107]. A possible approach to have into account these features is to estimate scaling factors from the NGS data and use it within the models that test for differential expression (DE) [107, 108]. This methodology has the advantage that the raw count data is preserved for subsequent analysis. Finally, quantile normalization can be used as an alternative to the previously described methodologies [109]. Nonetheless, the non-linearity of this approach lead to the loss of the raw count data and does not seem to improve differentially expressed detection when compared with an appropriate scaling factor methodology [107, 110].

18 2.4.6 Differential expression

Once gene expression has been quantified and normalized, statistical testing between conditions is usually performed. Bearing in mind the count-based nature of RNA-Seq data, initial models adjusted the observed reads using count-based distributions such as Poisson distribution, which provides a good fit for counts arising from technical replicates [58, 111]. However, studies such as the ones developed by Leek et al. [112] and Smyth et al. [113] concluded that these distributions do not account for bi- ological variability across samples. The fact that for Poisson’s distribution the variance is equal to the mean and that the variance associated with the RNA-Seq biologic replicates is likely to exceed the mean for a considerably number of genes explains this misfit. Hence, for datasets with biological replicates Poisson-based analysis will be prone to high false discovery rate (FDR) resulting from the underestima- tion of sampling error [108, 112, 114]. To overcame this limitation, many methods were developed in a attempt to model biological variability and provide a measure of statistical significance in a dataset with a narrow number of biological replicates. Robinson and Oshlack [114] were the first to propose the use of a negative binomial (NB) distribution to model the counts across samples and implemented it in a R/Bioconductor package [115] called edgeR [114, 116]. This methodology allows to neutralize the bio- logical replicates variability that Poisson distribution is not able to model, known as over-dispersion. The NB distribution is an extension of the Poisson distribution, requiring an additional dispersion parameter which allows the extra variability. Thus, the number of reads in a sample j that are assigned to gene i are modelled by equation 2.1.

2 Kij ∼ NB(µij, σij) (2.1)

However, the number of replicates in a typical RNA-Seq experiment is too small to estimate with confidence both these parameters independently for each gene. To solve this problem, the Biocondutor’s edgeR package estimates only a single factor and assumes a relationship between the variance and the mean of the defined negative binomial distribution:

2 2 σij = µij + αµij (2.2) with α being a proportionality constant that is estimated from the data and is the same throughout the experiment. The edgeR approach uses, therefore, a common dispersion model for all population [110]. Another approach was used by Anders and Simon [108] which still use a NB distribution but estimate the mean-variance relationship within each sample by local regression. The dispersion is, thus, locally estimated which allows the existence of a more flexible relationship between the mean and the variance [108]. When the experiment contains a low number of replicates, this methodology avoids inaccuracies in the dispersion estimation and the locally estimated dispersions are used instead of the raw estimates in a NB exact test [114]. There is another variation of NB-based DE analysis called baySeq [117]. This methodology uses an empirical Bayesian approach to detect patterns of differential expression within a set of sequencing samples. Moreover, extensions to the Poisson model in which over-dispersion was included have also been proposed [118]. Several surveys compare all these methodologies in the discovery of differentially expressed genes [119–121]. Overall, these studies concluded that no single

19 method among those evaluated is optimal under all circumstances and hence the method of choice in a particular situation depends on the experimental conditions. Finally, it is important to note that the biological conclusions extracted from the described method- ologies must be interpreted with care. In fact, differences in library construction [79] and variability intrinsic to the biological samples can greatly influence the number of false positives. With this in mind, it is imperative to have biological replicates in the RNA-Seq dataset since these are essential in the measurement of the samples intrinsic variability [106].

2.4.7 Pathway analysis

As a final step, the list of differentially expressed genes can be grouped into common pathways. This analysis identifies differentially active pathways and, ultimately, possibilities the connection of molecular information (transcriptome) with the phenotype of an organism in study (active pathways) [122]. Emmert- Streib and Glazko [122] describe the main reasons why this type of analysis is so important:

• reducing the dataset dimensionality lowers the number of statistical hypothesis that need to be tested;

• conclusions about pathways are more interesting given that genes are co-related with each others in gene networks;

• consideration of set of genes with similar behaviour allows to reduce the importance of false posi- tives in the list of differentially expressed genes;

• for individual genes, the expression variation upon a certain disease could be only moderate or even negligible. The analysis of set of genes variation can evidence differences between disease conditions. One example is type II diabetes in which no differentially expressed genes are found between positive and negative patients but, in fact, a set of genes related with oxidative phospho- rylation in the human diabetic muscle is decreased [123].

The methodologies that evaluate the behaviour of differentially expressed genes and perform its concatenation into cellular pathways exploit pathway knowledge available in public repositories such as the Gene Ontology (GO) [124] or the Kyoto Encyclopedia of Genes and Genomes (KEGG) [125, 126]. This is performed by relating the pathway information into these databases with the gene expression patterns, resulting in the transformation of the list of individual genes into a set of pathways. Following the classification performed by Khatri et al. [127], this type of pathway analysis is designated by over- representation analysis (ORA) and usually follows the subsequent strategy: receives as input a pre- selected gene list [128], the list of individual DE genes. From this list are selected the genes with higher rates of under- or over-expression with a certain FDR. Then, a test is performed to check if the genes in this list have any biological function in common. This search is, therefore, targeted to associate set of genes involved in the same cellular process [129]. The most commonly used tests are based on the Fisher’s exact test [130], hypergeometric [131], chi-square [132], or binomial distribution. An extensive list of tools that are designed to perform this type of analysis is introduced by Lempick et al. [128].

20 3 Listeria monocytogenes

Contents

3.1 Intracellular infectious cycle ...... 22 3.2 Listeria monocytogenes as a model organism ...... 22 3.3 Listeria monytogenes case study: dataset description ...... 23

21 3.1 Intracellular infectious cycle

The L. monocytogenes cell invasion process begins with its internalization into the host cell by dis- playing internalin A on its surface. This protein is specialized in making contact with E-cadherin, a surface epithelial protein in the human intestines, triggering the movement of host cell surface proteins to the biding site [133]. A second protein on the L. monocytogenes surface, internalin B, binds to the Met receptor tyrosine kinase on the host cell which, in turn, will induce the activation of the protein- tyrosine-kinase activity of Met [134], as well as the phosphatidylinositol 3-kinase (PI3K) [135, 136] and the Ras-mitogen-activated protein kinase (MAPK) pathways [137, 138]. The activation of these signalling pathways leads to a series of events in the host cells, which even- tually promote the cytoskeleton changes needed to internalize the attached bacterium [139]. Upon enter- ing the cell, L. monocytogenes uses the cell’s machinery to replicate. For that to occur, L. monocytogenes must escape from the internalization vacuole into the cytosol [140]. To perform that, after internalization, Listeria-containing phagosome is disrupted by a dynamic multistep process which involves two phospho- lipases (PlcA and PlcB) and a pore-forming toxin, LLO, that are under the control of one transcriptional regulator, prfA [138, 141, 142]. The bacterium secretion of LLO must be tightly regulated in order to balance efficient escape and, at the same time, prevent to damage the host cell [143]. The perfora- tion process occurs when, at vacuole acidic pH, LLO monomers oligomerize into large complexes that penetrate the membrane by extending β-hairpins [144]. At cell neutral pH, acidic amino-acid initiate irre- versible denaturation of the β-hairpins of LLO, leading to the inactivation of this pore-forming hemolysin [138, 143]. Moreover, LLO has also the ability to induce signalling in the host cell by activating, for exam- ple, MAPK [145], calcium [146] and protein-kinase-C [147] signalling pathways. Shortly after invading the cell cytosol, L. monocytogenes induces polymerization of host actin filaments and uses the force generated by actin polymerization to move first intra-cellularly and then from cell to cell [140, 148]. Actin filaments are organized into a comet-shaped tail at one bacteria pole, its sustained nucleation and elon- gation provide the propulsive force to push bacteria through the cytosol at a relatively high speed [149]. Upon finding on its path the plasma membrane of the infected cell, bacteria push this membrane forward creating an invagination in the membrane of the neighbouring cell. This protrusion, with the bacteria in the tip, is taken up by phagocytosis. By this process, the bacteria is capable of invading neighbour cells escaping the humural immune system of the host [149]. This process is illustrated in figure 3.1.

3.2 Listeria monocytogenes as a model organism

L. monocytogenes was discovered by Murray et al. in 1926, during an epidemic that affected labo- ratory rabbits and guinea pigs [150]. Posteriorly, in 1929, it was found to infect also wild animals [151] and humans [152] and was recognized as a food pathogen in 1986 [153]. Causing an infection named by listeriosis, L. monocytogenes is responsible for gastroenteritis in healthy individuals, meningitis in immunocompromised individuals, and abortions in pregnant women, with a high mortality rate (20-30%) [154]. Since the late 1980s, cell biology approaches combined with molecular biology and genomics

22 Figure 3.1: Schematic representation and electron micrographs of L. monocytogenes life-cycle. (a) L. monocytogenes induces its internalization in the host cell; (b) bacteria entry in a phagosome; (c) disruption of the internalization vacuole and Bacteria is released into the cytoplasm; (d) polymerization of host actin filaments (actin tail); (e) formation of protrusions in the plasma membrane; (f) Invasion of the neighbouring cell and the previously refered process are repeated. Extracted from Cossart et al. [138]. have unveiled the elegant strategies used by L. monocytogenes [154] to enter into non-phagocytic cells, escape from the internalization vacuole, replicate in the host cytosol and manipulate the innate response triggered in the cytosol. Its intracellular life-cycle in the human host provides insight about the dynamics of general host-pathogen interactions and possibilities the study of the early steps of the infection in vivo. Bearing this in mind, L. monocytogenes has been used along these two decades as an important model of bacterial adaptation in the comprehension of how pathogens engineer host cellular environments [138, 154–158]. More recently, deep sequencing studies have also been widely used in the deciphering of L. mono- cytogenes molecular behaviour [159–164].

3.3 Listeria monytogenes case study: dataset description

In this thesis I aim to comprehend in which way L. monocytogenes influences human host cells. Intending to perform that, I analysed a RNA-Seq dataset that was acquired from populations of human cells infected with this bacterium. The experimental data is composed by three populations of HeLa cells. These cells consist on a human cell lineage derived from cervical cancer cells which have a

23 remarkable resistance to aggressive agents [165]. The three populations are characterized as follows: Control, HeLa cells not infected growing in a healthy medium; LM1, HeLa cells infected with wild-type L. monocytogenes strain EGDe; LM2, HeLa cells infected with mutant L. monocytogenes strain EGDe, from which was removed the hly gene that encodes for LLO. Total RNA was extracted from the cells of each population at four time-points (20, 60, 120 and 240 minutes) with the purpose of having represented specific stages in the bacterium lifecycle. Extracted RNA was sequenced using Illumina platform. From this procedure resulted a paired-ended dataset in which each DNA fragment is constituted by 90 base pairs.

24 4 RNA-Sequencing analysis pipeline

Contents

4.1 Methods ...... 26 4.2 Listeria monocytogenes case study ...... 28 4.3 Discussion ...... 40 4.4 Future work ...... 46

25 Over the past few years, a wide number of methods have been developed to cope with different aspects of RNA-Seq data analysis. However, combining them into a congruent analysis pipeline is not a simple challenge, in the sense that their aggregation must address the needs and specificities of each problem. Therefore, awareness of each available tool is crucial to prevent erroneous biological conclusions given that configuration of upstream tools will influence the results of downstream tools and, ultimately, determine the extracted biological conclusions. With these considerations in mind, I propose a sequence of tools (pipeline) which is suitable to per- form a comparative study between two RNA-Seq samples. This pipeline is able, from raw RNA-Seq reads, to extract the main biological processes differing between the analysed conditions. The elabo- ration of this pipeline has broad interest, given that it is not unusual that biologists intend to perform a comparative analysis after subjecting, on the bench, the cell to different environments and pathogenic agents. For that, frequently, the chosen approach is sequence the cell transcriptome using NGS tech- niques [8, 17, 18]. However, the output of these technologies is, routinely, dozens of gigabytes of data. In order to extract relevant information from it the use of computational power is essential. This pipeline provided the basis for the results described in the subsequent chapters of this thesis.

4.1 Methods

A diagram illustrating the conceived pipeline is showed in Figure 4.1. This pipeline was implemented in a Makefile [166] and incorporates both publicly available tools and scripts developed by me to perform the biological evaluation of the RNA-Seq data. Data analysis begins with the input of the raw read files and the reference files. Once this data is gathered, the reads are processed with FastQC [167]. FastQC consists of a Java software which pro- vides tools to perform a QC study in raw high throughput sequence data. The analysis performed by this tool ensures that the data is qualitatively good and that there are no problems or biases in it, re- porting the following measurements: per base sequence quality, per sequence quality scores, per base sequence content, per base GC content, per base N content, sequence length distribution, sequence duplication levels, overrepresented sequences and Kmer content. Problems detected by a QC analy- sis may derive from the sequencer or from the starting library material. Some abnormalities may be resolved by trimming base pairs from the raw read. The pipeline supports this, containing a script that is able to trim a given number of base pairs. The pipeline is pre-set not to trim any bases from the raw nucleotide sequence. This option can be modified in the Makefile, depending on the QC results. Afterwards, the clean reads will be aligned with a reference sequence using Bowtie 2, a well estab- lished mapper [93]. This is a fast and memory-efficient mapping tool that is particularly suitable for the alignment of small reads, as the ones from RNA-Seq technology, to long reference sequences, as the human genome reference. Previously to the mapping process, the reference file must be indexed to be used by Bowtie 2. To perform this task, bowtie2-build is used. This tool constructs a Bowtie index from the set of DNA sequences in the reference file, which usually is in the FASTA format [168]. Once the

26 index is built, Bowtie no longer uses the original sequence FASTA file. At this point, a set of options associated with the type of search performed by the Bowtie 2 algorithm need to be defined (see ap- pendix A, Section A.3.1). The output of the mapping process is a Sequence Alignment/Map (SAM) file which stores the informations about the read alignments against the reference sequence. Particularly, for paired-end reads two records are printed (i.e. two lines of output) describing the mapping proprieties for each mate [102]. Typically, after mapping RNA-Seq reads to a reference the number of reads that map a certain gene or transcript is measured. For RNA-Seq, this read count has been found to be roughly linearly related to the abundance of the target transcript [53]. In order to access the genes expression information, the pipeline uses HTSeq-count, a script integrated in the HTSeq Python package that counts how many reads map to a certain feature [103, 169]. A feature is defined as an interval or a union of intervals on a . Due to the nature of this study, where high-level pathway analysis was the goal, it was not considered relevant to distinguish multiple isoforms of the same gene. Therefore, features in this analysis are equivalent to genes and each gene is considered as the union of all its exons. This approach simplifies considerably the data processing. Moreover, to perform the count of the mapping reads, it is needed a reference file which contains the information about the features. The default count approach is defined as follows: if a certain read maps only to one gene, the read is counted for it; if it maps to more than one gene, the read is counted ambiguous; and if no gene is matched at the read position, the read is counted as no feature - Union mode. The used tool has more two modes. The user can change the count mode on the Makefile depending on the intended analysis. The output is a table in which the entry in the i-th row and the j-th column contains the information about how many reads have been mapped to gene i in sample j. After gene expression level measurements, a DE test is performed between the RNA-Seq samples. This test has as main goal the detection of DE genes among the conditions in study. The inference of differential signal in such data is done using DESeq [108, 170], an R/Bioconductor package [115], which uses a negative binomial distribution to model the gene expression distribution. In order to do so, the gene expression measurement of both conditions in comparison need to be concatenated into one table. From this concatenated table, DESeq estimates the dispersion for each gene and analyses if there is differential expression among them. The final throughput of this step is a table in which the entries correspond to the genes that are DE between the two different conditions in comparison. For each gene, the output table contains information about the mean expression level as a joint estimate from both conditions and estimated separately for each condition, the fold change from the first to the second condition, the logarithm (to basis 2) of the fold change, the p-value for statistical significance of this change and the p-value adjusted with Benjamini-Hochberg procedure [171] that controls the percentage of false positives among all the rejected hypotheses (FDR). To select the most significant entries in this analysis, the algorithm is pre-set to performed a trimming in the raw output table by selecting only entrances characterized with a p-value less than 0.1. In order to understand this trimming, it is needed to have a clear knowledge about which statistical measure the p-

27 value translates. The p-value is defined as the probability of obtaining a value at least as extreme as the one that is being actually observed, assuming that the null hypothesis is true. This null hypothesis refers to a general or default position and it is rejected if the p-value is less than a threshold (significance level). In the DE analysis context, the null hypothesis corresponds to a scenario in which the genes are not differentially expressed. This means that with lower p-values is very unlikely that the observed difference is occurring randomly and, thereby, the 0.1 p-value cut-off assures that only the statistically important entries will be considered in further analysis. Concluding, the output of this step consists in three tables which entries were previously selected by statistical significance: one with the genes sorted by p-value, another by the most down-regulated genes and another by the most up-regulated genes. Lastly, the genes found to be differentially expressed are associated with GO terms using a Bio- conductor package called GOStats [131, 172, 173]. This software uses a standard hypergeometric test in order to relate a given gene list with the controlled vocabulary in the GO database. The GO project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data. Particularly, it consists in three structured controlled vocabularies: biological processes (BP), sets of molecular events which are essential to the functioning of integrated living units; cellular components (CC), parts of cells or its extracellular environment; and molecular functions (MF), elemental activities of a gene product at molecular level [124]. Therefore, in this final step, it is possible to have a biological insight about the samples being compared. Specifically, from the obtained table one can conclude the most differentially active processes when the cell is subjected to distinct growth circumstances. To perform an analysis using the Hypergeometric-based test implemented in the GOStats package, it is needed to define a gene universe. The pre-set universe is contained in org.Hs.eg.db [174] package from bioconductor. This consists on a genome wide annotation database for Homo Sapiens, mapped with Entrez Gene identifiers. Therefore, the pipeline is adjusted to analyse human RNA-Seq data. Secondly, it is necessary to define a list of genes from that universe. In this analysis, this list corresponds to the collection of genes in the final table of the DE analysis step. For the entries generated through the Hypergeometric test, it is defined a p-value cut-off of 0.1. Additionally, the test direction is set as over, so the final result of this step will be a table with the over represented GO terms associated with the differentially expressed genes found in the previous pipeline step for each one of the ontologies. In other words, GOStats will identify the most important ontology terms that differ between the conditions being compared in the DE analysis.

4.2 Listeria monocytogenes case study

The pipeline described above was used to process the experimental data characterized in section 3.3. In the subsequent section I describe the results from each processing step, interpret them and ultimately, conclude about which are the up- and down- regulated pathways in human HeLa cells upon L. monocytogenes infection.

28 Figure 4.1: Flowchart of the developed RNA-Seq analysis pipeline: QC study is performed in the raw RNA-Seq reads using FastQC; the filtered reads are then mapped to a reference file using Bowtie 2; from this data is measured the gene expression level and performed a differential expression test by DESeq; the genes are then concatenated into GO terms using GOStats. In the end, this pipeline gives insight into the biological processes that are differ- entially active between the conditions in comparison. Boxes with no outline shape represent the input/output files. Boxes with yellow outline shape contain the task and the tool to perform it.

4.2.1 Quality Control – FastQC

Overall, the quality of the L. monocytogenes RNA-Seq data was high. Nevertheless, 3 of the 10 ex- amined quality modules presented warnings. Those inadvertences appeared on the Per base sequence content, on the Per base GC content and on the Sequence Duplication levels sections. Particularly, the first warning topic plots out the proportion of the four DNA bases for each base position in a sequence file. So, for a random library, it is expected to have little to no difference between the different bases of a sequence run. These proprieties were not verified for any of the analysed read files. In fact, all of them had a really high variability on the first 15 bases, which points out to the presence of an overrep- resented sequence in the library. This may be related with a problem in the library generation or can be a consequence of a abnormal sequencing process. However, the most plausible explanation is that the use of random hexamer priming is introducing biases at the start of sequencing reads, as described by Hansen et al., 2010 [175]. Regarding the other two notified sections, the Per base GC content does an analysis similar to the one that was explained previously but, instead of focusing all the four nucleotides, only focuses the G and C content. And the Sequence Duplication levels refers to the duplication degree of every sequence in the data set, which, for the processed data, pointed out a large number of se- quences with very high levels of duplication (approximately 25% of the analysed sequences presented a degree of duplication higher than 10). For this module, only the first 50 base pairs are used to assess the duplication plot. Knowing this and the duplication rate, it is possible to conclude that nearly 12 base pairs of the total analysed 50 base pairs were repeated. This fact is consistent with the hypothesis of

29 the existence of some abnormality among the first 15 base pairs. Based on this study, 15 bases were trimmed from the beginning of each original sequence. Given that the reads are long (90 base pairs), the loss of information inherent with this trimming is nor signifi- cant. Therefore, each one of the reads is composed, after this step, by 75 base pairs.

(a) Per base sequence content. (b) Per base GC content. (c) Sequence Duplication levels.

Figure 4.2: Modules with inadvertences in the QC analysis of non-mutagenic L. monocytogenes data, for time-point 20 and for the first strand of the paired-end read.

4.2.2 Mapping reads – Bowtie 2

The filtered L. monocytogenes reads were then mapped against a reference file. The reference file was downloaded from Ensembl [176] website and contains the reference human DNA sequence in FASTA format. Particularly, the used file corresponds to the release number 71, assembly 32 [177]. The reference file was indexed by bowtie-build and used by Bowtie 2 to perform the mapping of the L. monocytogenes RNA-Seq reads to the human genome. Overall, for the mapped reads, the rate of alignment for LM1 samples was approximately 87.58 % and for LM2 was 88.24 %, which is a good alignment rate.

Table 4.1: Alignment rate for each one of the RNA-Seq reads against the human genome.

Sample Time-point Alignment rate

Control 0 88.66% LM1 20 88.24% LM1 60 87.11% LM1 120 87.75% LM1 240 87.20% LM2 20 89.70% LM2 60 87.36% LM2 120 87.50% LM2 240 88.39%

30 4.2.3 Expression quantification – HTSeq-count

Following alignment of RNA-Seq reads, the data need to be translated into a quantitative measure of gene expression. This task can be achieved by HTSeq-count, which counts the number of reads that map a given gene. To perform this, it is needed a reference file that contains all the annotated protein coding genes in the human genome release 71. This information is contained on a GTF file that was downloaded from Ensembl’s website [178].

4.2.4 Differential expression – DESeq

Given the samples and the respective conditions that constitute the L. monocytogenes dataset, it was decided to compare the infected samples, with both mutant and non-mutant L. monocytogenes, to the non-infected control sample. In order to do so, each one of the tables from the previous step was concatenated with the table containing the information about the gene expression level in the control sample. From these concatenated tables, DESeq estimates the dispersion of each gene and analyses whether there is differential expression between the defined conditions (e.g. comparison between non- infected – Control – and infected non-mutant data at time-point 20 – LM1 20). The final throughput of this step is a table in which the entries correspond to the genes that are differentially expressed among the two conditions being compared (table 4.2). This data can be represented in a MA-plot, which illustrates the dependence between the genes fold change log ratio and its mean of counts. More specifically, each dot in the MA-plot corresponds to a gene. In the x-axis we represent the baseMean, which corresponds to the number of reads divided by the size factor (normalization constant) of the sample; and, in the y-axis, is represented the log2 of the fold change, which describes how much a quantity changes from one condition to another. Therefore, in this context, analysis of this type of plots elucidates about the dispersion associated with each gene.

Table 4.2: Highly differentially expressed genes for the comparison between control and sample acquired at time- point 20 from the wild-type infected cell population (control versus LM1 20).

id baseMean foldChange log2FoldChange pval padj

ENSG00000120738 627 49.16 5.619 4.764e-62 1.186e-57 ENSG00000118523 499.5 14.609 3.869 1.408e-36 1.753e-32 ENSG00000142871 783 8.491 3.086 3.822e-32 3.170e-28 ENSG00000123358 548 3.871 1.953 1.389e-13 6.999e-10 ENSG00000170345 104.5 15.077 3.914 1.406e-13 6.998e-10 ENSG00000120129 303 3.488 1.802 3.885-09 1.611e-05 ENSG00000259884 192 4.12 2.042 1.389e-08 4.941e-05 ENSG00000137801 1287 2.12 1.084 9.191e-07 0.003 ENSG00000160888 265 2.868 1.520 1.387e-06 0.003

In figure 6.5 are represented the MA-plots for all the time-points for non-infected versus infected analysis. Genes classified as differentially expressed with a FDR less than 0.1 are represented in red

31 and non-outliers genes in black. By visually comparing the MA-plots of condition LM1, it is noticeable a evolution of the number of DE genes. In the beginning (time-point 20) a small set of genes is differentially expressed, this evolves for a large set of differentially expressed genes in time-points 60 and 120 and, fi- nally, only a small number of genes are again found to be differentially expressed after 240 minutes. This evidences the existence of a cellular response on both 60 and 120 time-points. Indeed, this evolution is congruent with expectations: after 20 minutes L. monocytogenes is starting its internalization into the cell, the cell is not yet responding to the invasion. Upon internalization, immunological response of the host cell is promoted, having as a consequence the increasing of the number of differentially expressed genes relatively to the non-infected sample. After 240 minutes, the cell innate immunological response is already reduced, thus, the lowering in the number of differentially expressed genes. This reduction of the immunological response might be related with the decreasing of the number of L. monocytogenes free on the cell cytosol. Probably, after 240, the bacteria already performed its extrusion from the first infected cell and is starting to infect neighbouring cells. The described events are not verified for LM2 condition. Among all the experimental time-points the number of differentially expressed genes is similarly low. This indicates that at these samples, the cell does not differ much from the control and, therefore, no immunological response was active. Supporting also these facts is the difference of fold change among both conditions. For LM1 this value is, overall, higher than in LM2 condition.

(a) Non-infected (control) versus infected wild-type (b) Non-infected (control) versus infected mutant (LM1). (LM2).

Figure 4.3: Normalized counts mean versus log2 fold change for the contrast non-infected versus infected for each time-point. Each dot corresponds to a gene. Genes classified as differentially expressed with a FDR less than 0.1 are represented in red and non-outliers genes in black.

32 For the genes found to be differentially expressed, with a p-value smaller than 0.1, the entries dis- tribution is concentrated near to 0, as it can be seen in figure 4.4. This fact implies that the selected data is well structured for further analysis, seeing that a great portion of it is concentrated near to 0 and, hence, not dispersed through the bins along the considered space (p-value between 0 and 0.1).

(a) Non-infected (control) versus infected wild- (b) Non-infected (control) versus infected mutant type (LM1). (LM2).

Figure 4.4: P-value distribution for the statistically significant DE genes (p-value < 0.1).

However, it is important to keep always in mind that this analysis was performed without any repli- cates, which implied several assumptions that might introduce uncertainty in the potential results. In fact, as stated in chapter 2.4, if from the comparison of two conditions results differences between it, it is not possible to say for sure if that discrepancy does or not arise from experimental or biological noise. The first assumption to do this kind of analysis is that the mean is a good predictor for the dispersion,thus, given two samples from different conditions and its respective number of genes with comparable expres- sion levels it is possible to take the dispersion estimation from comparing their counts across conditions; moreover, for DESeq to estimate the dispersion it needs to consider that the samples were replicates from the same condition. Therefore, the absence of replicates is reducing the power of DE inference among RNA-Seq samples. In addition to this analysis, I performed the comparison between samples infected with wild-type and mutant L. monocytogenes – LM1 versus LM2 for the same time-point (see tables on appendix A, Section A.2). Furthermore, I also compared consecutive time-points for the same condition (e.g. LM1 20 versus LM1 60). Both these analysis did not improve the conclusions extracted from the previously described study. An important limitation of the L. monocytogenes dataset is that does not contains any replicates. This restriction served as motivation for the development of the tools described in chapters 5 and 6.

33 4.2.5 Gene Ontology enrichment – GOStats

Using the pre-set options defined in the developed pipeline, GOStats identified the statistically sig- nificant ontology terms that differ between non-infected and L. monocytogenes infected cells. In the following subsections a summary of the obtained results is described and compared with what it was expected, having into account L. monocytogenes life-cycle.

4.2.5.A Biological Processes

Table 4.3 summarizes the BP ontology terms that resulted from both performed analysis: control versus LM1 and control versus LM2. This ontology describes operations or sets of molecular events pertinent to the functioning of integrated living units. The differential biological processes analysis shows that at an early stage (time-point 20) the cell is reacting to the binding of an extracellular ligand to a re- ceptor on its surface - enzyme linked receptor protein signalling pathway - where the receptor is closely associated with an enzyme such as a protein kinase - transmembrane receptor protein serine/threo- nine kinase signalling pathway - and ending with regulation of a downstream cellular process, such as transcription [179, 180]. This evidences the binding of the L. monocytogenes receptors to the host cell surface proteins. Particularly, L. monocytogenes has been documented to synthesize a serine/threo- nine kinase, named Stp, which has been found to be membrane associated and involved in this bacteria virulence [154, 181, 182]. In addition, the response to calcium signalling has also been reported to be implicated in cytoskeletal rearrangements required for cell binding or internalisation of pathogenic microorganisms [183]. At this early acquisition point the bacteria is performing its internalization into the host HeLa cells. After 60 minutes, the GO terms between control versus LM1 and control versus LM2 analysis start to differ. Particularly, for the first comparison, the terms imply that the cell is already responding to the bacterial invasion by reducing the frequency of its biological processes and promoting apoptosis. The process of cell suicide works as a natural mechanism to prevent the dissemination of infection to the healthy neighbour cells, functioning as a host defense against the pathogen invasion [139]. Contrary, for the second analysis, the cell continues its proliferation. Both evaluations, contain a term related with a anatomical structure - anatomical structure formation involved in morphogenesis - which is associated with the development of a biological entity that occupies space and is distinguished from its surroundings. This fact confirms that both sets of differentially expressed genes suggest that at this stage L. monocytogenes already performed its internalization into the host cell cytosol, fact that is congruent with L. monocytogenes life-cycle. At time-point 120, the apoptotic processes are intensified in cells infected with wild-type L. monocytogenes whereas the cells infected with the mutant bacteria continue its proliferation. Finally, after 240 minutes, the most noteworthy terms for the analysis control versus LM1 are related with cell communication, which is congruent with what we were expecting, given that, after using the cell machinery, L. monocytogenes invades the neighbouring cells. This is confirmed by the existence in this time-point of terms that also appear in time-point 60 - anatomical structure for- mation involved in morphogenesis. In addition, in order to move from cell to cell, L. monocytogenes uses host actin filaments to move. A term related with this process also appears - positive regulation

34 of actin filament bundle assembly. Moreover, these terms can be also related with the communication between cells due to the immunological response verified on the previous time-points. In fact, at this point, we may be picking up what happens in non-infected cells. For control versus LM2 the population of cells continues its proliferation. From this evaluation it is possible to conclude that the main differences between both analysis appear on minutes 60 and 120. For the first case, the cells population responds to the L. monocytogenes infection. Contrasting, in the second case the cells did not respond to the presence of the pathogen and continue their proliferation processes along the four acquisition points. This may be explained by the hypothesis that for the last case (LM2 samples) the bacteria could not be free in the host cell cytosol. Due to the non-existence of hly gene in the mutant bacteria, which synthesizes an important toxin for the bacteria to escape its internalization vacuole, L. monocytogenes is incapable of disrupting its internalization phagosome and, therefore, to use the host cell machinery.

Table 4.3: Summary of the BP ontology terms and respective p-value associated with the set of differentially expressed genes from control versus LM1 and control versus LM2 analysis, for all the experimental time-points.

Time- Control versus LM 1 Control versus LM2 point P-value Definition P-value Definition

transmembrane receptor protein 1,82E-07 2,72E-08 response to calcium ion serine/threonine kinase signaling pathway enzyme linked receptor protein signaling transcription from RNA polymerase II 2,06E-07 1,42E-05 pathway promoter 3,18E-07 response to calcium ion 3,51E-05 cell proliferation 20 transcription from RNA polymerase II transmembrane receptor protein 4,14E-06 6,17E-05 promoter serine/threonine kinase signaling pathway enzyme linked receptor protein signaling 1,80E-05 cellular response to endogenous stimulus 0,000112056 pathway positive regulation of biosynthetic 5,12E-05 0,000182139 cellular response to endogenous stimulus process anatomical structure formation involved anatomical structure formation involved 2,10E-13 1,26E-11 in morphogenesis in morphogenesis 1,05E-11 programmed cell death 3,67E-08 response to endogenous stimulus 1,29E-10 response to organic substance 3,69E-07 cell proliferation 60 1,51E-10 negative regulation of cellular process 1,96E-06 regulation of endothelial cell proliferation

2,03E-09 response to external stimulus 3,71E-06 endothelial cell proliferation 1,57E-08 response to stress 3,90E-06 regulation of cell proliferation transcription from RNA polymerase II 2,11E-20 negative regulation of cellular process 2,67E-09 promoter 1,15E-15 apoptotic process 2,15E-06 cell proliferation

1,56E-15 programmed cell death 4,62E-06 positive regulation of cell proliferation 120 transmembrane receptor protein 2,89E-14 cellular response to stimulus 5,59E-06 serine/threonine kinase signaling pathway

3,46E-14 cell death 7,03E-06 positive regulation of cell cycle

4,64E-14 regulation of cellular metabolic process 1,25E-05 positive regulation of gene expression anatomical structure formation involved 9,35E-14 8,31E-06 regulation of cell proliferation in morphogenesis 35 regulation of intracellular protein kinase positive regulation of protein 2,18E-11 1,44E-05 cascade phosphorylation 4,00E-11 regulation of signal transduction 2,93E-05 positive regulation of cell proliferation positive regulation of protein 240 5,05E-11 regulation of cell communication 5,86E-05 modification process 6,37E-11 regulation of protein modification process 5,96E-05 positive regulation of signal transduction

6,92E-11 regulation of protein phosphorylation 6,40E-05 cell proliferation positive regulation of actin filament 0,00124917 0,000266735 regulation of developmental process bundle assembly Time- Control versus LM 1 Control versus LM2 point P-value Definition P-value Definition

transmembrane receptor protein 1,82E-07 2,72E-08 response to calcium ion serine/threonine kinase signaling pathway enzyme linked receptor protein signaling transcription from RNA polymerase II 2,06E-07 1,42E-05 pathway promoter 3,18E-07 response to calcium ion 3,51E-05 cell proliferation 20 transcription from RNA polymerase II transmembrane receptor protein 4,14E-06 6,17E-05 promoter serine/threonine kinase signaling pathway enzyme linked receptor protein signaling 1,80E-05 cellular response to endogenous stimulus 0,000112056 pathway positive regulation of biosynthetic 5,12E-05 0,000182139 cellular response to endogenous stimulus process anatomical structure formation involved anatomical structure formation involved 2,10E-13 1,26E-11 in morphogenesis in morphogenesis 1,05E-11 programmed cell death 3,67E-08 response to endogenous stimulus 1,29E-10 response to organic substance 3,69E-07 cell proliferation 60 1,51E-10 negative regulation of cellular process 1,96E-06 regulation of endothelial cell proliferation

2,03E-09 response to external stimulus 3,71E-06 endothelial cell proliferation 1,57E-08 response to stress 3,90E-06 regulation of cell proliferation transcription from RNA polymerase II 2,11E-20 negative regulation of cellular process 2,67E-09 promoter 1,15E-15 apoptotic process 2,15E-06 cell proliferation

1,56E-15 programmed cell death 4,62E-06 positive regulation of cell proliferation 120 transmembrane receptor protein 2,89E-14 cellular response to stimulus 5,59E-06 serine/threonine kinase signaling pathway

3,46E-14 cell death 7,03E-06 positive regulation of cell cycle Table 4.3 - cont.: Summary of the BP ontology terms and respective p-value associated with the set of differentially expressed4,64E-14 genes fromregulation control versus of cellularLM1 metabolic and control processversus LM21,25E-05 analysis,positive for all the regulation experimental of gene time-points. expression anatomical structure formation involved 9,35E-14 8,31E-06 regulation of cell proliferation in morphogenesis regulation of intracellular protein kinase positive regulation of protein 2,18E-11 1,44E-05 cascade phosphorylation 4,00E-11 regulation of signal transduction 2,93E-05 positive regulation of cell proliferation positive regulation of protein 240 5,05E-11 regulation of cell communication 5,86E-05 modification process 6,37E-11 regulation of protein modification process 5,96E-05 positive regulation of signal transduction

6,92E-11 regulation of protein phosphorylation 6,40E-05 cell proliferation positive regulation of actin filament 0,00124917 0,000266735 regulation of developmental process bundle assembly

4.2.5.B Cellular Components

The most significant CC ontology terms are specified in table 4.3. This ontology regards the parts of a cell or its extracellular environment. Focusing on control versus LM1 analysis, for the first time-point terms are related with organelle lumen and transcription. Remembering that the BP terms found for this time-point evidenced the connection of an extracellular ligand to the cell, the terms found are congruent. Moreover, in the previous ontology we found that at this point the cell is still normally functioning, which explains the terms related with transcription. After 60 minutes, in the BP field, cell is initiating a response to the pathogenic invasion. For CC ontology, at this point, the terms are associated with extra-cellular space, nucleus, transcription and interleukin-6 receptor complex, which was documented as playing an important role in immune response against L. monocytogenes [184]. At the end of 120 minutes, the relevant terms are still associated with extracellular space, nucleus and interleukin-6 receptor complex. In addition, Bcl-2 family protein and I-kappaB/NF-kappaB complexes appear also as statistically signif- icant. Regarding Bcl-2 protein family, this protein complex regulates apoptotic process, responding to cues from various forms of intracellular stress and interacting with opposing family members to deter- mine whether or not the apoptotic cascade should be unleashed [185]. I-kappaB/NF-kappaB complex has also been reported as being active upon L. monocytogenes infection [186–188]. This complex is extremely important in the regulation of innate immunological response to infection [189]. The term extracellular space, which appears on both 60 and 120 time-points, suggests that the cell is signalling through secretion. This phenomena is congruent with the cell-cell communication in response to innate immune activation described by Hornef et al. [190]. Lastly, after 240 minutes, the terms with lower p-value are related with cell surface, extracellular space, focal adhesion and contractile fiber. These terms can be correlated with both BP terms and L. monocytogenes life-cycle, given that at this point the bacteria is expected to be infecting neighbouring cells. Thereby, focal adhesion term evidences the existence of mechanical linkages between the cells extracellular matrix and the contractile fiber term can be related with bacteria’s use of actin tails to move along the cytosol.

36 Concerning control versus LM2 analysis, for acquisition time-point 20 the most important terms are nucleoplasm, transcription factor complex, cis-Golgi network and intracellular organelle lumen. These terms are similar with the ones found to this time-point for control versus LM1 analysis, which indicates that a similar response is occurring in both samples. The differences appear in the next time-points. In fact, for control versus LM2 analysis, the terms are similar along the acquisition points, being related with the synthesis and processing of proteins. The exception occurs for time-point 240, in which terms are identical to the ones found in for LM1 samples and suggest communication between cells. Particularly, laminin-10/11 mixture is documented as a strong adhesive for human epithelial cells [191]. Moreover, contractile fiber and myofibril terms indicate the promotion of actin mobilization, which is associated with the last step of L. monocytogenes infection process. This suggests that even thought L. monocytogenes could not exit phagosome, it was capable to promote actin polymerization to perform its extrusion. Another possibility is that the cell itself is performing the extrusion of the foreign organism by phagocytic pathways. In fact, actin was described as a key factor at several stages along this pathway by which pathogenic agents can be eliminated [192].

Table 4.3: Summary of the CC ontology terms and respective p-value associated with the set of differentially expressed genes from control versus LM1 and control versus LM2 analysis, for all the experimental time-points. Time- Control versus LM 1 Control versus LM2 point P-value Definition P-value Definition 0,00297523 organelle lumen 0,002200524 nucleoplasm 0,00325545 membrane-enclosed lumen 0,002600755 transcription factor complex 20 0,00663199 nuclear lumen 0,026306428 cis-Golgi network 0,01371974 transcription factor complex 0,032416673 intracellular organelle lumen 0,00031877 nucleoplasm 0,00420261 transcription factor complex 0,00086888 transcription factor complex 0,005504891 nucleoplasm 60 0,00267336 organelle 0,005608743 intracellular organelle 0,0081105 interleukin-6 receptor complex 0,007509073 membrane-enclosed lumen 0,01124903 intracellular membrane-bounded organelle 0,013310909 extracellular matrix 2,56E-06 extracellular space 0,000190336 transcription factor complex 0,00036116 nucleus 0,003062138 nucleoplasm 0,00243576 transcription factor complex 0,008643145 fibrinogen complex 120 0,01474619 interleukin-6 receptor complex 0,009573543 nucleoplasm part 0,0196137 Bcl-2 family protein complex 0,027832962 cis-Golgi network 0,02445744 I-kappaB/NF-kappaB complex 0,033512174 nuclear chromosome 1,94E-05 extracellular space 1,69E-05 extracellular region 0,00460143 cell surface 5,27E-05 extracellular space 0,01198667 fibrinogen complex 0,000214572 extracellular region part 0,01416727 focal adhesion 0,002708395 laminin-10 complex 240 0,01529717 cell-substrate adherens junction 0,002708395 laminin-11 complex 0,01766962 contractile fiber part 0,006523323 contractile fiber part 0,01967318 myofibril 0,007285939 myofibril 0,02447605 extracellular matrix 0,023246443 cis-Golgi network

37 4.2.5.C Molecular Functions

The most significant terms related with MF are presented in table 4.4. These terms describe the elemental activities of a gene product at the molecular level. For the first time-point, the terms are expected to be associated with protein synthesis and with binding factors that are involved in the in- ternalization of L. monocytogenes. Focusing on control versus LM1 analysis, for time-point 20, in fact, terms like integrin binding, fibronectin binding, heparin binding, glycosaminoglycan binding or extracel- lular matrix binding evidence the communication with the cell’s outer space. All these proteins are able to mediate cellular interactions. For instance, fibronectin is an extracellular glycoprotein that can bind to both cell’s surface integrins and heparins, playing a key role in cell communication, adhesion, migra- tion, growth and differentiation [193]. L. monocytogenes was reported to bind with human fibronectin in the invasion process [194], which supports these terms as statistically significant. Furthermore, pro- tein tyrosine/threonine phosphatase activity [195–197] and insulin-like growth factor [198, 199] were also described to be related with the L. monocytogenes internalization into the host cell. On the other hand, the sequence-specific DNA binding transcription factor activity term evidences the regulation of the transcription activity. Regarding time-points 60 and 120, the terms in both these points are similar. First, for time-point 60, the analysis of the previous gene ontologies associate it with the starting of the cell’s response to the pathogen invasion by promoting apoptosis. The first term - sequence-specific DNA binding transcription factor activity - indicates the regulation of the transcription process. The cell immunological response occurs upon L. monocytogenes vacuole disruption, process mediated by LLO toxin which will induce signalling in the host cell by activating several pathways, namely, MAP kinase pathway [145, 200–202]. The three last terms evidence, precisely, the starting of the apoptotic process: ubiquitin is a small protein that can signal proteins for degradation [203]; transcription corepressor is inhibiting the transcription process; and cytokine are a family of proteins that modulate the immuno- logicalal response [204]. For time-point 120, the terms are likewise related with the the shut-down of the cell. Finally, after 240 minutes, it is expected the terms to be related with both actin polymerization and communication between cells, due to L. monocytogenes’ migration by phagocytosis, escaping host humural immune system. The found terms are, in fact, related with the binding between cells, with most terms similar to the ones described for time-point 20. Regarding control versus LM2 analysis, for time-points 20 and 240 the terms are very similar to the ones found for control versus LM1 analysis. As previously referred, the main differences appear on acquisition times 60 and 120. For this analysis, the terms imply a continuous cellular proliferation as well as a constant communication between neighbouring cells. Along all time-points no immunological- related term was found for this analysis. This is congruent with what was already stated for the previous ontologies: in these samples the cells did not react to the presence of the pathogen. This fact indicates that without the hly gene, which synthesizes for the virulence factor LLO, L. monocytogenes is no longer pathogenic [156].

38 Table 4.4: Summary of the MF ontology terms and respective p-value associated with the set of differentially expressed genes from control versus LM1 and control versus LM2 analysis, for all the experimental time-points.

39 4.3 Discussion

In this section, I discuss the methodology and the subsequent obtained results described in this chapter. Firstly, I focus on the developed pipeline: introduce tools that are analogous to the ones used, justify each tool choice and explain the main reasons that led to the development of this methodology when pre-build pipelines were already implemented. Secondly, I examine which are the fundamental limitations of the developed pipeline. Next, I argue about the restraints associated with RNA-Seq exper- imental data. Finally, I discuss the results from the analysis of the L. monocytogenes RNA-Seq dataset with the developed pipeline: comment its congruency with the documented bacteria life-cycle, compare it with electron micrographs obtained in each important acquisition time-point, and, ultimately, review the main drawn conclusions about the influence that the depletion of L. monocytogenes’ hly gene has into this bacterium’s virulence.

4.3.1 State-of-the-art: RNA-Seq processing tools

As indicated in section 2.4, a huge amount of bioinformatic tools are designed to handle RNA-Seq data. With this in mind, the tools integrated in the developed pipeline are just one alternative to perform a given task. In fact, for each one of the pipeline steps there are available analogous software which might or not utilize different methodologies to achieve the same objective (table 4.5). Concerning the sequencing quality assessment, the methodology used for alternative applications is very similar to the one used by FastQC [167]. However, FastQC output has significant added value for its clearness and simplicity combined with the display of all important information. This allows the user to easily conclude about the data sequencing quality. With respect to the alignment tool, Bowtie 2 is a state-of-the-art mapper that is specialized at aligning short reads (from 50-100bp) with long reference genomes [93]. With respect to Bowtie 2 alignment quality, as stated in Friedel et al. [205], ”Bowtie 2 alignment algorithm shows a remarkable tolerance to sequencing errors (...) making hash-based aligners obsolete” (like MAQ or SOAP2). When comparing Bowtie 2 algorithm with BWA, BWA-SW or the old Bowtie, results confirm that Bowtie 2 has a higher ratio between the number of correct alignments and incorrect ones [93]. Regarding gene expression profiling, HTSeq-count [169] is a simple script that takes advantage of the tools that the python package HTSeq contains [103]. The output of this tool is ideal for the next pipeline step. In order to perform the differential expression test, it was intended a method that, firstly, had a conser- vative approach with samples that contain a high number of outliers and, secondly, that was capable of performing the comparative analysis without replicate reads in the dataset. Bearing in mind the consid- erations stated in Soneson et al. [121], DESeq, edgeR and NBPSeq are the best methods to process datasets with low number of replicate samples, as the L. monocytogenes case study. Among these methods, Soneson et al. concluded that they have similar results, with DESeq having the most conser- vative behaviour. A similar study is performed by Kvam et al. [120], in which DESeq is acknowledged

40 as the only method tested that does not require replicates to perform DE analysis. All these reasons support DESeq as the package that is best suited for the proposed pipeline. At last, to determine which pathways are overrepresented in the set of DE genes it is pretended to perform a ORA, following the classification stated in Khatri et al. [127]. As referred in Emmert-Streib et al. [122], GOStats [131] is the principal method to perform ORA and, hence, it was chosen to integrate the developed pipeline.

Table 4.5: Alternative tools for each pipeline step.

Pipeline step Used tool Alternative tools

Quality control FastQC [167] RSeqQC [206], htSeqTools [207], HTQC [83], QC-Chain [81] Alignment Bowtie2 [93] BWA [208], SOAP2 [96], MAQ [95], Bowtie [98] Measure of gene expression level HTSeq-count [169] bedtools multicov [209] Gene expression profiling DESeq [108] edgeR [116], TSPM [210], baySeq [117], NOISeq [211] Gene ontology association GOStats [131] Onto-Express [212], GenMAPP [213], GoMiner [214], GO:TermFinder [215]

Moreover, there are also available pre-build pipelines which take the raw RNA-Seq reads and process it to extract biological conclusions (table 4.6). Processing RNA-Seq data using these implemented pipelines can be advantageous to, for instance, biologists with poor computational knowledge. When reviewing the tools provided by Galaxy, Taylor et al. [216], state precisely this as ”the most important feature of Galaxy’s analysis workspace”. This tool in particular is not a pipeline by itself but an engine on which the user can build pipelines easily. However, the referred characteristic can be also a disadvantage. To users with computational knowl- edge, analysing the data with such tools restricts the way in which the processing can be performed and, moreover, is frequently harder and less effective. Pre-build pipelines may not contain the tool which best suits the data in analysis and are not so versatile. Building a pipeline from scratch means that it is designed for the dataset in analysis and, furthermore, even if the dataset changes it is easy to adapt the same pipeline to perform its analysis.

4.3.2 Limitations of the developed pipeline

However the developed pipeline is well suited for the case study dataset, there are some limitations in the application of this pipeline:

• Bowtie 2 does not perform spliced alignment. Nevertheless, permits gapped alignments. Due to the nature of this study, where it was not considered relevant to distinguish multiple isoforms of the same gene, the use of a mapper that performs spliced alignment will not improve our results;

41 Table 4.6: Pre-build tools that allow users without without programming or informatics expertise to process RNA- Seq data.

Tool Goal Reference

ArrayExpressHTS QC analysis, alignment to a reference and expression estima- [217] tion. Galaxy Contains a workflow editor that provides a graphical user in- [216, 218–220] terface for creating and modifying workflows. DDBJ Alignment to a reference, de novo assembly and subsequent [221] high-level analysis of structural and functional annotations. Grape QC analysis, alignment to a reference, expression estimation, [222] calculates exon inclusion levels and identificates novel tran- scripts. RseqFlow QC analysis, alignment to a reference, expression estimation [223] and identifies differentially expressed genes. R-SAP Alignment to a reference, identification of RNA isoforms and [224] chimeric transcripts. Myrna Alignment to a reference, expression estimation and identifi- [112] cation of differentially expressed genes. Rnnotator QC analysis, de novo assembly and generation of transcript [225] models (without the need of a reference genome). GenePattern short-read mapping, identification of splice junctions, tran- [226] script and isoform detection, quantification, differential expres- sion, quality control metrics and visualization

• If a read maps to multiple places on the genome with equal score, Bowtie 2 chooses randomly the genome portion where that read is going to be aligned. This occurs because RNA-Seq reads are generally shorter than transcripts from which they are derived and, therefore, a single read may map to multiple genes;

• Bowtie 2 mapping process does not assures that the alignment reported is the best possible in terms of alignment score. Eventually, the algorithm will stop looking, either because it exceeded a limit placed on search effort or because it already knows all it needs to know to report an alignment;

• The pipeline has a conservative approach to reads which only one mate of the paired-ended read could be aligned to the genome, discarding those alignments;

• The function union pre-setted on HTSeq-count tool discards reads that map to overlapping genes, counting those reads as ambiguous. Another approach could be to chose randomly the gene to count, as Bowtie 2 does for the case explained in the previous point. However, this was the chosen approach to avoid false positives;

• DESeq was reported to have a poor FDR control with 2 samples by Soneson et al. [121]. However, they found also that DESeq, from the methods being compared, was the method with the most conservative behaviour and with lower type I error rate;

42 • The expected read count for a transcript is proportional to the gene’s expression level multiplied by its transcript length [13, 227]. Therefore, differences in transcript length will yield differing numbers of total reads. One consequence of this is that longer or more highly expressed transcripts have more statistical power for the detection of DE between samples [104]. The fact that statistical power increases with the number of reads is an unavoidable property of count data, which cannot be removed totally by normalization or re-scaling. Thus, when comparing two RNA-Seq samples, long or highly expressed transcripts are more likely to be detected as differentially expressed compared with their short and/or lowly expressed counterparts [228];

• The pipeline is pre-defined to receive a RNA-Seq dataset without replicates. DE analysis without replicates limits widely the conclusion about the genes expression discrepancy among samples;

• The developed pipeline is only suitable to process data from Illumina’s sequencing protocol.

4.3.3 Limitation of the experimental data

Despite RNA-Seq technology is an increasingly attractive method for whole-genome expression studies in many biological systems, this technique can present several limitations:

• RNA-seq strategies often involve a poly-(A) mRNA-enrichment step. This strategy enrich RNA polymerase II transcripts removing highly abundant RNAs, such as rRNA. Nevertheless, this strat- egy discards also mRNAs lacking poly-(A) tails or precursor transcripts processed into fragments that have lost their poly-(A) tails [229]. Furthermore, polyadenylation of transcripts also takes place during transcript degradation steps, and thus poly-(A) enrichment may also enrich for RNA degradation products of RNA polymerase I transcripts [230, 231];

• During the construction of cDNA library, the nascent cDNA that is being synthesized can dissociate from the template RNA and re-anneal to a different stretch of RNA with a sequence similar to the initial template in a process designated by template switching [232–234]. This phenomena may cause problems in the identification of exon-intron boundaries and true chimeric transcripts [231];

• RNA-seq signal across transcripts tends to show non-uniformity of coverage, which may be a result of biases introduced during various steps, such as priming with random hexamers [175, 235], cDNA synthesis, ligation [236], amplification [237] and sequencing [237–239]. For instance, PCR amplification can result in a low sequencing coverage for transcripts or regions within a transcript that have a high GC content. This can, in turn, cause gaps in the assembled transcripts and can cause other transcripts to be missing from the assembly altogether [237]. Amplification-free protocols have been developed to overcome this problem [240];

• Multiple fragmentation and RNA or cDNA size-selection steps on RNA-Seq protocol can result on transcript-length bias [104]. This bias may result in complications for downstream analyses [228];

43 • Quantification of transcripts with RNA-seq requires consideration of read mapping uncertainty (ow- ing to sequencing error rates, repetitive elements, incomplete genome sequence and inaccuracies in transcript annotations) [241];

• Particularly, for the L. monocytogenes case study, the data is unreplicated. The fundamental prob- lem with generalizing results gathered from unreplicated data is the complete lack of knowledge about biological variation. As Fisher [242] noted, without an estimate of variability there is no basis for inference (between treatment groups). Although we can test for DE between treatment groups from unreplicated data, the results of the analysis only apply to the specific subjects included in the study (i.e. the results cannot be generalized) [243];

• Still regarding the L. monocytogenes case study, the RNA-Seq data is extracted from a pool of infected and non-infected cells. The L. monocytogenes infection was done on a population of cells. Some cells get infected and some do not. The transcriptional responses between these two types of cells may vary considerably. For instance, L. monocytogenes may suppress the transcription of some inflammatory genes in only the infected cells, while signals are sent to neighbouring uninfected cells to express these genes. The result is that, at the level of the population (which is what RNA-Seq translates), the gene may look highly up-regulated. This phenomena is called bystander effect [244].

4.3.4 Case study: congruency between expected and obtained results

In subsection 4.2.5 are described the main cellular processes that differ between the non-infected cells and the infected ones. From this analysis it is possible to have insight about the cellular processes that were occurring in the four acquisition time-points, relate them with the known bacteria life-cycle and, ultimately, conclude about the bacteria’s virulence when it is not capable to synthesise LLO. A very interesting conclusion is that for cells infected with mutant L. monocytogenes the immunologic response was not activated. In fact, among all the acquisition time-points the active processes were related with transcription and protein synthesis, which evidence the normal functioning of this cell population on such condition. Despite there is no evidence of immunologic pathways activation, processes related with cell- bacteria interaction in the internalization course as well as promotion of actin polymerization related with the bacteria’s escape were registered. Therefore, for mutant infected samples (LM2 condition) only the first and last steps of the L. monocytogenes life-cycle were verified. Contrasty, among the samples infected with wild-type L. monocytogenes (LM1 condition) the data evidenced a strong immunological response from the host cell. And, furthermore, the processes related with each acquisition time-point can easily be correlated with the bacteria known life-cycle. This fact confirms the good functioning of the developed pipeline which is, in fact, outputting reasonable results that are congruent with what we were expecting: for wild-type infected cells, the over-expressed processes were similar to the ones already documented and for mutant infected cells no immunologic response was verified. Hence, this analysis evidences that without hly gene the cell does not activate its immunologic pathways. From

44 this data it was possible to formulate the following hypothesis: when LLO, a protein codified by hly gene, is not produced L. monocytogenes loses its virulence. Particularly, without LLO the bacteria is not able to disrupt the internalization vacuole and use the cell machinery to replicate. Nevertheless, actin polymerization is still verified. These conclusions are congruent with what Portnoy et al. [156] describe. Moreover, due to the fact that the internalization process is performed in a membrane-bound phagosome by inducing local cytoskeletal rearrangements in the host cell, with no disruption of this vacuole the cell can not detect any foreign body. Resuming, the results obtained from the analysis of LM1 samples allow to conclude the congruency between the published bacteria life-cycle and the processes that were active, and this supports the results obtained for LM2 samples. Besides the bioinformatics evaluation here described, micrographs were acquired intending to mon- itor the L. monocytogenes infection along the acquired time-points. These micrographs are illustrated in table 4.7. When analysing these figures no conclusions can be drawn about the process of infec- tion on LM1 and on LM2 samples. In fact, the infection process seems similar. Nevertheless, given the results obtained by the developed pipeline, this was proven not to be true. This fact evidences that the conclusions extracted from traditional tools such as microscopy can be scarce. By these im- ages it is not possible to understand if the bacteria was able to invade host cell cytosol and use the cell machinery. However, it is possible to confirm internalization and actin polymerization promoted by L. monocytogenes infectious process, phenomena that were also found in the analysis of the RNA-Seq L. monocytogenes data.

Table 4.7: Micrographs illustrating the evolution of HeLa cells upon infection with wild-type (LM1) and mutant (LM2) L. monocytogenes. The portion coloured in blue is the nucleus, in red is actin, in green is L. monocytogenes and in yellow is actin-coated L. monocytogenes. These micrgraphs were kindly provided by Mhlanga laboratory at CSIR, in South Africa [245].

30 → 60 minutes 60 → 120 minutes 120 → 180 minutes 180 → 240 minutes

LM1

LM2

45 4.4 Future work

The analysis described in this chapter produced biologically plausible conclusions that are in ac- cordance with the expected results. Nevertheless, the obtained results need to be supported. This can be performed following two different pathways: confirm the expression variation of the genes which are thought to have a greater influence in the immunological response to the L. monocytogenes invasion (with greater fold-change) or improve the confidence on the pipeline results. These procedures are com- plementary and, therefore, ideally both should be performed in order to validate the results described in section 4.2.

4.4.1 Laboratory approach: new data acquisition

Regarding the laboratory approach, qRT-PCR (real time quantitative reverse transcription PCR) [246] is a technique that has been widely used for microarrays validation, providing a reliable way to detect and measure the products generated during each PCR cycle [247]. In a similar way, qRT-PCR has also been used on RNA-Seq validation [248–250]. Hence, this technique can be used to confirm the results obtained from the developed pipeline. Performing qRT-PCR enables to investigate the induction of a given set of genes on the samples infected with wild-type and the non-existence of that response on the mutant infected samples and, therefore, confirm the existence of a immunological response in the first case and the innexistence of such response on the second. In order to perform this study, it is needed to have an idea of the genes with lower variation rates - called housekeeping genes - and the target genes which expression variation is intended to certify. To find the genes which expression is maintained along the four acquisition time-points for both samples, it was calculated the coefficient of variation (CV) for each gene according to equation 4.1, where xi corresponds to an array containing the normalized counts for all acquisition time-points of gene i. Table 4.8 contains the genes with lower cv and, hence, suitable to use as housekeeping genes on the qRT- PCR analysis.

1 PN 2 N (xi − µ) 1 X cv = N i=1 , where µ = x (4.1) µ N i i=1 In order to find the genes which expression is intended to validate, it could be performed a DE analysis between the wild-type infected samples and the mutant infected ones. The genes found to be differentially expressed should be analysed by qRT-PCR to confirm that expression discrepancy. In appendix A section A.2, are described the statistical significant differentially expressed genes among both conditions. These are the genes susceptible to further analysis by qRT-PCR. Note that the genes present on these tables are highly similar to the ones defined by Cohen et al. [251], as being expression modulated by L. monocytogenes infection. Another approach could be to re-do a part or all the L. monocytogenes experiment and sequence it once again. A good experimental design is defined by a conjunction of experiments which maximize the information collected in order to answer a main question and, simultaneously, minimize experimental

46 Table 4.8: First ten genes with lower coefficient of variation.

Ensembl gene id Gene name Coefficient of variation

ENSG00000134684 YARS 0.0333446472 ENSG00000180398 MCFD2 0.0350105611 ENSG00000141934 PPAP2C 0.0386257753 ENSG00000170275 CRTAP 0.0391815915 ENSG00000215492 HNRNPA1P7 0.0396114942 ENSG00000122566 HNRNPA2B1 0.0400763555 ENSG00000198804 MT-CO1 0.0413109887 ENSG00000254449 SF3A3P2 0.0433278457 ENSG00000260032 LINC00657 0.0435468832 cost. Therefore, it is important to understand, for which sample of the L. monocytogenes dataset it would be more reasonable to obtain replicates. To suggest that, it is needed to know how the samples are related with each other. This study is described on appendix A, section A.1. Nonetheless, in addition to being expensive, the acquisition of new samples can solve the validation problem but is not infallible. This novel data can provide evidences that the biological variations were real and not just an artefact. However, due to its great dependency on the biological data, such as the cell populations, as well as on the sequencing protocol, a very different sample can result from this re-sequencing process. Therefore, the validation of the RNA-Seq results on the laboratory is much more efficient both on time and money by qRT-PCR than by re-sequencing the cells population. However, if this was performed, bearing in mind the results illustrated on section A.1, I would advise the replication of the control sample, since this sample is extremely important in the DE analysis. This solution would be the best one if both technical and biological replicates were acquired together with the case study RNA-Seq data. At last, to go further in the study of how the depletion of hly gene is influencing the L. monocytogenes invasion process and, consequently, the host cell immunological response, chromatin immunoprecipita- tion (ChIP) based methods can be employed to study protein-DNA interactions. ChIA-PET [252, 253], for instance, is a reasonable choice to perform this analysis. This technique gives insight about functional chromatin interactions between distal and proximal regulatory transcription-factor binding sites and the promoters of the genes they interact with [253, 254].

4.4.2 Computational approach: improvement of the pipeline

In order to strengthen the reliability of the pipeline described above, a new dataset with more ro- bust information should be processed. This new case study should have replicates for each collected condition. Its analysis should be performed in two ways: one including all the replicates reads and an- other considering only one read from each condition. From this analysis, we will be able to compare the results with and without replicates and, subsequently, confirm or refute the similarity between the cellular processes found to be differentially active among both analysis. This conclusion will support (or

47 not) the reliability of the results from the developed pipeline. And, moreover, because the latter analysis is analogous to the one described in the section 4.2 for L. monocytogenes dataset, understand if the biological conclusion described in this section can be valid. In addition, the analogous tools described in the previous section should be tested as an integral part of the developed pipeline. The approach can be replace one or more tools along the pipeline. Moreover, the refereed pre-build pipelines can also be used to process the L. monocytogenes dataset. The result of each substitution should be compared with the one obtained by the proposed pipeline. With this approach it is possible to evidence that the results using different bioinformatic tools (but with comparable goals) are (or not) similar to the ones obtained with the proposed pipeline. If, in fact, the results are similar this will support the obtained results and the accuracy of the developed pipeline. Finally, it would be interesting to create a HTML interface where the user would be able to manipulate the developed RNA-Seq analysis pipeline and, furthermore, choose between the analogous available bioinformatic tools without the need to change the commands that call the software.

48 5 Gene networks to prove the existence of a given biological response in RNA-Sequencing data

Contents

5.1 Methods ...... 50 5.2 Listeria monocytogenes case study ...... 53 5.3 Discussion ...... 63 5.4 Interface ...... 65 5.5 Future work ...... 67

49 Understanding the complex network of interactions between genes implicated in a given biological response is extremely important to contextualize and validate the results from the pipeline described in the previous chapter 4. In fact, while sequencing techniques increase the level of biological detail, integrative data analysis is critical in the contextualization of those details. Statistical methods such as the ones provided by DESeq and GOStats packages are employed in the developed pipeline to extract from the RNA-Seq samples the cellular processes that are differentially active. However, these methods are not able to aggregate the data along a time-course, only do individ- ual analysis for each acquired time-point. In other words, they are not capable to integrate the data in a systems biology view. Systems biology does not investigate individual genes and rather focus on the behaviour and relationships of all the elements in a particular biological system while it is functioning, along a time course [255]. To address this constraint, I developed an algorithm that is capable of in- specting the activation of a given biologic pathway in a RNA-Seq dataset. In this chapter I describe the methodology to perform this and test its performance in the L. monocytogenes RNA-Seq dataset. Sub- sequent to this analysis, I discuss the importance and limitations associated with this novel tool. Finally, I introduce a HTML interface that I developed to facilitate the employment of this new methodology.

5.1 Methods

The methodology introduced in this chapter intends to investigate the activation of a given cellular pathway in a RNA-Seq dataset. In order to do so, the information about the cellular pathway is loaded from known gene interactions, which usually are concatenated in gene regulatory networks. Gene reg- ulatory networks are, therefore, graphic diagrams that are used to visualize the regulatory relationships between genes upon a certain stimulus or just along the cell life-cycle. These networks are comprised of nodes, the genes and their regulators, joined together by edges, which represent physical and/or regulatory interactions [256]. The control of these interactions between genes is essential in the mainte- nance of the cell’s function, fitness, and survival. Bearing this in mind, the methodology characterized in this section can be explained very roughly by searching, in the RNA-Seq dataset in analysis, the gene interactions described in a given gene network. Having collected a RNA-Seq dataset along a time-course, the first approach is, usually, to map it against a reference and measure the gene expression level. This is the start point of the methodology proposed here. The next steps are performed as follows:

1. Search for a published gene regulatory network describing the gene interactions on the cellular response that it is intended to study (the apoptotic process or the immunological response, for instance). In alternative, a relevant gene network can also be constructed by performing a literature review on the response that it is intended to study;

2. Inspect how the expression level of the genes on the network nodes vary on the RNA-Seq dataset in analysis;

50 3. Calculate the statistical significance of the connections desbribed by that gene network on the RNA-Seq dataset.

Figure 5.1: Methodology used to infer if a gene regulatory network is present on a RNA-Seq dataset. First two pictures extracted from Karlebach and Shamir, 2008 [257].

In order to perform this inspection, the developed methodology assumes that the networks of interest are translating the relations that best modulate the genes interactions in the RNA-Seq dataset. There- fore, the number of genes, as well as, the interactions between the genes are considered as described in the literature and, thus, are known a priori. As input, the developed algorithm receives a file containing the genes expression level information (a file similar to the output of HTSeq-count tool referred in Section 4.1), the experimental metadata (such as the library type and the time-points where the reads were acquired) and, finally, the network of interest which describes how a certain set of genes are related in the biological process in study. This set of genes corresponds to the genes in the network nodes and will be referred in this document as node genes (NG). The developed algorithm begins by extracting the NG expression level information from the input count table. This data is, then, normalized in two steps: the first normalization is performed using DESeq’s approach, which is considered a robust method [258]. Then, this gene expression level is subtracted by the normalized genes expression of the control sample and transformed to a logarithmic scale (equation 5.1). After this pre-processing, the kinetics of these genes is determined taking into account the expression level information. In order to do so, and following the model proposed in several surveys, the gene regulatory network is modelled by a system of linear differential equations, which general mathematical form is described by equation 5.2 [1, 259, 260].

xi(t) = log2(1 + yi(t) − ci). (5.1) where

xi(t) : Log-ratio of the gene i expression level at time t;

yi : Normalized expression level of gene i at time t in a given sample;

ci : Normalized expression level of gene i in the Control sample.

C dxi(t) X = w , .x (t) + b .u(t). (5.2) dt i j j i j=1

51 where

xi(t) : Log-ratio of the gene i expression level at time t;

wi,j : Entry of the gene-gene interaction matrix in column i and row j;

bi : External stimulus response vector for gene i;

u(t) : Heaviside step function: u(t < 0) = 0 and u(t ≥ 0) = 1.

The column vector bi numerically translates the influence that an external stimulus, such as infection, has in the gene i. The Heaviside step function represents the influence of the external stimulus as constant over time. Each entry of the gene-gene interaction matrix W describes the influence that gene i has on gene j. Knowing the network a priori, it is possible to determine the number of entries that are not null as well as its sign in the interaction matrix W. Therefore, the linear differential equations are solved in order to wi,j by the least square method. For that, a overdetermined system of equations is constructed. This system has into account the normalized expression values and the time derivatives for the node gene i, both along the several acquisition time-points. The time derivatives values are estimated from the experimental data points by linear interpolation as it is done in Guthke et. al [1]. This system is assumed to be at equilibrium prior to stimulation, i.e. dxi(t < 0)/dt = xi(t < 0) = 0. Moreover, in order to solve that system, the expression level of a certain gene i needs to be not null. Otherwise the differential equation will be impossible to solve. The methodology chosen to cope with this situation was to ignore the non-expressed genes from the analysis. After solving this system, the linear differential equations that modulate the variation of the NG along the time are completely defined. By numerical integrating those differential equations the algorithm finds the kinetics associated with the NG along the acquisition time-points. At this point, aiming to test the statistical significance of the network of interest in the data in analysis, the algorithm performs a permutation test. In order to do so, the algorithm chooses, randomly, from a trimmed gene universe a set of n genes, where n is the number of genes in the nodes of the known network. It is important to refer that this trimmed gene universe corresponds to the set of genes on the raw count table that do not have a null expression level for any of the acquisition time-points. Afterwards, the ratio between the mean square error calculated from the simulated and the experimental expression values and the variance of the experimental values along the time (MSE/Var) is measured. This proce- dure is repeated t times (with t greater than 1000) and the statistical values measured compared with the same value for the genes in the network nodes (NG). The p-value associated with the existence of the network’s gene interactions in the RNA-Seq dataset is calculated by measuring the number of random sets of genes that have lower MSE/Var than the NG (equation 5.4).

n MSE X MSE(array ˆ x, arrayx) (set ) = (5.3) V ar x V ar(array ˆ ) i=1 x #[MSE/V ar(set ) < MSE/V ar(set )] p − value = i NG . (5.4) t + 1

52 where

setx : Set of genes with length equal to the number of NG;

arrayˆ x : Array containing the estimated gene expression of the genes in setx at the acquisition time-points;

arrayx : Array containing the measured gene expression of the genes in setx along the acquisition time-points;

MSE(array ˆ x, arrayx) : Mean square error between the estimated and the measured gene

expression of the genes in setx

V ar(array ˆ x) : Variance of the estimated gene expression of the genes in setx

n : Number of NG

seti : Set of genes chosen randomly from the genes universe

setNG : Set of NG

t : Number of re-samplings performed in the permutation test

Statistical value MSE/Var measures how far the calculated kinetics is from the experimental values. For clusters of genes with low variance, the measurement of the absolute MSE will be low even if they are being described by a bad model. Therefore, the MSE value is normalized by the variance to get a fraction of explained variance. In this way, we are able to measure properly how well a set of genes are being modelled by the gene network of interest, independently if that set of genes has a high or a low variance associated. Thus, defining the statistical parameter of the permutation test as MSE/Var will avoid array of genes with low variance to appear before sets of biologically significant genes.

5.2 Listeria monocytogenes case study

Considering that regulatory networks controlling gene expression function as decision-making cir- cuits within the cell [249], it is intended with the subsequent analysis to understand which pathways of the cell circuits are activated when the cell is infected by L. monocytogenes and, consequently, support the results obtained in section 4.2. Given the characteristics of the case study in analysis, the networks extracted from published data correspond to a simple model of the cell immunological response during Escherichia coli infection (figure 5.2) or, more particularly, during L. monocytogenes infection (figure 5.5). In this section I describe in detail both these networks and use the developed methodology to test its activation in the L. monocytogenes RNA-Seq data.

5.2.1 Network 1

The first tested network was extracted from Guthke et al., and describes the immune response of human blood cells to bacterial infection, particularly to Escherichia coli infection [261]. The number of NG is six and are the follow: IL1A, HLA-DMA, NFKBIE, CD59, STAT1 and STAT5A (figure 5.2).

53 Figure 5.2: Gene network extracted from Guthke et al., 2005 [1], describing the genetic relations upon bacterial infection by Escherichia coli. The arrows represent stimuli and the T-shaped links represent inhibition.

To test the existence of such gene network in the L. monocytogenes dataset the developed method- ology starts by extracting the gene expression level from the raw counts table. These counts will then be normalized and reduced to the logarithmic scale following equation 5.1. From this data, the algo- rithm will modulate the variation of the NG along the acquisition time-points. The fist step to perform this is to calculate dxi/dt from the measured gene expression level along the time-points for gene i. This values will be calculated from the measured gene expression values by linear interpolation. Par- ticularly, the L. monocytogenes dataset contains five acquisition time-points and, thereby, we will obtain four time derivatives (0 → 20; 20 → 60; 60 → 120 and 120 → 240). Simultaneously, the algorithm extracts from the gene network the relations between genes. Having all this data, the interaction matrix W is calculated. Finally, by numerical integration the kinetics of the six NG are modelled along the five time-points. The model obtained for both L. monocytogenes conditions are represented by equations 5.5 and 5.6 and illustrated in figures 5.3(a) and 5.3(b) for wild-type infected (LM1) and mutant infected (LM2) samples, respectively. At this point, it is performed the permutation test. This begins by selecting six genes randomly from the trimmed gene universe. The selected array of genes are then supposed to have the same relations of the NG. The time derivatives are used, as done for the NG, to solve W and model kinetics associated with the random set of genes. Finally, MSE/Var is measured for this set of genes. After performing this procedure 5000 times, the algorithm measures the number of times that a set of randomly chosen genes had lower MSE/Var than the NG. From this analysis, are obtained the histograms illustrated in figures 5.4(a) and 5.4(b). To better understand the developed algorithm, the task and the consequent result of each step performed by it are demonstrated bellow for LM1 sample:

54 Step 1. Extract directly from the count table the gene expression level for each NG:

IL1A HLA − DMANFKBIECD59 ST AT 1 ST AT 5A   Control 1 1 11 1726 169 18     LM1 20 0 3 12 1926 207 26      LM1 60 3 0 11 1778 167 40      LM1 120 6 1 31 1574 147 40    LM1 240 7 0 20 1897 142 52

Step 2. Normalize the matrix obtained in step 1:

IL1A HLA − DMANFKBIECD59 ST AT 1 ST AT 5A   Control 0 0 0 0 0 0     LM1 20  −0.9533947 0.99014468 0.13229453 0.17705593 0.30983054 0.5236767      LM1 60 1.0284119 −0.95339471 0.06435944 0.11321280 0.05295365 1.1755336      LM1 120 1.9147139 0.08235509 1.57014399 0.03238489 −0.03557783 1.2681978    LM1 240 2.0977635 −0.95339471 0.94957845 0.28921232 −0.09740817 1.6268553

Step 3. Obtain the time derivative for each NG by linear interpolation:

IL1A HLA − DMANFKBIE   Control → 20 −0.047669736 0.049507234 0.006614726     LM1 20 → LM1 60 0.049545165 −0.048588485 −0.001698377 ...      LM1 60 → LM1 120 0.014771700 0.017262497 0.025096409    LM1 120 → LM1 240 0.001525414 −0.008631248 −0.005171379 CD59 ST AT 1 ST AT 5A   Control → 20 0.008852797 0.0154915271 0.026183835     LM1 20 → LM1 60  ... −0.001596078 −0.0064219222 0.016296423      LM1 60 → LM1 120  −0.001347132 −0.0014755248 0.001544404    LM1 120 → LM1 240 0.002140229 −0.0005152528 0.002988813

Step 4. Attribute to each NG an index:

Gene Index Ensembl gene id

IL1A x1 ENSG00000115008

HLA-DMA x2 ENSG00000204257

NFKBIE x3 ENSG00000146232

CD59 x4 ENSG00000085063

STAT1 x5 ENSG00000115415

STAT5A x6 ENSG00000126561

55 Step 5. Use the provided gene network to extract connections between the NG:

Index of source gene Connections

1 1 → 1 2 1 → 2, 2 → 2 3 1 → 3, 3 → 3 4 1 → 4, 4 → 4 5 3 → 5, 5 → 5 6 5 → 6, 6 → 6

For the sake of brevity, in steps 6 and 7 I will only demonstrate the calculations for gene NFKBIE, with index x3. Nevertheless, the methodology for the other five NG is similar.

Step 6. Extract the gene expression level of the genes that interact with NFKBIE and construct the over-determined system of equations.     0.006614726 0 0 1           w13       −0.001698377 0.13229453 −0.9533947 1     =         w33        0.025096409  0.06435944 1.0284119 1           b3 −0.005171379 1.57014399 1.9147139 1

Step 7. Solve the system by performing a QR decomposition of the step’s 6 matrix.     w −0.02545411  13         =   w33  0.01244857          b3 0.01126109

By performing the step 6 and 7 to all the NG we obtain the interaction matrix W :

IL1A HLA − DMANFKBIE   −0.008219661 0.000000000 0.000000000 IL1A     −0.007100533 −0.041492831 0.00000000  HLA − DMA       W = 0.01244857 0.000000000 −0.02545411 ...NFKBIE       −0.002242316 0.000000000 0.00000000  CD59        0.000000000 0.000000000 −0.007619770  ST AT 1   0.000000000 0.000000000 0.00000000 ST AT 5A

56 CD59 ST AT 1 ST AT 5A   0.000000000 0.000000000 0.0000000000 IL1A      0.000000000 0.000000000 0.0000000000  HLA − DMA       W =... 0.000000000 0.000000000 0.0000000000  NFKBIE        −0.070532798 0.000000000 0.0000000000  CD59        0.000000000 −0.052077156 0.0000000000  ST AT 1   0.000000000 −0.0008217503 −0.0195558641 ST AT 5A

And the external stimulus response vector b:   0.008631864        0.007155039         0.01126109  b=     0.008817271         0.009395349      0.0263281468

Performing the methodology described until here we obtain for both LM1 and LM2 samples a system of linear differential equations:

 dx  dx 1 = −28.1x + 43.9.u(t) 1 = −8.2x + 8.6.u(t)  dt 1  dt 1    dx2  dx2  dt = −43.8x1 − 36.2x2 + 57.8.u(t)  dt = −7.1x1 − 41x2 + 7.2.u(t)    dx3  dx3  = −7.8x1 − 40.5x3 + 23.8.u(t)  = 12.4x1 − 25.5x3 + 11.3.u(t) dt dt dx (5.6) dx (5.5) 4 = −6.6x − 33.7x + 4.9.u(t) 4 = −2.2x − 70.5x + 8.8.u(t)  dt 1 4  dt 1 4    dx5  dx5  dt = −9.3x3 − 12.4x5 + 3.3.u(t)  dt = −7.6x3 − 52.1x5 + 9.4.u(t)    dx6  dx6 dt = −86.1x5 + 0.6x6 + 5.3.u(t) dt = −0.8x5 − 19.6x6 + 26.3.u(t) Equation 5.6: Differential equations describing genes Equation 5.5: Differential equations describing genes kinetics for sample infected with mutant L. monocyto- kinetics for sample infected with wild-type L. monocyto- genes regarding network from figure 5.2. Each entry is genes regarding network from figure 5.2. Each entry is multiplied by a factor of 1000. multiplied by a factor of 1000.

Analysing equations 5.5 and 5.6 we can immediately conclude if the interactions between genes described in the figure 5.2 network are modelled in the same way. Thereby, genes HLA-DMA and CD59 are expected to be inhibited by IL1A. The entry of the interaction matrix W that translates these two relationships is positive for both L. monocytogenes samples and, hence, the relations are modelled as described in the network. NFKBIE expression, in turn, is described to be stimulated by IL1A. However, the connection is only modelled in this way for LM1 condition. Moreover, this gene is supposed to stimulate the expression of STAT1 and this, in its turn, to stimulate STAT5A. These relations were not verified in the modelled gene kinetics. Concerning the influence that the infection has into HLA-DMA,

57 CD59 and STAT5A, the parameter b is positive for these genes, which is congruent with what was expected. However, for NFKBIE the negative link was not modelled as described in the network.

Step 8. Numerical integration of the system of linear differential equations represented in 5.5 and 5.6 and obtain the modulation of the NG kinetics along the 240 minutes of experience.

(a) Model of the gene kinetics for the samples infected (b) Model of the gene kinetics for the samples infected with wild-type L. monocytogenes. with mutant L. monocytogenes.

Figure 5.3: Measured (blue dots) and simulated expression kinetics in log-ratios (black lines) of the NG, regarding the network from figure 5.2.

Performing an overall analysis of the plots in figure 5.3, it is possible to conclude that there are no major differences between the kinetics obtain from LM1 and LM2 conditions. Generally, the genes of LM1 sample are characterized by an higher expression rate. This fact is only not verified for STAT5A, which has a significant expression increasing along the 240 minutes of experiment. It can be also noted that genes HLA-DMA and CD59 suffer a stronger down-regulation after the 20 minutes in LM2 condition.

Step 9. Perform a permutation test. This test consists on repeating steps 3 to 7 for at least 1000 set of six genes randomly chosen from the normalized count table. Given that for LM1 condition the number of valid genes is 23054 and for LM2 condition is 23019, it was decided to re-sample the set of six genes 5000 times.

Step 10. Measure the MSE/Var for all the tested set of genes, including the NG. Compare all the measured values with the one obtained for the set of NG and count how many times this value was lower for the random set of genes. Plot this results in a histogram.

58 Table 5.1: Statistical values associated with the set of NG from the network in figure 5.2 on both analysed condi- tions: sum of the NG mean square error (MSE), sum of the NG variance (Var) and sum of the ratio between the NG’s mean square error and its variance (MSE/Var).

MSE Var MSE/Var p-value

LM1 1.760 3.278 3.338 ≈ 39.1% LM2 1.653 4.710 3.658 ≈ 67.1%

(a) Permutation test in respect to figure’s 5.2 network (b) Permutation test in respect to figure’s 5.2 network for wild-type L. monocytogenes infected sample. for mutant L. monocytogenes infected sample. The associated p-value is 39.08 % (1954/5501). The associated p-value is 67.06 % (3353/5001).

Figure 5.4: Distribution of the ratio between the mean square error and the variance in the permutation tests. The red line represents this value for the NG. The gene set re-sampling was performed 5000 times.

Finally, from the permutation test it is possible to access the statistical importance of the network in analysis. In fact, the p-value translates the probability of the network in analysis occur by chance in the dataset. Bearing this considerations in mind, the performed analysis does not support the ac- tivation of the figure’s 5.2 network in the L. monocytogenes RNA-Seq data. In fact, for both LM1 and LM2 conditions the calculated p-value is very high, evidencing the occurrence of this network as not statistically significant and, therefore, this network is not a good model of the response registered in the L. monocytogenes’ samples. The absence of the expected interactions between the genes in the NGS data can be a consequence of this relationships being extracted from an infection with Escherichia coli, a different pathogen agent from the case study analysed in this thesis. In fact, differences on the signalling mechanisms involved in the human brain microvascular endothelial cells invasion were documented between these two bacteria [262].

59 5.2.2 Network 2

The network analysed in this subsection results from a literature review on the L. monocytogenes infection. This review was perform by Loretta Magagula, a collaborator from Mhlanga laboratory at CSIR, in South Africa. Initially this network was presented with thirteen NG: NDRG1, ATF3, IL8, IL4, CCL2, IFNG, IL5, DDIT3, TNF, IL13, IL12A, IKBKB, NFKB. After accessing the measured gene expression level, only seven of these genes presented non-null expression and thereby the initial network was trimmed to the one illustrated in figure 5.5(b). Therefore, following this pre-processing, the NG with which the analysis will proceed are: NDRG1, ATF3, CCL2, IFNG, DDIT3, IKBKB, NFKB.

(a) Raw gene network. (b) Network after trimming of the non-expressed genes in the Listeria monocytogenes dataset.

Figure 5.5: Network constructed from literature review on L. monocytogenes infection. The arrows in green repre- sent stimuli and the arrows in red represent inhibitions. This network was kindly provided by Loretta Magagula from Mhlanga laboratory at CSIR, in South Africa [245].

Table 5.2: Index of each NG and its respective Ensembl identification.

Gene Index Ensembl gene id

NDRG1 x1 ENSG00000104419

ATF3 x2 ENSG00000162772

IL8 x3 ENSG00000169429

CCL2 x4 ENSG00000108691

DDIT3 x5 ENSG00000175197

IKBKB x6 ENSG00000104365

NFKB x7 ENSG00000109320

The simulation of how these NG kinetics varies in the L. monocytogenes dataset is represented, for both LM1 and LM2 samples, in figures 5.6(a) and 5.6(b) according to equations 5.7 and 5.8, respectively. From the network illustrated in figure 5.5(b) we are expecting a certain type of relation between the genes. Particularly, NDRG1 is expected to have an inhibitory effect on both ATF3 and IKBKB genes. For

ATF3 this was confirmed for both samples, with the parameter w21 on the dx2/dt function being negative.

60 However, for IKBKB this correlation was only found to be congruent for LM2 sample. Regarding the genes that ATF3 is influencing, it is expected a positive signal for w32, w42 and w52. Nevertheless, this was only confirmed for the interaction with IL8 gene. For both CCL2 and DDIT3 the modelled relations were inhibitory. With respect to the relation between IKBKB and NFKB, tthe modelled interaction was also contradictory with the influence described in the gene network. Finally, for the negative relations between NFKB and IL8, in the modulation of the data this relation was also found as stimulation. Thus, the accordance between the relationships described in the gene network of figure 5.5(b) is poor, with only the interaction between NDRG1 → AT F 3, NDRG1 → IKBKB (in LM2 sample) and AT F 3 → IL8.

 dx1  dx1 = −8.0x1 + 0.8.u(t) = −6.9x1 + 10.3.u(t)  dt  dt  dx  dx  2 = −14.8x − 50.3x + 121.9.u(t)  2 = −29.9x − 20.4x + 11.02.u(t)  dt 1 2  dt 1 2    dx3  dx3  dt = 32.2x2 − 28.3x3 + 22.8x7 − 47.8.u(t)  dt = 44.1x2 − 49.7x3 + 24.9x7 − 19.7.u(t)   dx4 dx4 = −2.6x2 − 7.1x4 + 21.9.u(t) = −3.8x2 − 8.5x4 + 4.1.u(t) dt dt  dx  dx  5 = −13.9x − 33.5x − 28.2.u(t)  5 = −3.6x − 21.7x − 15.8.u(t)  dt 2 5  dt 2 5    dx6  dx6  dt = −7.7x1 − 37.2x6 − 3.2.u(t)  dt = 2.2x1 − 39.1x6 + 5.2.u(t)    dx7  dx7 dt = 6.6x6 − 4.8x7 + 2.0.u(t) dt = 88.7x6 − 11.7x7 + 3.4.u(t) (5.7) (5.8) Equation 5.8: Differential equations describing genes Equation 5.7: Differential equations describing genes kinetics for sample infected with mutant L. monocyto- kinetics for sample infected with wild-type L. monocy- genes relatively to network from figure 5.5(b). Each togenes relatively to network from figure 5.5(b). Each entry is multiplied by a factor of 1000. entry is multiplied by a factor of 1000.

(a) Model of the gene kinetics for the samples infected (b) Model of the gene kinetics for the samples infected with wild-type L. monocytogenes. with mutant L. monocytogenes.

Figure 5.6: Measured (blue dots) and simulated expression kinetics in log-ratios (black lines) of the NG, regarding the network from figure 5.5(b).

By comparing empirically the gene kinetics on both LM1 and LM2 samples, it is easy to conclude that for LM1 sample all the genes have higher expression levels. This could mean that a more intense cellular response occurred in LM1 sample when compared with LM2 sample. In order to confirm this statistically,

61 it was performed a permutation test. For this specific case, the result of this test evidences that the NG conjunction is well suited for L. monocytogenes dataset with a considerable low p-value. Particularly, for LM1 sample only three set of randomly chosen genes had lower ratio between the mean square error and the variance (MSE/Var) and, hence, the p-value is 0.059 % (figure 5.7(a)).Concerning LM2 sample, the existence of the network of interest in the data has associated a p-value of 10.137 % (figure 5.7(b)). The differences between the two p-values might arise from the fact that in LM1 sample the bacteria is able to disrupt its internalization vacuole and use the cell machinery to replicate. Contrasting, in LM2 sample, lacking the gene for LLO, the bacteria is not capable to invade the cell citosol. Therefore, in the first sample it is verified a stronger immunological cell response. This is translated by the p-values differences, which statistically describes how well the known network is fitting the dataset.

Table 5.3: Statistical values associated with the analysis of the NG from the network in figure 5.5(b).

MSE Var MSE/Var p-value

LM1 2.084 12.975 1.253 ≈ 0.059% LM2 1.653 4.710 3.658 ≈ 10.137%

(a) Permutation test in respect to figure’s 5.5(b) net- (b) Permutation test in respect to figure’s 5.5(b) work for wild-type L. monocytogenes infected network for mutant L. monocytogenes infected sample. The associated p-value is 0.059 % sample. The associated p-value is 10.137 % (3/5001). (507/5001).

Figure 5.7: Distribution of the ratio between the mean square error and the variance in the permutation tests. The red line represents this value for the NG.

Therefore, considering the previously referred results, it is possible to conclude that the data being analysed is consistent with the published network. The p-value associated with the statistical measured value (MSE/Var) is quite low for the NG, particularly for LM1 sample. Hence, the described conjunction

62 of data is not random and it is present in the analysed dataset in the same way that is described in the analysed genes network. Furthermore, from this analysis we are also able to support the idea that in sample LM1 a stronger immunological response occurred than in LM2 samples. These results were evidenced in the analysis of the L. monocytogenes dataset by the developed pipeline and are congruent with what we were expecting.

5.3 Discussion

The analysis of RNA-Seq data in the systems biology context gives insight about the active cell pathways upon a certain stimulus and along a time-course. The methodology proposed here takes advantage of this information to validate the RNA-Seq data. The validation performed by this method does not intend to find biases in the data but, instead, aims to prove that the genes in the RNA-Seq dataset are well modelled by a published gene networks. The network used can describe a pathway that is expected to be active (by knowing which stimulus was applied in the cell upon transcriptome sequencing). This is done when it is intended to investigate if the data describes well a biological process in study. However, this methodology can be applied also when a pathway is thought not to be occurring. The developed methodology allows to investigate if an adjacent pathway is activated when the cell is submitted to different conditions. Moreover, this methodology may prove particularly useful for RNA-Seq datasets that do not have replicates. Without replicates it is hard to understand if the RNA-Seq reads are translating well the cell transcriptome or are just a consequence of sampling noise. This method provides a way to test that a certain model of a biological pathway is, in fact, a good model for the RNA-Seq dataset. By using all the single samples in the time-course dataset the developed methodology is able to extract a general conclusion about the genes expression levels along the several time-points – gene kinetics. Ultimately, the gene kinetics information will evidence that a certain cellular pathway is described in the NGS data and, moreover, will support the comparative analysis between the single time-course samples performed by the pipeline proposed in section 4.2 of chapter 4. This methodology can also be useful when the RNA-Seq dataset contains replicate reads. The available tools that statistically compare RNA-Seq reads are able to investigate the congruency be- tween single samples but not the biological processes that they are describing, in a systems biology way [108, 116, 117, 263]. The use of this novel methodology not only allows to investigate if a process is described along a time-course but also, by investigating how well a certain network is modelling a dataset, clarifies about the similarly between the NG variation among the samples. This new methodol- ogy intends to complement the available tools that statistically compare RNA-Seq reads by consolidating its conclusions. The assessment of how well a network is modelling a RNA-Seq dataset can also give insight about the pathways that are activated by different stimulus and environments where the cell is submitted. This elucidates how a condition is influencing a certain cellular response. An example of this situation is

63 showed in subsection 5.2.2, where it is found that the immunological response network fits better the RNA-Seq dataset extracted from the cell which is infected with a wild-type L. monocytogenes over the dataset extracted from the cell which is infected with a mutant L. monocytogenes. Nonetheless, this methodology is based on several assumptions that may limit the obtained results. For instance, if a gene that is described in the network in analysis is not expressed, the methodology trims it from the network and, consequently, from the analysis. In addition, the time derivatives dxi/dt need also to be determined by linear interpolation, and, therefore, this value corresponds only to an estimate. However, this methodology was adopted for several researchers to perform the modulation of their data in the gene networks context and proved to be an effective approach [1, 260, 264, 265]. This fact motivated the use of this approach in the developed methodology. Finally, the ratio between the mean square error and the variance (MSE/Var), which translates the statistical significance of a given gene network on a NGS dataset, might not be accurate in cases where the variance associated with the NG is very low. This occurs because in the permutation test will appear set of random genes with high variances. This high values imply that the measured statistical value will be low, even if the MSE is high. Excluding the described restraint, this approach is effective because discards set of genes with low variance that even tested against a bad model will have a low MSE associated. Moreover, only a small set of genes are expected to have a great variance. These correspond to the genes that were influenced by the stimulus induced on the sequenced cells. With these considerations on mind, MSE/Var can be a limitation in the previously refereed cases but also allows to discard non statistical significant genes. Concerning the existence of available tools that perform a task similar to the one described in this chapter, to my knowledge there is no tool which performs it in a identical way. The tools that exist take the gene expression level of a given RNA-Seq sample and try to concatenate it in order to extract biological processes that are common to a wide range of genes (final step of the pipeline described in chapter 4 - performed by GOStats) [266–269]. This type of analysis gives insight into the cellular processes that are active upon the collection of the data and might evidence the occurrence of a given response, but are not able to concatenate this information into an inter-connected system, as it occurs for gene networks. Another frequently used approach is to model from the RNA-Seq data gene networks, in an attempt to understand the relationships between genes under a certain stimulus [1, 264, 265, 270–273]. The methodology that I propose in this chapter does neither of these tasks. In fact, the methodology that I developed complements the execution of both these analysis. This methodology was created in a effort to support the conclusions extracted from these usual approaches. For the first type of analysis this tool can be used to evidence a given biological response in a NGS dataset. For the second, this methodology can also be useful to confirm the existence of the found network in the data being modulated. Therefore, the methodology here presented intends to have a supporting role on already well stablish approaches, aiding in the discovery of biologic ”secrets” that revolutionary tools such as RNA-Seq are starting to reveal by profiling the cell transcriptome with unprecedented high quality.

64 5.4 Interface

The methodology described in this chapter was developed with the intention of being used not only for bioinformaticians but also by biologists. Thus, it was crucial to develop an interactive interface where the user is able to enter his data and perform the validation analysis without the need of using the command line. In order to do so, I designed a HTML interface where it is easy for users with no computational know-how to gather the necessary data to proceed the analysis. The conceived HTML interface has a page header with two tabs: Analysis and Results. In the Analysis tab it is possible to enter all the data needed to perform the RNA-Seq consistency evaluation. In the Results tab the user is able to access the results and conclude about the congruency between the input data and the known gene network. Regarding the Analysis tab, this page has four main sections. The first section is called Enter Count File and provides a place to upload the file which contains the information about the genes expression level. Moreover, in this section, the user is asked to input the RNA-Seq type of library and the time- course in which the data was acquired. In the second section, Enter Network, is where the user insert the known gene regulatory network. After enter the NG in the first box text, it is required that the user connects those genes in the second box as the acknowledged network, connecting the green balls if the source gene is stimulating the target gene or connecting the red balls if the source genes is inhibiting the target one. The Permutation Test Properties section requires the introduction of the statistical test parameters, such as the number of re-sampling and the definition of a variance limit. This limit allows the user to set the variance of the random array of genes greater than a cut-off value. Finally, in the Analysis section, after enter all necessary data, the user only needs to click the GO! button to perform the intended analysis. After the analysis is complete, in the Results tab are displayed the plots with the gene kinetics and the respective differential equation describing it in a primary area composing the Gene Kinetics section. The second section, named Permutation Test, contains the histograms that resulted from the permutation test and, additionally, some statistical values related with it. In figures 5.8 and 5.9 are displayed how the developed interface can be used to evaluate the L. monocytogenes RNA-Seq data for network 2, described in section 5.2.2. After analyse the infor- mation entered about L. monocytogenes dataset and the respective published gene network (figure 5.8), the results are demonstrated in the Results tab (figure 5.9). As expected, the outcome is identical to the one obtained in section 5.2.2, using the command line.

65 (a)

(b) (c)

Figure 5.8: Analysis tab interface with all the demanded L. monocytogenes data information.

66 Figure 5.9: Results tab interface after the evaluation of the L. monocytogenes information entered in figure 5.8.

5.5 Future work

In order to strengthen the reliability of the new tool described in this chapter, a new dataset, with more robust information should be processed resorting to this method. It would be important to perform the analysis of a RNA-Seq dataset that studies a cell process that is not related with the immunological response. Additionally, the active cell pathways in this dataset should be know a priori. Having this infor- mation, the inspection of the p-value value gives insight about the algorithm reliability. The methodology should output a low p-value if the RNA-Seq dataset is tested against a gene network describing the a priori known active process. On the other hand, a high p-value should be obtained if the new NGS data is analysed against a network describing a biological pathway that is known not to be active. Finally, if the methodology is found not to be so robust as expected the developed algorithm needs to be improved. This can be achieved, for instance, by introducing a restriction in the signal of the inter-

67 action between the genes, forcing the connections among them to be as described in the gene network in test. Another approach can be to improve the formula MSE/Var that measures the error between the modelled kinetics and the experimental measured values from which the model was calculated.

68 6 Using publicly available data as RNA-seq replicates

Contents

6.1 Methods ...... 70 6.2 Listeria monocytogenes case study ...... 71 6.3 Discussion ...... 78 6.4 Future work ...... 79

69 When applying statistical methods, such as the ones provided by DESeq package, it is extremely important to have proper replicates in order to validate the obtained results. In fact, without replicates, it is not possible to assert if differences between conditions are just a consequence of experimental and biological noise. As previously referred, with the purpose of performing differential analysis without replicates, DESeq needs to do several assumptions. Firstly, intending to estimate genes’ dispersion, it considers all different conditions as replicates of one single condition. Secondly, it assumes that most genes behave similarly across conditions and, hence, the estimated variance should not be highly affected by the influence of the differentially expressed genes. Therefore, the software is expecting only a minority of genes to be influenced by the different applied stimulus. Nevertheless, the probability of the compared conditions being extremely different is relatively high and, without replicates, DESeq has a conservative approach to cope with this kind of data. As con- sequence, the estimated dispersion will be too high and, observing that the variance value is prominent in the determination of differentially expressed genes, a very low number of genes or even none will appear as differentially expressed when, in fact, what is happening is the opposite. In this chapter I test the hypothesis that publicly available data can serve as calibration on the DE analysis of a given RNA-Seq dataset and show that this methodology is not reliable. This is based only on publicly available data and bioinformatics tools. To test the formulated methodology, I used the L. monocytogenes control sample, which corresponds to the trancriptome of a HeLa cell growing in an healthy mean without any stimulus. From the public databases were extracted RNA-Seq reads acquired from HeLa cells that played the role of control in another published studies. Then, I investigated if this public data can be used as replicate in the inference of differentially expressed genes on the L. monocytogenes RNA-Seq dataset. This was proven not to be reliable, with a high variation rate between the samples that were supposed to be replicates. In order to test if the poor results were related with biases on the L. monocytogenes control sample, I used the same methodology to test if a given public data could be used as replicate of another public RNA-Seq control sample. The output of this approach was also poor, however better than the obtained for the L. monocytogenes control sample. Finally I discuss why this methodology fails to support the RNA-Seq dataset in analysis. However the results from this methodology were poor, this analysis is not worthless. In fact, allows to conclude that the approach defined in this chapter is not a reliable way to reinforce the DE inference among RNA-Seq samples.

6.1 Methods

To perform a DE analysis with confidence, it is necessary to have replicates. Otherwise, there is no way to infer with confidence the biological variation associated with each gene of the samples in analysis. The methodology here proposed intends to support the DE analysis, by using publicly available data and bioinformatics tools. This idea was a consequence of the challenge that it is validate biological conclusions from RNA-Seq datasets without replicates. Nevertheless, it is important to keep in mind

70 that, ultimately, this approach was proven not to be feasible. In this section I describe the methodology used to achieve that negative results. This methodology starts by extracting RNA-Seq samples similar to the ones in analysis from public databases, such as Gene Expression Omnibus (GEO) [274] or ArrayExpress [275]. Then, to access the similarly between the samples, it is used the pipeline described in the chapter 4 until the DESeq step. At this point, it is possible to understand if the public sample is suitable to be used as replicate of the RNA-Seq sample in analysis, by both analysing the output table of differentially expressed genes and, mainly, by observing the MA-plots illustrating the distribution of the genes log fold change along the several expression levels. If the sample downloaded from a public database is similar to the RNA-Seq that is intended to support, the MA-plot will have the shape of a narrow funnel.

Figure 6.1: Flowchart of the methodology developed to test the congruency between public RNA-Seq samples and the sample in analysis.

6.2 Listeria monocytogenes case study

To understand if the methodology here presented is valid, the control sample from the L. monocytogenes dataset was used to evaluate it. The control sample was generated by deep sequencing all poly-(A) tailed mRNAs of a HeLa cell population growing on a plate. Bearing this in mind, the first step of this methodology was to search in the available databases for RNA-Seq samples obtained from a population of HeLa cells growing in a healthy medium. After investigating the publicly available data, four different scientific works were found which collected HeLa cells transcriptome and made it publicly available (table 6.1).

Table 6.1: Summary of the publicly available datasets used.

Nb. Nb. Data Data Cell Line biological technical Nb. of files Reference format type replicates replicates

Dataset 1 HeLa BAM Paired-end 0 0 1 [276] Dataset 2 HeLa FASTQ Single-end 1 0 2 [277] Dataset 3 HeLa FASTQ Paired-end 1 1 4 [278] Dataset 4 HeLa S3 FASTQ Paired-end 1 0 2 [279]

In order to understand if the previously explained approach is valid, two sort of analysis were per- formed: firstly the control sample from the L. monocytogenes dataset was evaluated against the public

71 data; then, the several public samples were compared with each other in order to understand if the obtained results were similar to the results in the first analysis. Regarding the bioinformatics methodology, the pipeline described in chapter 4 was employed in a approach in all similar to the one described in section 4.2. However, given that in this analysis there is no interest in accessing the deferentially active cellular processes between the samples in analysis, the GOStats step was not performed. Very roughly, the computational methodology was as follows: first the public reads were mapped against the human genome using bowtie 2; then, the number of reads that mapped to a certain gene were counted using HTSeq-count script; and, finally, a statistical analysis was performed between the control sample of the L. monocytogenes dataset and the several public RNA-Seq samples using DESeq. The reference files used in this analysis, such as the human genome, for bowtie 2, and the file containing the information about the coding portions of the genome, for HTSeq-count, were the pipeline pre-defined files (both obtained from Ensembl website).

6.2.1 Comparison between Listeria monocytogenes control and the publicly available data

6.2.1.A Dataset 1

Dataset 1 was downloaded from GEO website [280] and resulted from a study of large intergenic non-coding RNA (lincRNA) based on RNA-Seq data from several human tissues. The final objective of this study was to define a reference catalog of human lincRNAs [276]. Given that the transferred sample had already been mapped against Hg19 human reference genome [281] using TopHat 1.1.4 [282], it was only necessary to count how many reads mapped to a determined gene using HTSeq-count (using the UCSC hg19 gene annotation file [283]), convert the RefSeq gene id to the Ensembl gene id using biomaRt database [284] and perform the statistical analysis using DESeq.

Figure 6.2 represents the values from log2 fold change as a function of the mean expression values for the comparative analysis between L. monocytogenes’ control sample and the public data described above. More specifically, each dot in figure’s 1 plot corresponds to a gene. In the x-axis we represent the baseMean, which corresponds to the number of reads divided by the size factor (normalization constant) of the sample; and, in the y-axis, is represented the log2 of the fold change, which describes how much a quantity changes from a condition to another. Finally, the red dots are outliers, which are judged to be differentially expressed. Analysing this plot it is possible to conclude that the two conditions are not similar. Focusing, for example, genes with relatively low expression (with base mean near 10), the fold change logarithm varies, for most of the genes, between -5 and 5. This is equivalent to a variance of 32 between the conditions being compared. Thereby the difference between conditions is three times higher than the genes’ mean expression. This discrepancy invalidates the use of this sample as replicate of the control sample of the L. monocytogenes dataset.

72 Figure 6.2: Plot of normalized counts mean versus log2 fold change for the contrast between L. monocytogenes’ control sample versus dataset’s 1 sample. Outliers genes represented in red.

6.2.1.B Dataset 2

The second dataset is composed by two single-end reads of HeLa cells [285]. The main goal of this data was to study the role of RNA molecules in the maintenance of decondensed and biologically active interphase chromatin conformation [277]. This dataset was processed according to the methodology explained in the previous section 6.1. Comparing the both replicates with L. monocytogenes control (figures figs. 6.3(a) to 6.3(c)) it is possible to conclude that there is a high variance between these datasets, which is most pronounced for the down regulated genes. In fact, the samples are so different that no gene is found to be differentially expressed. This fact can be explained by remembering how DESeq estimates differentially signal between samples: without replicates DESeq assumes that the samples in analysis are replicates and, thereby, is expecting that most genes behave similarly among conditions. This is not occurring for these two samples, and as consequence, the estimated dispersion will be too high. Observing this value is prominent in the determination of differentially expressed genes, a very low number of genes or even none will appear as differentially expressed when, in fact, what is happening is the opposite. When analysing the values with fold change less than 0, its dispersion is highly different from the dispersion verified between the two biological replicates. Finally, analysing figure 6.3(d), it is possible to conclude that the public replicates are not perfect but have a much higher similarly towards the control sample. In conclusion, is highly imprudent to use this public data as a replicate of L. monocytogenes control condition.

73 (a) L. monocytogenes dataset control versus the first (b) L. monocytogenes dataset control versus the replicate from dataset 2. second replicate from dataset 2.

(c) L. monocytogenes dataset control versus the (d) First versus second replicate from dataset 2. samples from dataset 2 (considered as repli- cates).

Figure 6.3: Plot of normalized counts mean versus log2 fold change for the contrast between L. monocytogenes’ control sample versus dataset’s 2 samples. Outliers genes represented in red.

6.2.1.C Dataset 3

The third dataset was downloaded from ArrayExpress website [286] and resulted from a study which aimed to understand how cellular mechanisms prevent Alu elements [287] from being incorporated into mature transcripts [278]. Particularly, Luscombe et al. [278] firstly found that hnRNPC, a RNA-binding protein, competes with the splicing factor U2AF65 at many cryptic splice sites. With this is mind, the authors performed a RNA-seq experiment to investigate how this competition influences the splicing process using two gene knockdowns as well as control HeLa cells [278]. The samples of interest are, therefore, the ones used as control. This dataset contains two biological replicates and one technical replicate for each biological replicate. Aiming to understand how these samples are correlated with our control data, reads were processed as previously explained in section 6.1. Results from the comparative analysis between the four replicates and our control data and between the two biological replicates are illustrated in figure 6.4. The genes dispersion in figures figs. 6.4(a) to 6.4(c) is very high, with a portion of the genes having a fold change of approximately -10, which corresponds to a difference of 1024 in the fold change between the two conditions being analysed. Comparing the results obtained in subsections

74 6.2.1.A and 6.2.1.B, DESeq was able to find much more outliers genes (red dots in figures figs. 6.4(a) to 6.4(c)). This is due to the fact that this dataset is much more reliable than the previous ones, containing 4 replicated samples. Having this data, DESeq is able to infer with more confidence a variance threshold and, hence, determine the differential signal among samples. Oppositely, the genes fold change in the comparison between the two replicates (figure 6.4(e)) is concentrated near to 0, which is what we were expecting, and have associated a very low dispersion.

(a) L. monocytogenes dataset control versus the first (b) L. monocytogenes dataset control versus the replicate from dataset 3. second replicate from dataset 3.

(c) L. monocytogenes dataset control versus the (d) First versus second replicate from dataset 3. samples from dataset 3 (considered as repli- cates).

(e) Comparison between the technical replicates of the dataset’s 3 first sample.

Figure 6.4: Plot of normalized counts mean versus log2 fold change for the contrast between L. monocytogenes’ control sample versus dataset’s 3 samples. Outliers genes represented in red.

75 6.2.1.D Dataset 4

The National Human Genome Research Institute (NHGRI) launched, in September 2003, a public research consortium named ENCODE to carry out a project to identify all functional elements in the human genome sequence [279]. The mission was to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology [288]. In the context of this project, it was acquired RNA-Seq data of HeLa S3 cells [289–292]. The methodology used to compare this dataset with our control was equal to the used in the previous subsections. Analysing the obtained results, genes’ fold change have associated a great variance, meaning that the two conditions are highly different (figures figs. 6.5(a) to 6.5(c)). When comparing the two replicates (figure 6.5(d)), it is easy to verify that the values for the genes fold change is much more concentrated and varies along 0, implying that the two conditions are similar.

(a) L. monocytogenes dataset control versus the first (b) L. monocytogenes dataset control versus the replicate from dataset 4. second replicate from dataset 4.

(c) L. monocytogenes dataset control versus the (d) First versus second replicate from dataset 4. samples from dataset 4 (considered as repli- cates).

Figure 6.5: Plot of normalized counts mean versus log2 fold change for the contrast between L. monocytogenes’ control sample versus dataset’s 4 samples. Outliers genes represented in red.

76 6.2.2 Comparison between publicly available data

Intending to evaluate if the variance among the control sample of the L. monocytogenes dataset with the public data was related with any biases introduced by our control sample, it was perform the same differential analysis among the several publicly available datasets used in the previous section. In this section is described that analysis. Correlating the MA-plots in figure 6.5 with the ones obtained from the comparison between the L. monocytogenes dataset control sample and the several publicly data (figures 6.2, 6.3(c), 6.4(c) and 6.5(c)), it is visible a different dispersion of the genes fold change. Specifically, in figure 6.5 genes scat- tering is similar to the one verified when comparing two replicate samples (see figures 6.3(d), 6.4(e) and 6.5(d)). This means that, although the samples being compared have associated a considerable disper- sion, this difference is not as accentuated as the verified in the comparison between these datasets and our control data. Particularly, for all the analysis performed with dataset 3 a great number of DE genes were found. As explained previously, this is due to the high statistical significance of the dataset’s 3 esti- mated variance. And, therefore, any gene with fold change higher than that threshold will be considered as a outlier.

(a) Samples from dataset 1 versus the samples from (b) Samples from dataset 1 versus the samples from dataset 2. dataset 3.

(c) Samples from dataset 1 versus the samples from (d) Samples from dataset 2 versus the samples from dataset 4. dataset 3.

77 (e) Samples from dataset 2 versus the samples from (f) Samples from dataset 3 versus the samples from dataset 4. dataset 4.

Figure 6.5: Plot of normalized counts mean versus log2 fold change for the contrast between the public datasets. Outliers genes represented in red.

6.3 Discussion

Even though this methodology could be an improvement on the RNA-Seq DE inference in datasets without replicates, the results from the analysis between the control sample of the L. monocytogenes dataset and the public transcriptome data acquired from HeLa cell line were not good. For all the four datasets tested the variance associated with the evaluation was very high and, therefore, this method- ology was proven not to be valid at least for this dataset. For all samples a high batch effect was found among public samples and the L. monocytogenes’ control. In fact, there is a high limitation associated with this methodology: is highly depended of the data acquisition protocol. Even when the cell is sim- ply growing on a plate, cultured with a healthy medium, the transcriptome can be deeply influenced by the external environment or the chosen growing medium. Moreover, variability can be also associated with the library construction from the mature cell population or even the sequencing process. All these parameters can be prominent in the fallibility of this methodology. Nevertheless, when it was performed the comparison between the public data the results were not so unsatisfactory. In fact, the MA-plots of the DE analysis between public RNA-Seq HeLa data demon- strated a shape similar to the ones obtained in the first analysis among replicates of the same dataset. And, therefore, it is possible to conclude that the similarity between the public datasets is not perfect but it is still much better than the resemblance between these datasets and our control data. In fact, the fold change variance is high but the dispersion of the data along the plot has a shape similar to the one found in replicates analysis. This similarity between the public samples hints that, in other settings, this methodology could have been worked. However the hypothesised methodology failed on the L. monocytogenes dataset. This does not means that will fail for sure on other datasets. Therefore, this methodology should be further explored. In fact, if the use of already acquired data was possible on, for instance, DE analysis the number of replicate reads needed to perform a DE signal inference would be lower. In this sense, this approach could save

78 both precious time and funds in a RNA-Seq scientific study. Furthermore, with the exponential growth of NGS technologies, the number of publicly available RNA-Seq reads is also increasing very rapidly and, hence, the data limitation would not be a drawback of this methodology. The strong point of the approach described in this chapter is that intends to improve a novel RNA-Seq dataset with data that was already acquired and published which, in addition to the previously referred advantages, will give ”new life” to data that was already published, reusing it. To my knowledge, there is no published study that focus on the use of public RNA-Seq samples as replicates of a novel dataset on a DE analysis. Finally, if the dataset in analysis has more than two replicates the employment of this methodology would be inadvisable and, moreover, could reduce the dataset quality. In such datasets, the variance associated with the genes expression can be calculated with confidence. Introducing a public sample in that analysis as replicates will introduce a biases in the variance and reduce its reliability. An example of this situation is dataset 3, that contains four replicates. Seeing figures 6.4(a), 6.4(b), 6.4(e), 6.6(b), 6.6(d) and 6.5(f) for all of them a lot of genes are considered as outliers (red dots). This is due to the high confidence of the estimated genes variance and genes with variance higher than that threshold will be considered as differentially expressed.

6.4 Future work

Despite the results of this analysis were not positive, the analysis between publicly available data hints that in another settings this methodology could work. Therefore, this idea should be further tested and developed. One approach could be to compare datasets that were acquired in different laboratories but have equal culture environments, situation that does not occur for any of the tested datasets. In addition, they should belong to the same cell line. Furthermore, in the future, it may be worthwhile to revisit this approach when the technology has improved.

79 80 7 Conclusions

81 Recent technological advances in genomics and proteomics are generating data at unprecedented high resolution [127]. One example is high-throughput sequencing of RNA (RNA-Seq) which firstly allowed the simultaneous measurement of RNAs sequence and expression at whole cell level [8, 13, 17]. With the introduction of these new technologies, new bioinformatic approaches are required. In this thesis I have developed methods for the analysis of RNA-Seq data and made contributions to the understanding of L. monocytogenes infection mechanisms. First, I implemented a complete pipeline to analyse RNA-Seq data. This pipeline begins by perform- ing a data quality assessment, next it aligns the cleaned reads to a reference file, measures the data gene expression level, tests for genes differential expression and, finally, concatenates this data into GO terms. The final outcome of this pipeline is a table that contains the differentially active cellular process between the RNA-Seq samples being processed. This enables the user to draw conclusions about the influence that a certain cell environment or stimulus has into the cell active processes. And, therefore, the use of this pipeline provides, from dozens of gigabytes of data, awareness of the cellular response upon a given stimulus. Subsequent to the development of the RNA-Seq analysis pipeline, it was developed an innovative tool that is able to investigate if a certain biological phenomena is active on a given RNA-Seq dataset. In order to perform that analysis, the algorithm takes advantage of previous knowledge and confirms if the existent interactions in a published gene network are confirmed in the dataset in analysis. A major problem in RNA-Seq studies is the limitation of the confidence in the conclusions if only a low number of replicate samples was acquired. Confirmation of the relations modelled by a given gene network in a RNA-Seq dataset without or with low number of replicates proves that the cell response described by that network is activated in the sequenced cell, supporting the conclusions extracted from that poor statistical significant RNA-Seq dataset. Likewise, for RNA-Seq datasets with high number of replicates this analysis can also be useful. Enabling to compare the RNA-Seq dataset with previously knowledge, the approach can be test it against a gene network that is expected to be occurring or against a gene network that is not supposed to be portrayed in the data. For both cases, this analysis reassures the sta- tistical confidence of the data and reinforces the conclusions extracted from the analysis of that dataset. Regardless of the RNA-Seq characteristics, this methodology complements the already available tools that detect a given cellular response on RNA-Seq data and can be extremely useful in the validation of the conclusions extracted from it. This novel methodology was implemented in a HTML interface, in an effort to facilitate the data gathering and to enable the validation analysis without the need of using the command line. With this interface this tool can be useful not only for bioinformaticians but also for biologists. Finally, in the same line of thought, a novel methodology was proposed to overcome the limitations of RNA-Seq datasets with low number or even with no replicated samples. This methodology is based on the use of already published RNA-Seq samples as replicates of the poor statistically significant dataset. With the exponential development of high-throughput sequencing technologies, the number of RNA-Seq reads available in databases have also presented an high increase. This methodology

82 intends to take advantage of samples saved in that databases that present similar cell sources, growth environments and which transcriptome was sequenced. This methodology could be particularly useful for control samples, since every DE study needs to have a control sample to compare which genes are differentially expressed between the stimulated and non stimulated cell. In order to evaluate the developed tools, it was studied a L. monocytogenes RNA-Seq dataset. This is characterized by two conditions: 1) trancriptome collected from population of cells infected with wild-type L. monocytogenes and 2) trancriptome collected from population of cells infected with mutant L. monocytogenes, to which was removed a gene that synthesises for a virulence factor essential in the disruption of the vacuole in which the bacteria enters into the host cell. For each condition were acquired four time-points, intending to represent the bacteria life-cycle. In addition, this dataset contains a control sample, which was acquired from a population of cells simply growing on a plate. Regarding the analysis of this dataset with the developed pipeline, it was possible to extract bio- logical meaningful considerations. Namely, it was possible to distinguish the cell response upon infec- tion with both wild-type and mutant L. monocytogenes. For the first case we observed a very typical response, with the activation of pathways related with the cell innate immunological response and ev- idencing the cell shut-down by promoting the apoptotic process. On the other hand, for cells infected with L. monocytogenes incapable of synthesise LLO, no immunological response was registered. Quite the contrary, the cell promoted its proliferation. From this analysis it was possible to formulate the follow- ing hypothesis: when LLO, a protein codified by hly gene, is not produced L. monocytogenes loses its virulence. Particularly, without LLO the bacteria is not able to disrupt the internalization vacuole and use the cell machinery to replicate. Moreover, due to the fact that the internalization process is performed in a membrane-bound phagosome by inducing local cytoskeletal rearrangements in the host cell, with no disruption of this vacuole the cell can not detect any foreign body. Resuming, the results obtained from the analysis of wild-type infected samples allow to conclude the congruency between the published bacteria life-cycle and the processes that were active, and this supports the results obtained for mutant infected samples. Intending to understand which cell pathways were activated in the L. monocytogenes data, I used the methodology developed to test the existence of a given gene network in a RNA-Seq dataset. Given that the processes expected to be active are related with the cell immunological response, the gene networks extracted from the literature were related with this pathway. Specifically, the first tested network described the genes interaction upon infection with Escherichia coli. The results were that the data was not well modulated by this network, with the activation of such network in the L. monocytogenes dataset having a low statistical significance for both conditions. The second network illustrates the interaction of the genes that mediate the immunological response upon L. monocytogenes infection. In contrast to the first network, the p-value associated with this gene network was 0.059 % for the first condition (samples infected with wild-type bacteria) supporting both the quality of this dataset as well as the conclusions from the pipeline analysis. Furthermore, with a p-value of 10.137 % associated with the second condition (samples infected with mutant bacteria) this analysis also supports the existence of a

83 much more accentuated immunological response on the first condition. Given that the L. monocytogenes RNA-Seq dataset does not contains any replicates, I attempted to improve its statistical confidence by using public RNA-Seq data from HeLa cells as a replicate of the control sample. However, this methodology was proven not to be reliable, with the variance among the four public datasets tested and our control found out to be very discrepant. These results invalidate this approach in the enhancement of DE signal inference among the L. monocytogenes samples. In order to strengthen the reliability of the new tools described above, a new dataset, with more robust information, for instance, should be processed resorting to the novel developed methodologies. It would be important to perform the analysis of a RNA-Seq dataset which has replicate reads for each collected condition and, moreover, this new dataset should study a cell process that it is not related with the immunological response. Additionally, in order to investigate the acuity of the RNA-Seq validation algorithm described in chapter 5, should be analysed a dataset which already was tested by other tools and from which are known the active cell pathways. Having this information, the inspection of the p- value value gives insight into the algorithm reliability. The methodology should output a low p-value if the RNA-Seq dataset is tested against a gene network describing the a priori known active process. On the other hand, a high p-value should be achieved if the new NGS data is analysed against a network describing a biological pathway that is known not to be occurring. Summing up, the tools and methodologies described in this thesis contribute to the improvement of bioinformatics science field. And, furthermore, the conclusions extracted from the L. monocytogenes dataset using these methodologies also bring new biological knowledge.

84 Bibliography

[1] R. Guthke, U. Moller,¨ M. Hoffmann, F. Thies, and S. Topfer,¨ “Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection.” Bioinformatics, vol. 21, no. 8, pp. 1626–34, Apr. 2005.

[2] F. Sanger, G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, J. C. Fiddes, C. A. Hutchison, P. M. Slocombe, and M. Smith, “Nucleotide sequence of bacteriophage φX174 DNA,” Nature, vol. 265, no. 5596, pp. 687–695, Feb. 1977.

[3] W. Gilbert and A. Maxam, “The nucleotide sequence of the lac operator.” Proceedings of the National Academy of Sciences of the United States of America, vol. 70, no. 12, pp. 3581–4, Dec. 1973.

[4] W. Fiers, R. Contreras, F. Duerinck, G. Haegeman, D. Iserentant, J. Merregaert, W. Min Jou, F. Molemans, A. Raeymaekers, A. Van den Berghe, G. Volckaert, and M. Ysebaert, “Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene,” Nature, vol. 260, no. 5551, pp. 500–507, Apr. 1976.

[5] A. M. Maxam, “A New Method for Sequencing DNA,” Proceedings of the National Academy of Sciences, vol. 74, no. 2, pp. 560–564, Feb. 1977.

[6] R. Fleischmann, M. Adams, O. White, R. Clayton, E. Kirkness, A. Kerlavage, C. Bult, J. Tomb, B. Dougherty, J. Merrick, and e. Al., “Whole-genome random sequencing and assembly of Haemophilus influenzae Rd,” Science, vol. 269, no. 5223, pp. 496–512, Jul. 1995.

[7] J. Shendure, R. D. Mitra, C. Varma, and G. M. Church, “Advanced sequencing technologies: methods and goals.” Nature reviews. Genetics, vol. 5, no. 5, pp. 335–44, May 2004.

[8] E. R. Mardis, “The impact of next-generation sequencing technology on genetics.” Trends in genetics, vol. 24, no. 3, pp. 133–41, Mar. 2008.

[9] M. L. Metzker, “Sequencing technologies - the next generation.” Nature reviews. Genetics, vol. 11, no. 1, pp. 31–46, Jan. 2010.

[10] L. D. Stein, “The case for cloud computing in genome informatics.” Genome biology, vol. 11, no. 5, p. 207, Jan. 2010.

85 [11] E. R. Mardis, “A decade’s perspective on DNA sequencing technology.” Nature, vol. 470, no. 7333, pp. 198–203, Feb. 2011.

[12] Wetterstrand KA., “DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program,” http://www.genome.gov/sequencingcosts/.

[13] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolutionary tool for transcriptomics,” Nature Reviews Genetics, vol. 10, no. 1, pp. 57–63, 2009.

[14] P. Hogeweg and B. Hesper, “Interactive instruction on population interactions,” Computers in Biology and Medicine, vol. 8, no. 4, pp. 319–327, Jan. 1978.

[15] J. B. Hagen, “The origins of bioinformatics.” Nature reviews. Genetics, vol. 1, no. 3, pp. 231–6, Dec. 2000.

[16] P. Hogeweg, “The roots of bioinformatics in theoretical biology.” PLoS computational biology, vol. 7, no. 3, p. e1002021, Mar. 2011.

[17] J. Shendure and H. Ji, “Next-generation DNA sequencing.” Nature biotechnology, vol. 26, no. 10, pp. 1135–45, Oct. 2008.

[18] M. Pop and S. L. Salzberg, “Bioinformatics challenges of new sequencing technology.” Trends in genetics : TIG, vol. 24, no. 3, pp. 142–9, Mar. 2008.

[19] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,” Science, vol. 270, pp. 467–470, 1995.

[20] V. G. Cheung, M. Morley, F. Aguilar, A. Massimi, R. Kucherlapati, and G. Childs, “Making and reading microarrays,” Nat. Genet., vol. 11, no. 1 Suppl, pp. 15–19, 1999.

[21] J. DeRisi, L. Penland, P. O. Brown, M. L. Bittener, P. S. Meltzer, M. Ray, Y. Chen, Y. A. Su, and J. M. Trent, “Use of a cDNA microarray analyse gene expression patterns in human cancer,” Nature Genetics, vol. 14, pp. 457–460, 1996.

[22] D. D. Bowtell, “Options available - from start to finish - for obtaining expression data by microarray,” Nature Genetics, vol. 21, pp. 25 – 32, 1999.

[23] U. Maskos and E. M. Southern, “Oligonucleotide hybridizations on glass supports: a novel linker for oligonucleotide synthesis and hybridization properties of oligonucleotides synthesised in situ.” Nucleic acids research, vol. 20, no. 7, pp. 1679–84, Apr. 1992.

[24] A. J. Westermann, S. A. Gorski, and J. Vogel, “Dual RNA-seq of pathogen and host,” Nature Publishing Group, vol. 10, no. 9, pp. 618–630, 2012.

[25] C.-C. Khor and M. L. Hibberd, “Revealing the molecular signatures of host-pathogen interactions.” Genome biology, vol. 12, no. 10, p. 229, Jan. 2011.

86 [26] D. R. Silva, D. M. Menegotto, L. F. Schulz, M. B. Gazzana, and P. T. Dalcin, “Mortality among patients with tuberculosis requiring intensive care: a retrospective cohort study.” BMC infectious diseases, vol. 10, no. 1, p. 54, Jan. 2010.

[27] K. Ray, B. Marteyn, P. J. Sansonetti, and C. M. Tang, “Life on the inside: the intracellular lifestyle of cytosolic bacteria.” Nature reviews. Microbiology, vol. 7, no. 5, pp. 333–40, May 2009.

[28] J. Watson and F. Crick, “Genetical Implications of the structure of Deoxyribonucleic Acid,” Nature, vol. 171, pp. 964–967, 1953.

[29] M. Wilkins, A. Stokes, and H. Wilson, “Molecular Structure of Deoxypentose Nucleic Acids,” Na- ture, vol. 171, pp. 738–740, 1953.

[30] J. Watson and F. Crick, “A Structure for Deoxyribose Nucleic Acid,” Nature, vol. 171, pp. 737–738, 1953.

[31] F. Crick, “On protein synthesis,” Symp. Soc. Exp. Biol, vol. 12, pp. 138–163, 1958.

[32] ——, “Central dogma of molecular biology,” Nature, vol. 227, no. 5258, pp. 561–563, 1970.

[33] B. Alberts, D. Bray, K. Hopkin, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter, Essential cell biology, 3rd ed. Garland Science, 2009.

[34] C. Ling, P.Poulsen, E. Carlsson, M. Ridderstra˚ le, P.Almgren, J. r. Wojtaszewski, H. Beck-Nielsen, L. Groop, and A. Vaag, “Multiple environmental and genetic factors influence skeletal muscle PGC-1alpha and PGC-1beta gene expression in twins.” The Journal of clinical investigation, vol. 114, no. 10, pp. 1518–26, Nov. 2004.

[35] A. Kreimer and I. Pe’er, “Variants in exons and in transcription factors affect gene expression in trans.” Genome biology, vol. 14, no. 7, p. R71, Jul. 2013.

[36] C. Sen and L. Packer, “Antioxidant and redox regulation of gene transcription,” FASEB J, vol. 10, no. 7, pp. 709–720, May 1996.

[37] P. Oberdoerffer, S. Michan, M. McVay, R. Mostoslavsky, J. Vann, S.-K. Park, A. Hartlerode, J. Stegmuller, A. Hafner, P. Loerch, S. M. Wright, K. D. Mills, A. Bonni, B. A. Yankner, R. Scully, T. A. Prolla, F. W. Alt, and D. A. Sinclair, “SIRT1 redistribution on chromatin promotes genomic stability but alters gene expression during aging.” Cell, vol. 135, no. 5, pp. 907–18, Nov. 2008.

[38] W. Akhtar, J. de Jong, A. V. Pindyurin, L. Pagie, W. Meuleman, J. de Ridder, A. Berns, L. F. A. Wessels, M. van Lohuizen, and B. van Steensel, “Chromatin position effects assayed by thousands of reporters integrated in parallel.” Cell, vol. 154, no. 4, pp. 914–27, Aug. 2013.

[39] T. D. Southall, K. S. Gold, B. Egger, C. M. Davidson, E. E. Caygill, O. J. Marshall, and A. H. Brand, “Cell-type-specific profiling of gene expression and chromatin binding without cell isolation: assaying RNA Pol II occupancy in neural stem cells.” Developmental cell, vol. 26, no. 1, pp. 101–12, Jul. 2013.

87 [40] A. Dvir, R. C. Conaway, and J. W. Conaway, “A role for TFIIH in controlling the activity of early RNA polymerase II elongation complexes,” Proceedings of the National Academy of Sciences, vol. 94, no. 17, pp. 9006–9010, Aug. 1997.

[41] A. C. Seila, L. J. Core, J. T. Lis, and P. a. Sharp, “Divergent transcription: a new feature of active promoters.” Cell cycle, vol. 8, no. 16, pp. 2557–64, Aug. 2009.

[42] Q. Zhou, T. Li, and D. H. Price, “RNA polymerase II elongation control.” Annual review of biochemistry, vol. 81, pp. 119–43, Jan. 2012.

[43] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. J. Sugarbaker, and M. Meyerson, “Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 24, pp. 13 790–5, Nov. 2001.

[44] L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend, “Gene expression profiling predicts clinical outcome of breast cancer.” Nature, vol. 415, no. 6871, pp. 530–6, Jan. 2002.

[45] D. G. Beer, S. L. R. Kardia, C.-C. Huang, T. J. Giordano, A. M. Levin, D. E. Misek, L. Lin, G. Chen, T. G. Gharib, D. G. Thomas, M. L. Lizyness, R. Kuick, S. Hayasaka, J. M. G. Taylor, M. D. Iannettoni, M. B. Orringer, and S. Hanash, “Gene-expression profiles predict survival of patients with lung adenocarcinoma.” Nature medicine, vol. 8, no. 8, pp. 816–24, Aug. 2002.

[46] V. Pagliarulo, R. H. Datar, and R. J. Cote, “Role of genetic and expression profiling in pharmacogenomics: the changing face of patient management.” Current issues in molecular biology, vol. 4, no. 4, pp. 101–10, Oct. 2002.

[47] P. H. Johnson, R. P. Walker, S. W. Jones, K. Stephens, J. Meurer, D. A. Zajchowski, M. M. Luke, F. Eeckman, Y. Tan, L. Wong, G. Parry, T. K. Morgan, M. A. McCarrick, and J. Monforte, “Multiplex gene expression analysis for high-throughput drug discovery: screening and analysis of compounds affecting genes overexpressed in cancer cells.” Molecular cancer therapeutics, vol. 1, no. 14, pp. 1293–304, Dec. 2002.

[48] P. A. Clarke, R. te Poele, R. Wooster, and P. Workman, “Gene expression microarray analysis in cancer biology, pharmacology, and drug development: progress and potential,” Biochemical Pharmacology, vol. 62, no. 10, pp. 1311–1336, Dec. 2001.

[49] R. a. Heller, M. Schena, a. Chai, D. Shalon, T. Bedilion, J. Gilmore, D. E. Woolley, and R. W. Davis, “Discovery and analysis of inflammatory disease-related genes using cDNA microarrays.” Proceedings of the National Academy of Sciences of the United States of America, vol. 94, no. 6, pp. 2150–5, Mar. 1997.

88 [50] J.-P. Thiery, X. Sastre-Garau, B. Vincent-Salomon, X. Sigal-Zafrani, J. Y. Pierga, C. Decraene, J. P. Meyniel, E. Gravier, B. Asselain, Y. De Rycke, P. Hupe, E. Barillot, S. Ajaz, M. Faraldo, M. A. Deugnier, M. Glukhova, and D. Medina, “Challenges in the stratification of breast tumors for tailored therapies.” Bulletin du cancer, vol. 93, no. 8, pp. E81–9, Aug. 2006.

[51] J. D. McPherson, M. Marra, L. Hillier, R. H. Waterston, A. Chinwalla, J. Wallis, M. Sekhon, K. Wylie, E. R. Mardis, R. K. Wilson, R. Fulton, T. A. Kucaba, C. Wagner-McPherson, W. B. Barbazuk, S. G. Gregory, S. J. Humphray, L. French, R. S. Evans, G. Bethel, A. Whittaker, J. L. Holden, O. T. McCann, A. Dunham, C. Soderlund, C. E. Scott, D. R. Bentley, G. Schuler, H. C. Chen, W. Jang, E. D. Green, J. R. Idol, V. V. Maduro, K. T. Montgomery, E. Lee, A. Miller, S. Emerling, Kucherlapati, R. Gibbs, S. Scherer, J. H. Gorrell, E. Sodergren, K. Clerc-Blankenburg, P. Tabor, S. Naylor, D. Garcia, P. J. de Jong, J. J. Catanese, N. Nowak, K. Osoegawa, S. Qin, L. Rowen, A. Madan, M. Dors, L. Hood, B. Trask, C. Friedman, H. Massa, V. G. Cheung, I. R. Kirsch, T. Reid, R. Yonescu, J. Weissenbach, T. Bruls, R. Heilig, E. Branscomb, A. Olsen, N. Doggett, J. F. Cheng, T. Hawkins, R. M. Myers, J. Shang, L. Ramirez, J. Schmutz, O. Velasquez, K. Dixon, N. E. Stone, D. R. Cox, D. Haussler, W. J. Kent, T. Furey, S. Rogic, S. Kennedy, S. Jones, A. Rosenthal, G. Wen, M. Schilhabel, G. Gloeckner, G. Nyakatura, R. Siebert, B. Schlegelberger, J. Korenberg, X. N. Chen, A. Fujiyama, M. Hattori, A. Toyoda, T. Yada, H. S. Park, Y. Sakaki, N. Shimizu, S. Asakawa, K. Kawasaki, T. Sasaki, A. Shintani, A. Shimizu, K. Shibuya, J. Kudoh, S. Minoshima, J. Ramser, P. Seranski, C. Hoff, A. Poustka, R. Reinhardt, and H. Lehrach, “A physical map of the human genome.” Nature, vol. 409, no. 6822, pp. 934–41, Feb. 2001.

[52] L. H. Augenlicht and D. Kobrin, “Cloning and screening of sequences expressed in a mouse colon tumor.” Cancer research, vol. 42, no. 3, pp. 1088–93, Mar. 1982.

[53] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, “Mapping and quantifying mammalian transcriptomes by RNA-Seq.” Nature methods, vol. 5, no. 7, pp. 621–8, Jul. 2008.

[54] L. D. Murphy, C. E. Herzog, J. B. Rudick, A. T. Fojo, and S. E. Bates, “Use of the polymerase chain reaction in the quantitation of mdr-1 gene expression.” Biochemistry, vol. 29, no. 45, pp. 10 351–6, Nov. 1990.

[55] J. D. Hoheisel, “Microarray technology: beyond transcript profiling and genotype analysis.” Nature reviews. Genetics, vol. 7, no. 3, pp. 200–10, Mar. 2006.

[56] M. J. Okoniewski and C. J. Miller, “Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations.” BMC bioinformatics, vol. 7, p. 276, Jan. 2006.

[57] L. Gautier, L. Cope, B. M. Bolstad, and R. a. Irizarry, “affy–analysis of Affymetrix GeneChip data at the probe level.” Bioinformatics, vol. 20, no. 3, pp. 307–15, Feb. 2004.

89 [58] J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad, “RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.” Genome research, vol. 18, no. 9, pp. 1509–17, Sep. 2008.

[59] F. Sanger, S. Nicklen, and A. R. Coulson, “DNA sequencing with chain-terminating inhibitors.” Proceedings of the National Academy of Sciences of the United States of America, vol. 74, no. 12, pp. 5463–7, Dec. 1977.

[60] V. E. Velculescu, L. Zhang, B. Vogelstein, and K. W. Kinzler, “Serial analysis of gene expression.” Science, vol. 270, no. 5235, pp. 484–7, Oct. 1995.

[61] T. Shiraki, S. Kondo, S. Katayama, K. Waki, T. Kasukawa, H. Kawaji, R. Kodzius, A. Watahiki, M. Nakamura, T. Arakawa, S. Fukuda, D. Sasaki, A. Podhajska, M. Harbers, J. Kawai, P. Carninci, and Y. Hayashizaki, “Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage.” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 26, pp. 15 776–81, Dec. 2003.

[62] S. Brenner, M. Johnson, J. Bridgham, G. Golda, D. H. Lloyd, D. Johnson, S. Luo, S. McCurdy, M. Foy, M. Ewan, R. Roth, D. George, S. Eletr, G. Albrecht, E. Vermaas, S. R. Williams, K. Moon, T. Burcham, M. Pallas, R. B. DuBridge, J. Kirchner, K. Fearon, J. Mao, and K. Corcoran, “Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays.” Nature biotechnology, vol. 18, no. 6, pp. 630–4, Jun. 2000.

[63] G. M. Church, “Genomes for All,” Scientific American, vol. 294, no. 1, pp. 46–54, Jan. 2006.

[64] N. Hall, “Advanced sequencing technologies and their wider impact in microbiology.” The Journal of experimental biology, vol. 210, no. Pt 9, pp. 1518–25, May 2007.

[65] C. A. Maher, C. Kumar-Sinha, X. Cao, S. Kalyana-Sundaram, B. Han, X. Jing, L. Sam, T. Barrette, N. Palanisamy, and A. M. Chinnaiyan, “Transcriptome sequencing to detect gene fusions in cancer.” Nature, vol. 458, no. 7234, pp. 97–101, Mar. 2009.

[66] E. Lalonde, K. C. H. Ha, Z. Wang, A. Bemmo, C. L. Kleinman, T. Kwan, T. Pastinen, and J. Majewski, “RNA sequencing reveals the role of splicing polymorphisms in regulating human gene expression.” Genome research, vol. 21, no. 4, pp. 545–54, Apr. 2011.

[67] E. Garcion, B. Wallace, L. Pelletier, and D. Wion, “RNA mutagenesis and sporadic prion diseases.” Journal of theoretical biology, vol. 230, no. 2, pp. 271–4, Sep. 2004.

[68] Q. Pan, O. Shai, L. J. Lee, B. J. Frey, and B. J. Blencowe, “Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing.” Nature genetics, vol. 40, no. 12, pp. 1413–5, Dec. 2008.

[69] M. Guttman, M. Garber, J. Z. Levin, J. Donaghey, J. Robinson, X. Adiconis, L. Fan, M. J. Koziol, A. Gnirke, C. Nusbaum, J. L. Rinn, E. S. Lander, and A. Regev, “Ab initio reconstruction of cell

90 type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs.” Nature biotechnology, vol. 28, no. 5, pp. 503–10, May 2010.

[70] J. F. Degner, J. C. Marioni, A. A. Pai, J. K. Pickrell, E. Nkadori, Y. Gilad, and J. K. Pritchard, “Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data.” Bioinformatics, vol. 25, no. 24, pp. 3207–12, Dec. 2009.

[71] S. Bennett, “Solexa Ltd.” Pharmacogenomics, vol. 5, no. 4, pp. 433–8, Jun. 2004.

[72] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton, C. G. Brown, K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, J. M. Boutell, J. Bryant, R. J. Carter, R. Keira Cheetham, A. J. Cox, D. J. Ellis, M. R. Flatbush, N. A. Gormley, S. J. Humphray, L. J. Irving, M. S. Karbelashvili, S. M. Kirk, H. Li, X. Liu, K. S. Maisinger, L. J. Murray, B. Obradovic, T. Ost, M. L. Parkinson, M. R. Pratt, I. M. J. Rasolonjatovo, M. T. Reed, R. Rigatti, C. Rodighiero, M. T. Ross, A. Sabot, S. V. Sankar, A. Scally, G. P. Schroth, M. E. Smith, V. P. Smith, A. Spiridou, P. E. Torrance, S. S. Tzonev, E. H. Vermaas, K. Walter, X. Wu, L. Zhang, M. D. Alam, C. Anastasi, I. C. Aniebo, D. M. D. Bailey, I. R. Bancarz, S. Banerjee, S. G. Barbour, P. A. Baybayan, V. A. Benoit, K. F. Benson, C. Bevis, P. J. Black, A. Boodhun, J. S. Brennan, J. A. Bridgham, R. C. Brown, A. A. Brown, D. H. Buermann, A. A. Bundu, J. C. Burrows, N. P. Carter, N. Castillo, M. Chiara E Catenazzi, S. Chang, R. Neil Cooley, N. R. Crake, O. O. Dada, K. D. Diakoumakos, B. Dominguez-Fernandez, D. J. Earnshaw, U. C. Egbujor, D. W. Elmore, S. S. Etchin, M. R. Ewan, M. Fedurco, L. J. Fraser, K. V. Fuentes Fajardo, W. Scott Furey, D. George, K. J. Gietzen, C. P. Goddard, G. S. Golda, P. A. Granieri, D. E. Green, D. L. Gustafson, N. F. Hansen, K. Harnish, C. D. Haudenschild, N. I. Heyer, M. M. Hims, J. T. Ho, A. M. Horgan, K. Hoschler, S. Hurwitz, D. V. Ivanov, M. Q. Johnson, T. James, T. A. Huw Jones, G.-D. Kang, T. H. Kerelska, A. D. Kersey, I. Khrebtukova, A. P. Kindwall, Z. Kingsbury, P. I. Kokko-Gonzales, A. Kumar, M. A. Laurent, C. T. Lawley, S. E. Lee, X. Lee, A. K. Liao, J. A. Loch, M. Lok, S. Luo, R. M. Mammen, J. W. Martin, P. G. McCauley, P. McNitt, P. Mehta, K. W. Moon, J. W. Mullens, T. Newington, Z. Ning, B. Ling Ng, S. M. Novo, M. J. O’Neill, M. A. Osborne, A. Osnowski, O. Ostadan, L. L. Paraschos, L. Pickering, A. C. Pike, A. C. Pike, D. Chris Pinkard, D. P. Pliskin, J. Podhasky, V. J. Quijano, C. Raczy, V. H. Rae, S. R. Rawlings, A. Chiva Rodriguez, P. M. Roe, J. Rogers, M. C. Rogert Bacigalupo, N. Romanov, A. Romieu, R. K. Roth, N. J. Rourke, S. T. Ruediger, E. Rusman, R. M. Sanches-Kuiper, M. R. Schenker, J. M. Seoane, R. J. Shaw, M. K. Shiver, S. W. Short, N. L. Sizto, J. P. Sluis, M. A. Smith, J. Ernest Sohna Sohna, E. J. Spence, K. Stevens, N. Sutton, L. Szajkowski, C. L. Tregidgo, G. Turcatti, S. Vandevondele, Y. Verhovsky, S. M. Virk, S. Wakelin, G. C. Walcott, J. Wang, G. J. Worsley, J. Yan, L. Yau, M. Zuerlein, J. Rogers, J. C. Mullikin, M. E. Hurles, N. J. McCooke, J. S. West, F. L. Oaks, P. L. Lundberg, D. Klenerman, R. Durbin, and A. J. Smith, “Accurate whole human genome sequencing using reversible terminator chemistry.” Nature, vol. 456, no. 7218, pp. 53–9, Nov. 2008.

91 [73] B. T. Wilhelm and J.-R. Landry, “RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing.” Methods, vol. 48, no. 3, pp. 249–57, Jul. 2009.

[74] M. A. Tariq, H. J. Kim, O. Jejelowo, and N. Pourmand, “Whole-transcriptome RNAseq analysis from minute amount of total RNA.” Nucleic acids research, vol. 39, no. 18, p. e120, Oct. 2011.

[75] R. Sooknanan, J. Pease, and K. Doyle, “Novel methods for rRNA removal and directional, ligation- free RNA-seq library preparation,” Nature methods, vol. 7, no. 10, pp. 1548–7091, Oct. 2010.

[76] I. sequencing, “Data sheet: TruSeq RNA and DNA sample preparation kits v2,” http://res.illumina. com/documents/products/datasheets/datasheet truseq sample prep kits.pdf.

[77] ——, “Specification sheet: cBot - Fully automated clonal cluster genaration for Ullimina sequenc- ing,” http://res.illumina.com/documents/products/datasheets/datasheet cbot.pdf.

[78] ——, “Technology spotlight: Illumina sequencig tecnology - Highest data accuracy, sim- ple workflow and a broad range of applications,” http://res.illumina.com/documents/products/ techspotlights/techspotlight sequencing.pdf.

[79] J. Z. Levin, M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson, N. Friedman, A. Gnirke, and A. Regev, “Comprehensive comparative analysis of strand-specific RNA sequencing methods.” Nature methods, vol. 7, no. 9, pp. 709–15, Sep. 2010.

[80] M. C. Van Verk, R. Hickman, C. M. J. Pieterse, and S. C. M. Van Wees, “RNA-Seq: revelation of the messengers.” Trends in plant science, vol. 18, no. 4, pp. 175–9, Apr. 2013.

[81] Q. Zhou, X. Su, A. Wang, J. Xu, and K. Ning, “QC-Chain: fast and holistic quality control method for next-generation sequencing data.” PloS one, vol. 8, no. 4, p. e60234, Jan. 2013.

[82] S. Andrews, “FastQC: Duplicate sequences,” http://www.bioinformatics.babraham.ac.uk/projects/ fastqc/Help/3AnalysisModules/9DuplicateSequences.html.

[83] X. Yang, D. Liu, F. Liu, J. Wu, J. Zou, X. Xiao, F. Zhao, and B. Zhu, “HTQC: a fast quality control toolkit for Illumina sequencing data.” BMC bioinformatics, vol. 14, no. 1, p. 33, Jan. 2013.

[84] B. Ewing, L. Hillier, M. C. Wendl, and P. Green, “Base-calling of automated sequencer traces using phred. I. Accuracy assessment.” Genome research, vol. 8, no. 3, pp. 175–85, Mar. 1998.

[85] B. Ewing and P. Green, “Base-calling of automated sequencer traces using phred. II. Error probabilities.” Genome research, vol. 8, no. 3, pp. 186–94, Mar. 1998.

[86] N. a. Fonseca, J. Rung, A. Brazma, and J. C. Marioni, “Tools for mapping high-throughput sequencing data.” Bioinformatics, vol. 28, no. 24, pp. 3169–77, Dec. 2012.

[87] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J. P. Mesirov,

92 C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange-Thomann, N. Stojanovic, A. Subramanian, D. Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. C. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson, L. W. Hillier, J. D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J. B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J. Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J. F. Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E. Scherer, J. B. Bouck, E. J. Sodergren, K. C. Worley, C. M. Rives, J. H. Gorrell, M. L. Metzker, S. L. Naylor, R. S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori, T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J. Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E. Pelletier, C. Robert, P. Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K. Weinstock, H. M. Lee, J. Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S. Taudien, A. Rump, H. Yang, J. Yu, J. Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A. Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J. Proctor, R. M. Myers, J. Schmutz, M. Dickson, J. Grimwood, D. R. Cox, M. V. Olson, R. Kaul, N. Shimizu, K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A. Roe, F. Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de la Bastide, N. Dedhia, H. Blocker,¨ K. Hornischer, G. Nordsiek, R. Agarwala, L. Aravind, J. A. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C. B. Burge, L. Cerutti, H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S. R. Eddy, E. E. Eichler, T. S. Furey, J. Galagan, J. G. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. S. Johnson, T. A. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J. Kent, P. Kitts, E. V. Koonin, I. Korf, D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght, T. Mikkelsen, J. V. Moran, N. Mulder, V. J. Pollara, C. P. Ponting, G. Schuler, J. Schultz, G. Slater, A. F. Smit, E. Stupka, J. Szustakowski, D. Thierry-Mieg, J. Thierry-Mieg, L. Wagner, J. Wallis, R. Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, S. P. Yang, R. F. Yeh, F. Collins, M. S. Guyer, J. Peterson, A. Felsenfeld, K. A. Wetterstrand, A. Patrinos, M. J. Morgan, P. de Jong, J. J. Catanese, K. Osoegawa, H. Shizuya, S. Choi, Y. J. Chen, and J. Szustakowki, “Initial sequencing and analysis of the human genome.” Nature, vol. 409, no. 6822, pp. 860–921, Feb. 2001.

[88] R. H. Waterston, K. Lindblad-Toh, E. Birney, J. Rogers, J. F. Abril, P. Agarwal, R. Agarwala, R. Ainscough, M. Alexandersson, P. An, S. E. Antonarakis, J. Attwood, R. Baertsch, J. Bailey, K. Barlow, S. Beck, E. Berry, B. Birren, T. Bloom, P. Bork, M. Botcherby, N. Bray, M. R. Brent, D. G. Brown, S. D. Brown, C. Bult, J. Burton, J. Butler, R. D. Campbell, P. Carninci, S. Cawley, F. Chiaromonte, A. T. Chinwalla, D. M. Church, M. Clamp, C. Clee, F. S. Collins, L. L. Cook, R. R.

93 Copley, A. Coulson, O. Couronne, J. Cuff, V. Curwen, T. Cutts, M. Daly, R. David, J. Davies, K. D. Delehaunty, J. Deri, E. T. Dermitzakis, C. Dewey, N. J. Dickens, M. Diekhans, S. Dodge, I. Dubchak, D. M. Dunn, S. R. Eddy, L. Elnitski, R. D. Emes, P. Eswara, E. Eyras, A. Felsenfeld, G. A. Fewell, P. Flicek, K. Foley, W. N. Frankel, L. A. Fulton, R. S. Fulton, T. S. Furey, D. Gage, R. A. Gibbs, G. Glusman, S. Gnerre, N. Goldman, L. Goodstadt, D. Grafham, T. A. Graves, E. D. Green, S. Gregory, R. Guigo,´ M. Guyer, R. C. Hardison, D. Haussler, Y. Hayashizaki, L. W. Hillier, A. Hinrichs, W. Hlavina, T. Holzer, F. Hsu, A. Hua, T. Hubbard, A. Hunt, I. Jackson, D. B. Jaffe, L. S. Johnson, M. Jones, T. A. Jones, A. Joy, M. Kamal, E. K. Karlsson, D. Karolchik, A. Kasprzyk, J. Kawai, E. Keibler, C. Kells, W. J. Kent, A. Kirby, D. L. Kolbe, I. Korf, R. S. Kucherlapati, E. J. Kulbokas, D. Kulp, T. Landers, J. P. Leger, S. Leonard, I. Letunic, R. Levine, J. Li, M. Li, C. Lloyd, S. Lucas, B. Ma, D. R. Maglott, E. R. Mardis, L. Matthews, E. Mauceli, J. H. Mayer, M. McCarthy, W. R. McCombie, S. McLaren, K. McLay, J. D. McPherson, J. Meldrim, B. Meredith, J. P. Mesirov, W. Miller, T. L. Miner, E. Mongin, K. T. Montgomery, M. Morgan, R. Mott, J. C. Mullikin, D. M. Muzny, W. E. Nash, J. O. Nelson, M. N. Nhan, R. Nicol, Z. Ning, C. Nusbaum, M. J. O’Connor, Y. Okazaki, K. Oliver, E. Overton-Larty, L. Pachter, G. Parra, K. H. Pepin, J. Peterson, P. Pevzner, R. Plumb, C. S. Pohl, A. Poliakov, T. C. Ponce, C. P. Ponting, S. Potter, M. Quail, A. Reymond, B. A. Roe, K. M. Roskin, E. M. Rubin, A. G. Rust, R. Santos, V. Sapojnikov, B. Schultz, J. Schultz, M. S. Schwartz, S. Schwartz, C. Scott, S. Seaman, S. Searle, T. Sharpe, A. Sheridan, R. Shownkeen, S. Sims, J. B. Singer, G. Slater, A. Smit, D. R. Smith, B. Spencer, A. Stabenau, N. Stange-Thomann, C. Sugnet, M. Suyama, G. Tesler, J. Thompson, D. Torrents, E. Trevaskis, J. Tromp, C. Ucla, A. Ureta-Vidal, J. P. Vinson, A. C. Von Niederhausern, C. M. Wade, M. Wall, R. J. Weber, R. B. Weiss, M. C. Wendl, A. P. West, K. Wetterstrand, R. Wheeler, S. Whelan, J. Wierzbowski, D. Willey, S. Williams, R. K. Wilson, E. Winter, K. C. Worley, D. Wyman, S. Yang, S.-P. Yang, E. M. Zdobnov, M. C. Zody, and E. S. Lander, “Initial sequencing and comparative analysis of the mouse genome.” Nature, vol. 420, no. 6915, pp. 520–62, Dec. 2002.

[89] D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg, “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.” Genome biology, vol. 14, no. 4, p. R36, Apr. 2013.

[90] S. Huang, J. Zhang, R. Li, W. Zhang, Z. He, T.-W. Lam, Z. Peng, and S.-M. Yiu, “SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from RNA-Seq Data.” Frontiers in genetics, vol. 2, p. 46, Jan. 2011.

[91] W. J. Kent, “BLAT–the BLAST-like alignment tool.” Genome research, vol. 12, no. 4, pp. 656–64, Apr. 2002.

[92] G. S. C. Slater and E. Birney, “Automated generation of heuristics for biological sequence comparison.” BMC bioinformatics, vol. 6, no. 1, p. 31, Jan. 2005.

[93] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with Bowtie 2.” Nature methods, vol. 9, no. 4, pp. 357–9, Apr. 2012.

94 [94] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler transform.” Bioinformatics, vol. 25, no. 14, pp. 1754–60, Jul. 2009.

[95] H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores.” Genome research, vol. 18, no. 11, pp. 1851–8, Nov. 2008.

[96] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang, “SOAP2: an improved ultrafast tool for short read alignment.” Bioinformatics (Oxford, England), vol. 25, no. 15, pp. 1966–7, Aug. 2009.

[97] C. Trapnell and S. L. Salzberg, “How to map billions of short reads onto genomes.” Nature biotechnology, vol. 27, no. 5, pp. 455–7, May 2009.

[98] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.” Genome biology, vol. 10, no. 3, p. R25, Jan. 2009.

[99] D. J. W. M. Burrows, D. J. Wheeler, M. Burrows, “A block-sorting lossless data compression algorithm,” Digital SRC Reports 124,, vol. 124, 1994.

[100] P. Ferragina and G. Manzini, “Opportunistic data structures with applications,” in Proceedings 41st Annual Symposium on Foundations of Computer Science. IEEE Comput. Soc, 2000, pp. 390–398.

[101] ——, “An experimental study of an opportunistic index,” in In SODA, 2001, pp. 269–278.

[102] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map format and SAMtools.” Bioinformatics, vol. 25, no. 16, pp. 2078–9, Aug. 2009.

[103] S. Anders, “HTSeq: Analysing high-throughput sequencing data with Python,” http://www-huber. embl.de/users/anders/HTSeq/.

[104] A. Oshlack and M. J. Wakefield, “Transcript length bias in RNA-seq data confounds systems biology.” Biology direct, vol. 4, p. 14, Jan. 2009.

[105] M. D. Robinson and A. Oshlack, “A scaling normalization method for differential expression anal- ysis of RNA-seq data,” Genome biology, vol. 11, no. 3, p. R25, 2010.

[106] M. Garber, M. G. Grabherr, M. Guttman, and C. Trapnell, “Computational methods for transcriptome annotation and quantification using RNA-seq.” Nature methods, vol. 8, no. 6, pp. 469–77, Jun. 2011.

[107] J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit, “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.” BMC bioinformatics, vol. 11, no. 1, p. 94, Jan. 2010.

95 [108] S. Anders and W. Huber, “Differential expression analysis for sequence count data.” Genome biology, vol. 11, no. 10, p. R106, Jan. 2010.

[109] P. J. Balwierz, P. Carninci, C. O. Daub, J. Kawai, Y. Hayashizaki, W. Van Belle, C. Beisel, and E. van Nimwegen, “Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data.” Genome biology, vol. 10, no. 7, p. R79, Jan. 2009.

[110] A. Oshlack, M. D. Robinson, and M. D. Young, “From RNA-seq reads to differential expression results.” Genome biology, vol. 11, no. 12, p. 220, Jan. 2010.

[111] H. Jiang and W. H. Wong, “Statistical inferences for isoform expression in RNA-Seq.” Bioinformatics, vol. 25, no. 8, pp. 1026–32, Apr. 2009.

[112] B. Langmead, K. D. Hansen, and J. T. Leek, “Cloud-scale RNA-sequencing differential expression analysis with Myrna.” Genome biology, vol. 11, no. 8, p. R83, Jan. 2010.

[113] M. D. Robinson and G. K. Smyth, “Moderated statistical tests for assessing differences in tag abundance.” Bioinformatics, vol. 23, no. 21, pp. 2881–7, Nov. 2007.

[114] ——, “Small-sample estimation of negative binomial dispersion, with applications to SAGE data.” Biostatistics, vol. 9, no. 2, pp. 321–32, Apr. 2008.

[115] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. H. Yang, and J. Zhang, “Bioconductor: open software development for computational biology and bioinformatics.” Genome biology, vol. 5, no. 10, p. R80, Jan. 2004.

[116] M. D. Robinson, D. J. McCarthy, and G. K. Smyth, “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, vol. 26, no. 1, pp. 139–40, Jan. 2010.

[117] T. J. Hardcastle and K. A. Kelly, “baySeq: empirical Bayesian methods for identifying differential expression in sequence count data.” BMC bioinformatics, vol. 11, no. 1, p. 422, Jan. 2010.

[118] S. Srivastava and L. Chen, “A two-parameter generalized Poisson model to improve the analysis of RNA-seq data.” Nucleic acids research, vol. 38, no. 17, p. e170, Sep. 2010.

[119] P. Glaus, A. Honkela, and M. Rattray, “Identifying differentially expressed transcripts from RNA-seq data with biological variation.” Bioinformatics, vol. 28, no. 13, pp. 1721–8, Jul. 2012.

[120] V. M. Kvam, P. Liu, and Y. Si, “A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data.” American journal of botany, vol. 99, no. 2, pp. 248–56, Feb. 2012.

96 [121] C. Soneson and M. Delorenzi, “A comparison of methods for differential expression analysis of RNA-seq data.” BMC bioinformatics, vol. 14, no. 1, p. 91, Jan. 2013.

[122] F. Emmert-Streib and G. V. Glazko, “Pathway analysis of expression data: deciphering functional building blocks of complex diseases.” PLoS computational biology, vol. 7, no. 5, p. e1002053, May 2011.

[123] V. K. Mootha, C. M. Lindgren, K.-F. Eriksson, A. Subramanian, S. Sihag, J. Lehar, P. Puigserver, E. Carlsson, M. Ridderstra˚ le, E. Laurila, N. Houstis, M. J. Daly, N. Patterson, J. P. Mesirov, T. R. Golub, P. Tamayo, B. Spiegelman, E. S. Lander, J. N. Hirschhorn, D. Altshuler, and L. C. Groop, “PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.” Nature genetics, vol. 34, no. 3, pp. 267–73, Jul. 2003.

[124] The Gene Ontology Consortium, “Gene ontology: tool for the unification of biology,” Nat. Genet., vol. 25, no. 1, pp. 25–29, 2000.

[125] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, “KEGG for integration and interpretation of large-scale molecular data sets.” Nucleic acids research, vol. 40, no. Database issue, pp. D109–D114, Jan. 2012.

[126] M. Kanehisa, “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, Jan. 2000.

[127] P. Khatri, M. Sirota, and A. J. Butte, “Ten years of pathway analysis: current approaches and outstanding challenges.” PLoS computational biology, vol. 8, no. 2, p. e1002375, Jan. 2012.

[128] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists.” Nucleic acids research, vol. 37, no. 1, pp. 1–13, Jan. 2009.

[129] M. Evangelou, A. Rendon, W. H. Ouwehand, L. Wernisch, and F. Dudbridge, “Comparison of methods for competitive tests of pathway analysis.” PloS one, vol. 7, no. 7, p. e41018, Jan. 2012.

[130] B. Zeeberg, W. Feng, G. Wang, M. Wang, A. Fojo, M. Sunshine, S. Narasimhan, D. Kane, W. Reinhold, S. Lababidi, K. Bussey, J. Riss, J. Barrett, and J. Weinstein, “GoMiner: a resource for biological interpretation of genomic and proteomic data,” Genome Biology, vol. 4, no. 4, p. R28, 2003.

[131] S. Falcon and R. Gentleman, “Using GOstats to test gene lists for GO term association,” Bioinformatics, vol. 23, no. 2, pp. 257–258, Jan. 2007.

[132] S. Zhong, K.-F. Storch, O. Lipan, M.-C. J. Kao, C. J. Weitz, and W. H. Wong, “GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space.” Applied bioinformatics, vol. 3, no. 4, pp. 261–4, Jan. 2004.

97 [133] J. Pizarro-Cerda,´ R. Jonquieres,` E. Gouin, J. Vandekerckhove, J. Garin, and P. Cossart, “Distinct protein patterns associated with Listeria monocytogenes InlA- or InlB-phagosomes.” Cellular microbiology, vol. 4, no. 2, pp. 101–15, Feb. 2002.

[134] Y. Shen, M. Naujokas, M. Park, and K. Ireton, “InIB-dependent internalization of Listeria is mediated by the Met receptor tyrosine kinase.” Cell, vol. 103, no. 3, pp. 501–10, Oct. 2000.

[135] K. Ireton, B. Payrastre, H. Chap, W. Ogawa, H. Sakaue, M. Kasuga, and P. Cossart, “A role for phosphoinositide 3-kinase in bacterial invasion.” Science, vol. 274, no. 5288, pp. 780–2, Nov. 1996.

[136] P. Cossart, “Met, the HGF-SF receptor: another receptor for Listeria monocytogenes.” Trends in microbiology, vol. 9, no. 3, pp. 105–7, Mar. 2001.

[137] P. Tang, C. L. Sutherland, M. R. Gold, and B. B. Finlay, “Listeria monocytogenes invasion of epithelial cells requires the MEK-1/ERK-2 mitogen-activated protein kinase pathway.” Infection and immunity, vol. 66, no. 3, pp. 1106–12, Mar. 1998.

[138] M. Hamon, H. Bierne, and P. Cossart, “Listeria monocytogenes: a multifaceted model.” Nature reviews. Microbiology, vol. 4, no. 6, pp. 423–34, Jun. 2006.

[139] E. Gulbins and F. Lang, “Pathogens, Host-Cell Invasion and Disease,” American scientist, vol. 89, no. 5, p. 406, 2001.

[140] D. a. Portnoy, V. Auerbuch, and I. J. Glomski, “The cell biology of Listeria monocytogenes infection: the intersection of bacterial pathogenesis and cell-mediated immunity.” The Journal of cell biology, vol. 158, no. 3, pp. 409–14, Aug. 2002.

[141] N. E. Freitag, L. Rong, and D. A. Portnoy, “Regulation of the prfA transcriptional activator of Listeria monocytogenes: multiple promoter elements contribute to intracellular growth and cell-to-cell spread.” Infection and immunity, vol. 61, no. 6, pp. 2537–44, Jun. 1993.

[142] A. L. Radtke, K. L. Anderson, M. J. Davis, M. J. DiMagno, J. A. Swanson, and M. X. O’Riordan, “Listeria monocytogenes exploits cystic fibrosis transmembrane conductance regulator (CFTR) to escape the phagosome.” Proceedings of the National Academy of Sciences of the United States of America, vol. 108, no. 4, pp. 1633–8, Jan. 2011.

[143] D. W. Schuerch, E. M. Wilson-Kubalek, and R. K. Tweten, “Molecular basis of listeriolysin O pH dependence.” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 35, pp. 12 537–42, Aug. 2005.

[144] O. Shatursky, A. P. Heuck, L. A. Shepard, J. Rossjohn, M. W. Parker, A. E. Johnson, and R. K. Tweten, “The mechanism of membrane insertion for a cholesterol-dependent cytolysin: a novel paradigm for pore-forming toxins.” Cell, vol. 99, no. 3, pp. 293–9, Oct. 1999.

98 [145] P. Tang, I. Rosenshine, P. Cossart, B. B. Finlay, P. Tang, I. Rosenshine, and P. Cossart, “Listeri- olysin O activates mitogen-activated protein kinase in eucaryotic cells.” INFECTION AND IMMU- NITY, vol. 64, no. 6, pp. 2359–61, Jun. 1996.

[146] S. Dramsi and P. Cossart, “Listeriolysin O-mediated calcium influx potentiates entry of Listeria monocytogenes into the human Hep-2 epithelial cell line.” Infection and immunity, vol. 71, no. 6, pp. 3614–8, Jun. 2003.

[147] S. J. Wadsworth and H. Goldfine, “Mobilization of protein kinase C in macrophages induced by Listeria monocytogenes affects its internalization and escape from the phagosome.” Infection and immunity, vol. 70, no. 8, pp. 4650–60, Aug. 2002.

[148] M. B. Goldberg, “Actin-based motility of intracellular microbial pathogens.” Microbiology and molecular biology reviews, vol. 65, no. 4, pp. 595–626, table of contents, Dec. 2001.

[149] A. Lambrechts, K. Gevaert, P. Cossart, J. Vandekerckhove, and M. Van Troys, “Listeria comet tails: the actin-based motility machinery at work.” Trends in cell biology, vol. 18, no. 5, pp. 220–7, May 2008.

[150] E. G. D. Murray, R. A. Webb, and M. B. R. Swann, “A disease of rabbits characterised by a large mononuclear leucocytosis, caused by a hitherto undescribed bacillusBacterium monocytogenes (n.sp.),” The Journal of Pathology and Bacteriology, vol. 29, no. 4, pp. 407–439, 1926.

[151] D. A. Gill, “Ovine Bacterial Encephalitis (Circling Disease) and the Bacterial Genus Listerella,” Australian Veterinary Journal, vol. 13, no. 2, pp. 46–56, Apr. 1937.

[152] A. Nyfeldt, “Etiologie de la Mononucleose infectieuse.” Compt. Rend. Soc. Biol., vol. 101, pp. 590–1, 1929.

[153] W. F. Schlech, P. M. Lavigne, R. A. Bortolussi, A. C. Allen, E. V. Haldane, A. J. Wort, A. W. Hightower, S. E. Johnson, S. H. King, E. S. Nicholls, and C. V. Broome, “Epidemic listeriosis-evidence for transmission by food.” The New England journal of medicine, vol. 308, no. 4, pp. 203–6, Jan. 1983.

[154] P. Cossart, “Illuminating the landscape of host-pathogen interactions with the bacterium Listeria monocytogenes.” Proceedings of the National Academy of Sciences of the United States of America, vol. 108, no. 49, pp. 1–8, 2011.

[155] P. Cossart and J. Mengaud, “Listeria monocytogenes. A model system for the molecular study of intracellular parasitism.” Molecular biology & medicine, vol. 6, no. 5, pp. 463–74, Oct. 1989.

[156] D. a. Portnoy, V. Auerbuch, and I. J. Glomski, “The cell biology of Listeria monocytogenes infection: the intersection of bacterial pathogenesis and cell-mediated immunity.” The Journal of cell biology, vol. 158, no. 3, pp. 409–14, Aug. 2002.

99 [157] M. Lecuit, “Human listeriosis and animal models.” Microbes and infection / Institut Pasteur, vol. 9, no. 10, pp. 1216–25, Aug. 2007.

[158] F. Stavru, C. Archambaud, and P. Cossart, “Cell biology and immunology of Listeria monocytogenes infections: novel insights.” Immunological reviews, vol. 240, no. 1, pp. 160–84, Mar. 2011.

[159] S. Raengpradub, M. Wiedmann, and K. J. Boor, “Comparative analysis of the sigma B-dependent stress responses in Listeria monocytogenes and Listeria innocua strains exposed to selected stress conditions.” Applied and environmental microbiology, vol. 74, no. 1, pp. 158–71, Jan. 2008.

[160] T. Hain, H. Hossain, S. S. Chatterjee, S. Machata, U. Volk, S. Wagner, B. Brors, S. Haas, C. T. Kuenne, A. Billion, S. Otten, J. Pane-Farre, S. Engelmann, and T. Chakraborty, “Temporal transcriptomic analysis of the Listeria monocytogenes EGD-e sigmaB regulon.” BMC microbiology, vol. 8, no. 1, p. 20, Jan. 2008.

[161] A. Toledo-Arana, O. Dussurget, G. Nikitas, N. Sesto, H. Guet-Revillet, D. Balestrino, E. Loh, J. Gripenland, T. Tiensuu, K. Vaitkevicius, M. Barthelemy, M. Vergassola, M.-A. Nahori, G. Soubigou, B. Regnault,´ J.-Y. Coppee,´ M. Lecuit, J. Johansson, and P. Cossart, “The Listeria transcriptional landscape from saprophytism to virulence.” Nature, vol. 459, no. 7249, pp. 950–6, Jun. 2009.

[162] H. F. Oliver, R. H. Orsi, L. Ponnala, U. Keich, W. Wang, Q. Sun, S. W. Cartinhour, M. J. Filiatrault, M. Wiedmann, and K. J. Boor, “Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs.” BMC genomics, vol. 10, no. 1, p. 641, Jan. 2009.

[163] J. R. Mellin and P. Cossart, “The non-coding RNA world of the bacterial pathogen Listeria monocytogenes.” RNA biology, vol. 9, no. 4, pp. 372–8, Apr. 2012.

[164] M. A. Mraheil, A. Billion, W. Mohamed, K. Mukherjee, C. Kuenne, J. Pischimarov, C. Krawitz, J. Retey, T. Hartsch, T. Chakraborty, and T. Hain, “The intracellular sRNA transcriptome of Listeria monocytogenes during growth in macrophages.” Nucleic acids research, vol. 39, no. 10, pp. 4235–48, May 2011.

[165] W. F. Scherer, J. T. Syverton, and G. O. Gey, “Studies on the propagation in vitro of poliomyelitis viruses. IV. Viral multiplication in a stable strain of human malignant epithelial cells (strain HeLa) derived from an epidermoid carcinoma of the cervix.” The Journal of experimental medicine, vol. 97, no. 5, pp. 695–710, May 1953.

[166] “GNU ‘make’,” http://www.gnu.org/software/make/manual/make.html.

[167] S. Andrews, “Babraham Bioinformatics - FastQC,” 2010, http://www.bioinformatics.babraham.ac. uk/projects/fastqc/.

100 [168] D. Lipman and W. Pearson, “Rapid and sensitive protein similarity searches,” Science, vol. 227, no. 4693, pp. 1435–1441, Mar. 1985.

[169] S. Anders, “HTSeq-count,” http://www-huber.embl.de/users/anders/HTSeq/doc/count.html.

[170] S. Anders and W. Huber, “Differential expression of RNA-Seq data at the gene level: DE- Seq package,” 2012, http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/ doc/DESeq.pdf.

[171] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” J. Roy. Statist. Soc. Ser. B, vol. 57, no. 1, pp. 289–300, 1995.

[172] R. Gentleman and S. Falcon, “Package GOstats: Reference manual,” 2013, http://www. bioconductor.org/packages/2.14/bioc/manuals/GOstats/man/GOstats.pdf.

[173] S. Falcon and R. Gentleman, “Package GOStats: How To Use GOstats Testing Gene Lists for GO Term Association,” pp. 1–10, 2013, http://www.bioconductor.org/packages/2.14/bioc/vignettes/ GOstats/inst/doc/GOstatsHyperG.pdf.

[174] Bioconductor, “Org.Hs.eg.db: Genome wide annotation for human,” http://www.bioconductor.org/ packages/2.7/data/annotation/html/org.Hs.eg.db.html.

[175] K. D. Hansen, S. E. Brenner, and S. Dudoit, “Biases in Illumina transcriptome sequencing caused by random hexamer priming.” Nucleic acids research, vol. 38, no. 12, p. e131, Jul. 2010.

[176] P. Flicek, I. Ahmed, M. R. Amode, D. Barrell, K. Beal, S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates, S. Fairley, S. Fitzgerald, L. Gil, C. Garc´ıa-Giron,´ L. Gordon, T. Hourlier, S. Hunt, T. Juettemann, A. K. Kah¨ ari,¨ S. Keenan, M. Komorowska, E. Kulesha, I. Longden, T. Maurel, W. M. McLaren, M. Muffato, R. Nag, B. Overduin, M. Pignatelli, B. Pritchard, E. Pritchard, H. S. Riat, G. R. S. Ritchie, M. Ruffier, M. Schuster, D. Sheppard, D. Sobral, K. Taylor, A. Thormann, S. Trevanion, S. White, S. P. Wilder, B. L. Aken, E. Birney, F. Cunningham, I. Dunham, J. Harrow, J. Herrero, T. J. P. Hubbard, N. Johnson, R. Kinsella, A. Parker, G. Spudich, A. Yates, A. Zadissa, and S. M. J. Searle, “Ensembl 2013.” Nucleic acids research, vol. 41, no. Database issue, pp. D48–55, Jan. 2013.

[177] Ensembl, “Human DNA (FASTA),” ftp://ftp.ensembl.org/pub/release-71/fasta/homo sapiens/dna/ Homo sapiens.GRCh37.71.dna.toplevel.fa.gz.

[178] ——, “GTF file,” ftp://ftp.ensembl.org/pub/release-71/gtf/homo sapiens/Homo sapiens.GRCh37. 71.gtf.gz.

[179] EMBL-EBI, “Quick GO: A fast browser for Gene Ontology terms and annotations - GO:0007167,” http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0007167.

[180] B. Alberts, A. Johnson, and J. Lewis, Molecular Biology of the Cell: signaling through Enzyme-Linked Cell-Surface Receptors, 4th ed. Garland Science, 2002.

101 [181] C. Archambaud, E. Gouin, J. Pizarro-Cerda, P. Cossart, and O. Dussurget, “Translation elongation factor EF-Tu is a target for Stp, a serine-threonine phosphatase involved in virulence of Listeria monocytogenes.” Molecular microbiology, vol. 56, no. 2, pp. 383–96, Apr. 2005.

[182] A. Lima, R. Duran,´ G. E. Schujman, M. J. Marchissio, M. M. Portela, G. Obal, O. Pritsch, D. de Mendoza, and C. Cervenansky,˜ “Serine/threonine protein kinase PrkA of the human pathogen Listeria monocytogenes: biochemical characterization and identification of interacting partners through proteomic approaches.” Journal of proteomics, vol. 74, no. 9, pp. 1720–34, Aug. 2011.

[183] G. TranVan Nhieu, C. Clair, G. Grompone, and P. Sansonetti, “Calcium signalling during cell interactions with bacterial pathogens.” Biology of the cell / under the auspices of the European Cell Biology Organization, vol. 96, no. 1, pp. 93–101, Feb. 2004.

[184] J. Hoge, I. Yan, N. Janner,¨ V. Schumacher, A. Chalaris, O. M. Steinmetz, D. R. Engel, J. Scheller, S. Rose-John, and H.-W. Mittrucker,¨ “IL-6 controls the innate immune response against Listeria monocytogenes via classical IL-6 signaling.” Journal of immunology (Baltimore, Md. : 1950), vol. 190, no. 2, pp. 703–11, Jan. 2013.

[185] Y. Weinrauch and A. Zychlinsky, “The induction of apoptosis by bacterial pathogens.” Annual review of microbiology, vol. 53, pp. 155–87, Jan. 1999.

[186] N. Hauf, W. Goebel, E. Serfling, and M. Kuhn, “Listeria monocytogenes infection enhances transcription factor NF-kappa B in P388D1 macrophage-like cells.” Infection and immunity, vol. 62, no. 7, pp. 2740–7, Jul. 1994.

[187] M. Kuhn and W. Goebel, “Induction of cytokines in phagocytic mammalian cells infected with virulent and avirulent Listeria strains.” Infection and immunity, vol. 62, no. 2, pp. 348–56, Feb. 1994.

[188] S. Kayal, A. Lilienbaum, O. Join-Lambert, X. Li, A. Israel,¨ and P. Berche, “Listeriolysin O secreted by Listeria monocytogenes induces NF-kappaB signalling by activating the IkappaB kinase complex.” Molecular microbiology, vol. 44, no. 5, pp. 1407–19, Jun. 2002.

[189] E. N. Hatada, D. Krappmann, and C. Scheidereit, “NF-κB and the innate immune response,” Current Opinion in Immunology, vol. 12, no. 1, pp. 52–58, 2000.

[190] T. Dolowschiak, C. Chassin, S. Ben Mkaddem, T. M. Fuchs, S. Weiss, A. Vandewalle, and M. W. Hornef, “Potentiation of epithelial innate host responses by intercellular communication.” PLoS pathogens, vol. 6, no. 11, p. e1001194, Jan. 2010.

[191] M. Ferletta and P. Ekblom, “Identification of laminin-10/11 as a strong cell adhesive complex for a normal and a malignant human epithelial cell line.” Journal of cell science, vol. 112, pp. 1–10, Jan. 1999.

102 [192] W. Stockinger, S. C. Zhang, V. Trivedi, L. A. Jarzylo, E. C. Shieh, W. S. Lane, A. B. Castoreno, and A. Nohturfft, “Differential requirements for actin polymerization, calmodulin, and Ca2+ define distinct stages of lysosome/phagosome targeting.” Molecular biology of the cell, vol. 17, no. 4, pp. 1697–710, Apr. 2006.

[193] R. Pankov and K. M. Yamada, “Fibronectin at a glance.” Journal of cell science, vol. 115, no. Pt 20, pp. 3861–3, Oct. 2002.

[194] P. Gilot, P. Andre,´ and J. Content, “Listeria monocytogenes possesses adhesins for fibronectin.” Infection and immunity, vol. 67, no. 12, pp. 6698–701, Dec. 1999.

[195] N. Van Langendonck, P. Velge, and E. Bottreau, “Host cell protein tyrosine kinases are activated during the entry of Listeria monocytogenes,” FEMS Microbiology Letters, vol. 162, no. 1, pp. 169–176, 1998.

[196] S. K. Misra, E. Milohanic, F. Ake,´ I. Mijakovic, J. Deutscher, V. Monnet, and C. Henry, “Analysis of the serine/threonine/tyrosine phosphoproteome of the pathogenic bacterium Listeria monocytogenes reveals phosphorylated proteins related to virulence.” Proteomics, vol. 11, no. 21, pp. 4155–65, Nov. 2011.

[197] R. Kastner, O. Dussurget, C. Archambaud, E. Kernbauer, D. Soulat, P. Cossart, and T. Decker, “LipA, a tyrosine and lipid phosphatase involved in the virulence of Listeria monocytogenes.” Infection and immunity, vol. 79, no. 6, pp. 2489–98, Jun. 2011.

[198] T. Shintani and D. J. Klionsky, “Autophagy in health and disease: a double-edged sword.” Science, vol. 306, no. 5698, pp. 990–5, Nov. 2004.

[199] U. Gasanov, C. Koina, K. W. Beagley, R. J. Aitken, and P. M. Hansbro, “Identification of the insulin-like growth factor II receptor as a novel receptor for binding and invasion by Listeria monocytogenes.” Infection and immunity, vol. 74, no. 1, pp. 566–77, Jan. 2006.

[200] P. Tang, I. Rosenshine, and B. B. Finlay, “Listeria monocytogenes, an invasive bacterium, stimulates MAP kinase upon attachment to epithelial cells.” Molecular biology of the cell, vol. 5, no. 4, pp. 455–64, Apr. 1994.

[201] S. Kugler,¨ S. Schuller,¨ and W. Goebel, “Involvement of MAP-kinases and -phosphatases in uptake and intracellular replication of Listeria monocytogenes in J774 macrophage cells.” FEMS microbiology letters, vol. 157, no. 1, pp. 131–6, Dec. 1997.

[202] I. Weiglein, W. Goebel, J. Troppmair, U. R. Rapp, A. Demuth, and M. Kuhn, “Listeria monocytogenes infection of HeLa cells results in listeriolysin O-mediated transient activation of the Raf-MEK-MAP kinase pathway.” FEMS microbiology letters, vol. 148, no. 2, pp. 189–95, Mar. 1997.

103 [203] M. H. Glickman and A. Ciechanover, “The ubiquitin-proteasome proteolytic pathway: destruction for the sake of construction.” Physiological reviews, vol. 82, no. 2, pp. 373–428, Apr. 2002.

[204] C. A. Dinarello, “Proinflammatory Cytokines,” CHEST Journal, vol. 118, no. 2, p. 503, Aug. 2000.

[205] R. Lindner and C. C. Friedel, “A comprehensive evaluation of alignment algorithms in the context of RNA-seq.” PloS one, vol. 7, no. 12, p. e52403, Jan. 2012.

[206] L. Wang, S. Wang, and W. Li, “RSeQC: quality control of RNA-seq experiments.” Bioinformatics, vol. 28, no. 16, pp. 2184–5, Aug. 2012.

[207] E. Planet, C. S.-O. Attolini, O. Reina, O. Flores, and D. Rossell, “htSeqTools: high-throughput sequencing quality control, processing and visualization in R.” Bioinformatics, vol. 28, no. 4, pp. 589–90, Mar. 2012.

[208] M. Griffith, O. L. Griffith, J. Mwenifumbo, R. Goya, A. S. Morrissy, R. D. Morin, R. Corbett, M. J. Tang, Y.-C. Hou, T. J. Pugh, G. Robertson, S. Chittaranjan, A. Ally, J. K. Asano, S. Y. Chan, H. I. Li, H. McDonald, K. Teague, Y. Zhao, T. Zeng, A. Delaney, M. Hirst, G. B. Morin, S. J. M. Jones, I. T. Tai, and M. A. Marra, “Alternative expression analysis by RNA sequencing.” Nature methods, vol. 7, no. 10, pp. 843–7, Oct. 2010.

[209] Bedtools, “bedtools multicov,” http://bedtools.readthedocs.org/en/latest/content/tools/multicov. html.

[210] A. P. L. and D. R. W, “A Two-Stage Poisson Model for Testing RNA-Seq Data,” Statistical Applications in Genetics and Molecular Biology, vol. 10, no. 1, pp. 1–26, 2011.

[211] D. Yanming, S. D. W, C. J. S, and C. J. H, “The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq,” Statistical Applications in Genetics and Molecular Biology, vol. 10, no. 1, pp. 1–28, 2011.

[212] P. Khatri, S. Draghici, G. C. Ostermeier, and S. A. Krawetz, “Profiling gene expression using onto-express.” Genomics, vol. 79, no. 2, pp. 266–70, Mar. 2002.

[213] K. D. Dahlquist, N. Salomonis, K. Vranizan, S. C. Lawlor, and B. R. Conklin, “GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways.” Nature genetics, vol. 31, no. 1, pp. 19–20, May 2002.

[214] B. R. Zeeberg, W. Feng, G. Wang, M. D. Wang, A. T. Fojo, M. Sunshine, S. Narasimhan, D. W. Kane, W. C. Reinhold, S. Lababidi, K. J. Bussey, J. Riss, J. C. Barrett, and J. N. Weinstein, “GoMiner: a resource for biological interpretation of genomic and proteomic data.” Genome biology, vol. 4, no. 4, p. R28, Jan. 2003.

[215] E. I. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J. M. Cherry, and G. Sherlock, “GO::TermFinder–open source software for accessing Gene Ontology information and finding

104 significantly enriched Gene Ontology terms associated with a list of genes.” Bioinformatics (Oxford, England), vol. 20, no. 18, pp. 3710–5, Dec. 2004.

[216] J. Goecks, A. Nekrutenko, and J. Taylor, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.” Genome biology, vol. 11, no. 8, p. R86, Jan. 2010.

[217] A. Goncalves, A. Tikhonov, A. Brazma, and M. Kapushesky, “A pipeline for RNA-seq data processing and quality assessment.” Bioinformatics (Oxford, England), vol. 27, no. 6, pp. 867–9, Mar. 2011.

[218] P. J. Cock, B. A. Gruning,¨ K. Paszkiewicz, and L. Pritchard, “Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology,” PeerJ, vol. 1, p. e167, Sep. 2013.

[219] B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, J. Taylor, W. Miller, W. J. Kent, and A. Nekrutenko, “Galaxy: a platform for interactive large-scale genome analysis.” Genome research, vol. 15, no. 10, pp. 1451–5, Oct. 2005.

[220] D. Blankenberg, G. Von Kuster, N. Coraor, G. Ananda, R. Lazarus, M. Mangan, A. Nekrutenko, and J. Taylor, “Galaxy: a web-based genome analysis tool for experimentalists.” Current protocols in molecular biology, vol. Unit 19, no. Unit 19.10, pp. 1–21, Jan. 2010.

[221] H. Nagasaki, T. Mochizuki, Y. Kodama, S. Saruhashi, S. Morizaki, H. Sugawara, H. Ohyanagi, N. Kurata, K. Okubo, T. Takagi, E. Kaminuma, and Y. Nakamura, “DDBJ read annotation pipeline: a cloud computing-based pipeline for high-throughput analysis of next-generation sequencing data.” DNA research : an international journal for rapid publication of reports on genes and genomes, vol. 20, no. 4, pp. 383–90, Aug. 2013.

[222] D. G. Knowles, M. Roder,¨ A. Merkel, and R. Guigo,´ “Grape RNA-Seq analysis pipeline environment.” Bioinformatics, vol. 29, no. 5, pp. 614–21, Mar. 2013.

[223] Y. Wang, G. Mehta, R. Mayani, J. Lu, T. Souaiaia, Y. Chen, A. Clark, H. J. Yoon, L. Wan, O. V. Evgrafov, J. A. Knowles, E. Deelman, and T. Chen, “RseqFlow: workflows for RNA-Seq data analysis.” Bioinformatics, vol. 27, no. 18, pp. 2598–600, Sep. 2011.

[224] V. K. Mittal and J. F. McDonald, “R-SAP: a multi-threading computational pipeline for the characterization of high-throughput RNA-sequencing data.” Nucleic acids research, vol. 40, no. 9, p. e67, May 2012.

[225] J. Martin, V. M. Bruno, Z. Fang, X. Meng, M. Blow, T. Zhang, G. Sherlock, M. Snyder, and Z. Wang, “Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads.” BMC genomics, vol. 11, no. 1, p. 663, Jan. 2010.

105 [226] M. Reich, T. Liefeld, J. Gould, J. Lerner, P. Tamayo, and J. P. Mesirov, “GenePattern 2.0.” Nature genetics, vol. 38, no. 5, pp. 500–1, May 2006.

[227] R. Morin, M. Bainbridge, A. Fejes, M. Hirst, M. Krzywinski, T. Pugh, H. McDonald, R. Varhol, S. Jones, and M. Marra, “Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing.” BioTechniques, vol. 45, no. 1, pp. 81–94, Jul. 2008.

[228] M. D. Young, M. J. Wakefield, G. K. Smyth, and A. Oshlack, “Gene ontology analysis for RNA-seq: accounting for selection bias.” Genome biology, vol. 11, no. 2, p. R14, Jan. 2010.

[229] Z. Zhang, W. E. Theurkauf, Z. Weng, and P. D. Zamore, “Strand-specific libraries for high throughput RNA sequencing (RNA-Seq) prepared without poly(A) selection.” Silence, vol. 3, no. 1, p. 9, Jan. 2012.

[230] N. Shcherbik, M. Wang, Y. R. Lapik, L. Srivastava, and D. G. Pestov, “Polyadenylation and degradation of incomplete RNA polymerase I transcripts in mammalian cells.” EMBO reports, vol. 11, no. 2, pp. 106–11, Feb. 2010.

[231] F. Ozsolak and P. M. Milos, “RNA sequencing: advances, challenges and opportunities.” Nature reviews. Genetics, vol. 12, no. 2, pp. 87–98, Feb. 2011.

[232] S. W. Roy and M. Irimia, “When good transcripts go bad: artifactual RT-PCR ’splicing’ and genome analysis.” BioEssays, vol. 30, no. 6, pp. 601–5, Jun. 2008.

[233] R. M. Mader, W. M. Schmidt, R. Sedivy, B. Rizovski, J. Braun, M. Kalipciyan, M. Exner, G. G. Steger, and M. W. Mueller, “Reverse transcriptase template switching during reverse transcriptase-polymerase chain reaction: artificial generation of deletions in ribonucleotide reductase mRNA.” The Journal of laboratory and clinical medicine, vol. 137, no. 6, pp. 422–8, Jun. 2001.

[234] J. Cocquet, A. Chong, G. Zhang, and R. A. Veitia, “Reverse transcriptase template switching and false alternative transcripts.” Genomics, vol. 88, no. 1, pp. 127–31, Jul. 2006.

[235] C. D. Armour, J. C. Castle, R. Chen, T. Babak, P. Loerch, S. Jackson, J. K. Shah, J. Dey, C. A. Rohl, J. M. Johnson, and C. K. Raymond, “Digital transcriptome profiling using selective hexamer priming for cDNA synthesis.” Nature methods, vol. 6, no. 9, pp. 647–9, Sep. 2009.

[236] D. Faulhammer, R. J. Lipton, and L. F. Landweber, “Fidelity of enzymatic ligation for DNA computing.” Journal of computational biology, vol. 7, no. 6, pp. 839–48, Jan. 2000.

[237] I. Kozarewa, Z. Ning, M. A. Quail, M. J. Sanders, M. Berriman, and D. J. Turner, “Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes.” Nature methods, vol. 6, no. 4, pp. 291–5, Apr. 2009.

[238] R. Rosenkranz, T. Borodina, H. Lehrach, and H. Himmelbauer, “Characterizing the mouse ES cell transcriptome with Illumina sequencing.” Genomics, vol. 92, no. 4, pp. 187–94, Oct. 2008.

106 [239] J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer, “Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.” Nucleic acids research, vol. 36, no. 16, p. e105, Sep. 2008.

[240] L. Mamanova, R. M. Andrews, K. D. James, E. M. Sheridan, P. D. Ellis, C. F. Langford, T. W. B. Ost, J. E. Collins, and D. J. Turner, “FRT-seq: amplification-free, strand-specific transcriptome sequencing.” Nature methods, vol. 7, no. 2, pp. 130–2, Feb. 2010.

[241] B. Li, V. Ruotti, R. M. Stewart, J. A. Thomson, and C. N. Dewey, “RNA-Seq gene expression estimation with read mapping uncertainty.” Bioinformatics (Oxford, England), vol. 26, no. 4, pp. 493–500, Feb. 2010.

[242] R. A. Fisher, The design of experiments. Oxford, England: Oliver & Boyd, 1935.

[243] P. L. Auer and R. W. Doerge, “Statistical design and analysis of RNA sequencing data.” Genetics, vol. 185, no. 2, pp. 405–16, Jun. 2010.

[244] C. A. Kasper, I. Sorg, C. Schmutz, T. Tschon, H. Wischnewski, M. L. Kim, and C. Arrieumerlou, “Cell-cell propagation of NF-κB transcription factor and MAP kinase activation amplifies innate immunity against bacterial infection.” Immunity, vol. 33, no. 5, pp. 804–16, Nov. 2010.

[245] M. Mhlanga, “Mhlanga Laboratory Homepage at CSIR South Africa,” http://mhlangalab.synbio. csir.co.za/.

[246] C. A. Heid, J. Stevens, K. J. Livak, and P. M. Williams, “Real time quantitative PCR.” Genome research, vol. 6, no. 10, pp. 986–94, Oct. 1996.

[247] Z. Fang and X. Cui, “Design and validation issues in RNA-seq experiments.” Briefings in bioinformatics, vol. 12, no. 3, pp. 280–7, May 2011.

[248] L. Feng, H. Liu, Y. Liu, Z. Lu, G. Guo, S. Guo, H. Zheng, Y. Gao, S. Cheng, J. Wang, K. Zhang, and Y. Zhang, “Power of deep sequencing and agilent microarray for gene expression profiling study.” Molecular biotechnology, vol. 45, no. 2, pp. 101–10, Jun. 2010.

[249] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Snyder, “The transcriptional landscape of the yeast genome defined by RNA sequencing.” Science, vol. 320, no. 5881, pp. 1344–9, Jun. 2008.

[250] L. Camarena, V. Bruno, G. Euskirchen, S. Poggio, and M. Snyder, “Molecular mechanisms of ethanol-induced pathogenesis revealed by RNA-sequencing.” PLoS pathogens, vol. 6, no. 4, p. e1000834, Apr. 2010.

[251] P. Cohen, “Monitoring Cellular Responses to Listeria monocytogenes with Oligonucleotide Arrays,” Journal of Biological Chemistry, vol. 275, no. 15, pp. 11 181–11 190, Apr. 2000.

107 [252] M. J. Fullwood, C.-L. Wei, E. T. Liu, and Y. Ruan, “Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses.” Genome research, vol. 19, no. 4, pp. 521–32, Apr. 2009.

[253] M. J. Fullwood and Y. Ruan, “ChIP-based methods for the identification of long-range chromatin interactions.” Journal of cellular biochemistry, vol. 107, no. 1, pp. 30–9, May 2009.

[254] S. Fanucchi, Y. Shibayama, S. Burd, M. S. Weinberg, and M. M. Mhlanga, “Chromosomal Contact Permits Transcription between Coregulated Genes,” Cell, vol. 155, no. 3, pp. 606–620, Oct. 2013.

[255] T. Ideker, T. Galitski, and L. Hood, “A new approach decoding life: systems biology,” Annu. Rev. Genom. Human Genet., vol. 2, pp. 343–372, 2001.

[256] L. T. Macneil and A. J. M. Walhout, “Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression.” Genome research, vol. 21, no. 5, pp. 645–57, May 2011.

[257] G. Karlebach and R. Shamir, “Modelling and analysis of gene regulatory networks.” Nature reviews. Molecular cell biology, vol. 9, no. 10, pp. 770–80, Oct. 2008.

[258] M.-A. Dillies, A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime, G. Marot, D. Castel, J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloe,¨ C. Le Gall, B. Schaeffer,¨ S. Le Crom, M. Guedj, and F. Jaffrezic,´ “A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.” Briefings in bioinformatics, Sep. 2012.

[259] H. Zhu, R. S. P. Rao, T. Zeng, and L. Chen, “Reconstructing dynamic gene regulatory networks from sample-based transcriptional data.” Nucleic acids research, vol. 40, no. 21, pp. 10 657–67, Nov. 2012.

[260] M. K. S. Yeung, J. Tegner,´ and J. J. Collins, “Reverse engineering gene networks using singular value decomposition and robust regression.” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 9, pp. 6163–8, Apr. 2002.

[261] J. C. Boldrick, A. A. Alizadeh, M. Diehn, S. Dudoit, C. L. Liu, C. E. Belcher, D. Botstein, L. M. Staudt, P. O. Brown, and D. A. Relman, “Stereotyped and specific gene expression programs in human innate immune responses to bacteria.” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 2, pp. 972–7, Jan. 2002.

[262] K. S. Kim, “Strategy of Escherichia coli for crossing the blood-brain barrier.” The Journal of infectious diseases, vol. 186 Suppl, no. Supplement 2, pp. S220–4, Dec. 2002.

[263] S. Tarazona, F. Garc´ıa-Alcalde, J. Dopazo, A. Ferrer, and A. Conesa, “Differential expression in RNA-seq: a matter of depth.” Genome research, vol. 21, no. 12, pp. 2213–23, Dec. 2011.

108 [264] P. D’haeseleer, X. Wen, S. Fuhrman, and R. Somogyi, “Linear modeling of mRNA expression levels during CNS development and injury.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pp. 41–52, Jan. 1999.

[265] P. D’haeseleer, S. Liang, and R. Somogyi, “Genetic network inference: from co-expression clustering to reverse engineering.” Bioinformatics, vol. 16, no. 8, pp. 707–26, Aug. 2000.

[266] M. Takahashi and N. Saitou, “Identification and characterization of lineage-specific highly conserved noncoding sequences in Mammalian genomes.” Genome biology and evolution, vol. 4, no. 5, pp. 641–57, Jan. 2012.

[267] M. L. Mayer, C. J. Blohmke, R. Falsafi, C. D. Fjell, L. Madera, S. E. Turvey, and R. E. W. Hancock, “Rescue of dysfunctional autophagy attenuates hyperinflammatory responses from cystic fibrosis cells.” Journal of immunology, vol. 190, no. 3, pp. 1227–38, Feb. 2013.

[268] D. Nickles, H. P. Chen, M. M. Li, P. Khankhanian, L. Madireddy, S. J. Caillier, A. Santaniello, B. A. C. Cree, D. Pelletier, S. L. Hauser, J. R. Oksenberg, and S. E. Baranzini, “Blood RNA profiling in a large cohort of multiple sclerosis patients and healthy controls.” Human molecular genetics, vol. 22, no. 20, pp. 4194–4205, Jun. 2013.

[269] S. C. Isom, J. R. Stevens, R. Li, W. G. Spollen, L. Cox, L. D. Spate, C. N. Murphy, and R. S. Prather, “Transcriptional profiling by RNA-Seq of peri-attachment porcine embryos generated by a variety of assisted reproductive technologies,” Physiological Genomics, vol. 45, no. 14, pp. 577–589, May 2013.

[270] M. Zhu, J. L. Dahmen, G. Stacey, and J. Cheng, “Predicting gene regulatory networks of soybean nodulation from RNA-Seq transcriptome data,” BMC Bioinformatics, vol. 14, no. 1, p. 278, 2013.

[271] N. S. Holter, a. Maritan, M. Cieplak, N. V. Fedoroff, and J. R. Banavar, “Dynamic modeling of gene expression data.” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 4, pp. 1693–8, Feb. 2001.

[272] O. D. Iancu, S. Kawane, D. Bottomly, R. Searles, R. Hitzemann, and S. McWeeney, “Utilizing RNA-Seq data for de novo coexpression network inference.” Bioinformatics, vol. 28, no. 12, pp. 1592–7, Jun. 2012.

[273] Y. Cai, B. Fendler, and G. S. Atwal, “Utilizing RNA-Seq Data for Cancer Network Inference,” p. 4, Nov. 2012.

[274] NCBI, “Gene Expression Omnibus,” http://www.ncbi.nlm.nih.gov/geo/.

[275] EMBL-EBI, “ArrayExpress - functional genomics data,” http://www.ebi.ac.uk/arrayexpress/.

[276] M. N. Cabili, C. Trapnell, L. Goff, M. Koziol, B. Tazon-Vega, A. Regev, and J. L. Rinn, “Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses.” Genes & development, vol. 25, no. 18, pp. 1915–27, Sep. 2011.

109 [277] M. Caudron-Herger, K. Muller-Ott,¨ J.-P. Mallm, C. Marth, U. Schmidt, K. Fejes-Toth,´ and K. Rippe, “Coding RNAs with a non-coding function: maintenance of open chromatin structure.” Nucleus, vol. 2, no. 5, pp. 410–24, 2011.

[278] K. Zarnack, J. Konig,¨ M. Tajnik, I. n. Martincorena, S. Eustermann, I. Stevant,´ A. Reyes, S. Anders, N. M. Luscombe, and J. Ule, “Direct Competition between hnRNP C and U2AF65 Protects the Transcriptome from the Exonization of Alu Elements,” Cell, vol. 152, no. 3, pp. 453–466, 2013.

[279] B. E. Bernstein, E. Birney, I. Dunham, E. D. Green, C. Gunter, and M. Snyder, “An integrated encyclopedia of DNA elements in the human genome.” Nature, vol. 489, no. 7414, pp. 57–74, Sep. 2012.

[280] “Gene Expression Omnibus: Sample GSM759888,” 2011, http://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSM759888.

[281] “UCSC hg19 assembly,” http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.2bit.

[282] C. Trapnell, L. Pachter, and S. L. Salzberg, “TopHat: discovering splice junctions with RNA-Seq.” Bioinformatics, vol. 25, no. 9, pp. 1105–11, May 2009.

[283] “UCSC hg19 gene annotation,” ftp://ftp.ncbi.nih.gov/genomes/H sapiens/GFF/ref GRCh37.p13 top level.gff3.gz.

[284] D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, and A. Kasprzyk, “BioMart–biological queries made easy.” BMC genomics, vol. 10, no. 1, p. 22, Jan. 2009.

[285] “ArrayExpress: E-MTAB-582,” 2012, http://www.ebi.ac.uk/arrayexpress/experiments/ E-MTAB-582/.

[286] “ArrayExpress: E-MTAB-1147,” 2013, http://www.ebi.ac.uk/arrayexpress/experiments/ E-MTAB-1147/.

[287] C. W. Schmid and P. L. Deininger, “Sequence organization of the human genome,” Cell, vol. 6, no. 3, pp. 345–358, 1975.

[288] T. Encode and P. Consortium, “A user’s guide to the encyclopedia of DNA elements (ENCODE).” PLoS biology, vol. 9, no. 4, p. e1001046, Apr. 2011.

[289] “UCSC genome bioinformatics,” http://hgdownload-test.cse.ucsc.edu/ goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/ wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd1Rep1.fastq.gz.

[290] “UCSC genome bioinformatics,” http://hgdownload-test.cse.ucsc.edu/ goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/ wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd1Rep2.fastq.gz.

110 [291] “UCSC genome bioinformatics,” http://hgdownload-test.cse.ucsc.edu/ goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/ wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd2Rep1.fastq.gz.

[292] “UCSC genome bioinformatics,” http://hgdownload-test.cse.ucsc.edu/ goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/releaseLatest/ wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd2Rep2.fastq.gz.

111 112 A Supplementary material for chapter 4

Contents A.1 Similarity between samples ...... A-2 A.2 Supplementary Tables ...... A-4 A.3 Scripts ...... A-8

A-1 A.1 Similarity between samples

In order to suggest for which sample is more reasonable to obtain replicates, it was studied how samples relate with each other. To perform that, it was firstly calculated a distance matrix of the genes’ normalized expression. When doing clustering, the similarity between samples is given by the Euclidean distance between two vectors with the expression values of all genes. Therefore, all genes should have roughly equal influence on the distance and, hence, similar variance. This fact allows that distances between genes become independent of overall expression strength. Hence, the variance associated with the number of counts was stabilized using a variance-stabilizing transformation (VST) function implemented in DESeq package [108]. From this data, it was plotted a heatmap describing the distance between samples and a PCA plot which, doing a dimension reduction considering the 500 most variable genes, illustrates the relation between samples along the first three principal components of the data. Regarding the Euclidean distance between samples, it is possible to extract four main clusters from the dendogram in figure A.1: cluster 1A - LM2 240; cluster 2A - LM1 20, LM2 20, control; cluster 3A - LM1 240; cluster 4A - LM1 60, LM1 120, LM2 60, LM2 120. Ordering the clusters by time (cluster2, cluster 4, cluster 3, cluster 1), an immediate conclusion is that the similarity between samples follows the time-line, as it was expected, i.e., control is similar with samples from time point 20 and samples from time-point 60 are similar with samples from time point 120. Respecting sample conditions (cell populations infected with wild-type L. monocytogenes - LM1 - or mutant - LM2), for the early time-point, samples from different conditions are clustered together with control (cluster 2). This means that at this time, the difference between what is happening in the cell at LM1 and LM2 is not considerably different. For the mid- time points (60 and 120 minutes) a similar state is observed. Nevertheless, for the last time point (after 240 minutes), wild-type sample is clustered with mid-time samples and mutant sample is clustered with the early samples. This could implicate that at this time-point the cell infected with wild-type L. monocytogenes is reacting in a different way that the cell infected with mutant L. monocytogenes. Analysing the spacial distribution of the samples in the PCA plot (see figure A.2), it is possible to define four main clusters: cluster 1B - control, LM1 20, LM2 20; cluster 2B - LM2 60, LM2 120; cluster 3B - LM1 60, LM1 120, LM1 240; cluster 4B - LM2 240. Hence, the previous conclusions are, overall, confirmed by the principal component analysis. In fact, control is clustered with the two 20 minutes samples (equivalent with cluster 2A), indicating that at this time point the difference between samples is not yet significant. For the other time points, samples can be sep- arated by condition and, thereby, main differences occur 20 minutes after infection. Cluster 3B contains the samples infected with wild-type L. monocytogenes (LM1). For the samples infected with mutant L. monocytogenes (LM2), the last time point (240 minutes) is extremely different from mid-time points (60 and 120 minutes) and, consequently, these samples are not clustered together.

A-2 monocytogenes A.2: Figure ape lse ysri.Po ftefis he rnia opnnsfralte9smlsta ops the compose that samples 9 the all for components principal three first the of Plot strain. by cluster Samples iueA.1: Figure dataset. eta hwn h ulda itne ewe the between distances Euclidean the showing Heatmap

Count 0 2 4 6 8 0 and Histogram Color Key 100 Value 200

LM2_240 LM2_20 LM1_20 Control LM1_240 LM2_60 LM2_120 LM1_120 LM1_60 LM2_240 LM2_20 LM1_20 Control LM1_240 LM2_60 LM2_120 LM1_120 LM1_60 .monocytogenes L. samples. ’ A-3 Listeria A.2 Supplementary Tables

• Differentially expressed genes between infected with wild-type L. monocytogenes samples and infected with mutant L. monocytogenes samples with a p-value cut-off of 10 %, along the four acquisition time-points.

Table A.1: Statistically significant DE expressed genes among Table A.2: Statistically significant DE genes among samples samples infected with wild-type and mutant L. monocytogenes infected with wild-type and mutant L. monocytogenes for time- for time-point 20. point 240.

Time-point 20 Timepoint 240 Ensembl id Gene name p-value Ensembl id Gene name p-value ENSG00000205771 CATSPER2P1 4,27E-17 ENSG00000261150 EPPK1 1,25E-08 ENSG00000227220 RP11-69I8.3 2,80E-15 ENSG00000186940 CHCHD2P9 2,16E-07 ENSG00000178741 COX5A 5,04E-14 ENSG00000236358 RP5-827C21.2 0,000109016 ENSG00000163453 IGFBP7 7,55E-12 ENSG00000235605 RP5-827C21.1 0,000243901 ENSG00000213935 AC092610.12 3,99E-06 ENSG00000227184 EPPK1 0,000327065 ENSG00000130748 TMEM160 0,001864 ENSG00000233554 RP11-326F20.5 0,00043398 ENSG00000267449 RP11-264B14.2 0,001864 ENSG00000163453 IGFBP7 0,002456719 ENSG00000175602 CCDC85B 0,001864 ENSG00000125534 PPDPF 0,003308859 ENSG00000101084 C20orf24 0,002497 ENSG00000107984 DKK1 0,003308859 ENSG00000051523 CYBA 0,004447 ENSG00000126934 MAP2K2 0,003550374 ENSG00000204055 RP11-247A12.2 0,004964 ENSG00000101096 NFATC2 0,006697975 ENSG00000245067 RP11-12A1.1 0,006969 ENSG00000245067 RP11-12A1.1 0,011805389 ENSG00000175390 EIF3F 0,009726 ENSG00000267598 CTC-250I14.6 0,012292872 ENSG00000227694 AC005884.1 0,011436 ENSG00000219928 RP11-40C6.2 0,013822224 ENSG00000142530 FAM71E1 0,011436 ENSG00000249014 HMGN2P4 0,016077411 ENSG00000265830 AL592188.7 0,012126 ENSG00000148677 ANKRD1 0,016077411 ENSG00000240463 RPS19P3 0,023303 ENSG00000205746 RP11-1212A22.1 0,0215832 ENSG00000231884 NDUFB1P1 0,024617 ENSG00000125740 FOSB 0,023069666 ENSG00000069849 ATP1B3 0,030075 ENSG00000224888 RP5-1142A6.2 0,023895793 ENSG00000140988 RPS2 0,032168 ENSG00000259031 CTD-2062F14.3 0,023895793 ENSG00000266658 RNA28S5 0,032168 ENSG00000261691 LA16c-366D1.3 0,023895793 ENSG00000212664 RP11-592N21.1 0,038825 ENSG00000258048 RP11-530C5.1 0,028507658 ENSG00000231744 RP11-164H5.1 0,038825 ENSG00000218537 AP000350.4 0,032237968 ENSG00000221869 CEBPD 0,039292 ENSG00000177606 JUN 0,032516999 ENSG00000219928 RP11-40C6.2 0,052698 ENSG00000240972 MIF 0,040550233 ENSG00000183779 ZNF703 0,056735 ENSG00000230777 RP11-318C24.1 0,056175652 ENSG00000128951 DUT 0,0576 ENSG00000128422 KRT17 0,057335182 ENSG00000268592 RP11-244K5.8 0,060286 ENSG00000250251 PKD1P6 0,076396902 ENSG00000124177 CHD6 0,082339 ENSG00000119922 IFIT2 0,076803869 ENSG00000269888 RP11-3P17.5 0,082339 ENSG00000095752 IL11 0,078552054 ENSG00000262814 MRPL12 0,078552054 ENSG00000204055 RP11-247A12.2 0,078552054

A-4 Table A.3: Statistically significant DE genes among samples infected with wild-type and mutant L. monocytogenes for time-point 60.

Timepoint 60 Timepoint 60 Ensembl id Gene name p-value Ensembl id Gene name p-value ENSG00000170345 FOS 1,70E-32 ENSG00000230128 IER3 0,002076015 ENSG00000198576 ARC 3,47E-20 ENSG00000230777 RP11-318C24.1 0,002527815 ENSG00000163453 IGFBP7 1,16E-17 ENSG00000099860 GADD45B 0,0036138 ENSG00000125740 FOSB 1,23E-17 ENSG00000169429 IL8 0,004171842 ENSG00000159388 BTG2 2,93E-17 ENSG00000233554 RP11-326F20.5 0,004171842 ENSG00000158050 DUSP2 4,41E-15 ENSG00000173334 TRIB1 0,004171842 ENSG00000179388 EGR3 4,88E-15 ENSG00000171223 JUNB 0,006188937 ENSG00000128016 ZFP36 6,81E-15 ENSG00000108106 UBE2S 0,008451176 ENSG00000177606 JUN 3,35E-11 ENSG00000100664 EIF5 0,009987431 ENSG00000122877 EGR2 7,26E-10 ENSG00000244485 RPL18P13 0,010007102 ENSG00000245067 RP11-12A1.1 2,40E-09 ENSG00000259884 RP11-1100L3.8 0,011264331 ENSG00000162772 ATF3 2,89E-09 ENSG00000135625 EGR4 0,011541487 ENSG00000081041 CXCL2 3,22E-09 ENSG00000257453 RP11-290L1.3 0,012060934 ENSG00000153234 NR4A2 3,22E-09 ENSG00000137331 IER3 0,012060934 ENSG00000267598 CTC-250I14.6 2,65E-07 ENSG00000184545 DUSP8 0,012602806 ENSG00000120738 EGR1 2,65E-07 ENSG00000078401 EDN1 0,023833927 ENSG00000073756 PTGS2 1,47E-06 ENSG00000118503 TNFAIP3 0,028494482 ENSG00000087074 PPP1R15A 3,78E-06 ENSG00000250317 SMIM20 0,02926386 ENSG00000237155 IER3 4,47E-06 ENSG00000224598 RPS5P2 0,039067697 ENSG00000120129 DUSP1 1,17E-05 ENSG00000163874 ZC3H12A 0,039485476 ENSG00000095066 HOOK2 3,97E-05 ENSG00000227231 IER3 0,043432041 ENSG00000160888 IER2 4,84E-05 ENSG00000130513 GDF15 0,043657599 ENSG00000184557 SOCS3 5,01E-05 ENSG00000187479 C11orf96 0,049014354 ENSG00000164949 GEM 6,73E-05 ENSG00000182836 PLCXD3 0,060104685 ENSG00000059804 SLC2A3 0,000310216 ENSG00000175592 FOSL1 0,060104685 ENSG00000136244 IL6 0,000423045 ENSG00000159200 RCAN1 0,061669555 ENSG00000139289 PHLDA1 0,000653468 ENSG00000105649 RAB3A 0,068448599 ENSG00000215221 UBA52P6 0,000685436 ENSG00000136235 GPNMB 0,068448599 ENSG00000138166 DUSP5 0,00081519 ENSG00000237596 RP13-143G15.4 0,073500243 ENSG00000235030 IER3 0,000820566 ENSG00000243023 UBA52P3 0,07374847 ENSG00000132510 KDM6B 0,001801084 ENSG00000225770 AC092933.3 0,079202593 ENSG00000130066 SAT1 0,002076015 ENSG00000180389 ATP5EP2 0,097420433

A-5 Table A.4: Statistically significant DE genes among samples infected with wild-type and mutant L. monocytogenes for time-point 120.

Timepoint 120 Timepoint 120 Ensembl id Gene name p-value Ensembl id Gene name p-value ENSG00000125740 FOSB 1,90E-30 ENSG00000123358 NR4A1 0,000716046 ENSG00000179388 EGR3 5,78E-27 ENSG00000237155 IER3 0,001246679 ENSG00000170345 FOS 9,63E-25 ENSG00000230128 IER3 0,002584242 ENSG00000153234 NR4A2 3,38E-21 ENSG00000148926 ADM 0,002584242 ENSG00000159388 BTG2 4,51E-21 ENSG00000171223 JUNB 0,004296906 ENSG00000120738 EGR1 1,51E-19 ENSG00000175602 CCDC85B 0,006185018 ENSG00000198576 ARC 4,38E-16 ENSG00000134531 EMP1 0,00633825 ENSG00000164949 GEM 6,91E-13 ENSG00000142627 EPHA2 0,00633825 ENSG00000118503 TNFAIP3 1,41E-12 ENSG00000163660 CCNL1 0,00633825 ENSG00000059804 SLC2A3 2,28E-12 ENSG00000198355 PIM3 0,007592045 ENSG00000138166 DUSP5 2,34E-12 ENSG00000128342 LIF 0,008737392 ENSG00000162772 ATF3 8,45E-11 ENSG00000267519 MIR24-2 0,00884609 ENSG00000173334 TRIB1 1,56E-09 ENSG00000136158 SPRY2 0,009518016 ENSG00000158050 DUSP2 4,12E-09 ENSG00000078401 EDN1 0,009666298 ENSG00000267598 CTC-250I14.6 1,78E-08 ENSG00000165655 ZNF503 0,009856944 ENSG00000169429 IL8 2,48E-08 ENSG00000173276 ZBTB21 0,010795448 ENSG00000122877 EGR2 4,26E-08 ENSG00000163435 ELF3 0,010795448 ENSG00000132510 KDM6B 2,36E-07 ENSG00000135625 EGR4 0,011780796 ENSG00000177606 JUN 3,47E-07 ENSG00000167604 NFKBID 0,012874546 ENSG00000073756 PTGS2 4,30E-07 ENSG00000164442 CITED2 0,014745968 ENSG00000160888 IER2 5,75E-07 ENSG00000130844 ZNF331 0,017869565 ENSG00000259884 RP11-1100L3.8 1,46E-06 ENSG00000179361 ARID3B 0,018798707 ENSG00000253125 RP11-459E5.1 6,15E-06 ENSG00000227231 IER3 0,019437204 ENSG00000128016 ZFP36 2,22E-05 ENSG00000185031 SLC2A3P2 0,019911147 ENSG00000175592 FOSL1 2,54E-05 ENSG00000120129 DUSP1 0,019911147 ENSG00000119508 NR4A3 4,70E-05 ENSG00000215319 RP11-98L4.1 0,021834172 ENSG00000087074 PPP1R15A 9,71E-05 ENSG00000137193 PIM1 0,028586058 ENSG00000100906 NFKBIA 0,000113744 ENSG00000206478 IER3 0,029226174 ENSG00000081041 CXCL2 0,000130903 ENSG00000130513 GDF15 0,041282395 ENSG00000163874 ZC3H12A 0,000142545 ENSG00000111912 NCOA7 0,042243325 ENSG00000136244 IL6 0,000167016 ENSG00000107984 DKK1 0,043707924 ENSG00000248394 FOSL1P1 0,000223467 ENSG00000187678 SPRY4 0,043707924 ENSG00000257453 RP11-290L1.3 0,000223467 ENSG00000130522 JUND 0,046187869 ENSG00000163734 CXCL3 0,000241034 ENSG00000198142 SOWAHC 0,046983521 ENSG00000144655 CSRNP1 0,000276638 ENSG00000163739 CXCL1 0,052362558 ENSG00000139289 PHLDA1 0,00037995 ENSG00000162840 MT2P1 0,063144735 ENSG00000124882 EREG 0,00038989 ENSG00000137331 IER3 0,064654793 ENSG00000235030 IER3 0,00038989 ENSG00000143507 DUSP10 0,082427333 ENSG00000134107 BHLHE40 0,000397781 ENSG00000100664 EIF5 0,085445932 ENSG00000179094 PER1 0,000470311 ENSG00000184545 DUSP8 0,086856866 ENSG00000130066 SAT1 0,000533196 ENSG00000019549 SNAI2 0,088257729 ENSG00000067082 KLF6 0,000562736 ENSG00000154734 ADAMTS1 0,088591038 ENSG00000254088 SLC2A3P4 0,000711163 ENSG00000104147 OIP5 0,088591038

A-6 • GO enrichment results for wild-type versus mutant L. monocytogenes infected samples.

Table A.5: Summary of the GO terms for the biological pro- Table A.6: Summary of the GO terms for the molecular func- cesses ontology and respective p-value associated with the tion ontology and respective p-value associated with the set of set of DE genes from LM1 versus LM2 analysis, for all the ex- DE genes from LM1 versus LM2 analysis, for all the experi- perimental time-points. mental time-points.

Time- P-value Definition Time- P-value Definition point point 0,000562 intracellular sequestering of iron ion 0,000289 growth factor binding 0,001685 negative regulation of necrotic cell 0,001078 ferroxidase activity death 0,001078 oxidoreductase activity, oxidizing 20 20 0,003087 regulation of necrotic cell death metal ions, oxygen as acceptor 0,004208 necrotic cell death 0,002962 ferric iron binding 0,005607 negative regulation of fibroblast pro- 0,004843 fibroblast growth factor binding liferation 0,006721 insulin-like growth factor binding 0,006934 negative regulation of cell prolifera- 6,95E-08 sequence-specific DNA binding tion transcription factor activity 0,018691 cellular iron ion homeostasis 1,70E-06 MAP kinase tyrosine/serine/threo- 3,21E-09 response to external stimulus nine phosphatase activity 1,43E-08 regulation of apoptotic process 1,35E-05 RNA polymerase II activating tran- 1,58E-08 regulation of programmed cell death 60 scription factor binding 2,08E-08 response to organic substance 3,42E-05 protein tyrosine/threonine phos- 2,16E-08 regulation of cell death phatase activity 60 3,09E-08 apoptotic process 0,001002 insulin-like growth factor binding 3,51E-08 programmed cell death 0,002757 transcription regulatory region DNA 4,70E-08 response to endogenous stimulus binding 1,30E-07 cell death 0,003017 regulatory region DNA binding 2,29E-07 response to chemical stimulus 0,011268 cAMP response element binding 2,75E-07 cellular response to calcium ion 0,011583 phosphatase activity 2,20E-13 apoptotic process 0,013134 C-C chemokine binding 2,74E-13 programmed cell death 0,022413 chemokine binding 2,39E-12 negative regulation of cellular pro- 6,13E-09 sequence-specific DNA binding cess transcription factor activity 2,65E-12 cell death 6,32E-09 nucleic acid binding transcription 4,75E-12 response to organic substance factor activity 5,98E-12 transcription from RNA polymerase 1,26E-06 RNA polymerase II activating tran- 120 II promoter scription factor binding 9,35E-12 regulation of primary metabolic pro- 9,45E-06 MAP kinase tyrosine/serine/threo- cess nine phosphatase activity 120 1,36E-11 regulation of cellular process 1,20E-05 MAP kinase phosphatase activity 1,99E-11 regulation of cell death 3,21E-05 protein dimerization activity 3,15E-11 negative regulation of biological pro- 0,000106 protein tyrosine/threonine phos- cess phatase activity 1,06E-10 regulation of programmed cell death 0,000355 protein tyrosine/serine/threonine 1,09E-10 regulation of transcription from RNA phosphatase activity polymerase II promoter 0,000435 RNA polymerase II core promoter 1,28E-10 regulation of cellular metabolic pro- proximal region sequence-specific cess DNA binding transcription factor ac- 5,58E-10 cell proliferation tivity involved in positive regulation 6,92E-10 positive regulation of biological pro- of transcription cess 0,000918 SMAD binding 240 No terms 0,00134 kinase binding 0,003052 insulin-like growth factor binding 0,004807 protein kinase binding 0,006592 interleukin-8 receptor binding 0,011029 ubiquitin protein ligase binding 0,019649 cAMP response element binding 240 0,040493 structural molecule activity

A-7 Table A.7: Summary of the GO terms for the cellular components ontology and respective p-value associated with the set of DE genes from LM1 versus LM2 analysis, for all the experimental time-points.

Time-point P-value Definition Time-point P-value Definition 0,000482 intracellular ferritin complex 6,96E-06 nucleus 20 0,000482 ferritin complex 0,002333 intracellular membrane- 0,013903 small ribosomal subunit bounded organelle 0,021512 cytosolic ribosome 0,002424 organelle lumen 0,000529 nucleoplasm 120 0,002897 membrane-enclosed lumen 0,002209 intracellular organelle lumen 0,009453 transcription factor complex 0,002938 membrane-enclosed lumen 0,011129 intracellular organelle 60 0,01207 transcription factor complex 0,011994 Bcl-2 family protein complex 0,036072 cytoplasmic stress granule 0,014971 I-kappaB/NF-kappaB complex 0,049335 intracellular membrane- 240 No terms bounded organelle

A.3 Scripts

A.3.1 Makefile

Implementation of the pipeline described in Chapter 4.

1 ## Reference folder 2 AREF=Genome # path for genome 3 READS_DIR=reads # path for reads 4 ## Main directories 5 HUMAN_DIR=. # output folder 6 BOWTIE_DIR=bowtie2-2.0.0-beta7 # path for bowtie 7 SCRIPT_DIR=Scripts # path for scripts 8 ## Auxiliary directories 9 # Results 10 FASTQC_DIR=$(HUMAN_DIR)/FastQC #QC analysis 11 TRIM_DIR=$(HUMAN_DIR)/trim # cleanned reads 12 SORTED_DIR=$(HUMAN_DIR)/alignsort # align with genome and sort data by gene name 13 PAIRED_DIR=$(HUMAN_DIR)/paired_reads # remove reads which onde mate of the pair did not aligned 14 COUNTS_DIR=$(HUMAN_DIR)/counts_ensembl # gene expression mesurement 15 DESEQ_DIR=$(HUMAN_DIR)/DESeq #DE analysis 16 GO_DIR=$(HUMAN_DIR)/GOStats #GO enrichment 17 GENOME_FILE := $(AREF)/Homo_sapiens.GRCh37.71.dna.toplevel 18 GTF_FILE := $(AREF)/Homo_sapiens.GRCh37.71.gtf 19 ## Rules 20 all: gostats 21 gostats-adjacent-tp: $(subst $(DESEQ_DIR)/DESeq_, $(GO_DIR)/GOsummary_BP_, $(subst _pvalue.csv.gz,. csv.gz, $(wildcard $(DESEQ_DIR)/DESeq_*.vs.*_pvalue.csv.gz)))

A-8 22 gostats-control-vs-LM: $(addprefix $(GO_DIR)/GOsummary_,$(subst HeLa,MF, $(notdir $(subst _L1_1.fq. gz,.csv.gz, $(wildcard $(READS_DIR)/LM*_1.fq.gz))))) 23 gostats-LM1-vs-LM2: $(addprefix $(GO_DIR)/,$(subst LM1_HeLa, GOsummary, $(notdir $(subst _L1_1.fq. gz,.csv.gz, $(wildcard $(READS_DIR)/LM1_*_1.fq.gz))))) 24 deseq-adjacent-tp: $(subst $(COUNTS_DIR)/,$(DESEQ_DIR)/DESeq_, $(subst _tp.txt.gz,_pvalue.csv.gz, $ (wildcard $(COUNTS_DIR)/LM*.vs.*.txt.gz))) 25 deseq-control-vs-LM: $(addprefix $(DESEQ_DIR)/DESeq_, $(notdir $(subst L1_1.fq.gz,pvalue.csv.gz, $( wildcard $(READS_DIR)/LM*_L1_1.fq.gz)))) 26 deseq-LM1-vs-LM2: $(addprefix $(DESEQ_DIR)/,$(subst LM1, DESeq_LM, $(notdir $(subst L1_1.fq.gz, pvalue.csv.gz, $(wildcard $(READS_DIR)/LM1_*_1.fq.gz))))) 27 concatenate-adjacent-tp: $(subst $(READS_DIR)/,$(COUNTS_DIR)/, $(subst _HeLa_20_L1_1.fq.gz,_20.vs .60_tp.txt.gz, $(wildcard $(READS_DIR)/LM*_HeLa_20_L1_1.fq.gz))) 28 concatenate-control-vs-LM: $(addprefix $(COUNTS_CONTROL_DIR)/,$(notdir $(subst _L1_1.fq.gz,_control .txt.gz, $(wildcard $(READS_DIR)/LM*_L1_1.fq.gz)))) 29 concatenate-LM1-vs-LM2: $(addprefix $(COUNTS_DIR)/,$(subst LM1, LM, $(notdir $(subst _L1_1.fq.gz,. txt.gz, $(wildcard $(READS_DIR)/LM1_*_1.fq.gz))))) 30 counts: $(addprefix $(COUNTS_DIR)/,$(addsuffix .txt.gz, $(notdir $(subst _1.fq.gz,, $(wildcard $( READS_DIR)/*_1.fq.gz))))) 31 paired: $(addprefix $(PAIRED_DIR)/,$(addsuffix _paired.sam.gz, $(notdir $(subst _1.fq.gz,, $( wildcard $(READS_DIR)/*_1.fq.gz))))) 32 sorted: $(addprefix $(SORTED_DIR)/,$(addsuffix _sorted.sam.gz, $(notdir $(subst _1.fq.gz,, $( wildcard $(READS_DIR)/*_1.fq.gz))))) 33 qc : $(addprefix $(QC_DIR)/, $(notdir $(subst fq,txt, $(wildcard $(READS_DIR)/LM*.fq)))) 34 $(AREF)/Homo_sapiens.GRCh37.72.dna.toplevel : 35 @ wget"ftp://ftp.ensembl.org/pub/release-72/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.72. dna.toplevel.fa.gz" -P $(AREF)/ 36 @ gunzip $(AREF)/Homo_sapiens.GRCh37.72.dna.toplevel.fa.gz 37 @ $(BOWTIE_DIR)/bowtie2-build $(AREF)/Homo_sapiens.GRCh37.72.dna.toplevel.fa Homo_sapiens. GRCh37 .72 38 $(AREF)/Homo_sapiens.GRCh37.72.gtf: 39 @ wget"ftp://ftp.ensembl.org/pub/release-72/gtf/homo_sapiens/Homo_sapiens.GRCh37.72.gtf.gz " -P $(AREF)/ 40 @ gunzip $(AREF)/Homo_sapiens.GRCh37.72.gtf.gz 41 # Perform quality control in fastq reads 42 $(QC_DIR)/%.txt : $(READS_DIR)/%.fq | $(QC_DIR)/.d 43 @ gunzip $(READS_DIR)/$*.fq.gz 44 @ $(FASTQC_DIR)/fastqc $*.fq --noextract --outdir=$(QC_DIR) 45 ## Sort theSAM file by read name(to use in HTSeq) 46 # Convert.bam sorted to.sam 47 $(SORTED_DIR)/%_sorted.sam.gz : $(SORTED_DIR)/%_sorted.bam 48 @ samtools view $(SORTED_DIR)/$*_sorted.bam -h > $(SORTED_DIR)/$*_sorted.sam 49 @ gzip $(SORTED_DIR)/$*_sorted.sam 50 @ echo"SAM file sorted." 51 # Convert.sam to.bam and sort 52 $(SORTED_DIR)/%_sorted.bam : $(SORTED_DIR)/%.sam.gz 53 @ samtools view -bS $(SORTED_DIR)/$*.sam.gz | samtools sort -n - $(SORTED_DIR)/$*_sorted 54 # Make sequences alignment with the human genome(using Bowtie2 with--sensitive option and with all paired ends that aligned written at.sam) 55 # Particularly, the pre-set approach is defined as follows: to do up to 15 consecutive seed

A-9 extension attempts in order to yielda new best alignment, if it is not found Bowtie2 moves on with the alignments that it has so far(-D 15); to re-choosea new set of reads at least2 times if the total number of seed hits divided by the number of seeds that aligned at least once is greater than 300(-R 2); to set the length of the seed substring as 22(-L 22); and to set the interval between seed substrings as $f(x)=1+ 1.15\sqrt{x} $ , where $x$ corresponds to the read length. The definition of this parameter asa function allows its adjustment according to the read length(-iS,1,1.15). 56 $(SORTED_DIR)/%.sam.gz : $(TRIM_DIR)/%_1_trimmed.gz $(TRIM_DIR)/%_2_trimmed.gz | $(SORTED_DIR)/.d 57 $(BOWTIE_DIR)/bowtie2 -x $(GENOME_FILE) --sensitive -1 $(TRIM_DIR)/$*_1_trimmed.gz -2 $( TRIM_DIR)/$*_2_trimmed.gz | gzip > $(SORTED_DIR)/$*.sam.gz 58 @ echo"Alignment done." 59 # Filter paired end reads which only one align 60 $(PAIRED_DIR)/%_paired.sam.gz : $(SORTED_DIR)/%_sorted.sam.gz | $(PAIRED_DIR)/.d 61 @ python $(SCRIPT_DIR)/paired_end.py $(SORTED_DIR)/ $* $(PAIRED_DIR)/ 62 @ echo"Reads filtered." 63 # Uses HTSeq-count script to count how many reads map to each feature(beinga featurea range of positions ona chromosome) 64 $(COUNTS_DIR)/%.txt.gz : $(PAIRED_DIR)/%_paired.sam.gz $(GTF_FILE) | $(COUNTS_DIR)/.d 65 @ gunzip $(PAIRED_DIR)/*.sam.gz 66 @ python -m HTSeq.scripts.count -m union -s no -i gene_id $(PAIRED_DIR)/$*_paired.sam $( GTF_FILE) | gzip > $@ 67 @ gzip $(PAIRED_DIR)/*.sam 68 ## Concatenate files from HTSeq-count 69 # control vs(LM1 or LM2) 70 $(COUNTS_CONTROL_DIR)/%_control.txt.gz : $(COUNTS_DIR)/%_L1.txt.gz $(COUNTS_CONTROL_DIR)/.d 71 @ gunzip $(COUNTS_DIR)/$*_L1.txt.gz $(COUNTS_DIR)/Control_HeLa_L1.txt.gz 72 @ paste $(COUNTS_DIR)/Control_HeLa_L1.txt $(COUNTS_DIR)/$*_L1.txt | cut -f1,2,4- | sed ’1 i \Control_HeLa\tLM1_HeLa_’$*’’> $(COUNTS_CONTROL_DIR)/$*_control.txt 73 @ gzip $(COUNTS_CONTROL_DIR)/$*_control.txt $(COUNTS_DIR)/$*_L1.txt $(COUNTS_DIR)/ Control_HeLa_L1.txt 74 # LM1 versus LM2 75 $(COUNTS_DIR)/LM_HeLa_%.txt.gz : $(COUNTS_DIR)/LM1_HeLa_%_L1.txt.gz $(COUNTS_DIR)/LM2_HeLa_%_L1.txt .gz 76 @ gunzip $(COUNTS_DIR)/LM*_HeLa_$*_L1.txt.gz 77 @ join -t"‘/bin/echo-e’\t’‘" --head $(COUNTS_DIR)/LM1_HeLa_$*_L1.txt $(COUNTS_DIR)/ LM2_HeLa_$*_L1.txt > $(COUNTS_DIR)/LM_HeLa_$*.txt 78 @ sed -i ’1s/^/LM1_HeLa_’$*’\tLM2_HeLa_’$*’\n/’ $(COUNTS_DIR)/LM_HeLa_$*.txt 79 @ gzip $(COUNTS_DIR)/*.txt 80 # Adjacent timepoints 81 $(COUNTS_DIR)/LM%_20.vs.60_tp.txt.gz : $(COUNTS_DIR)/Control_HeLa_L1.txt.gz $(COUNTS_DIR)/LM% _HeLa_20_L1.txt.gz $(COUNTS_DIR)/LM%_HeLa_60_L1.txt.gz $(COUNTS_DIR)/LM%_HeLa_120_L1.txt.gz $( COUNTS_DIR)/LM%_HeLa_240_L1.txt.gz 82 @ gunzip $(COUNTS_DIR)/LM$*_HeLa_*_L1.txt.gz $(COUNTS_DIR)/Control_HeLa_L1.txt.gz 83 @ paste $(COUNTS_DIR)/Control_HeLa_L1.txt $(COUNTS_DIR)/LM$*_HeLa_20_L1.txt | cut -f1,2,4- | sed ’1 i\LM’$*’_HeLa_20\tLM’$*’_HeLa_60’ > $(COUNTS_DIR)/LM$*_ctrl.vs.20_tp.txt 84 @ paste $(COUNTS_DIR)/LM$*_HeLa_20_L1.txt $(COUNTS_DIR)/LM$*_HeLa_60_L1.txt | cut -f1,2,4- | sed ’1 i\LM’$*’_HeLa_20\tLM’$*’_HeLa_60’ > $(COUNTS_DIR)/LM$*_20.vs.60_tp.txt 85 @ paste $(COUNTS_DIR)/LM$*_HeLa_60_L1.txt $(COUNTS_DIR)/LM$*_HeLa_120_L1.txt | cut -f1,2,4- | sed ’1 i\LM’$*’_HeLa_60\tLM’$*’_HeLa_120’ > $(COUNTS_DIR)/LM$*_60.vs.120_tp.txt

A-10 86 @ paste $(COUNTS_DIR)/LM$*_HeLa_120_L1.txt $(COUNTS_DIR)/LM$*_HeLa_240_L1.txt | cut -f1 ,2,4- | sed ’1 i\LM’$*’_HeLa_120\tLM’$*’_HeLa_240’ > $(COUNTS_DIR)/LM$*_120.vs.240_tp. txt 87 @ gzip $(COUNTS_DIR)/LM$*_*.txt 88 # Differencial expression analysis(using DESeq) 89 # control versus(LM1 or LM2) 90 $(DESEQ_DIR)/DESeq_%_pvalue.csv.gz : $(COUNTS_CONTROL_DIR)/%_control.txt.gz $(DESEQ_DIR)/.d 91 @ gunzip $(COUNTS_CONTROL_DIR)/$*_control.txt.gz 92 @ Rscript $(SCRIPT_DIR)/DEseq.R $(COUNTS_CONTROL_DIR)/$*_control.txt $* $(DESEQ_DIR)/ 93 @ gzip $(DESEQ_DIR)/DESeq_$*_*.csv $(COUNTS_CONTROL_DIR)/$*_control.txt 94 # LM1 versus LM2 95 $(DESEQ_DIR)/DESeq_%_pvalue.csv.gz: $(COUNTS_DIR)/%.txt.gz | $(DESEQ_DIR)/.d 96 @ gunzip $(COUNTS_DIR)/$*.txt.gz 97 @ Rscript $(SCRIPT_DIR)/DEseq.R $(COUNTS_DIR)/$*.txt $* $(DESEQ_DIR)/ 98 @ gzip $(DESEQ_DIR)/*.csv 99 # Adjacent timepoints(e.g.. T20 vs T60 from LM1 condition, etc..) 100 $(DESEQ_DIR)/DESeq_%_pvalue.csv.gz : $(COUNTS_DIR)/%_tp.txt.gz concatenate-adjacent-tp 101 @ gunzip $(COUNTS_DIR)/$*_tp.txt.gz 102 @ Rscript $(SCRIPT_DIR)/DEseq.R $(COUNTS_DIR)/$*_tp.txt $* $(DESEQ_DIR)/ 103 @ gzip $(DESEQ_DIR)/DESeq_$*_*.csv 104 ##GO enrichment(using GOstats) 105 # control vs(LM1 or LM2) 106 $(GO_DIR)/GOsummary_LM1_MF_%.csv.gz $(GO_DIR)/GOsummary_LM1_CC_%.csv.gz $(GO_DIR)/GOsummary_LM1_BP_ %.csv.gz: $(DESEQ_DIR)/DESeq_LM1_HeLa_%_pvalue.csv.gz $(GO_DIR)/.d 107 @ gunzip $(DESEQ_DIR)/DESeq_LM1_HeLa_$*_pvalue.csv.gz 108 @ Rscript $(SCRIPT_DIR)/GOStats_ensembl.R $(DESEQ_DIR)/DESeq_LM1_HeLa_$*_pvalue.csv LM1_$* $( GO_DIR )/ 109 @ gzip $(GO_DIR)/GOsummary_BP_LM1_$*.csv $(GO_DIR)/GOsummary_CC_LM1_$*.csv $(GO_DIR)/ GOsummary_MF_LM1_$*.csv $(DESEQ_DIR)/DESeq_LM1_HeLa_$*_pvalue.csv 110 $(GO_DIR)/GOsummary_LM2_MF_%.csv.gz $(GO_DIR)/GOsummary_LM2_CC_%.csv.gz $(GO_DIR)/GOsummary_LM2_BP_ %.csv.gz: $(DESEQ_DIR)/DESeq_LM2_HeLa_%_pvalue.csv.gz $(GO_DIR)/.d 111 @ gunzip $(DESEQ_DIR)/DESeq_LM2_HeLa_$*_pvalue.csv.gz 112 @ Rscript $(SCRIPT_DIR)/GOStats_ensembl.R $(DESEQ_DIR)/DESeq_LM2_HeLa_$*_pvalue.csv LM2_$* $( GO_DIR )/ 113 @ gzip $(GO_DIR)/GOsummary_BP_LM2_$*.csv $(GO_DIR)/GOsummary_CC_LM2_$*.csv $(GO_DIR)/ GOsummary_MF_LM2_$*.csv $(DESEQ_DIR)/DESeq_LM2_HeLa_$*_pvalue.csv 114 # LM1 versus LM2 115 $(GO_DIR)/GOsummary_%.csv.gz: $(DESEQ_DIR)/DESeq_LM_HeLa_%_pvalue.csv.gz | $(GO_DIR)/.d 116 @ gunzip $(DESEQ_DIR)/DESeq_LM_HeLa_$*_pvalue.csv.gz 117 @ Rscript $(SCRIPT_DIR)/GOStats.R $(DESEQ_DIR)/DESeq_LM_HeLa_$*_pvalue.csv $* $(GO_DIR)/ 118 @ gzip $(GO_DIR)/*.csv 119 @ gzip $(DESEQ_DIR)/DESeq_LM_HeLa_$*_pvalue.csv 120 # Adjacent timepoints 121 $(GO_DIR)/GOsummary_BP_%.csv.gz $(GO_DIR)/GOsummary_MF_%.csv.gz $(GO_DIR)/GOsummary_CC_%.csv.gz: $( DESEQ_DIR)/DESeq_%_pvalue.csv.gz $(GO_DIR)/.d 122 @ gunzip $(DESEQ_DIR)/DESeq_$*_pvalue.csv.gz 123 @ Rscript $(SCRIPT_DIR)/GOStats_ensembl.R $(DESEQ_DIR)/DESeq_$*_pvalue.csv $* $(GO_DIR)/ 124 @ gzip $(GO_DIR)/GOsummary_*_$*.csv 125 @ gzip $(DESEQ_DIR)/DESeq_$*_pvalue.csv

A-11 126 # Createa directory(useDIR/.d) 127 %/. d: 128 @ mkdir -p $(@D) 129 @ touch $@

A.3.2 trim.py

Script to trim a given number of base pairs that are introducing bias on a raw NGS data.

1 from sys import argv 2 import gzip 3 script, in_directory, fast_file, delete, out_directory =argv 4 fastq_file=in_directory + fast_file +".fq.gz" 5 delete=int(delete) 6 inf=gzip.open(fastq_file, mode=’rb’) 7 out_file = gzip.open(out_directory + fast_file +"_trimmed.gz", mode=’wb’,compresslevel=9) 8 for (i, line) in enumerate(inf): 9 ifi%2!= 0: 10 out_file.write(line[delete:]) 11 ifi in range (0,10): 12 print line[delete:] 13 print line 14 else: 15 out_file.write(line) 16 ifi in range (0,10): 17 print line 18 inf.close() 19 out_file.close()

A.3.3 paired end.py

Script that invalidates alignments where only one mate of the paired-end read was mapped.

1 from sys import argv 2 import gzip 3 script , in_directory, sam_file, out_directory, =argv 4 inf=gzip.open(in_directory + sam_file+"_sorted.sam.gz", mode=’rb’) 5 out_file = gzip.open(out_directory + sam_file+"_paired.sam.gz", mode=’wb’,compresslevel=9) 6 j=0 7 y=0 8 o=0 9 for (i,line) in enumerate(inf): 10 if line[0] =="@": 11 out_file.write(line) 12 j=j+1 13 else: 14 o=o+1

A-12 15 if (i+j)%2 == 0: 16 l1 = line .split()[5] 17 name = line 18 else: 19 l2= line .split()[5] 20 if (l1!= l2) and (l1=="*" or l2=="*"): 21 pass 22 else: 23 out_file.write(name) 24 out_file.write(line) 25 y=y+2 26 print"Original file(number of lines):%d" %o 27 print"Filtered file(number of lines):%d" %y 28 inf .close() 29 out_file.close()

A.3.4 DESeq.R

Script to perform the differential gene inference between RNA-Seq samples.

1 args <- commandArgs(TRUE) 2 library("DESeq") 3 # Load count tables 4 count_table= read.table(file= args[1], header=TRUE, row.names=1) 5 # Define data design 6 dataDesign = data.frame( 7 row.names= colnames( count_table), 8 condition = colnames( count_table), 9 libType = rep("paired-end",ncol(count_table))) 10 conditions=dataDesign$condition 11 libraryType=dataDesign$libType 12 n=nrow(count_table) 13 count_table_trimmed=count_table[-((n-4):n),] 14 # Normalization 15 data = newCountDataSet(count_table_trimmed, conditions) 16 data = estimateSizeFactors(data) 17 normalized = counts(data, normalized=TRUE) 18 # Variance estimation 19 data = estimateDispersions( data, method="blind", sharingMode="fit-only" , fitType="parametric") 20 # Find differential expression: 21 res = nbinomTest( data, colnames( count_table )[1], colnames( count_table )[2] ) 22 resp =subset(res,res$padj<0.1) 23 res_pvalue=resp[ order(resp$pval), ] 24 res_down=resp[ order( resp$foldChange, -resp$baseMean ),] 25 res_up=resp[ order( -resp$foldChange, -resp$baseMean ), ] 26 # write results into tables 27 write.csv(res_pvalue, file = sprintf("%sDESeq_%s_pvalue.csv", args[3], args[2])) 28 write.csv(res_up, file = sprintf("%sDESeq_%s_up.csv", args[3], args[2])) 29 write.csv(res_down, file = sprintf("%sDESeq_%s_down.csv", args[3], args[2]))

A-13 30 write.csv(res, file = sprintf("%sDESeq_%s_ALL.csv", args[3], args[2]))

A.3.5 GOstats.R

Script to associate a given list of genes with the GO database.

1 args <- commandArgs(TRUE) 2 # Setup 3 library("org.Hs.eg.db") 4 library("GOstats") 5 # Load data 6 data <- read.csv(file=args[1], head=TRUE, sep=",", row.names=1) 7 if(length(data$id)>0){ 8 genes_ensembl=as.character(data$id) 9 # Filtering-noEntrez 10 ## Remove genes that have no entrezGene id 11 entrezIds <- mget(genes_ensembl, envir=org.Hs.egENSEMBL2EG, ifnotfound=NA) 12 genes_ensembl <- names(entrezIds)[sapply(entrezIds, function(x) unique(!is.na(x)))] 13 # Filtering-noGO 14 ## Convert data to EntrezId 15 genes_entrez=c() 16 for(i in 1:length(genes_ensembl)){ 17 genes_entrez=c(genes_entrez,get(genes_ensembl[i],org.Hs.egENSEMBL2EG))} 18 ## Remove genes with noGO mapping 19 haveGo <- sapply(mget(genes_entrez, org.Hs.egGO), 20 function(x) { 21 if(length(x) == 1&& is.na(x)) 22 FALSE 23 elseTRUE}) 24 numNoGO <- sum(!haveGo) 25 genes_entrez <- genes_entrez[haveGo] 26 if(length(genes_entrez)>0){ 27 # defineGeneUniverse 28 ## Define gene universe(relation between all Refseq and Entrez Id) 29 entrezUniverse <- as.list(org.Hs.egENSEMBL2EG) 30 # HyperGeo 31 hgCutoff <- 0.1 32 ontologies =c(’MF’,’BP’,’CC’) 33 for (ont in ontologies) { 34 params <- new("GOHyperGParams", 35 geneIds=genes_entrez, 36 universeGeneIds=entrezUniverse, 37 annotation="org.Hs.eg.db", 38 ontology=ont, 39 pvalueCutoff=hgCutoff, 40 conditional=FALSE, 41 testDirection="over") 42 paramsCond <- params 43 #HGTEST

A-14 44 hgOver <- hyperGTest(params) 45 # summary 46 df <- summary(hgOver) 47 ## Get Entrez genes related with eachGO term 48 goterms <- df[,1] 49 i=1 50 genes_ids <-c() 51 while (i <= length(goterms)){ 52 genes_ids <-c(genes_ids, geneIdsByCategory(hgOver, goterms[i])) 53 i <- i+1} 54 ## Select only Entrez genes that are in the data 55 for( i in 1:length(genes_ids)){ 56 genes_tested=c() 57 if(length(genes_ids[[i]]>0)){ 58 genes_to_test=genes_ids[[i]] 59 for ( j in 1:length(genes_to_test) ){ 60 if (genes_to_test[j] %in% genes_entrez) 61 genes_tested=c(genes_tested, genes_to_test[ j])} 62 genes_ids[[i]]=genes_tested }} 63 ## Convert data to Ensembl Id 64 for(i in 1:length(genes_ids)){ 65 genes_ensembl=c() 66 if(length(genes_ids[[i]]>0)){ 67 for(j in 1:length(genes_ids[[i]])){ 68 genes_ensembl =c(genes_ensembl, get(genes_ids[[i]][j],org. Hs.egENSEMBL))}} 69 genes_ids[[i]]=genes_ensembl} 70 ## Add column with genes(Ensembl annotation) 71 df$genes_Ensembl <- genes_ids 72 df$genes_Ensembl <- sapply(df$genes_Ensembl, FUN = paste, collapse =",") 73 ## Write table 74 write.csv(df, file = sprintf("%sGOsummary_%s_%s.csv", args[3], ont, args[2]))}}}

A-15 B Supplementary material for chapter 5

Contents B.1 Scripts ...... B-2

B-1 B.1 Scripts

B.1.1 network.R

Implementation of the methodology described in Chapter 5.

1 ## Library 2 library ("DESeq") 3 library(biomaRt) 4 ## Get genes of interest(goi) 5 # Map gene id to ensembl id 6 get_ensembl_genes = function(gene_array){ 7 mart<- useDataset("hsapiens_gene_ensembl", useMart("ensembl")) 8 chromossomes=c(1:22) 9 chromossomes=c(chromossomes, ’X’, ’Y’, ’MT’) 10 id <- getBM(filters="hgnc_symbol", attributes= c("hgnc_symbol","ensembl_gene_id"," chromosome_name"), values= gene_array, mart= mart) 11 select_genes=which(id$chromosome_name %in% chromossomes) 12 valid_genes=id[select_genes ,] 13 id_new1=c() 14 id_new2=c() 15 for(i in 1:length(gene_array)){ 16 index=grep(gene_array[i], valid_genes[,1]) 17 id_new1=c(id_new1, valid_genes[index,1]) 18 id_new2=c(id_new2, valid_genes[index,2]) } 19 ensembl_array=id_new2 20 return(ensembl_array)} 21 # Get counts for goi 22 get_goi = function(count_table, gene_array, is_ensembl){ 23 tab_goi=c() 24 genes_col=rownames(count_table) 25 for( j in 1:length(goi) ){ 26 pos_gene = goi[j] == row.names(count_table) 27 line_count_table = as.matrix(count_table[pos_gene,]) 28 tab_goi = rbind(tab_goi,line_count_table) } 29 return(t(tab_goi))} 30 # Obtain X_dot- time derivative matrix 31 get_X_dot = function(count_table, tp){ 32 X_dot = c() 33 for (i in 1:nrow(count_table)){ 34 row = t(as.matrix(count_table[i,])) 35 colnames(row) = NULL 36 linear_interpol_vec = c() 37 for (j in 1:(length(row)-1)){ 38 adj_tp = c(tp[j], tp[j+1]) 39 adj_exp = as.vector(c(row[1,j],row[1,j+1])) 40 reg = lm( adj_exp ~ adj_tp )$coefficients[2] 41 linear_interpol_vec=c(linear_interpol_vec, reg)} 42 linear_interpol_vec=as.matrix(linear_interpol_vec) 43 row.names(linear_interpol_vec) = NULL

B-2 44 X_dot = rbind(X_dot, t(linear_interpol_vec))} 45 row.names(X_dot)=row.names(count_table) 46 colnames(X_dot)=colnames(count_table)[-1] 47 return(t(X_dot))} 48 ## Get conections from html interface 49 get_conections = function (html_line, gene_array){ 50 tab_line = strsplit(html_line, ’’) 51 relation_list = as.list(gene_array) 52 for( i in 2:length(tab_line[[1]])){ 53 relation_line = strsplit(tab_line[[1]][i], ’’) 54 type_relation = relation_line[[1]][1] 55 source_gene = relation_line[[1]][2] 56 target_gene = strsplit(relation_line[[1]][3], ’’)[[1]][1] 57 for( j in 1:length(relation_list)){ 58 if(relation_list[[j]][1]==target_gene){ 59 relation_list[[j]] = c(relation_list[[j]], source_gene) 60 var=TRUE}}} 61 return(relation_list)} 62 ## Get interaction matrixW 63 get_w = function(goi_120, X_dot_mat, relation_list, goi){ 64 w = c() 65 col_name=c() 66 for( i in 1:length(relation_list)){ 67 # focusing the interactions of the ith gene 68 relation_gene = relation_list[[i]] 69 # select genes that interact with ith gene 70 X_select=c() 71 for( j in 1:length(relation_gene) ){ 72 # convert to ensembl id 73 index_ensembl_goi = grep(relation_gene[j], gene_array) 74 # get expression for gene 75 index_gene = which(colnames(goi_120)==goi[index_ensembl_goi]) 76 X_select=cbind(X_select, goi_120[,index_gene]) } 77 X_select=cbind(X_select, goi_120[,ncol(goi_120)]) 78 index_ensembl = grep(relation_gene[1], gene_array) 79 index_dot_matrix = which(colnames(goi_120)==goi[index_ensembl]) 80 X_dot_select = X_dot_mat[,index_dot_matrix] 81 w_selected = qr.coef(qr(X_select), X_dot_select) 82 w_bind=rep(0, ncol(goi_120)) 83 for( j in 1:length(relation_gene) ){ 84 index_ensembl_goi = grep(relation_gene[j], gene_array) 85 index_gene = which(colnames(goi_120)==goi[index_ensembl_goi]) 86 w_bind[index_gene] = w_selected[j] } 87 # add parameterb 88 w_bind[ncol(goi_120)] = w_selected[j+1] 89 w=cbind(w, w_bind)} 90 colnames(w)=goi 91 return (w)} 92 # search order of genes(from the gene with less connection to the one with most)

B-3 93 search_order = function(w){ 94 order_id=c() 95 nb_b_col=nrow(w) 96 id_array=1:ncol(w) 97 while ( length(id_array)>0 ){ 98 i=id_array[1] 99 non_zero_w=which(w[,i]!=0) 100 if(isTRUE(unique(non_zero_w==c(i,nb_b_col)))){ 101 order_id=c(order_id, i) 102 id_array=id_array[-1] 103 } else if (isTRUE(unique(non_zero_w%in%c(order_id,i,nb_b_col)))){ 104 order_id=c(order_id,i) 105 id_array=id_array[-1] 106 } else { 107 not_found_id=id_array[1] 108 id_array=id_array[-1] 109 id_array[length(id_array)+1]=not_found_id 110 }} 111 return(order_id)} 112 ## Simulate kinetics 113 find_kinetics = function(div, goi_mat, begin_tp, end_tp, w, order){ 114 xvec=seq(begin_tp, end_tp, div) 115 expression_list=list() 116 colnames(w)=NULL 117 # for each gene 118 for( i in 1:ncol(w) ){ 119 # for each weigthw 120 # inicialization 121 func = goi_mat[1,order[i]] 122 expression_list[[order[i]]]=func 123 non_zero_w = which(w[,order[i]]!=0) 124 # do numerical integration. Each integral is calculated from div to div. 125 for(t in 2:length(xvec)){ 126 integral = 0 127 # calcule incrementation 128 for( var in 1:length(non_zero_w) ){ 129 if(non_zero_w[var]==nrow(w)){ 130 integral = integral + w[non_zero_w[var], order[i]] 131 } else { 132 integral = integral + w[non_zero_w[var], order[i]]* expression_list[[non_zero_w[var]]][t-1] }} 133 # sum incrementation with the previous value 134 func = func + integral*div 135 expression_list[[order[i]]]=c(expression_list[[order[i]]], func) }} 136 expression_list[[i+1]]=xvec 137 return(expression_list)} 138 ## Functions to perform the permutation test 139 mse = function(sim, obs) {mean( (sim - obs)^2, na.rm = FALSE)} 140 # Randomly selecta set of genes from the gene universe

B-4 141 find_goi = function(genes_sol, used_index, cutoff, nb_genes){ 142 found = TRUE 143 while ( found==TRUE ){ 144 #get genes which variance is greater than acept_var 145 # investigate if each column has solution 146 acept_var=cutoff 147 not_found=TRUE 148 while(not_found==TRUE){ 149 genes_index = sample(1:ncol(genes_sol), nb_genes) 150 goi_mat = genes_sol[,genes_index] 151 inc =0 152 for(i in 1:(ncol(goi_mat)-1)){ 153 array=goi_mat[,i] 154 if(length(unique(goi_mat[2:(nrow(goi_mat)-1),i]))==1) 155 {inc=inc+1} 156 if((i+1)>ncol(goi_mat)){break} 157 for(j in (i+1):ncol(goi_mat)){ 158 if(length(which((array==goi_mat[,j])==’TRUE’))>(nrow( goi_mat)-2)) 159 {inc=inc+1}}} 160 var_array = c() 161 for(i in 1:ncol(goi_mat)){ 162 var_array = c(var_array, var(goi_mat[,i])) 163 } 164 if(sum(var_array)>acept_var && inc==0){not_found=FALSE} 165 } 166 # Investigate if this gene set was already tested 167 if(length(used_index)==0){ 168 found=FALSE 169 used_index=cbind(used_index, genes_index) 170 } else { 171 for(i in 1:ncol(used_index)){ 172 found_array = identical(used_index[, i], genes_index) 173 } 174 if(found_array==FALSE){ 175 found=FALSE 176 used_index=cbind(used_index, genes_index)}}} 177 return(list(goi_mat, used_index))} 178 ## Get interaction matrixW for the random set of genes 179 get_w_permutation = function(goi_120, X_dot_mat, relation_list_index, gene_names){ 180 w = c() 181 col_name=c() 182 for( i in 1:length(relation_list_index)){ 183 # focusing the interactions of the ith gene 184 relation_gene = gene_names[as.numeric(relation_list_index[[i]])] 185 # select genes that interact with ith gene 186 X_select=c() 187 for( j in 1:length(relation_gene) ){ 188 # get expression for gene

B-5 189 index_gene = which(colnames(goi_120)==relation_gene[j]) 190 X_select=cbind(X_select, goi_120[,index_gene]) } 191 # appendb column 192 X_select=cbind(X_select, goi_120[,ncol(goi_120)]) 193 #createb vector(in Ax=b) 194 index_dot_matrix = which(colnames(X_dot_mat)==relation_gene[1]) 195 X_dot_select = X_dot_mat[,index_dot_matrix] 196 #solve Ax=b 197 w_selected = qr.coef(qr(X_select), X_dot_select) 198 w_selected = qr.solve(X_select, X_dot_select, tol = 1e-12) 199 # createw column for gene relation_gene[1] 200 w_bind=rep(0, ncol(goi_120)) 201 for( j in 1:length(relation_gene) ){ 202 index_gene = match(relation_gene[j], gene_names) 203 w_bind[index_gene] = w_selected[j] } 204 # add parameterb 205 w_bind[ncol(goi_120)] = w_selected[j+1] 206 w=cbind(w, w_bind) } 207 colnames(w)=gene_names 208 return (w)} 209 permutation_test = function(tp, count_mat, nb_resample, div, cutoff, relation_list_index, order){ 210 mse_sample=c() 211 var_sample=c() 212 mse1_sample=c() 213 used_index=c() 214 index =c() 215 # get genes which entries are all non-zero 216 for( i in 1:ncol(count_mat) ){ 217 have_zero = length(which(count_mat[2:nrow(count_mat),i]==0))>0 218 if (have_zero==0){ 219 index = c(index,i)}} 220 genes_sol = count_mat[,index] 221 colnames(genes_sol)=colnames(count_mat)[index] 222 nb_genes=length(relation_list_index) 223 224 # for each re-sampling perform this 225 for( j in 1:nb_resample ){ 226 print (j) 227 # get random set of genes(goi) 228 list = find_goi(genes_sol, used_index, cutoff, nb_genes) 229 goi_mat=cbind(list[[1]], rep(1,5)) 230 used_index=list[[2]] 231 gene_names=colnames(genes_sol)[used_index[,ncol(used_index)]] 232 colnames(goi_mat)=c(gene_names, ’b’) 233 # get X_dot 234 X_dot = get_X_dot(t(goi_mat), tp) 235 X_dot_mat = as.matrix(X_dot) 236 # getW 237 goi_120 = goi_mat[1:4,]

B-6 238 W = get_w_permutation(goi_120, X_dot_mat, relation_list_index, gene_names) 239 # Integrate 240 expression_list = find_kinetics(div, goi_mat, begin_tp, end_tp, W, order) 241 # measeure mse 242 known_expression_id = c() 243 # Extract measured gene expression for each tp 244 for( i in 1:length(tp)){ 245 known_expression_id = c(known_expression_id, match(tp[i], expression_list[[ length(expression_list)]]))} 246 mse_matrix = c() 247 mse_array = c() 248 var_array = c() 249 for(i in 1:length(gene_names)){ 250 mse_matrix = rbind(mse_matrix, expression_list[[i]][known_expression_id]) 251 # mse between the caculated expression and the measured expression fora certain genei 252 mse_array=c(mse_array, mse(goi_mat[,i],mse_matrix[i,])) 253 # variation associated with genei among the calculated expression 254 var_array = c(var_array, var(goi_mat[,i]))} 255 mse_sample=c(mse_sample, sum(mse_array/var_array)) 256 var_sample=c(var_sample, sum(var_array)) 257 mse1_sample=c(mse1_sample, sum(mse_array)) } 258 return(list(mse_sample, var_sample, mse1_sample, used_index))} 259 measure_goi_mse = function(tp, w, goi_mat, begin_tp, end_tp, div, order){ 260 mse_sample=c() 261 var_sample=c() 262 mse1_sample=c() 263 # Integrate 264 expression_list = find_kinetics(div, goi_mat, begin_tp, end_tp, w, order) 265 # measeure mse 266 known_expression_id = c() 267 for( i in 1:length(tp)){ 268 known_expression_id = c(known_expression_id, match(tp[i], expression_list[[length( expression_list)]]))} 269 mse_matrix = c() 270 mse_array = c() 271 var_array = c() 272 for(i in 1:(ncol(goi_mat)-1)){ 273 mse_matrix = rbind(mse_matrix, expression_list[[i]][known_expression_id]) 274 mse_array=c(mse_array, mse(goi_mat[,i],mse_matrix[i,])) 275 var_array = c(var_array, var(goi_mat[,i])) } 276 mse_sample= sum(mse_array/var_array) 277 var_sample= sum(var_array) 278 mse1_sample= sum(mse_array) 279 return(list(mse_sample, var_sample, mse1_sample))} 280 ## Plots 281 # Kinetics 282 plot_kinetics = function(expression_list, goi_mat, tp, gene_array){ 283 nb_ng=length(expression_list)

B-7 284 for( i in 1:(nb_ng-1) ){ 285 png(filename=sprintf(’plots/%s.png’, gene_array[i])) 286 exp=goi_mat[,i] 287 plot( tp, exp, main=gene_array[i],col="blue", pch=19, xlab=NA, ylab=NA, ylim=c(min( min(expression_list[[i]]), min(exp)), max(max(expression_list[[i]]), max(exp)))) 288 lines(expression_list[[nb_ng]], expression_list[[i]]) 289 mtext ("Time[min]", side=1, outer=TRUE, line=-2) 290 mtext ("log2(expression)", side=2, outer=TRUE, line=-1.5) 291 dev . off () 292 }} 293 # Histogram 294 plot_permutationtest = function(mse_goi, mse_sample){ 295 # mse/var 296 png(’plots/histogram_msevar.png’) 297 hist(c(mse_goi[[1]],mse_sample[[1]][mse_sample[[1]]<10]), breaks=60, main=NA, xlim=c(0,10), xlab ="MSE/Var", ylab="Frequency", 298 panel.first={points(0, 0, pch=16, cex=1e6, col="white")}) 299 # mtext("MSE/Var", side=1, outer=TRUE, line=0) 300 # mtext("Frequency", side=2, outer=TRUE, line=0.7) 301 abline(v=mse_goi[[1]], col="red", coef=) 302 axis(3, at=signif(mse_goi[[1]], digits=4), col="red", col.axis="red") 303 dev . off () 304 #variance 305 png(’plots/histogram_var.png’) 306 limit=max(mse_goi[[2]],mse_sample[[2]][mse_sample[[2]]<20])+0.5 307 hist(c(mse_goi[[2]],mse_sample[[2]][mse_sample[[2]]

B.1.2 network-LM1.R

This script analyses how well the RNA-Seq data from the L. monocytogenes wild-type infected cells is modelled by the network on Section 5.2.2, Chapter 5.

B-8 1 source(’net.R’) 2 args<-c(’documents/LM1_HeLa_all_timep.txt’, ’FALSE’, ’paired-end’, ’FALSE’,’0,20,60,120,240’, ’ NDRG1,ATF3,IL8,CCL2,DDIT3,IKBKB,NFKB1’, ’Connections

< td>ATF3
Type of relashionshipSourceTarget
Down- regulatedNDRG1ATF3
Down-regulatedNDRG1IKBKB
Down-regulatedATF3CCL2
Down-regulatedDDIT3
Down-regulatedATF3IL8
Down-regulatedIKBKBNFKB1
Down-regulatedNFKB1IL8
’, 5, 0) 3 counts_file = args[1] 4 is_ensembl = args[2] 5 librarytype = args[3] 6 normalized_bol = args[4] 7 tp=as.numeric(unlist(strsplit(args[5], split=","))) 8 begin_tp = tp[1] 9 end_tp = tp[length(tp)] 10 if(is_ensembl==TRUE){ 11 goi = unlist(strsplit(args[6], split=",")) 12 } else { 13 gene_array = unlist(strsplit(args[6], split=",")) 14 } 15 html_line = strsplit(strsplit(args[7],’’)[[1]][2],’’)[[1]][1] 16 nb_resample=as.numeric(args[8]) 17 cutoff=args[9] 18 div =1 19 if(normalized_bol==TRUE){ 20 normalized = read.table(file = counts_file, header=TRUE, row.names=1) 21 } else { 22 count_table = read.table(file = counts_file, header=TRUE, row.names=1) 23 ## Define data design 24 dataDesign = data.frame( 25 row.names = colnames( count_table ), 26 condition = colnames( count_table ), 27 libType = rep(librarytype,ncol(count_table))) 28 conditions=dataDesign$condition 29 libraryType=dataDesign$libType 30 data = newCountDataSet(count_table, conditions) 31 data = estimateSizeFactors(data) 32 normalized = counts(data, normalized=TRUE) 33 } 34 if(is_ensembl==FALSE){ 35 goi=get_ensembl_genes(gene_array) 36 } else { goi=gene_array } 37 goi_mat = as.matrix(get_goi(as.data.frame(normalized), goi, is_ensembl)) 38 goi_mat = log2(as.matrix(goi_mat+1)) 39 #subtract control 40 zero =c() 41 for(i in 1:ncol(goi_mat)){ 42 goi_mat[,i] = goi_mat[,i] - goi_mat[1,i]

B-9 43 if(goi_mat[2,i]==0){ 44 zero=c(zero, i) 45 }} 46 if(length(zero)>0){ 47 goi_mat=goi_mat[,-zero] 48 goi=goi[-zero] 49 gene_array=gene_array[-zero] 50 } 51 goi_mat = cbind(goi_mat, rep(1,5)) 52 # remove last row 53 goi_120=goi_mat[1:(nrow(goi_mat)-1),] 54 # getX dot from linear interpolation(Matrix X_dot) 55 X_dot = get_X_dot(t(goi_mat), tp) 56 X_dot_mat = as.matrix(X_dot) 57 # extract relationships between genes from html page 58 relation_list = get_conections (html_line, gene_array) 59 # convert gene id in relation_list to index 60 relation_list_index=relation_list 61 for( i in 1:length(relation_list)){ 62 for( j in 1:length(gene_array)){ 63 gene_interaction=relation_list[[i]] 64 pos=match(gene_array[[j]], gene_interaction) 65 if(!is.na(pos)){ relation_list_index[[i]][pos]=j } 66 }} 67 # Solve interaction matrixW 68 W = get_w(goi_120, X_dot_mat, relation_list, goi) 69 # find order 70 options(warn=-1) 71 order = search_order(W) 72 options(warn=1) 73 ## Simulate kinetics 74 expression_list = find_kinetics(1, goi_mat, begin_tp, end_tp, W, order) 75 plot_kinetics(expression_list, goi_mat, tp, gene_array) 76 ## Permutation test 77 # Set all genes as log and subtract control value 78 count_mat = as.matrix(normalized) 79 count_mat = t(log2(count_mat+1)) 80 for(i in 1:ncol(count_mat)){ 81 count_mat[,i] = count_mat[,i] - count_mat[1,i]} 82 mse_sample = permutation_test(tp, count_mat, nb_resample, div, cutoff, relation_list_index, order) 83 mse_goi = measure_goi_mse(tp, W, goi_mat, begin_tp, end_tp, div, order) 84 ratio = length(mse_sample[[1]][mse_sample[[1]] < mse_goi[[1]]]) 85 p_value = ratio/(nb_resample+1) 86 plot_permutationtest(mse_goi, mse_sample) 87 # Write file with analysis proprieties 88 #differential expression 89 exp =c() 90 for (i in 1:ncol(W)){ 91 if(length(gene_array)==0){

B-10 92 gene_list=goi 93 } else {gene_list=gene_array} 94 line =W[,i] 95 non_null=which(line!=0) 96 for(j in 1:(length(non_null)-1)){ 97 var1=paste(’x_’,as.character(non_null[j]), sep="") 98 var2=paste(var1, ’+ ’) 99 var3=paste(as.character(signif(line[non_null[j]],digits = 5)), var2, sep="") 100 if(j>1){dif_e=paste(dif_e, var3) 101 } else { dif_e = var3 } 102 } 103 var4=paste(as.character(signif(line[non_null[j+1]],digits = 5)),’u(t)’,sep="") 104 dif_e=paste(dif_e, var4, sep="") 105 frac=’\\frac{dx_’ 106 exp=c(exp, paste(frac,paste(i,paste(’}{dt} =’,dif_e),sep=""),sep=""))} 107 dir.create(’documents’) 108 write(exp,’documents/matrix_w.txt’) 109 stats=c(mse_goi[[1]], ratio, signif(p_value,digits = 10)) 110 write(stats, ’documents/permutation_test_stats.txt’)

B-11 B-12