Deciphering the ontogeny

of unmutated and

mutated subsets of

Chronic Lymphocytic

Leukemia

Master Degree Project in Systems Biology Two years Level 120 ECTS Spring term 2019

Ahmed Mohamed Email: [email protected]

Supervisor: Ola Grimsholm (University of Gothenburg) Email: [email protected]

Co-supervisor: Andreas Tilevik and Magnus Fagerlind (University of Skövde) Emails: [email protected] & [email protected]

Examiner: Zelmina Lubovac Email: [email protected] Abstract

Chronic Lymphocytic Leukemia (CLL) is a type of cancer that affects the B cells of the immune system causing problems in the process of producing antibodies. It can be sorted into mutated and unmutated CLL based on the percentage of somatic mutations in the Immunoglobulin Heavy chain Variable region (IgHV). The B cells of healthy individuals can be sorted into three groups; CD27dull memory B cells (MBCs), CD27bright MBCs and naïve B cells. The hypothesis for the project was that the unmutated CLL subset originates from CD27dull MBCs and the mutated CLL subset originates from CD27bright MBCs. RNA- sequencing data from healthy individuals were acquired from a collaboration partner in Rome and CLL- patients were collected from public datasets available online. Several bioinformatic tools were used to analyze the data. First, the quality of the data files was checked, then adapter sequence from the sequencing process and low-quality bases were removed (trimming). Good quality of the files was confirmed after the trimming. Secondly, these files were mapped against the human reference genome (GRCh38/hg38) for alignment, then the resulted data was used to check for that showed differential expression between the different groups. Results were analyzed and visualized using Venn diagrams, Principal Component Analysis (PCA) and heatmap plots and random forest. A list of 85 genes was generated based on the different comparisons and was used in one PCA plot that showed clear separation between the different groups. The SWAP70 was analyzed for single nucleotide polymorphisms (SNPs). The study concluded five genes that could be used as biomarkers for CLL and the diagnosis of its subtypes where some of them were discussed in previous studies. Also, the mutated CLL subset showed a similar behavior to the healthy individuals and this could validate the original hypothesis and justifies the better disease prognosis for this subtype.

i

Popular scientific summary

DNA consists of billions of a smaller unit called nucleotides which are arranged in a certain order for every individual. This order is maintained by the cells for healthy survival, but when this order is modified, diseases and even cancer can occur. Every sum of nucleotides form something called gene and each gene has distinctive function, but malfunction occurs when the arrangement of the nucleotides in this gene is changed. Studies have been focusing on figuring out the arrangement of these nucleotides to help sorting out which modifications causing the diseases. This have been approached by a new technology called next generation sequencing. This technology gives information about the order of the nucleotides when DNA is fed into the machine. The results are generated as files on the computer to be analyzed in different ways and most of the resulted data are available online with public access. RNA is another genetic material with the same starting unit as DNA. However, it can provide more details about the expression of the genes which control the that body makes and decides how the system of the body behaves. When the gene is modified, its expression is changed too, hence the is modified and diseases could show up.

Chronic lymphocytic leukemia (CLL) is a type of Cancer that causes disruptions in this tight system and specifically the part of the system responsible for the immunity. It can be divided into two subtypes that behave differently as one subtype called the mutated CLL tends to have better prognosis than the other subtype which is called unmutated CLL. RNA sequencing data of both CLL subtypes were collected from online sources. Also, RNA sequencing data of healthy individuals were collected for comparisons against the CLL-patients.

The comparisons were performed using different statistical tools and visualization plots. This resulted in 85 genes that showed different levels of expressions between the healthy individuals and CLL- patients. Those genes also showed different expression levels between the subtypes of CLL where they can be used to classify the patients to which subtype, they belonged to. This would provide a more accurate personalized treatment.

The expression patterns of the genes were also used to compare the healthy expression patterns with the expression patterns of the CLL-patients to find the cause and the origin of the disease. Those tools check the expression patterns of the genes and try to group them together, so if an expression pattern of a certain gene of a patient is too similar to the one of a healthy individual, they group together, hence providing possible information about the certain gene/type of cell that started the disease progression. The mutated CLL was concluded to be more similar to the healthy individuals and hence explaining the better disease prognosis for this subtype of CLL. A generated gene list of 85 important genes was summarized into five potentially significant genes that could be used instead of the whole list, but needs to be validated. Some of the specific nucleotide changes in the region of SWAP70 gene were also recorded which could be useful for further investigations.

ii

List of abbreviations

AP4B1 AP-4 complex subunit beta-1

BCRs B cell receptors

CH Constant Heavy chain

CL Constant Light chain

CLL Chronic Lymphocytic Leukemia

CRC CLL Research Consortium

CSR Class switch recombination

DEGs Differentially expressed genes

Fab Fragment of antigen binding region

Fc Fragment crystallizable region

FDR False Discovery Rate

FISH Fluorescence in situ hybridization

GC Germinal center

GEO Gene Expression Omnibus

GPSM3 G-protein-signaling modulator 3

IGHA2 Immunoglobulin heavy constant alpha 2

IgHV Immunoglobulin Heavy chain Variable region

IGV Integrative Genomics Viewer

IRP Identity Registration Protocol

MBCs Memory B cells

MHCII Major histocompatibility complex II

NK Natural killer cells

OSU Ohio State University

PAMPs Pathogen-associated molecular patterns

PCA Principal Component Analysis

PRRs Pathogen-recognition receptors

Rlog Regularized-logarithm

SH Somatic hypermutation

iii

SNPs Single nucleotide polymorphisms

SRA Sequence Read Archive

SWAP70 Switching-associated protein 70

TCRs T cell receptors

TLRs Toll-like receptors

TMM Trimmed Mean of M-values

TNFR Tumor necrosis factor receptor

VDJ Variable, Diversity and Joining subregions

VH Variable Heavy chain

VL Variable Light chain

WES Whole-exome sequencing

WGS Whole-genome sequencing

Xist X-inactive specific transcript

ZNF471 Zinc finger protein 471

iv

Table of Contents

Introduction ...... 1

The immune system ...... 1

Immune response ...... 1

The adaptive immune system ...... 2

Germinal center ...... 4

Chronic Lymphocytic Leukemia ...... 4

Next generation sequencing and CLL ...... 5

CD27 gene ...... 5

Aims ...... 6

Techniques and methods ...... 7

Quality control for FASTQ files ...... 7

Alignment and differential expression ...... 8

Correlation check and random forest ...... 9

Checking single nucleotide polymorphisms (SNPs) ...... 9

Results ...... 10

Quality control for FASTQ files and correlation check...... 10

Alignment and differential expression ...... 12

Random forest ...... 16

Checking single nucleotide polymorphisms (SNPs) ...... 17

Discussion...... 19

Ethical aspects and impact on the society ...... 23

Future perspectives...... 24

Acknowledgements ...... 25

References ...... 26

Appendices ...... 33

Appendix I ...... 33

v

Introduction

The immune system

The human immune system is a vital complex system that plays a very important role in protecting the body from different kinds of threats. These threats can be microorganisms trying to enter the body through injuries or intact skin to initiate an infection or disease (Janewayb Jr et al., 2001). The immune system is divided into two arms; innate and adaptive defence systems (Abbas et al., 1994; Gangemi et al., 2005). There is a thin line that defines and separates between the two different systems that needs to be further investigated (Hoebe et al., 2004). The innate immune system is the first line of defence and consists of several types of cells such as neutrophils, macrophages and natural killer cells (NK) that can kill bacteria and viruses through different mechanisms for example, phagocytosis and cell lysis and also involves the physical barrier of the skin while the adaptive immune system that acts as the second line of defence, involves T lymphocytes and B lymphocytes (Janewayb Jr et al., 2001; Abbas et al., 2014). The innate immunity is unspecific and cannot induce memory when fighting different pathogens as the system recognizes shared pathogen-associated molecular patterns (PAMPs) through the use of pathogen-recognition receptors (PRRs) like Toll-like receptors (TLRs) (Carruthers et al., 2007). The innate immune system can also activate the adaptive immune system through different mechanism such as cytokine secretion and antigen presentation (Janewayb Jr et al., 2001).

The adaptive immune system, on the other hand, can induce immunological memory, so it is more sophisticated than the innate immune system as it acts more specifically against the different pathogens since both B and T cells express a unique antigen receptor. B and T cells differ from each other in the surface receptors which are called B cell receptors (BCRs) in case of the B cells and T cell receptors (TCRs) in case of the T cells (Market and Papavasiliou, 2003). B cells develop in the bone marrow and complete their maturation in the spleen to become a mature naïve B cell. Memory B cells (MBCs) and plasma cells are formed after the antigen activation of the naïve B cells (Agematsu et al., 2000). On the other hand, the T cells consists of several subsets, depending on the function, such as cytotoxic T cells, helper T cells, regulatory T cells and gamma delta T cells (γδ T cells) (Mosmann and Sad, 1996; Abbas et al., 2017). The cytotoxic T cells can attack the pathogen directly while the helper T cells provide aid to the macrophages to destroy microbes or help the B cells to produce antibodies (Mosmann and Sad, 1996). The response time for the adaptive immune system is longer than the innate immune system as it needs time to be activated for the first time. However, this lag time is much shorter during a secondary response and pathogen can be neutralized directly (Abbas et al., 1994).

Humoral and cell-mediated immunity represent the two arms of the adaptive immunity. Cell-mediated immunity involves cytotoxic T cells, phagocytosis and cytokines. Humoral immunity is antibody- dependent and manifested by macromolecules present in the extracellular fluid such as complement molecules and antibodies that try to destroy toxins and microbes before entering the cells. T helper cell is a type of T cell and is responsible for the humoral immunity regulation by activating the B cells to develop into plasma cells to produce antibodies against the foreign body and also involved in cell- mediated immunity by the activation of cytotoxic T cells (Janewaya et al., 2001). The harmony between those cells is crucial for defending the body against invaders (Parish, 1972).

Immune response

The primary immune response is the immune system reacting to an antigen for the first time while the secondary immune response is the reaction to the same antigen, but for the second time. The secondary immune response is faster to be launched, more vigorous and expresses higher serum levels of antibodies. This occurs because of the MBCs that help initiating this reaction much faster, unlike the

1 primary immune response that needs a latent period at the start to produce the antibodies and their serum level is not as high as their level during the secondary immune response (Welliver et al., 1980). Some vaccination methods imply the concept of the long-acting more vigorous secondary immune response by using a two-dose vaccination programs that trigger both primary and secondary immune response to prolong and ensure the protection against the desired antigen (Peltola et al., 1994).

In general, the immune response to different antigens varies significantly between individuals. One of the factors that can affect the strength and the speed of the immune response is the route of administration of the antigen as different responses are observed in case of intramuscular, subcutaneous or intradermal routes (Mohanan et al., 2010). Also, if the antigen was introduced through an oral route, in most of the cases, the immune system would not be activated as most antigens are proteins and would get digested. However, oral immunization does also exist for certain antigens. Another factor that can affect the immune response is the sex of the individual because sex hormones can increase or decrease the immune response (Paavonen et al., 1981; Beagley and Gockel, 2003). Age is a third factor that interferes with the immune response as elderly tend to have a poorer immune response compared to the young and their response to new infections is decreased (Goodwin et al., 2006; Colonna-Romano et al., 2006; Pawelec, 2006). Other factors affecting the immune response are the presence of specific diseases such as diabetes, the use of specific medications such as immunosuppressant agents which have distinctive cytokines profiles affecting the immune reaction or the dose of the antigen introduced into the body (Graves and Kayala, 2008; Wallner et al., 2018; Liu et al., 2009; Goidl et al., 1968).

The adaptive immune system

Adaptive immunity can be subclassified into two types; active and passive adaptive immunity which imply the different mechanisms for vaccinations (Baxter, 2007). The active adaptive immunity is delayed immune response with longer duration (Baxter, 2007). It can also be natural or artificial depending on how the antigen was introduced to the body. If the antigen was introduced into the body through an active infection, the body shows natural active adaptive immune response. However, if the antigen was introduced into the body through a vaccine that is live-attenuated, killed or toxoid, the body shows artificial active adaptive immune response (Giannasca and Warny, 2004; Baxter, 2007). The passive adaptive immunity is sudden onset reaction with shorter duration by getting the immune response directly into the system using already-produced antibodies (Baxter, 2007). Needless for the body to exhibit immune response or produce antibodies, the passive adaptive immunity can be also divided into natural or artificial according to the way by which the antibodies get into the system. If the antibodies were introduced spontaneously into the body through different ways such as the placenta or mother’s lactation, it can be called passive natural adaptive immunity. On the other hand, if the antibodies were produced in a model, then isolated to be transferred into another body, it can be called passive artificial adaptive immunity (Giannasca and Warny, 2004; Baxter, 2007).

Antibodies are proteins produced by the B cells after being activated and differentiated into plasma cells. Antibodies can be present in a secreted form or a membrane-bound form (BCRs). The antibody is Y-shaped protein formed by four polypeptide chains; two light chains and two heavy chains. The heavy chains are attached to each other with disulfide linkage and each light chain connects to one heavy chain with another disulfide linkage (Figure 1; Abbas et al., 2014). Two main regions are shown on the antibody structure; a variable region and a constant region. The variable region is present at the Fab region (Fragment of antigen binding region) while the constant region is at the bottom of the Y-shape forming the larger part of the antibody, including both; the lower part of the Fab and the complete Fc region (fragment crystallizable region) (Figure 1; Abbas et al, 2017). The variable region is responsible for the identification and binding to the antigen. On the other hand, the constant region

2 is responsible for the interaction between the antibody and the receptors on other cells of the immune system (Schroeder Jr and Cavacini, 2010).

Figure 1. Schematic structure of the antibody showing the heavy chains in blue and the light chains in orange, the variable heavy chain (VH) in light blue and the constant heavy chain (CH) in dark blue while the variable light chain (VL) in light orange and the constant light chain (CL) in dark orange.

The heavy chain variable region consists of three subregions called variable (V), diversity (D) and joining (J) and each one of them has different segments where the recombination occurs (Figure 2). Different ways to recombine these segments and regions result in more than 1015 different antibodies (Schroeder Jr and Cavacini, 2010). The heavy chain constant region that is present directly after the junction and the VDJ region, consists of five subregions (µ, δ, γ, ε and α) which undergo alternative splicing to express one of the five antibody isotypes (Figure 2; Janewayc Jr et al., 2001; Schroeder Jr and Cavacini, 2010). Also, γ and α of the heavy chain constant subregions can express different subtypes of the same antibody isotype such as IgG antibodies that can be separated into four subclasses (Janewayc Jr et al., 2001). Somatic hypermutation (SHM) and class switch recombination (CSR) are two modification processes that occur on the DNA level during the activation and affinity maturation of the B cells (Figure 2; Market and Papavasiliou, 2003). SHM makes point mutations into the variable region of the heavy and light chains and leads to BCRs with higher affinity to the antigen while cells with lower-affinity BCRs are removed by apoptosis (Maul and Gearhart, 2010). Additionally, CSR that occurs on the heavy chain constant region switches the isotype of the antibody from IgM to IgG, IgE or IgA (Figure 2; Schroeder Jr and Cavacini, 2010; Stavnezer and Schrader, 2014). The mutations on the heavy chain variable region are highly active and mainly deletions. This whole process allows the B cells to modify both the antigen binding site and the part responsible for binding to different immune cells to be more specific and have higher affinity in recognizing a specific antigen and destroy it (Schroeder Jr and Cavacini, 2010).

3

Figure 2. Showing the exons for the heavy chain variable and constant regions with their subregions separated by a junction where somatic recombination occurs to form different antibodies with different isoforms.

Germinal center

During the antigen activation of B cells in secondary lymphoid organs such as the lymph nodes and the spleen, an area called the germinal center (GC) is formed where B cells undergo affinity maturation. It is divided into a dark zone and a light zone. Mature B cells reach the mantle zone outside the GC, after being activated by an antigen, where they will undergo cognate interaction with T helper cells. Subsequently, they will move to the dark zone of the GC where they will start the clonal expansion and proliferation along with somatic hypermutation in the dark zone to form centroblasts with different antigen affinities (Wollenberg et al., 2011). The centroblasts will then move to the light zone, develop into centrocytes and pick up the antigen from the follicular dendritic cells that are stromal cells playing the role as antigen deposit in the GC. The centrocyte will then present the antigen to a follicular helper T cell; Major histocompatibility complex II (MHCII) and then receive a signal from the follicular helper T cell for survival, recycling back to the dark zone or apoptosis. Thus, if the centrocyte had disadvantageous mutations that led to decreased affinity to the antigen, apoptosis occurs and the centrocyte is destroyed. However, if the affinity has been improved, the centrocyte survives and undergoes CSR to switch the antibody class from IgM/IgD to e.g. IgE. After CSR, the centrocyte will differentiate into plasma cells to secrete high affinity antibodies or MBCs that help to defend the body against a secondary infection (Klein et al., 2003).

Chronic Lymphocytic Leukemia

Chronic Lymphocytic Leukemia (CLL) is a common form of cancer that occurs in the B cells of the immune system causing abnormal changes and irregularities in the normal process of producing antibodies (Blachly et al., 2015). There are two subtypes of CLL; mutated and unmutated CLL, depending on the somatic mutations’ percentage in the Immunoglobulin Heavy chain Variable (IgHV) genes and the cutoff is at 2% mutation (Kulis et al., 2012). Mutations in the IgHV genes are almost inevitable, so the unmutated CLL still shows some mutations, but with less frequency. Also, it has been hypothesized that the unmutated CLL develops from naïve B cells and the mutated CLL develops from the MBCs in the GC (Hamblin et al., 1999). The abnormalities can be manifested through abnormal apoptosis process causing specific B cells to be accumulated. The prognosis of the disease differs between the two subtypes as the unmutated subtype tends to have a worse prognosis, probably because the less mutated B cells have higher capacity to proliferate (Hamblin et al., 1999). Anemia, thrombocytopenia, weight loss and enlargement of lymph nodes are some of the symptoms that may appear in a diseased person. Flow cytometry can be used for the diagnosis of CLL based on CD19/CD5/CD20/CD79b expression (Rawstron et al., 2016). Fluorescence in situ hybridization (FISH) is also used for the diagnosis of the disease, but when a patient is diagnosed with early-stage of the

4 disease, immediate treatment is not always preferred as it has been proven that it does not extend the survival period (Döhner et al., 2000; Hallek et al., 2008; CLL Trialists' CollaborativeGroup, 1999). Moreover, a study showed that treatment of an early-stage patient resulted in a higher risk of fatal epithelial cancer when compared to patients who did not receive treatment (Dighiero et al., 1998). CLL treatment has evolved over the years from rituximab, antibody targeting CD20 cell surface protein on B cells, to ibrutinib (Bruton tyrosine kinase inhibitor), idelalisib (phosphatidylinositol-3-kinase δ inhibitor), and venetoclax (BCL-2 inhibitor) (Hainsworth et al., 2003; Furman et al., 2014; Parikh, 2018).

Next generation sequencing and CLL

Whole-genome sequencing (WGS) and whole-exome sequencing (WES) have been applied recently to differentiate between the healthy and malignant B cells through the analysis of significant mutated genes and depending on the mutation percentage it can be interpreted whether the patient has a mutated or an unmutated CLL subtype (Rodríguez-Vicente et al., 2017). Using IgBLAST to compare against the references, if the matching with the IgHV region is more than or equal 98%, it can be classified as unmutated CLL subtype, otherwise (less than 98%), it can be classified as mutated CLL subtype. This cut-off value is not so accurate and needs attention during the classification process (Ye et al., 2013; Rodríguez-Vicente et al., 2017). RNA-sequencing data can also be used to identify whether the patient has the unmutated CLL subtype or not as previously-mentioned depending on the mutation percentage of the IgHV region (Blachly et al., 2015). It can provide a deeper insight about the disease at the gene level since both mutations and aberrant gene expression can be investigated and give information about single nucleotide polymorphisms (SNPs), gene expression profile and the status of IgHV region mutations instead of using specialized clinical procedures or sanger sequencing that are being used now (Blachly et al., 2015). RNA-sequencing is a preferred technique to be applied to a lot of diseases for a faster, more informative and more accurate diagnosis.

CD27 gene

The CD27 gene is a member of the tumor necrosis factor receptor (TNFR) superfamily and it can regulate B cell responses (Guikema et al., 2003). CD27 is expressed on MBCs and peripheral blood T cells at different levels throughout life and it can also be found in a soluble from in the serum when B cell malignancies occur, mainly CLL (Agematsu, 2000; Klein et al., 1998; Agematsu et al., 2000; Van Oers et al., 1993). The differentiation process of B cells depends on the CD27 ligand (CD70) which activates CD27 on the peripheral blood B cells (Agematsu et al., 1998). Recently, it has been shown that MBCs can be separated into CD27dull and CD27bright MBCs with different functions and ability to differentiate into plasma cells. CD27dull MBCs tend to keep their memory abilities and proliferate while the CD27bright MBCs tend to develop to plasma cells.

5

Aims

1- Analyze similarity in gene expression between the CD27 expression levels on B cells to mutated and unmutated subtypes of CLL as the hypothesis is that CD27dull MBCs that have few somatic mutations are the origin of unmutated subset of CLL whereas CD27bright MBCs that are highly mutated are the origin of the mutated subset of CLL.

2- Identify potential biomarkers to stratify the CLL-patients into mutated and unmutated CLL subtypes to provide a more accurate prognosis and treatment for the patients.

6

Techniques and methods

Quality control for FASTQ files

RNA-sequencing data files of four healthy individuals were already divided to three subsets each; naïve B cells, CD27dull MBCs and CD27bright MBCs and the sequencing process was performed on four sequencing runs, so four FASTQ files were available for each subset. Those files were merged into one big FASTQ file for each subset using the cat function in Cygwin which is a UNIX emulator on Windows (Racine, 2000). Along with the downloaded public data files of CLL-patients to be compared against, RNA-sequencing data files of seventeen CLL-patients were downloaded from GEO database with accession number GSE66228 using SRA toolkit and Cygwin. The sequencing process was pair-ended for all the files and all the groups had two files; forward and reverse FASTQ files.

RNA-sequencing data files from both healthy individuals and CLL-patients were analyzed using different tools. First, all the FASTQ files were checked for the quality using FastQC tool (Andrews, 2010; Bolger et al., 2014). This tool provides detailed information about different quality parameters with an overall estimation for each parameter to be marked with a green sign for quality check passing, yellow sign for quality warnings or a red sign for quality check fail for this parameter. The parameters that were mostly focused on were the per base sequence quality, per base sequence content and sequence duplication levels as they show the quality score for each base in the read, the bias of the sequencing machine and PCR bias, respectively (Li et al., 2015; Guo et al., 2013). In Illumina machines, the quality scores for the bases are represented by PHRED quality score which usually ranges from 0 to 40 with each number representing how accurate the base calling was during the sequencing with 40 being the highest quality and lowest chance (0.0001) for errors (Cock et al., 2009).

The Illumina TruSeq universal adapter was checked in all the FASTQ files from the CLL-patients using the first eleven bases of the reverse complementary sequence of the adapter (Appendix I). Then, the adapter was removed from the files of the CLL-patients using the reverse complementary sequence of the adapter using a Java-based tool called Trimmomatic with the “ILLUMINACLIP” command in Cygwin with the following parameters ”ILLUMINACLIP: adaptor_seq.fa1:30:5” where ”1” implies the maximum number of allowed mismatched bases to be one base, ”30” for paired-end mode and ”5” is for the alignment score for the accuracy of the matching (Bolger et al., 2014). The quality of the bases of the reads was improved and low-quality bases were filtered through Trimmomatic tool in Cygwin using the command ”SLIDINGWIMDOW:4:20 MINLEN:30” for FASTQ files of both healthy individuals and CLL-patients. This command with the specified parameters allows the tool to check the mean quality of every four consecutive bases on each read and if it is below 20 which is the normal cutoff, corresponds to an error probability of 0.01, for the base quality, the read is trimmed (Cock et al., 2009).

The quality of all the trimmed and filtered FASTQ files was checked again using FastQC tool to confirm that the adapter sequence was removed and the low-quality bases were trimmed to confirm that the data showed better quality and to go on with the next step. US-1422302 forward FASTQ file from the CLL dataset was checked before and after the trimming and quality control process as a representative for the CLL dataset quality while the reverse file of the CD27bright MBCs of the third healthy person was used as a representative of the quality of the healthy dataset before and after the trimming and quality control process.

7

Alignment and differential expression

Secondly, the alignment was conducted using the trimmed FASTQ files against the human reference genome (GRCh38/hg38) which was accessed from Ensembl genome database (Ensembl release 96). No filtration was applied to be able to check all the genes and compare them first with the help of Rsubread package in R program in Linux operating system. This package has been proven to be computationally-efficient and time-saving when compared to other tools (Liao et al., 2013) (Liao et al., 2018). It also offers the advantage of providing an output that is compatible with the R programming language to be used directly without further complications (Liao et al., 2013). The mappability percentage of each file was also recorded. Count files for both healthy and CLL datasets were generated by applying the function “featureCounts” of Rsubread package to the resulted BAM files from the alignments (Liao et al., 2013). The healthy and CLL count files were then uploaded to R program and genes with a count less than 10 were removed. About 50% of the genes were removed from each file due to filtering. The differentially expressed genes (DEGs) were identified based on the filtered count data using DESeq2 package in R as this package fits the negative-binomial model to the count data to identify DEGs, finds a good cutoff value to remove genes with low counts and mark outliers (Love1 et al., 2014; Love2 et al., 2014; Love et al., 2015). Also, DESeq2 package has showed good consistency when the replicates’ number was limited and also showed low frequencies of false positives when analyzing diverse human data (Seyednasrollah et al., 2013).

To get the list of DEGs, four pairwise comparisons were conducted in R program after fitting the variables into two models using DESeq2 package (Love1 et al., 2014). The comparisons were; CD27dull MBCs versus CD27bright MBCs, CD27dull MBCs versus naive B cells, CD27bright MBCs versus naive B cells and the mutated CLL subtype versus the unmutated CLL subtype and DEGs were listed in those comparisons using an adjusted p-value/False Discovery Rate (FDR) of 0.05 to record the genes. It was calculated based on Benjamini-Hochberg procedure for multiple comparisons. Venn diagram was created in R program with the help of VennDiagram package using the DEGs from those four pairwise comparisons to check the common DEGs between all the groups. Principal component analysis (PCA) plot was also visualized in R program using all the samples and genes with the help of ggplot2 package to check if the different groups would cluster separately.

Another PCA plot for all the samples was generated using the generated list of the DEGs between the mutated and unmutated CLL subtypes. A heatmap for all the samples was also generated with the help of gplots package in R program using the same list of DEGs to check the expression pattern of those genes in the different individuals. The linkage of the heatmap was based on the mean values and the distance was based on the Pearson correlation. A volcano plot was also used to visualize the previous list of the DEGs based on the log2 (fold-change) to mark the significant genes that show high/low log2 (fold-change) and hence the important genes in this list. As a result, PCA and heatmap plots were regenerated using the high/low log2 (fold-change) genes only (cutoff values of 1.5 and -1.5) from the previously-mentioned list of DEGs between the mutated and unmutated CLL subtypes to check if the clustering improved or not.

The same procedure of analysis was applied to the list of DEGs between CD27dull MBCs and CD27bright MBCs where PCA and heatmap plots were generated using the full list of the DEGs between those two groups. The linkage of the heatmap was also based on the mean values and the distance was based on Pearson correlation. A volcano plot was used to visualize the DEGs based on the log2 (fold-change) and mark the genes with high/low log2 (fold-change) which are considered important. Afterwards, PCA and heatmap plots were regenerated using the high/low log2 (fold-change) genes only (cutoff values of 1.5 and -1.5) from the list of DEGs between CD27dull MBCs and CD27bright MBCs to check if the improvement in the clustering.

8

Correlation check and random forest

Correlation was calculated in R program based on Pearson’s correlation coefficient since the variables were continuous and showed linear relationship as well as they had a normal distribution. To check if the distributions of the gene expressions were similar between the samples of the normalized logged counts, boxplot was generated for both; healthy and CLL datasets. This was performed to check the correlation between the CD27 expression levels of the mutated and unmutated subtypes of CLL (Hauke and Kossowski, 2011; Liao et al., 2013).

Random forest was also used with the help of randomForest package in R program using the filtered gene counts over 600 after merging the two datasets, ending up using around half of the genes and 1000 trees to create the forest. The top ten decision-making genes in the random forest were recorded as they were anticipated to have a big impact in the separation between the five groups. Those genes were plotted individually based on the count to check how well they could separate between the groups. The random forest was rerun a few times and the top ten genes changed during every run due to stochastic processes. However, they were recorded every time and plotted individually to extract the best of them for the separation process. Four of the decision-making genes showed a good separation between the groups and were extracted to be further used in the analysis. A PCA plot was generated using those four genes.

Venn diagram was created for the four important decision-making genes in the random forest, the genes with high/low log2 (fold-change) from the list of DEGs between the mutated and unmutated CLL subtypes and also the genes with high/low log2 (fold-change) from the list of DEGs between CD27dull MBCs and CD27bright MBCs. Then, all those genes were combined and used to generate a final PCA plot and a heatmap where it was based on average linkage and Euclidean distance instead of Pearson correlation.

Checking single nucleotide polymorphisms (SNPs)

Rsubread-generated BAM files from the alignment were sorted and indexed using samtools in Cygwin. The sorted BAM files were then used to check for and visualize possible splice junctions and single nucleotide polymorphisms (SNPs) in certain regions of the genome by using the Java-based tool called Integrative Genomics Viewer (IGV) which is a kind of advanced web browser (Robinson et al., 2011). The region of interest that was checked, was the SWAP70 gene region which was located on 11: 9,664,077-9,752,993 (Ensembl1).

9

Results

Quality control for FASTQ files and correlation check

The FASTQ files from the CLL-patients, both forward and reverse files, were analyzed using FASTQC tool. All files had the same length across the reads which was 90 bases meaning that no trimming or adapter removal had been applied to the files beforehand. They had good quality for these parameters; per base sequence quality, per sequence quality scores, per base N content, sequence length distribution and adapter content (Figure 3A, B, C, D, E & F). Also, overrepresented sequences were shown in all the files which was expected in RNA-sequencing library as genes show different expression levels and so different mRNA levels, hence highly expressed genes would have more RNA to be sequenced and bias would be introduced. The quality scores of the bases at the end of the reads (below 30) was not as good as the quality of the bases at the start of the reads (over 30) before the trimming (Figure 3A). After the trimming, the quality of the bases at the end of the reads improved to be as good as the ones at the start of the reads to be over 30 (Figure 3B). All the files had warnings for per base sequence content and sequence duplication levels which persisted even after the trimming process which was expected. However, the per base sequence content was improved at the end of the reads as the lines became more parallel after the trimming (Figure 3C & D). Also, the sequence duplication levels improved after trimming where the percentage of unique sequences was below 50% before trimming and rose over 50% after the trimming (Figure 3E & F). Some of the files showed warnings for GC content and per tile sequence quality. When checking for the Illumina TruSeq universal adapter sequence, it was present in all the files which was removed later along with the low-quality bases. Over 90% of the reads in all the files were maintained after the trimming and filtration process.

The FASTQ files from the healthy individuals, both forward and reverse files, had the reads ranging from 35 to 76 bases long which means that adapter removal had already been applied to those files. Most of the sequences had a length of 76 bases. All the files showed good quality for the per base sequence quality, per tile sequence quality, per sequence quality scores, per sequence GC content and per base N content (Figure 3G, H, I, J, K & L). However, they all failed the per base sequence content and the sequence duplication levels and also showed a warning for sequence length distribution before and after the trimming. This was expected due to the removal of adapters. The quality scores of the last few bases were below 30 before trimming which later improved to be over 30 after trimming (Figure 3G & H). The per base sequence content did not show noticeable changes before and after the trimming (Figure 3I & J). Also, no changes were observed in the sequence duplication levels before after trimming (Figure 3K & L). Overrepresented sequences were present in all the files too. Over 90% of the reads in all the files were maintained after the trimming and filtering process. Also, the fluctuations in the first eleven bases in the per base sequence content for all the FASTQ files from healthy individuals and CLL-patients was normal and they were not removed due to the fact of using random hexamers as primers to get the cDNA for sequencing (Hansen et al., 2010).

10

Figure 3. Showing the quality parameters using FastQC tool. 1- for the US-1422302 forward file (CLL), per base sequence quality before (A) and after (B) the quality trimming and adapter removal, per base sequence content before (C) and after (D) the quality trimming and adapter removal and sequence duplication levels before (E) and after (F) the quality trimming and adapter removal. 2- for the the reverse file of the CD27bright of the third healthy individual, per base sequence before (G) and after (H) the quality trimming and adapter removal, per base sequence content before (I) and after (J) the quality trimming and adapter removal and sequence duplication levels before (K) and after (L) the quality trimming and adapter removal.

Boxplots for both the healthy and CLL datasets were generated to check the distribution of the logged counts. Both boxplots showed similar logged counts across all the samples (Figure 4A & 4B). As a result, Pearson’s correlation was calculated based on the mean values (Table 1). The correlation coefficient has a value ranging from -1 to 1 indicating how strong two factors/groups are affected by each other. A correlation coefficient of 1 means that a change in one factor/group strongly affects the other

11 factor/group (directly proportional) while a correlation coefficient of -1 means that a strong inversely proportional relationship exists between the two factors/groups. Correlation coefficient values > 0.8 or < -0.8 means that a strong relationship exists between the two factors/groups. The three healthy B- cell subsets showed higher correlation with the mutated CLL subtype than the unmutated CLL subtype (Table 1).

Figure 4. Showing (A) boxplot for the logged counts of the healthy dataset, (B) boxplot for the logged counts of the CLL dataset.

Table 1. Showing the Correlation matrix for the mean counts of the five groups based on Pearson’s coefficient.

Groups CD27bright MBCs CD27dull MBCs Naïve B cells Mutated CLL Unmutated CLL subtype subtype

CD27bright MBCs 1.0000000 0.9917679 0.9704828 0.8486202 0.8217922

CD27dull MBCs 0.9917679 1.0000000 0.9814602 0.8267748 0.8011212

Naive B cells 0.9704828 0.9814602 1.0000000 0.8252373 0.7998319

Mutated CLL 0.8486202 0.8267748 0.8252373 1.0000000 0.9864814 subtype

Unmutated CLL 0.8217922 0.8011212 0.7998319 0.9864814 1.0000000 subtype

Alignment and differential expression

The BAM files from the CLL-patients showed a mappability percentage over 94% after the alignment that was performed using Rsubread package while for the BAM files from the healthy individuals, the mappability percentage after the alignment was over 84% in all files except for one file that had only 78,55% which was the CD27bright MBCs of the third healthy person. The files containing the counts of the genes were generated based on the BAM files from the alignment using the function “featureCounts” from Rsubread package based on the filtered counts of the genes. When the counts of the genes were filtered, about 49.8% of the genes were removed from the file of the CLL-patients and about 50.9 % of the genes were removed from the file of the healthy individuals. DEGs were generated through four separate-pairwise comparisons (Table 2). Then, a Venn diagram was generated using the list DEGs from the four pairwise comparisons where 10 genes were shared between all the comparisons (Figure 5).

12

Table 2. Showing DEGs from the four pairwise comparisons.

Groups/Results DEGs

Top 3 upregulated genes Top 3 downregulated genes

CD27dull MBCs VS naive B cells 5824 genes

ALPL, CD70 and KLK1 TUNAR, MAML3 and SNTG2- AS1

CD27bright MBCs VS naive B 7928 genes cells KLK1, FA2H and CD27 SNTG2-AS1, MAML3 and TCL1A

CD27dull MBCs VS CD27bright 1738 genes MBCs ZNF667-AS1, DACT1 and NNMT IGHA2, GPR15 and LINC02541

Mutated CLL subset VS 113 genes unmutated CLL subset ZNF471, RNA gene (Novel MTND4P12, IGHA2 and JAM3 transcript; ENSG00000237429) and DBN1

Figure 5. Showing Venn diagram for the list of DEGs from the four pairwise comparisons.

A PCA plot was created using all the samples and all the genes (Figure 6A). The PCA reduces the dimensionality of the data and can visualize the combined genes in a two-dimensional plot which illustrates the similarity in the gene expression profiles between different samples (Figure 6A). Based on the first principal component (PC1) with 62% variance, the CLL-patients were clearly separated from the healthy individuals, whereas PC2 (5% variance) mainly separated the groups within the healthy individuals and CLL-patients. However, with all genes being used, it was hard to identify which one of the healthy B-cell subsets (bright, dull or naïve) showed similar expression to either the mutated or unmutated CLL-patients. Therefore, a new PCA analysis was generated based only on the 113 DEGs that were identified between the mutated and unmutated CLL-patients (Figure 6B). As expected, a much better separation between the unmutated and mutated CLL-patients could be seen.

13

Interestingly, based on PC1, the mutated CLL-patients showed a more similar expression profile to the healthy individuals as they got closer to each other on the plot (Figure 6B). Analysing the same samples using the same 113 DEGs, but with a different approach, using a heatmap with clustering based on the log2 (fold-change), a similar pattern was shown (Figure 6C). Based on the dendrogram on the top of the heatmap, all the unmutated CLL-patients clustered in one group (Figure 6C, marked with green), whereas the mutated CLL-patients clustered together (Figure 6C). Interestingly, the mutated CLL- patients showed a much more similar expression profile to healthy individuals based on those 113 DEGs.

A volcano plot was used for the previously-mentioned 113 DEGs to check for the genes with too high or too low log2 (fold-change) (Figure 6D). Cutoff values of 1.5 and -1.5 for the log2 (fold-change) were used to extract the genes with high/low expression to be used to create another PCA and heatmap plots. This process resulted in 51 genes to be considered of high/low expression and were labeled on the volcano plot (Figure 6D).

The new PCA plot using the 51 genes showed the same results as the previous PCA plot that used the 113 DEGs (Figure 6E). Interestingly, the new heatmap using the 51 genes showed new information regarding two CLL-patients; US-1422311_2% and US-1422321_0.7%. They belonged to the unmutated CLL-patients. However, they clustered with the five mutated CLL-patients (Figure 6F, marked with green). The seven CLL-patients also clustered with the healthy individuals as shown on the dendrogram on the heatmap which means that the seven CLL-patients have similar expression profile to the healthy individuals (Figure 6F).

14

Figure 6. Showing (A) PCA plot for all the samples using all the genes, (B) PCA plot for all the samples using the 113 DEGs between the mutated CLL subtype and the unmutated CLL subtype, (C) heatmap for all the samples using 113 DEGs between the mutated CLL subtype and the unmutated CLL subtype, (D) volcano plot for the 113 DEGs between the mutated CLL subtype and the unmutated CLL subtype and labelled genes. 51 genes, are based on the log2 (fold-change) which is > 1.5 or < -1.5, in red lines, (E) PCA plot for all the samples using the 51 genes with high/low log2 (fold-change) out of the 113 previously-mentioned DEGs & (F) heatmap for all the samples using the 51 genes with high/low log2 (fold-change) out of the 113 previously-mentioned DEGs.

The second step of the differential expression analysis was to use different set of genes and visualize the data from a different perspective. Using the 1738 DEGs between CD27dull and CD27bright MBCs, a PCA plot was created where PC1 (61% variance) separated clearly between the three healthy B-cell subsets and the two CLL subtypes, whereas PC2 (8% variance) mainly separated the groups within the healthy individuals and CLL patients (Figure 7A). The heatmap using the same gene list showed the same results with only the two CLL subtypes clustering together in one group and the healthy B-cell subsets in another group as shown on the dendrogram (Figure 7B).

The volcano plot was also used for the 1738 DEGs to check the genes that show too high or too low log2 (fold-change) which would provide more insight about the important genes that were highly expressed (Figure 7C). Cutoff values of 1.5 and -1.5 for the log2 (fold-change) were used to extract the genes that showed fairly high/low expression to be used to recreate the PCA and heatmap plots. This process resulted in 32 genes to be considered of high/low expression that were labeled on the volcano plot (Figure 7C).

The regenerated PCA plot using the 32 genes showed the same results as the previous PCA plot (Figure 7D). Interestingly, based on PC1 (44% variance), the CD27bright MBCs showed a more similar expression profile to the CLL-patients (Figure 7D), whereas the regenerated heatmap with the same 32 genes showed that the CD27bright MBCs clustered with CLL-patients as shown on the dendrogram on the top of the heatmap (Figure 7E, marked in green).

15

Figure 7. Showing (A) PCA plot for all the samples using the 1738 DEGs between CD27dull and CD27bright MBCs, (B) heatmap for all the samples using 1738 DEGs between CD27dull and CD27bright MBCs, (C) volcano plot for the 1738 DEGs between CD27dull and CD27bright MBCs and labelled genes, 32 genes, are based on the log2 (fold-change) which is > 1.5 or < -1.5 in red lines, (D) PCA plot for all the samples using the 32 genes with high/low log2 (fold- change) out of the previously-mentioned 1738 DEGs & (E) heatmap for all the samples using the 32 genes with high/low log2 (fold-change) out of the previously-mentioned 1738 DEGs.

Random forest

When running the random forest for filtered genes with counts over 600, 16378 genes out of 33393 genes, multiple times using 1000 trees, the top four genes that reduced the Gini index were considered interesting because they were highly involved in separating between the groups. The top four genes were SWAP70 and ZNF471, GPSM3 and AP4B1. Those four genes were used to create a PCA plot that showed a different way for the clustering of the groups where the three healthy B-cell subsets clustered in the middle separating between the two CLL subtypes with no clear separation between the three healthy B-cell subsets themselves (Figure 8A). The Venn diagram that was created for the important genes from the three comparisons, showed the four genes from the random forest, the 51 genes with high/low log2 (fold-change) out of the 113 DEGs between the mutated and unmutated CLL subtypes and the 32 genes with high/low log2 (fold-change) out of the 1738 DEGs between the CD27dull and CD27bright MBCs (Figure 8B). The Venn diagram showed one gene (ZNF471) that was shared between the four genes from the random forest and the 51 genes with high/low log2 (fold-change)

16 out of the 113 DEGs between the mutated and unmutated CLL subtypes. Another gene (IGHA2) was shared between the 51 genes with high/low log2 (fold-change) out of the 113 DEGs between the mutated and unmutated CLL subtypes and the 32 genes with high/low log2 (fold-change) out of the 1738 DEGs between the CD27dull and CD27bright MBCs. Then, those 85 important genes were combined and used to create a final PCA plot (Figure 8C). The PCA plot showed good separation between the healthy dataset, the mutated and unmutated CLL subtypes using PC1 which explained 55% of the variance and the mutated CLL subtype clustered closer to the healthy individuals too as before (Figure 8C). Using PC2 with 12% variance, good separation between the three healthy B-cell subsets was observed (Figure 8C). The generated heatmap using Euclidean distance and average linkage based on the 85 combined genes did not show much information as it showed only separation between the healthy and CLL datasets without clear clustering of the subgroups (Figure 8D).

Figure 8. Showing (A) PCA plot for all the samples using the four most important genes from the random forest (SWAP70 and ZNF471, GPSM3 and AP4B1), (B) Venn diagram for the four important genes in the random forest, the 51 genes with high/low log2 (fold-change) out of the 113 DEGs between the mutated and unmutated CLL subtypes and the 32 genes with high/low log2 (fold-change) out of the 1738 DEGs between the CD27dull and CD27bright MBCs, (C) PCA plot for all the samples using the combined 85 genes & (D) heatmap using Euclidean distance for all the samples using the combined 85 genes.

Checking single nucleotide polymorphisms (SNPs)

After sorting and indexing the resulted BAM files from the alignment, the sorted BAM files for all the individuals were imported to the IGV-browser and compared to each other to find single nucleotide polymorphisms. The area of main focus was the SWAP70 gene area which was located on : 9,664,077-9,752,993 forward strand (GRCh38/hg38) (Ensembl1). Nine possible important SNPs were observed and recorded with their locations and three of them were found on exons (Table 3).

17

Table 3. Showing important SNPs in the SWAP70 region with their locations and occurrence in the five groups.

Locations and Mutated Unmutated CD27bright CD27dull Naïve B cells Substitutions/Groups CLL CLL subtype MBCs ”4 MBCs ”4 ”4 subtype ”5 ”12 individuals” individuals” individuals” patients” patients” chr11:9,725,695 60% 83.33% 100% 100% 100% ”C instead of T” (intron) chr11:9,728,098 60% 66.67% 75% 75% 75% ”C instead of A ” (R230R) (R230R) (R230R) (R230R) (R230R) (Exon 5) chr11:9,732,674 100% 91.67% 80% 80% 80% ”T instead of C” (N348N) (N348N) (N348N) (N348N) (N348N) (Exon 7) chr11:9,714,534 100% 91.67% 100% 100% 100% ”C instead of T” (intron) chr11:9,682,330 80% 75% 100% 50% 25% ”T instead of G” The rest had The rest had (intron) no coverage no coverage chr11:9,748,015 100% 91.67% 100% 100% 100% ”G instead of C” (Q505E) (Q505E) (Q505E) (Q505E) (Q505E) (Exon 10) chr11:9,698,100 60% 66.67% 75% 75% 25% ”T instead of G” (intron) chr11:9,751,970 100% 91.67% 100% 100% 100% ”A instead of C” (intron) chr11:9,752,021 100% 91.67% 100% 100% 100% ”C instead of T” (intron)

18

Discussion

The main focus of the study was to analyze the RNA-sequencing data of the two CLL subtypes and the three healthy B-cell subsets and check the similarities in gene expression profiles between those groups in order to anticipate the original cause of the two CLL subtypes. Also, extracting a certain set of genes that can be used to easily distinguish between the two CLL subtypes for a better prognosis and personalized treatment.

Some warnings were shown during the quality check of the FASTQ files from the CLL-patients. Per base sequence content showed warnings for the first eleven bases of the reads in all the files where fluctuations in the percentage of each base was observed and was predicted to happen due to the use of random hexamers as primers to get the cDNA for the sequencing process (Blachly et al., 2015; Hansen et al., 2010). This method is usually preferred and it does not affect the overall quality (Hansen et al., 2010). Removing those bases would result in about 2% decrease in the numbers of DEGs (Williams et al., 2016).

Sequence duplication levels showed warnings for all the files with around 10% of the reads to have more than ten duplicates. However, most of the reads of the forward and reverse FASTQ files from the CLL-patients were unique, average of 58% and 60%, respectively. This was expected to show up during the analysis due to the fact that some genes exhibit different levels of expression with different mRNA levels, so highly expressed genes have more RNA and thus more duplication levels. Also, the fact of working with RNA-sequencing library where PCR was used for amplification and reverse transcription of RNA to cDNA before the sequencing which may have contributed to the duplication levels (Blachly et al., 2015; Li et al., 2015). A new approach for transcriptome sequencing was proposed where reverse transcription is integrated into the sequencing machine, so RNA could be loaded directly to the sequencing machine without the need for PCR. This would reduce the duplicates and the bias in the per base sequence content (Mamanova et al., 2010).

Also, for the FASTQ files from the healthy individuals, the per base sequence content showed some bias at the end of the reads toward certain bases at certain positions which could have happened because of the over-trimming of the adapter sequence exposing the 5’ ends of the sequences to be more biased (Babraham, 2019). A lot of trimming tools were available to use like Cutadapt, FASTX quality trimmer and Trimmomatic, but the later was preferred in this analysis due to the ease of access on Windows platform and can work on paired-end files (Del Fabbro et al., 2013). In general, all the files showed good quality after the trimming process to go further down the analysis. However, vigorous quality trimming could have an impact on the accuracy of the end-analysis (Williams et al., 2016).

The correlation check process did not add much value to the process of anticipating the origin of the mutated and unmutated CLL subtypes. Both the CD27dull and CD27bright MBCs had higher correlation with the mutated CLL subtype, so no solid conclusion was reached. It is also possible that the unmutated CLL subtype has a different origin. One other possibility is that these cells become transformed in the spleen since that is a big reservoir for B cells and unmutated MBCs can be found there. People born without the spleen lack the MBCs with none or few mutations (Aranburu et al., 2016).

Rusbread package was used for the alignment as it pays attention to the splice junctions during the mapping process which serves the nature of the RNA data being analyzed that is susceptible to splicing (Liao et al., 2019). It also provides results as R-friendly files. After the two alignments, 58676 genes were listed in the generated count files of both; the healthy individuals and CLL-patients despite the fact of the presence of only 20465 human protein coding genes (Ensembl2). This occurred because the is not only limited to the protein coding DNA which forms only a small

19 part of the genome, but also including other functional and non-functional parts (Table 4). Non- coding DNA including; small, long and miscellaneous non-coding DNA form a larger portion than the coding DNA (Table 4). Additionally, a lot of the genes that were included in the count files had a gene count equal to zero, so filtering step was included where only the genes with a total count over 10 were included. The number of the genes went down to 28852 genes for the healthy dataset and 29458 genes for the CLL dataset.

Table 4. Showing human gene counts for primary and alternative assembly for different types of DNA according to Genome Reference Consortium Human Build 38.

DNA types/Counts Primary assembly Alternative assembly

Coding DNA 20465 2960

Non-coding DNA 22229 1429

Small non-coding DNA 4871 278

Long non-coding DNA 15137 974

Misc non-coding DNA 2221 177

Pseudogenes 15171 1754

Gene transcripts 208689 20661

Using all the genes to create a PCA plot to try to separate between all the groups did not work as only separation between the healthy B-cell subsets and the CLL subtypes was observed (Figure 6A). The negative binomial model was favored over the Poisson distribution as the count data showed over- dispersion (Gardner et al., 1995). As a result, the negative binomial regression model was applied to the data through the DESeq2 package when analyzing for DEGs (Love1 et al., 2014). There were different packages available for differential expression analysis like DESeq2, Limma and edgeR packages which differ from each other in the applied normalization methods. Limma and edgeR packages use Trimmed Mean of M-values (TMM) normalization while DESeq2 uses geometric normalization (Seyednasrollah et al., 2013). Limma package follows the classic linear regression unlike edgeR and DESeq2 which follow negative binomial distribution (Seyednasrollah et al., 2013). Since DESeq2 package is the best in minimizing the FDR, hence more trustworthy results, it was chosen for the analysis. Although using different methods is applicable, but would give slightly different results. DESeq2 package also showed good consistency regarding the results and false positive rate regardless the number of replicates (Schurch et al., 2016). Additionally, rlog-transformation (regularized- logarithm transformation) was applied to the data before creating the PCA plots to minimize the dependence of the variance on the mean counts and get less biased results (Love2 et al., 2014)

To improve the separation, the 113 DEGs between the CLL subtypes were used where a better separation was achieved between the CLL subtypes as expected and the mutated CLL subtype clustered more closer to the healthy B-cell subsets (Figure 6B). This could mean the possibility that the mutated CLL subtype would behave more like the healthy individuals which in turn also could justify the better prognosis and lower risk of the mutated CLL subtype because it follows the healthy expression profile more (Hamblin et al., 1999). This was also shown on the heatmap where the five patients of the mutated CLL subtype clustered with the healthy individuals on the dendrogram (Figure 6C). To get more insight about the genes that were highly expressed and of high importance out of the 113 DEGs between the CLL subtypes, a volcano plot was used (Figure 6D). A list of 51 genes that

20 showed one and half times the normal expression was created and two unmutated CLL-patients clustered with the mutated CLL-patients and the healthy individuals on the dendrogram of the heatmap (Figure 6F). Those two unmutated CLL-patients had a mutation percentage of 2% and 0.7%. However, they clustered with the mutated CLL-patients and this may raise the question whether the 2% mutation cut off is accurate enough or need to be further validated.

When the process was reversed and the 1738 DEGs between the CD27dull and CD27bright MBCs instead of the other genes, the clustering was reversed on the PCA and heatmap plots where a clear separation between the healthy B-cell subsets was shown and no clear separation between the CLL subtypes (Figure 7A & 7B). Although using the genes that only showed one and half times the normal expression from the 1738 DEGs through a volcano plot, a slightly better clustering was generated on the PCA plot (Figure 7C & 7D). On the heatmap generated using the 32 high/low log2 (fold-change) genes out of the 1738 DEGs, the CD27bright MBCs clustered with the CLL dataset on the dendrogram (Figure 7E). This clustering behavior could mean the possibility that the CD27bright MBCs could behave more like the CLL dataset which in turn could mean that those cells are more related to the origin of CLL.

Running the random forest for all the filtered genes was computationally-problematic, so the number of the genes had to be reduced. When the random forest was run multiple times, four shared decision- making genes between the different forests were extracted. SWAP70 gene was one of the important genes and its expressed protein is called Switching-associated protein 70 (UniProt Consortium1). SWAP70 protein is a human protein that is expressed in mature naïve B cells and can be found in monocytes and macrophages too. It is believed to play an important role in the signaling in B cell activation and isotype class switching (UniProt Consortium1).

The second important gene was ZNF471 that corresponds to a human protein called Zinc finger protein 471 (UniProt Consortium2). This protein is believed to be affiliated with DNA binding and the regulation of the transcription process (UniProt Consortium2; Cao et al., 2018). A recent study proved that this exact protein is associated with gastric cancer and to be a potential cause of the cancer progression (Cao et al., 2018). Another type of ZNF was affiliated with another type of cancer where ZNF185 gene played a significant role in causing prostate cancer when its expression was suppressed (Vanaja et al., 2003). Promoter hypermethylation was a common cause to suppress and silence this gene (Vanaja et al., 2003; Cao et al., 2018). This can confirm that ZNF gene plays a critical part in causing different types of cancer. This gene was also found to be one of the 51 high/low log2 (fold-change) DEGs between the mutated and unmutated CLL subtypes (Figure 8B).

The third important gene was GPSM3 and its expressed protein is called G-protein-signaling modulator 3 (UniProt Consortium3). This protein is involved in inflammatory responses and its regulation through modulating the production of cytokines and leukocyte chemotaxis (UniProt Consortium3). It is expressed in heart, placenta and liver. It also controls GTPase enzyme activity which is a hydrolase enzyme that is important for the biosynthesis of proteins and signal transduction. It also exhibits an important role in the regulation of immune responses and can cause some immune conflicts which in turn can manifest through autoimmune diseases (Giguère et al., 2013).

The fourth gene was AP4B1 and its expressed protein is called AP-4 complex subunit beta-1 (UniProt Consortium4). This protein is found on Golgi apparatus where it manages the process of transporting other proteins (Dell’Angelica et al., 1999). It forms a complex that consists of four chains; two large subunits called epsilon and beta subunits, a medium one called mu-subunit and a small one called sigma subunit. This complex serves an important part in the development of the brain as some defects in this complex were proven to cause intellectual disabilities and character changes (Jamra et al., 2011).

IGHA2 gene was also found to be important as it was shared between the 51 high/low log2 (fold- change) DEGs between the mutated and unmutated CLL subtypes and the 32 high/low log2 (fold- change) DEGs between the CD27dull and CD27bright MBCs. Its expressed protein is IgA2, thus the constant

21 part of Ig heavy chain A2 switched MBC. It forms antigen binding site on the immunoglobulin and control the general isotype of the antibody (UniProt Consortium5). It is also one of the most abundant secreted antibodies in the body especially at mucosal surfaces (Schroeder Jr and Cavacini, 2010). It was also found as a potential biomarker for breast cancer and its metastasis in the lymph nodes (Kang et al., 2012).

SWAP70 was considered as the most important of them all as it showed the best separation between all the groups when an individual count plot was created (Figure 9). Hence, the SNPs were checked for this gene in all the individuals using the IGV-browser (Table 3). Two of these observed SNPs; R230R and Q505E were previously recorded as natural variants (UniProt Consortium1). Three of the SNPs were found on the exons which are the protein coding parts. R230R, N348N and Q505E mutations on exons 5, 7 and 10, respectively, were found to be already-proven mutations (Rapalus et al., 2001). R230R is a mutation where a (A) nucleotide was substituted with a (C) nucleotide giving a different codon for the same amino acid, Arginine, at position 230 of the polypeptide chain of SWAP70 protein. N348N is a mutation where a (C) nucleotide was substituted with a (T) nucleotide giving a different codon for the same amino acid, Asparagine, at position 348 of the polypeptide chain of SWAP70 protein. Q505E is a mutation where a (C) nucleotide was substituted with a (G) nucleotide resulting in switching the amino acid at position 505 in the polypeptide chain from Glutamine to Glutamate. Both amino acids are polar, but Glutamate is a negatively-charged amino acid which would give it the edge to interact better with positive cations like Zinc (Betts and Russell, 2003). It was also proven that this mutation did not have major effects in the pathogenesis of diseases (Rapalus et al., 2001).

Figure 9. Showing SWAP70 gene counts for the five groups.

In conclusion, the five previously-mentioned genes can be considered potential biomarkers for CLL and some of them could be further investigated in the pathogenesis of the two subtypes of CLL. The expression levels of those genes could also be used to differentiate between the CLL subtypes. The mutated CLL-patients showed very similar expression profiles to those of the healthy individuals based on the 113 DEGs between the mutated and unmutated CLL subtypes, whereas the CD27bright MBCs showed a similar expression profile to the CLL-patients. Those similarities in expression profiles could explain why the mutated CLL subtype has a better prognosis. Two unmutated CLL-patients with 0.3 and 2% mutations were actually separated from other unmutated CLL-patients, thus indicating these had been to the GC. Additionally, the availability of multiple tools to analyze RNA-sequencing data is increasing every day, so choosing one optimal tool for the analysis could not be achieved. Every tool has its own advantages and disadvantages and would give slightly different results from each other, so cross-validation with different tools is a good approach to obtain more trust-worthy results (Yendrek et al., 2012).

22

Ethical aspects and impact on the society

RNA-sequencing data files of four healthy individuals were accessed from a previous experiment conducted by the supervisor and approved by Bambino Gesu Children Hospital, Rome, Italy. The samples were collected anonymously and the individuals were informed about the flow of the experiment and how important it is for the development of new ways to diagnose and treat CLL. This can lead to early diagnosis of the disease and accurate stratification of its subtype which would result in using the accurate treatment protocols for a faster and more personalized treatment. This allows the application of the concept of precision medicine which is an individualized treatment of the disease based on the genetic composition of that specific individual allowing the treatment to be more efficient and for shorter duration. No personal information from the individuals was obtained and no informed consent was given. RNA-sequencing data files of CLL-patients were downloaded from Gene Expression Omnibus public database (GEO) with accession number GSE66228 to be compared against the data files of healthy individuals (Blachly et al., 2015). The data files of CLL-patients could also be downloaded from Sequence Read Archive public database (SRA) with accession number SRP055444. Since the data was publicly available, no ethical approval was required. However, when the experiment was conducted, the samples were collected from the CLL-patients registered in clinical trials at The Ohio State University (OSU) James Comprehensive Cancer Center after signing the informed consent and applying Identity Registration Protocol (IRP). Also, one sample was collected from one patient at OSU along with some patients registered in a CLL Research Consortium (CRC) research protocol after signing the CRC informed consent (Hertlein et al., 2013). Also, no personal information from the source of the samples was collected for this study as it was not needed for conducting the analysis.

23

Future perspectives

The results using this dataset of CLL-patients should be validated with another dataset of CLL-patients, then, the resulted genes can be validated by using qPCR to confirm the differential expression between the groups and the role of the resulted genes in the process of diagnosing and sorting the subtypes of CLL (Everaert et al., 2017). The raw RNA-sequencing data can also be analyzed using other statistical tools to check for DEGs and compare the results with the previous work and select the genes that are common between the different tools (Love et al., 2015; Yendrek et al., 2012). More focus and experimenting should be directed to elicit the exact role of the Zinc-finger proteins in causing cancers and specifically studying its epigenetic changes, why it occurs and how to suppress those changes (Vanaja et al., 2003).

Analysis of the splicing variants of the rest of the DEGs could be of importance and show more insight about the CLL pathogenesis. The mutation percentage cut off of 2% to classify CLL-patients into mutated and unmutated subtypes should be further investigated. Sorting splenic B cell subsets both; naïve and MBCs and comparing the transcriptome of those with the unmutated CLL subtype is another valid option to check the origin of the unmutated CLL subtype.

Also, the ability to distinguish between protein coding RNAs and non-coding ones is of critical importance. It is not an easy mission to accomplish. However, more attention needs to be directed towards it as the non-coding RNAs constitute the majority of RNAs and play important role in gene and transcription regulation which can be even more important than the coding RNAs themselves (Ferreira and Esteller, 2018; Mercer et al., 2009; Eddy, 2001). The regulation process can be through small non- coding transcripts that can block some mRNAs by being complementary to the mRNA sequence and therefore no production of that specific protein such as Xist (X-inactive specific transcript) RNA (Mercer, 2009). It can be also through inducing certain processes or functioning as RNA only without the need to express a protein such as p53 mRNA (Eddy, 2001; Mercer, 2009). This can establish a new era of personalized medicine based on the genetic fingerprint of each patient.

24

Acknowledgements

This project was conducted as a collaboration work between the University of Skövde and the University of Gothenburg under the supervision of Dr. Ola Grimsholm at University of Gothenburg, Department of Rheumatology and Inflammation Research, Andreas Tilevik, PhD, at University of Skövde, School of Bioscience and Magnus Fagerlind, PhD, at University of Skövde, School of Bioscience. Thanks to the Swedish Institute scholarship.

25

References

Abbas, A.K., Lichtman, A.H. and Pillai, S., 1994. Cellular and molecular immunology. Elsevier Health Sciences.

Abbas, A.K., Lichtman, A.H. and Pillai, S., 2014. Cellular and molecular immunology E-book. Elsevier Health Sciences.

Abbas, A.K., Lichtman, A.H. and Pillai, S., 2017. Cellular and Molecular Immunology E-Book. Elsevier Health Sciences.

Agematsu, K., 2000. lnvited Review Memory B cells and CD27. Histology and Histopathology, 15, pp.573-576.

Agematsu, K., Hokibara, S., Nagumo, H. and Komiyama, A., 2000. CD27: a memory B-cell marker. Immunology today, 21(5), pp.204-206.

Agematsu, K., Nagumo, H., Oguchi, Y., Nakazawa, T., Fukushima, K., Yasui, K., Ito, S., Kobata, T., Morimoto, C. and Komiyama, A., 1998. Generation of plasma cells from peripheral blood memory B cells: synergistic effect of interleukin-10 and CD27/CD70 interaction. Blood, 91(1), pp.173-180.

Andrews, S., 2010. FastQC: a quality control tool for high throughput sequence data. Available at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [collected 2018.12.20]

Aranburu, A., Piano Mortari, E., Baban, A., Giorda, E., Cascioli, S., Marcellini, V., Scarsella, M., Ceccarelli, S., Corbelli, S., Cantarutti, N. and De Vito, R., 2017. Human B‐cell memory is shaped by age‐and tissue‐ specific T‐independent and GC‐dependent events. European journal of immunology, 47(2), pp.327- 344.

Babraham Bioinformatics Institute. Available at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20 Per%20Base%20Sequence%20Content.html [collected 2019.05.15].

Baxter, D., 2007. Active and passive immunity, vaccine types, excipients and licensing. Occupational Medicine, 57(8), pp.552-556.

Beagley, K.W. and Gockel, C.M., 2003. Regulation of innate and adaptive immunity by the female sex hormones oestradiol and progesterone. Federation of European Microbiological Societies, Immunology & Medical Microbiology, 38(1), pp.13-22.

Betts, M.J. and Russell, R.B., 2003. Amino acid properties and consequences of substitutions. Bioinformatics for geneticists, pp.289-316.

Blachly, J.S., Ruppert, A.S., Zhao, W., Long, S., Flynn, J., Flinn, I., Jones, J., Maddocks, K., Andritsos, L., Ghia, E.M. and Rassenti, L.Z., 2015. Immunoglobulin transcript sequence and somatic hypermutation computation from unselected RNA-seq reads in chronic lymphocytic leukemia. Proceedings of the National Academy of Sciences, 112(14), pp.4322-4327.

Bolger, A.M., Lohse, M. and Usadel, B., 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), pp.2114-2120.

Cao, L., Wang, S., Zhang, Y., Wong, K.C., Nakatsu, G., Wang, X., Wong, S., Ji, J. and Yu, J., 2018. Zinc- finger protein 471 suppresses gastric cancer through transcriptionally repressing downstream oncogenic PLS3 and TFAP2A. Oncogene, p.1.

26

Carruthers, V.B., Cotter, P.A. and Kumamoto, C.A., 2007. Microbial pathogenesis: mechanisms of infectious disease. Cell host & microbe, 2(4), pp.214-219.

CLL Trialists' CollaborativeGroup, 1999. Chemotherapeutic options in chronic lymphocytic leukemia: a meta-analysis of the randomized trials. Journal of the National Cancer Institute, 91(10), pp.861-868.

Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L. and Rice, P.M., 2009. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research, 38(6), pp.1767-1771.

Colonna-Romano, G., Aquino, A., Bulati, M., Lorenzo, G.D., Listì, F., Vitello, S., Lio, D., Candore, G., Clesi, G. and Caruso, C., 2006. Memory B cell subpopulations in the aged. Rejuvenation research, 9(1), pp.149-152.

Del Fabbro, C., Scalabrin, S., Morgante, M. and Giorgi, F.M., 2013. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PloS one, 8(12), p.e85024.

Dell’Angelica, E.C., Mullins, C. and Bonifacino, J.S., 1999. AP-4, a novel protein complex related to clathrin adaptors. Journal of biological chemistry, 274(11), pp.7278-7285.

Dighiero, G., Maloum, K., Desablens, B., Cazin, B., Navarro, M., Leblay, R., Leporrier, M., Jaubert, J., Lepeu, G., Dreyfus, B. and Binet, J.L., 1998. Chlorambucil in indolent chronic lymphocytic leukemia. New England Journal of Medicine, 338(21), pp.1506-1514.

Dohner, H., Stilgenbauer, S., Benner, A., Leupolt, E., Krober, A., Bullinger, L., Dohner, K., Bentz, M. and Lichter, P., 2000. Genomic aberrations and survival in chronic lymphocytic leukemia. New England Journal of Medicine, 343(26), pp.1910-1916.

Eddy, S.R., 2001. Non–coding RNA genes and the modern RNA world. Nature Reviews Genetics, 2(12), p.919.

Ensembl release 96, EMBL-EBI. Available at: https://www.ensembl.org/index.html [collected 2019.05.15].

Ensembl1 SWAP70 gene (ENSG00000133789) – Ensembl release 96, EMBL-EBI. Available at: http://www.ensembl.org/Homo_sapiens/Gene/ExpressionAtlas?g=ENSG00000133789;r=11:9664077 -9752993 [collected 2019.05.05].

Ensembl2 (Genome Reference Consortium Human Build 38). Available at: http://www.ensembl.org/Homo_sapiens/Location/Genome?r=Y:1-1000 [collected 2019.05.16].

Everaert, C., Luypaert, M., Maag, J.L., Cheng, Q.X., Dinger, M.E., Hellemans, J. and Mestdagh, P., 2017. Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Scientific reports, 7(1), p.1559.

Ferreira, H.J. and Esteller, M., 2018. Non-coding RNAs, epigenetics, and cancer: tying it all together. Cancer and Metastasis Reviews, 37(1), pp.55-73.

Furman, R.R., Sharman, J.P., Coutre, S.E., Cheson, B.D., Pagel, J.M., Hillmen, P., Barrientos, J.C., Zelenetz, A.D., Kipps, T.J., Flinn, I. and Ghia, P., 2014. Idelalisib and rituximab in relapsed chronic lymphocytic leukemia. New England Journal of Medicine, 370(11), pp.997-1007.

Gangemi, S., Allegra, A. and Musolino, C., 2015. Lymphoproliferative disease and cancer among patients with common variable immunodeficiency. Leukemia research, 39(4), pp.389-396.

27

Gardner, W., Mulvey, E.P. and Shaw, E.C., 1995. Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychological bulletin, 118(3), p.392.

Giannasca, P.J. and Warny, M., 2004. Active and passive immunization against Clostridium difficile diarrhea and colitis. Vaccine, 22(7), pp.848-856.

Giguère, P.M., Billard, M.J., Laroche, G., Buckley, B.K., Timoshchenko, R.G., McGinnis, M.W., Esserman, D., Foreman, O., Liu, P., Siderovski, D.P. and Tarrant, T.K., 2013. G-protein signaling modulator-3, a gene linked to autoimmune diseases, regulates monocyte function and its deficiency protects from inflammatory arthritis. Molecular immunology, 54(2), pp.193-198.

Goidl, E.A., Paul, W.E., Siskind, G.W. and Benacerraf, B., 1968. The effect of antigen dose and time after immunization on the amount and affinity of anti-hapten antibody. The Journal of Immunology, 100(2), pp.371-375.

Goodwin, K., Viboud, C. and Simonsen, L., 2006. Antibody response to influenza vaccination in the elderly: a quantitative review. Vaccine, 24(8), pp.1159-1169.

Graves, D.T. and Kayal, R.A., 2008. Diabetic complications and dysregulated innate immunity. Frontiers in bioscience: a journal and virtual library, 13, p.1227.

Guikema, J.E., Hovenga, S., Vellenga, E., Conradie, J.J., Abdulahad, W.H., Bekkema, R., Smit, J.W., Zhan, F., Shaughnessy Jr, J. and Bos, N.A., 2003. CD27 is heterogeneously expressed in multiple myeloma: low CD27 expression in patients with high‐risk disease. British journal of haematology, 121(1), pp.36- 43.

Guo, Y., Ye, F., Sheng, Q., Clark, T. and Samuels, D.C., 2013. Three-stage quality control strategies for DNA re-sequencing data. Briefings in bioinformatics, 15(6), pp.879-889.

Hainsworth, J.D., Litchy, S., Barton, J.H., Houston, G.A., Hermann, R.C., Bradof, J.E. and Greco, F.A., 2003. Single-agent rituximab as first-line and maintenance treatment for patients with chronic lymphocytic leukemia or small lymphocytic lymphoma: a phase II trial of the Minnie Pearl Cancer Research Network. Journal of Clinical Oncology, 21(9), pp.1746-1751.

Hallek, M., Cheson, B.D., Catovsky, D., Caligaris-Cappio, F., Dighiero, G., Döhner, H., Hillmen, P., Keating, M.J., Montserrat, E., Rai, K.R. and Kipps, T.J., 2008. Guidelines for the diagnosis and treatment of chronic lymphocytic leukemia: a report from the International Workshop on Chronic Lymphocytic Leukemia updating the National Cancer Institute–Working Group 1996 guidelines. Blood, 111(12), pp.5446-5456.

Hamblin, T.J., Davis, Z., Gardiner, A., Oscier, D.G. and Stevenson, F.K., 1999. Unmutated Ig VH genes are associated with a more aggressive form of chronic lymphocytic leukemia. Blood, 94(6), pp.1848- 1854.

Hansen, K.D., Brenner, S.E. and Dudoit, S., 2010. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research, 38(12), pp.e131-e131.

Hauke, J. and Kossowski, T., 2011. Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data. Quaestiones geographicae, 30(2), pp.87-93.

Hertlein, E., Beckwith, K.A., Lozanski, G., Chen, T.L., Towns, W.H., Johnson, A.J., Lehman, A., Ruppert, A.S., Bolon, B., Andritsos, L. and Lozanski, A., 2013. Characterization of a new chronic lymphocytic leukemia cell line for mechanistic in vitro and in vivo studies relevant to disease. PloS one, 8(10), p.e76607.

28

Hoebe, K., Janssen, E. and Beutler, B., 2004. The interface between innate and adaptive immunity. Nature immunology, 5(10), p.971.

Jamra, R.A., Philippe, O., Raas-Rothschild, A., Eck, S.H., Graf, E., Buchert, R., Borck, G., Ekici, A., Brockschmidt, F.F., Nöthen, M.M. and Munnich, A., 2011. Adaptor protein complex 4 deficiency causes severe autosomal-recessive intellectual disability, progressive spastic paraplegia, shy character, and short stature. The American Journal of Human Genetics, 88(6), pp.788-795.

Janewaya CA Jr, Travers P, Walport M, et al. Immunobiology: The Immune System in Health and Disease. 5th edition. New York: Garland Science; 2001. Chapter 8, T Cell-Mediated Immunity. Available at: https://www.ncbi.nlm.nih.gov/books/NBK10762/ [collected 2018.12.18].

Janewayb Jr, C.A., Travers, P., Walport, M. and Shlomchik, M.J., 2001. Principles of innate and adaptive immunity.

Janewayc Jr, C.A., Travers, P., Walport, M. and Shlomchik, M.J., 2001. Structural variation in immunoglobulin constant regions.

Kang, S., Maeng, H., Kim, B.G., Qing, G.M., Choi, Y.P., Kim, H.Y., Kim, P.S., Kim, Y., Kim, Y.H., Choi, Y.D. and Cho, N.H., 2012. In situ identification and localization of IGHA2 in the breast tumor microenvironment by mass spectrometry. Journal of proteome research, 11(9), pp.4567-4574.

Klein, U., Rajewsky, K. and Küppers, R., 1998. Human immunoglobulin (Ig) M+ IgD+ peripheral blood B cells expressing the CD27 cell surface antigen carry somatically mutated variable region genes: CD27 as a general marker for somatically mutated (memory) B cells. Journal of Experimental Medicine, 188(9), pp.1679-1689.

Klein, U., Tu, Y., Stolovitzky, G.A., Keller, J.L., Haddad, J., Miljkovic, V., Cattoretti, G., Califano, A. and Dalla-Favera, R., 2003. Transcriptional analysis of the B cell germinal center reaction. Proceedings of the National Academy of Sciences, 100(5), pp.2639-2644.

Kulis, M., Heath, S., Bibikova, M., Queirós, A.C., Navarro, A., Clot, G., Martínez-Trillos, A., Castellano, G., Brun-Heath, I., Pinyol, M. and Barberán-Soler, S., 2012. Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia. Nature genetics, 44(11), p.1236.

Li, X., Nair, A., Wang, S. and Wang, L., 2015. Quality control of RNA-seq experiments. In RNA Bioinformatics (pp. 137-146). Humana Press, New York, NY.

Liao, Y., Smyth, G.K. and Shi, W., 2013. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), pp.923-930.

Liao, Y., Smyth, G.K. and Shi, W., 2018. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. bioRxiv, p.377762.

Liao, Y., Smyth, G.K. and Shi, W., 2019. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic acids research, 47(8), pp.e47-e47.

Liu, Z., Yuan, X., Luo, Y., He, Y., Jiang, Y., Chen, Z.K. and Sun, E., 2009. Evaluating the effects of immunosuppressants on human immunity using cytokine profiles of whole blood. Cytokine, 45(2), pp.141-147.

Love, M.I., Anders, S., Kim, V. and Huber, W., 2015. RNA-Seq workflow: gene-level exploratory analysis and differential expression. F1000Research, 4.

29

Love1, M., Anders, S. and Huber, W., 2014. Differential analysis of count data–the DESeq2 package. Genome Biol, 15(550), pp.10-1186.

Love2, M.I., Huber, W. and Anders, S., 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology, 15(12), p.550.

Mamanova, L., Andrews, R.M., James, K.D., Sheridan, E.M., Ellis, P.D., Langford, C.F., Ost, T.W., Collins, J.E. and Turner, D.J., 2010. FRT-seq: amplification-free, strand-specific transcriptome sequencing. Nature methods, 7(2), p.130.

Market, E. and Papavasiliou, F.N., 2003. V (D) J recombination and the evolution of the adaptive immune system. PLoS biology, 1(1), p.e16.

Maul, R.W. and Gearhart, P.J., 2010. AID and somatic hypermutation. In Advances in immunology (Vol. 105, pp. 159-191). Academic Press.

Mercer, T.R., Dinger, M.E. and Mattick, J.S., 2009. Long non-coding RNAs: insights into functions. Nature reviews genetics, 10(3), p.155.

Mohanan, D., Slütter, B., Henriksen-Lacey, M., Jiskoot, W., Bouwstra, J.A., Perrie, Y., Kündig, T.M., Gander, B. and Johansen, P., 2010. Administration routes affect the quality of immune responses: a cross-sectional evaluation of particulate antigen-delivery systems. Journal of Controlled Release, 147(3), pp.342-349.

Mosmann, T.R. and Sad, S., 1996. The expanding universe of T-cell subsets: Th1, Th2 and more. Immunology today, 17(3), pp.138-146.

Paavonen, T., Andersson, L.C. and Adlercreutz, H.E.R.M.A.N., 1981. Sex hormone regulation of in vitro immune response. Estradiol enhances human B cell maturation via inhibition of suppressor T cells in pokeweed mitogen-stimulated cultures. Journal of Experimental Medicine, 154(6), pp.1935-1945.

Parikh, S.A., 2018. Chronic lymphocytic leukemia treatment algorithm 2018. Blood cancer journal, 8(10), p.93.

Parish, C.R., 1972. The relationship between humoral and cell‐mediated immunity. Immunological Reviews, 13(1), pp.35-66.

Pawelec, G., 2006. Immunity and ageing in man. Experimental gerontology, 41(12), pp.1239-1242.

Peltola, H., Heinonen, O.P., Valle, M., Paunio, M., Virtanen, M., Karanko, V. and Cantell, K., 1994. The elimination of indigenous measles, mumps, and rubella from Finland by a 12-year, two-dose vaccination program. New England journal of medicine, 331(21), pp.1397-1402.

Racine, J., 2000. The Cygwin tools: a GNU toolkit for Windows. Journal of Applied Econometrics, 15(3), pp.331-341.

Rapalus, L., Minegishi, Y., Lavoie, A., Cunningham-Rundles, C. and Conley, M.E., 2001. Analysis of SWAP-70 as a candidate gene for non-X-linked hyper IgM syndrome and common variable immunodeficiency. Clinical Immunology, 101(3), pp.270-275.

Rawstron, A.C., Fazi, C., Agathangelidis, A., Villamor, N., Letestu, R., Nomdedeu, J., Palacio, C., Stehlikova, O., Kreuzer, K.A., Liptrot, S. and O'brien, D., 2016. A complementary role of multiparameter flow cytometry and high-throughput sequencing for minimal residual disease detection in chronic lymphocytic leukemia: an European Research Initiative on CLL study. Leukemia, 30(4), p.929.

30

Robinson, J.T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G. and Mesirov, J.P., 2011. Integrative genomics viewer. Nature biotechnology, 29(1), p.24.

Rodríguez-Vicente, A.E., Bikos, V., Hernández-Sánchez, M., Malcikova, J., Hernández-Rivas, J.M. and Pospisilova, S., 2017. Next-generation sequencing in chronic lymphocytic leukemia: recent findings and new horizons. Oncotarget, 8(41), p.71234.

Schroeder Jr, H.W. and Cavacini, L., 2010. Structure and function of immunoglobulins. Journal of Allergy and Clinical Immunology, 125(2), pp.S41-S52.

Schurch, N.J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., Wrobel, N., Gharbi, K., Simpson, G.G., Owen-Hughes, T. and Blaxter, M., 2016. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? Rna, 22(6), pp.839-851.

Seyednasrollah, F., Laiho, A. and Elo, L.L., 2013. Comparison of software packages for detecting differential expression in RNA-seq studies. Briefings in bioinformatics, 16(1), pp.59-70.

Stavnezer, J. and Schrader, C.E., 2014. IgH chain class switch recombination: mechanism and regulation. The Journal of Immunology, 193(11), pp.5370-5378.

UniProt Consortium1 SWAP70 – UniProt Protein Knowledgebase. Available at: https://www.uniprot.org/uniprot/Q9UH65 [collected 2019.05.16].

UniProt Consortium2 ZNF471 – UniProt Protein Knowledgebase. Available at: https://www.uniprot.org/uniprot/Q9BX82 [collected 2019.05.16].

UniProt Consortium3 GPSM3 – UniProt Protein Knowledgebase. Available at: https://www.uniprot.org/uniprot/Q9Y4H4 [collected 2019.05.16].

UniProt Consortium4 AP4B1 – UniProt Protein Knowledgebase. Available at: https://www.uniprot.org/uniprot/Q9Y6B7 [collected 2019.05.16].

UniProt Consortium5 IGHA2 – UniProt Protein Knowledgebase. Available at: https://www.uniprot.org/uniprot/P01877 [collected 2019.05.16].

Wallner, F.K., Hultquist Hopkins, M., Woodworth, N., Lindvall Bark, T., Olofsson, P. and Tilevik, A., 2018. Correlation and cluster analysis of immunomodulatory drugs based on cytokine profiles. Pharmacological Research, 128, pp.244-251.

Van Oers, M.H., Pals, S.T., Evers, L.M., Van der Schoot, C.E., Koopman, G., Bonfrer, J.M., Hintzen, R.Q., Von dem Borne, A.E. and Van Lier, R.A., 1993. Expression and release of CD27 in human B-cell malignancies. Blood, 82(11), pp.3430-3436.

Vanaja, D.K., Cheville, J.C., Iturria, S.J. and Young, C.Y., 2003. Transcriptional silencing of zinc finger protein 185 identified by expression profiling is associated with prostate cancer progression. Cancer research, 63(14), pp.3877-3882.

Welliver, R.C., Kaul, T.N., Putnam, T.I., Sun, M., Riddlesberger, K. and Ogra, P.L., 1980. The antibody response to primary and secondary infection with respiratory syncytial virus: kinetics of class-specific responses. The Journal of pediatrics, 96(5), pp.808-813.

Williams, C.R., Baccarella, A., Parrish, J.Z. and Kim, C.C., 2016. Trimming of sequence reads alters RNA- Seq gene expression estimates. BMC bioinformatics, 17(1), p.103.

31

Wollenberg, I., Agua-Doce, A., Hernández, A., Almeida, C., Oliveira, V.G., Faro, J. and Graca, L., 2011. Regulation of the germinal center reaction by Foxp3+ follicular regulatory T cells. The Journal of Immunology, p.1101328.

Ye, J., Ma, N., Madden, T.L. and Ostell, J.M., 2013. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic acids research, 41(W1), pp.W34-W40.

Yendrek, C.R., Ainsworth, E.A. and Thimmapuram, J., 2012. The bench scientist's guide to statistical analysis of RNA-Seq data. BMC research notes, 5(1), p.506.

32

Appendices

Appendix I

Illumina TruSeq universal adapter: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

Reverse complementary of Illumina TruSeq universal adapter: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

33