A Thesis

entitled

Identifying common from Rheumatoid Arthritis, Systemic Lupus, Multiple

Sclerosis and Sjögren’s Syndrome by pooling existing microarray data.

by

Eric Haynes

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the

Master of Science Degree in Biomedical Sciences

______Dr. Sadik Khuder, Committee Chair

______Dr. Alexei Fedorov, Committee Member

______Dr. Robert Trumbly, Committee Member

______Dr. Patricia R. Komuniecki, Dean College of Graduate Studies

The University of Toledo

June 28th, 2013

Copyright 2013, Eric E. Haynes

This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of

Identifying common genes from Rheumatoid Arthritis, Systemic Lupus, Multiple Sclerosis and Sjögren’s Syndrome by pooling existing microarray data.

by

Eric Haynes

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Sciences

The University of Toledo

June 28th, 2013

Our design protocol of whether blood-based expression profiles of normal individuals contrasted substantially enough to patients with autoimmune diseases;

Rheumatoid Arthritis (RA), Systemic Lupus Erythematosus (SLE), Multiple Sclerosis

(MS) and Sjögren’s Syndrome (PSS). The goal was to obtain common differentially expressed genes for each of these diseases using Affymetrix microarray data obtained from GEO and Array Express repositories, analyzed in the R program environment.

These gene sets enabled us to identify ten common genes as possible biomarkers to correctly and adequately identify diseased patients. These genes have been associated with the Interferon Pathway; interacting additionally as components of the ISGF3 complex, the Ubiquitin-like pathway (ISGylation system) and the Toll-Like Receptors

(3, 7 and 9) in the cytoplasm. These ten genes are as follows; IFI44- Interferon-induced 44, IFI44L- Interferon-induced protein 44-like, IFIH1- Interferon induced with helicase C domain 1, IFIT1- Interferon induced protein with tetratricopeptide repeats,

IRF7- Interferon regulatory factor 7, LAP3- Leucine aminopeptidase 3, RABGAP1L-

RAB GTPase activating protein 1-like, RSAD2- Radical S-adenosyl methionine domain

iii containing 2, STAT1- Signal transducer and activator of transcription 1, XAF1- XIAP (X- linked inhibitor of apoptosis) associated factor 1. Other known genes of the INF pathway were also found, they include DDX58- DEAD box polypeptide 58 (Asp-Glu-Ala-Asp),

ISG15- ISG15 ubiquitin-like modifier (IFIT5), ISG20- Interferon stimulated exonuclease gene 20kDa, OAS (2, 3) 2'-5'-oligoadenylate synthetase 2/3 and USP18-

Ubiquitin specific peptidase 18. It is our hope that a definitive biomarker, detecting RA,

SLE, MS and PSS together and individually can be introduced to diagnosticians, however at this time with the warning that more samples need to evaluated and proven reproducible in the research setting.

iv Acknowledgements

I would like to personally thank Dr. S. Khuder, for his time, dedication, patience, knowledge and understanding. I would like to also thank Dr. R. Trumbly, Dr. A. Fedorov and Dr. B. Lecka-Czernik for being committee members, my instructors and additionally mentoring me through this process.

Additional thanks to Jo Anne Gray, BIPG Department Secretary and Michelle

Arbogast at the Graduate School for helping me through these three years.

Thank you Arnab, Shuhao, and Ahmed, my University of Toledo lab mates, for the hours we have spent together studying concepts, discussing programming, and attending the Great Lakes Bioinformatics Conference.

I would like to express my gratitude to Dr. D. Stachowiak, my spouse for her hard work, tolerance, insight, support and contributions that helped me in more ways than can be expressed here.

Finally, I am most indebted to my parents and brother, Jason for their love and their life-long support of my academic career.

v Table of Contents

Abstract iii

Acknowledgements v

Table of Contents vi

List of Tables vii

List of Figures viii

The Introduction, Chapter 1 1

1.1 General Statement 1

1.2 Need for this Study 3

1.3 Objectives 5

The Methods, Chapter 2 6

2.1 Identification of Microarray Studies 6

2.2 Running Data in GEO2R 7

2.3 Downloading and Collecting Data 8

2.4 Merging by p-values 10

The Results, Chapter 3 11

3.1 Gene List Evolution 11

The Discussion, Chapter 4 20

4.1 Genes of the Interferon Pathway 20

4.2 Our Ten Common Shared Autoimmune Genes 23

4.2 Limitations 26

The Conclusion, Chapter 5 27

vi References 28

Appendices 34

A The Merge R Code, Appendix A 34

B Additional Tables from Gene Lists, Appendix B 34

C The Heatmap R Code, Appendix C 36

D Additional Resources and Affymetrix data/studies used 37

E Author Information/QR Code 37

List of Tables

Table 3.1 Ten Common Gene 12

Table 3.2 3-Way Merge 13

Table 3.3 3-Way Limited 14

Table3.4 Exclusion Genes 16

Table 3.5 Shared Common/Consensus 16

Table 4.1 Cellular Location 21

Table 4.2 Top 10 per Disease 22

Appendix B.1.1 Gene Lists from Pairing 33

Appendix B.1.2 2-Way merge, MS-PSS, complete 33

Appendix B.1.3 3-Way merge of MS-RA-PSS, raw 33

vii List of Figures

Figure 3.1 Venn diagram 15

Figure 3.2 RA Heatmap 18

Figure 3.3 SLE Heatmap 18

Figure 3.4 Common Genes Heatmap 19

Figure 4.1 INF from Nature 24

Figure 4.2 25 Genes of the INF Network 24

Figure 4.3 Gene Network Interaction, 25 genes 25

Figure 4.4 Gene Network, 10 genes 25

viii Chapter 1

Introduction

1.1 General Statement

These autoimmune diseases result from a dysfunction in the immune system, when the body attacks its own organs, tissues, and cells in response to the presence of foreign entities, resulting in an immune response leading to pathogenesis. These disorders include the more well-known diseases; Rheumatoid Arthritis, Systemic Lupus Erythematosus, and Multiple Sclerosis to the less familiar Sjögren’s Syndrome. In the United States, they impact 3-8% of the population afflicting about 23.5 million to 50 million people, because of problems with under-reporting and diagnosis of patients. (1) These diseases together, afflict people from dynamically different social, racial, ethnic and economic groups and reduce the quality of their lives. (2) Of the 8% with some autoimmune disease, 78% of the affected are women making this class of ailments the third most prevalent set of diseases in the United States behind cancer and heart disease. It is also among the top ten leading causes of death among U.S. women age 65 and under and the fourth largest cause of disability nationally. (3) The joint disorder, RA affects over 1% (~2.1 million

U.S. cases, including 30,000 to 50,000 children) in the Caucasian population, leading to the progressive destruction of cartilage, bone deformities and severe disability. Of these cases, women are two times higher in developing 80% of the joint limitations associated with function, versus men. (4) The genetic risk for developing RA is 50% with 60% of those U.S. patients tied to inherited epitopes increasing their risk 2-3 higher for the disease. (5) The genetic associations

1 shared by HLA genes has explained 23% of the risk for RA through multiple genes; variants, biological pathways and/or cumulative effects, leaving many genes to be examined. (6)

Moreover the number of people diagnosed with skin, joints, organs and brain tissue manifestations from SLE has increased from 1998 with 240,000 to 1.5 million Americans and 5 million people worldwide. [9] SLE development can begin as early as age 10 up to age 50, but generally begins between ages 15 and 44 for either sex. The majority of the women (or 90%) diagnosed are non-Caucasians and are 2-3 times more prone than their Caucasian females counterparts in developing SLE. The third disease MS, causes the destruction of the surrounding nerve fibers resulting in 400,000 cases with 10,000 new U.S. cases annually and 2.5 million cases worldwide. The treatment of MS costs about $2.5 billion because it continues to increase disproportionately among women ranking it in the top 10 causes of death in every age group up to 64 years of age. The risk of developing MS in the United States is 1 in 750, familial genetics increases this risk to 1 in 40; almost an nineteen fold increase. Like RA, MS is 2-3 times more likely to occur in women than men believed to be caused by hormones; this factor is amplified by being a member of a Caucasian group with Northern European ethnicity and yet again 1.6 times higher for those who smoke. [15] , [16] One of the most prevalent but lesser known autoimmune diseases is Sjögren's Syndrome, affecting the moisture-producing glands of the body in 4 million people , with 75% being adults usually ages 45 to 55 years old. [10]

Surprisingly, women are nine times more likely to present symptoms than men, with approximately 50% of these PSS cases ending in organ damage development or co-morbidity involving RA or SLE. (7)

2

1.2 Need for Study

Today, with the improved technology available to conduct research, we have the ability to identify the fundamentals of genetic, environmental, and infectious causes of these autoimmune diseases. While they are still poorly understood and difficult to identify in their disease activity states of dormant versus active, they are rarely studied together for a common causation or gene regulation. Each of these autoimmune diseases has been studied as a separate entity for specifics; chemical signatures, biologically as exons, introns, SNP’s, chromosomal locations, etc. Much of the previous research on autoimmune diseases has focused on a limited number of genes (ex.

HLA, Interleukin family (IL-n), and leukocytes; neutrophils, macrophages, B and T cell-related genes, etc. It was our belief that we should pursue a more common and broader scope as many of these diseases share a common mechanism of induction and pathogenesis. One of the current problems revolves around the existence of differentially expressed genes in autoimmune diseases, many yet unidentified and loosely understood. Our approach was combine them in a study and search for a common gene set that might lead to the biomarker development. This approach will hopefully gain merit with both practitioners and patients by identification of these common genes. We used the biomarkers as a tool in definitive (positive) diseased patient identification and accurate diagnostics, better prognostics with novel preventative and treatment methods and finally ending conflicting patient test results. (8)

Moreover, most of the public data and published research has been tissue-related; brain, glandular and other non-blood based research, thus creating problems when trying to compare different cell or tissue sources from different individuals for studies. When conducting microarray studies at the molecular level of disease progression, analysis became complicated by multi-cellular searches, even for expression levels in blood and tissue samples.

3

This wide angle research approach could help to identify triggers for genetically-susceptible individuals exposed to environmental and/or infectious agents, involving multiple and overlapping gene contribution regions for the development of these autoimmune diseases. These new technologies enable researchers to peer into the volumes of data in microarray studies discovering new models, completing signaling pathways and building new algorithms. The technology and the data housed in public domains provides great opportunities for pooling data, meta-analysis, multi-discipline approaches and more advanced comprehension. However, in many of these studies, authors have submitted data in a wide variety of formats making direct equivalent assessment impossible. For example, many of the studies we excluded from the model used one of the main three chipset manufacturers; Agilent, Illumina and Affymetrix. Each platform had their own plethora of older and subsequently different versions of microarray chipsets. This broad variation in formats complicates statistical analysis for research methodology. This study was conducted using only the Affymetrix chipsets HGU133 plus 2 and

A, to identify common genes among the four autoimmune diseases; RA, SLE, MS and PSS.

4

1.3 Objectives of Study

1. Identify common genes between two autoimmune diseases

A. RA and SLE D. SLE and PSS

B. MS and PSS E. RA and PSS

C. RA and MS F. SLE and MS

2. Identify common genes between three diseases

A. RA-SLE-MS

B. RA-SLE-PSS

C. RA-MS-PSS

D. SLE-MS-PSS

3. Identify common genes between all four autoimmune diseases

A. RA-SLE-MS-PSS

4. Develop a predictive model using the differential expressed genes for each disease.

5. Compare our list of differentially expressed genes to those published in individual studies.

6. Finding a genomic biomarker (gene(s)) from pathway and regulatory gene network

5

Chapter 2

Methods section

2.1 Identification of Microarray Studies

Our initial selection of study (series) data began with web searches; first at NCBI-GEO DataSets and then cross-referenced with EBI’s-ArrayExpress databases. It entailed obtaining useful microarray studies with regard to these four autoimmune diseases. In our search we used different key search terms combined with only those studies which were also Human blood samples. This data was downloaded from either the NCBI or EBI sites using one of the following download methods for our collection saving; .zip files (.CEL files), individual

Affymetrix .CEL files, or by using the BioCondcutor package-GEOquery. The intent was to encompass as large a group of these autoimmune diseases with human blood cell types specific to each disease and combining them with the changes from the effector and regulation of white blood cell components (leukocytes), regardless of the specific leukocyte types. It was assumed initially that the ratio for each disease would be different and yet have similarities as a result of the status of the disease (disease versus control), with attention paid to disease status; active vs. non-active vs. normal (control). The four different autoimmune pathologies each alter the basal level of resident leukocytes; namely T- cells and B-cells, differently. Together with other sub- components, their expression could be evaluated for regulation in both disease and control samples, defining these changes would help in identifying genes, as biomarkers.

6

2.2 Running Data in GEO2R

To get a better perspective of the study we utilized the “Analyze with GEO2R” web tool at

NCBI, however samples from EBI were therefore not assessed. Our first glimpse of the potential gene candidates became clearer when we used GEO2R on the submitting author’s data. After deciding which studies met our criterion, a list of each series (ex. GSE13732) was preliminarily assessed using the newly developed tool at NCBI called “Analyze with GEO2R.” This open source web tool allowed user-defined groups to be identified and calculated statistically from various chipset data stored at NCBI database. It produces the top 250 genes providing gene information as well as p-values. GEO2R provided interaction with the R environment, produced differential expression, additional data transformation capabilities and graphical visualization.

Our assumption was that shared common genes between four diseases might exist but p-values would not likely be at the top, with the exception of the Sjogrens Syndrome data. The scope of our experiment was broader and more encompassing than GEO2R; these differences resulted in its role as a guide and not further gene list development. A GEO2R example has been included in the Appendix B.1.0 from the SLE data from GSE10325, as GEO2R-13732.txt file. After running sample data per disease, we abandoned GEO2R because samples > 256 could not be assessed, so we began exploring other packages in R for new methods of analysis.

7

2.3 Downloading and Collecting Data

We used the R package “GEOquery” to download the data. This package enabled downloading of the GEO data directly from the NCBI portal and the attached information using different functions and classes for Meta data information. This method was applied to each of the NCBI series datasets. Since the two ArrayExpress files (E-MIMR341 and E-MEXP1242) were not stored in NCBI, they were manually downloaded. After the initial assessment search we discovered that most of our target data was stored primarily in Affymetrix chip platform and most were specifically HGU-133plus2 format. Some of the studies included both chipsets which was rarely clearly stated, however, we used GEO2R to sort the data by chipset by defining them

(generally selecting Affymetrix GPL570, 571 and not GPL97 or 96). The gene data had been stored in .CEL files; (12) Affymetrix HGU133plus2 and (3) Affymetrix HGU133A -GEO

DataSets. The volume of signal intensities used for heatmaps was generated by the Probeset_IDs for each Affymetrix HGU133plus2 with their 54676 rows and Affymetrix HGU133A was its

22277 rows. Their potential was multiplied by the 588 (GSM) samples, so our initial pooled data set had conceivably over 22 million data points. The number of data sets was broken down as follows; RA (2), SLE (6), MS (5) and PSS (1).

We used the oneChannelGUI package for statistical analysis of the data. The GUI was able to run analysis on 54,676 Probeset_IDs for an HGU133plus2 and 22,277 Probeset_IDs for their

HGU133A chipsets, conduct normalization and apply the cut-off p-value of 0.05 along with other features. We pre-processed the Affymetrix chipset data, analyzed it using the statistical software R-2.15.2 [64-bit], R-3.0.1 [64-bit], R Studio 0.97.551, BioConductor packages;

GEOquery, limma, affylmGUI, oneChannelGUI, heatmap.plus and RColorBrewer and Microsoft

Excel. We began running the 588 microarray samples from 14 GSE data series containing the

8 contrasts for diseased and control through the GUI. The oneChannelGUI enabled us to import our data easily using the interface and load our pre-designed target file which identified our contrasts. Preprocessing of the data included background correction and normalization using the

Robust Multi-array Average (RMA) method. (9) From these steps we produced the linear modeling and computed the contrasts in the GUI in order to find the differentially expressed gene (DEG). Once we had the resulting DEG’s, we used the lmFit() function from the “Limma” package and adjusted for multiple testing with False Discovery Rate (FDR). Next we had the

GUI construct the linear model which was fitted over the series (of arrays) with the log2- transformation of chip summary intensities for the dependent variable and the independent variables, and the biological and technological factors. We exported the expression data; viewing our output in a variety of graphics from boxplots and volcano plotting, hierarchical clustering and principal component analysis (PCA). The initial DEG list was defined as those genes with p<0.05. We chose not to alter the data using any other adjustment methods using the default settings for correction, fold-change and adjusted p-value, because we wanted to develop the largest list possible. Microsoft Excel was used to sort, match remove duplicate gene names in columns of data sometime employing formulas in individual cells. We also used it to eliminate unnamed rows and remove duplicate gene names retaining the best or lower p-values in the process. After processing our 14 files, each file had normalized p-values, alignment of gene names to their respective Probeset_IDs. Next we made a list of all of the genes to be used for the

Merge program (see Appendix E, please contact author for more information) to combine these genes into common genes per disease. Knowing that we wanted inclusion of genes, we chose to conduct an experiment in recombination of our data files.

9

2.4 Merging by p-values

In order to combine our data we had chosen to simplify the process by including only the

Affymetrix chipsets with Human blood components using the normalized p-values. These lists of differentially expressed genes (p-value ≤ 0.05) were combining two diseases at a time. From this merging of the p-values we obtained a new 2-way list. This list of common genes was then combined with that of a third disease to find the common genes for three diseases, our 3-way combination. Finally this list was combined with the list of the fourth disease to find the differentially expressed genes among the four diseases in this study. These genes were then researched for their basic functions, locations and complexes. Afterwards they were submitted to

BioGPS in order to better understand their inter-connected relationships and make a network pathway from our genes. Additionally our expression data was used for heatmap construction for each of the expression data sets consistent with our common genes significant p-values, as well as those specific to associated pathways.

10

Chapter 3

Results section

3.1 Gene List Evolution

The initial list of differentially expressed genes for Rheumatoid Arthritis contained a large number of DE genes (10251) compared to Systemic Lupus (1045), Multiple Sclerosis (1851), and Sjögren’s Syndrome (99), the potential candidate genes were merged using our Merge R code (Appendix A), which combined them by their gene names, creating two columns of p- values. We eliminated the higher p-value for each gene between the two columns; the columns were then combined into one column and saved. These new (2-way) lists could then be combined with other data sets. The initial combination 2-way lists were RA-SLE, MS-PSS, RA-

MS, SLE-PSS, RA-PSS and SLE-MS. This process was repeated until we had combined all of the possible combinations yielding our DE genes found in each of the diseases and together as our shared common genes. The unintentional filtering mechanism from the PSS list having a maximum of 99 differentially expressed genes, limited the potential for larger number of common genes when combined with any of the other three autoimmune diseases. The number of resultant non-duplicated genes for the paired combinations were MS-PSS (14), SLE-PSS (45), and RA-PSS (50), respectively seen in Figure 3.1 (10). The remaining groups were distributed as follows; RA-SLE (652), RA-MS (1147) and SLE-MS (296). The p-values for our ten common genes appear in Table 3.1. The smaller lists formed by combining each of the other diseases with the PSS gene list created almost identical lists of genes (Table 3.2), are in contrast to the larger

11

3-way combination of non-PSS (RA-SLE-MS) group of genes. The list of common genes was distilled from the merging of any 2-way combination with a single list forming a new 3-way gene list (Table 3.3). Each of the PSS limited 3-way combinations resulted in several genes filtering out valuable information about the number of genes per list and the common genes between these diseases. Additionally there were genes that only appeared in one gene list per disease against the groups represented in Table 3.2. If a biomarker was to be manufactured from this gene list, it would be important to a researcher or practitioner to be able to distinguish between these diseases. This test should provide not only the positive indication for these autoimmune diseases, but identify which disease(s) are present.

Table 3.1- Ten common genes per paired grouping with significant p-values <0.05.

Gene Merge A Merge B Merge C Merge D Merge E Merge F

Symbol RA:SLE MS:PSS RA:MS SLE:PSS RA:PSS SLE:MS

IFI44 5.53E-07 3.85E-04 3.54E-05 5.53E-07 3.54E-05 5.53E-07

IFI44L 9.43E-08 9.08E-04 7.39E-04 9.43E-08 7.39E-04 9.43E-08

IFIH1 5.07E-06 2.24E-02 1.20E-02 5.07E-06 1.20E-02 5.07E-06

IFIT1 3.05E-05 2.95E-04 5.31E-04 3.05E-05 2.95E-04 3.05E-05

IRF7 2.53E-06 1.14E-03 5.86E-03 2.53E-06 1.14E-03 2.53E-06

LAP3 8.70E-05 2.61E-03 1.08E-02 8.70E-05 2.61E-03 8.70E-05

RSAD2 2.79E-07 1.16E-04 8.83E-03 2.79E-07 7.15E-05 2.79E-07

RABGAP1L 4.31E-03 8.94E-03 4.31E-03 5.82E-03 4.31E-03 5.82E-03

STAT1 3.57E-09 1.63E-02 8.06E-04 3.57E-09 8.06E-04 3.57E-09

XAF1 2.48E-07 9.51E-05 9.51E-05 2.48E-07 2.34E-03 2.48E-07

12

Table 3.2. - The 3-way merge combinations with significant p-values <0.05.

Gene RA-MS-PSS RA-PSS-SLE SLE-PSS-MS RA-SLE-MS

Symbol No SLE No MS No RA No PSS

DDX60 -- 2.98E-06 -- --

IFI44 1.16E-05 5.53E-07 5.53E-07 5.53E-07

IFI44L 2.95E-04 9.43E-08 9.43E-08 9.43E-08

IFIH1 1.14E-03 5.07E-06 5.07E-06 5.07E-06

IFIT1 7.39E-04 3.05E-05 3.05E-05 3.05E-05

IFITM1 -- -- 1.33E-04 --

IRF7 2.61E-03 2.53E-06 2.53E-06 2.53E-06

LAP3 3.63E-04 8.70E-05 8.70E-05 8.70E-05

RABGAP1L 3.54E-05 4.31E-03 5.82E-03 4.31E-03

RSAD2 4.31E-03 2.79E-07 2.79E-07 2.79E-07

SAMD9L 8.06-04 ------

STAT1 1.20E-04 3.57E-09 3.57E-09 3.57E-09

XAF1 9.51E-05 2.48E-07 2.48E-07 2.48E-07

ZNF91 -- -- 1.27E-03

13

Table 3.3 - 10 common genes and the limiting factor of adding Sjögren’s Syndrome.

Gene Symbol Table B-MS:PSS Table D-SLE:PSS Table E-RA:PSS

IFI44 3.85E-04 5.53E-07 3.54E-05

IFI44L 9.08E-04 9.43E-08 7.39E-04

IFIH1 2.24E-02 5.07E-06 1.20E-02

IFIT1 2.95E-04 3.05E-05 2.95E-04

IRF7 1.14E-03 2.53E-06 1.14E-03

LAP3 2.61E-03 8.70E-05 8.70E-05

RSAD2 1.16E-04 2.79E-07 7.15E-05

RABGAP1L 8.94E-03 5.82E-03 4.31E-03

STAT1 1.63E-02 3.57E-09 8.06E-04

XAF1 9.51E-05 2.48E-07 2.34E-03

Each of the singular entries in Table 3.3 could help identify an individual disease. The genes per disease lists could provide other exclusionary genes to distinguish which disease based upon the presence or absence of the following genes; ABCF1- ATP-binding cassette sub-family 1,

MOV10- Moloney leukemia virus 10, RTN3- Reticulon 3 , STX11- Syntaxin 11 and ZNF91-

Zinc finger 91, seen in Table 3.4. An application for developing this protocol would be the addition of ZNF 91 in order to eliminate Sjögren’s Syndrome based upon positive results (p- values exist for each of the other diseases). Similarly the gene ABCF1 could be used to check for

RA, MOV10 for the presence of Sjogren’s Syndrome; RTN3 for MS and STX11 for SLE

(Lupus).

14

Figure 3.1. Venn Diagram- Relationships between these (10) common genes limited by the PSS group when combined with the other diseases RA, SLE and MS. The overlapping areas between the 3-way combinations show the genes; SAMD9, SAMD9L, HIST1H2AC, IFITM1 and others.

These 3-way combinations are shown in Figure 3.1, a Venn diagram of the resultant non- duplicated genes from these combinations; RA-MS-PSS (11), RA-PSS-SLE (30), SLE-MS-PSS

(12), but not RA-SLE-MS (226), partial lists have been included in Appendix B.1.1. In contrast to the four 3-way combinations produced, the common gene list from the 4-way combinations yielded more genes than anticipated, equaling the ten (10) common genes seen in Table 3.5 below and at the center of the Venn diagram above. Each possible 4-way combination was assessed, checking for consensus among genes and their p-values seen below. These genes represent the shared or common genes between the autoimmune diseases.

15

Table 3.4 - Exclusionary Genes specific to an autoimmune disease, or excluding PSS Gene Symbol RA SLE MS PSS

ABCF1 1.43E-03 ------

MOV10 ------9.43E-04

RTN3 -- -- 7.43E-09 --

STX11 -- 1.46E-05 -- --

ZNF91 1.27E-03 1.10E-02 3.69E-02 --

Table 3.5 - Consensus of 4-way merged combinations by p-values

Gene Symbol RA:SLE+MS:PSS(10) SLE:PSS+RA:M (10) RA:PSS+SLE-MS(10)

IFI44 5.53E-07 5.53E-07 5.53E-07

IFI44L 9.43E-08 9.43E-08 9.43E-08

IFIH1 5.07E-06 5.07E-06 5.07E-06

IFIT1 3.05E-05 3.05E-05 3.05E-05

IRF7 2.53E-06 2.53E-06 2.53E-06

LAP3 8.70E-05 8.70E-05 8.70E-05

RABGAP1L 4.31E-03 4.31E-03 4.31E-03

RSAD2 2.79E-07 2.79E-07 2.79E-07

STAT1 3.57E-09 3.57E-09 3.57E-09

XAF1 2.48E-07 2.48E-07 2.48E-07

16

The next steps involved assembling information and identifying the genes, themselves. The genes identified were IFI44, IFI44L, IFIH1, IFIT1, IRF7, LAP3, RABGAP1L, RSAD2, STAT1 and XAF1. The next logical step was researching their locations, possibly finding common functions, complexes and mapping their network pathways, including production of graphics from DEG expression values. We used these Probeset_IDs for the ten genes; they were (IFI44)

214453_s_at, (IFI44L) 204443_at, (IFIH1) 1555464_at, 219209_at, (IFIT1) 203153_at, (IRF7)

208436_s_at, (LAP3) 217933_at, (RABGAP1L) 20302_at, 215342_s_at, 213982_s_at,

213313_at, 204028_s_at, (RSAD2) 213797_at, 242625_at, (STAT1) 200887_s_at, AFFX-

HUMISGF3A/M97935_3_at, 209969_s_at, AFFX-HUMISGF3A/M97935_MA_at, AFFX-

HUMISGF3A/M97935_MB_at, (XAF1) 228617_at. In some cases, the combination of the gene/Probeset_IDs and the number of GSM samples made it nearly impossible to identify the labels for either axis (genes and disease state (disease or control)). Our original list had 78

Affymetrix Probeset_IDs matched with gene symbols. After running each data set through R making heatmaps, the gene/Probeset_IDs list was reduced because of issues with duplicated row names in R and better imaging. We settled upon final 25 Probeset_IDs were based upon the

Type 1-Interferon pathway components and the ten shared common genes; (CXCL10)

204533_at, (DDX58) 222793_at, (DDX60) 218943_s_at, (GBP1) 202270_at, (HERC5)

219863_at, (HERC6) 219352_at, (HIST1H2AC) 215071_at, (IFI44) 214453_s_at, (IFI44L)

204443_at, (IFIH1) 219209_at, (IFIT1) 203153_at, (IFITM1) 201601_x_at, (IRF7)

208436_s_at, (ISG15) 205483_s_at, (LAP3) 217933_at, (MX2) 204994_at , (OAS2) 204972_at,

(OAS3) 218400_at, (RABGAP1L) 215342_s_at, (RSAD2) 242625_at, (SAMD9) 228531_at,

(SAMD9L) 219285_s_at, (STAT1) AFFX-HUMISGF3A/M97935_MB_at, (USP18) 219211_at

, (XAF1) 228617_at. The heatmap and dendograms were produced using the expression values

17 for each GEO DataSet (series) and the R code hm.2.thesis.txt listed in Appendix C. An example of our resulting genes including both SAMD9 and SAMD9L in Figure 3.2, indicating up- regulation in RA samples and down-regulation in the Controls.

Figure 3.2. Heatmap of data from Rheumatoid Arthritis sample, AE-MIMR-341

Figure 3.3. Heatmap of data from SLE sample, GSE13887

18

The unsupervised hierarchical cluster analysis of gene expression levels of a set of ten genes shows how they are related based upon their expression levels. The rows of genes are relative to the mean, indicated by higher expression red, decreased expression green and little difference are indicated by a black color. The use of the dendogram=FALSE code for the disease status allowed the clustering of the disease and control samples in their original order. The heatmaps showed greater variation in the expression levels using the larger 72 probeset group than the 25 probesets which had more similarity between them. The example heatmap, AE-MIMR-341 in

Figure 3.2, demonstrates the clustering of Rheumatoid Arthritis genes. Figure 3.3 is a heatmap showing the same type of results with up-regulated disease samples and down-regulated controls from data set GSE13887. Our final heatmap (Figure 3.4) was generated using samples from each disease and their controls, these ten common genes appear to be up-regulated in many of their respective samples and generally down-regulated in the controls.

Figure 3.4. Selected autoimmune disease samples (5) SLE, (5) MS, (4) PSS, and (4) RA with

(11) Controls representing each disease.

19

Chapter 4

Discussion section

4.1 Genes of the Interferon Pathway

The variation in clustering resulted from both the contrasting mathematical expressions, and in lieu of the strict method of labeling used with the target files as they only had the two contrasts, disease or control. Prior knowledge that some samples included disease states at various stages of development/onset had the potential result of creating the inability of the outcomes to form any set of common genes, just as applying stringency from any statistical adjustment would have reduced our potential candidate genes to nearly zero.

In this study, ten common genes were identified between these four diseases and another 26 accessory genes include the common genes, partially shared genes and the exclusionary genes with the potential to become biomarker candidate genes. Moreover, this study enabled identification of distinct gene lists for each individual disease. Without the filtering mechanism of Sjögren’s data (the smallest and limiting data), the shared list could have been larger and might have had more accessory genes as well. Many of the 26 genes in the list demonstrated direct involvement in Type 1- Interferon pathway because they were directly involved in activation, induction, inhibition or phosphorylation of other genes, or included in literature reviews and signaling pathways. A unique interaction occurring in this IFN pathway is between genes, ISG15 and USP18 in the Ubiquitin-like pathway (ISGylation system). (11) Most of these

20 top 10 genes are generally found in the cytoplasm, are up-regulated and play roles in several pathways or complexes. Table 4.1., provides the cellular locations outside of the cytoplasm additionally for each of these genes and how some of them differ by location. Interestingly, the

Type I-IFN pathway encompasses the genes in the list in Table 4.2; the BOLD lettering indicates that that particular gene was one of the 26 probesets.

Table 4.1. Cellular location of top 10 genes

Gene Cellular Location

IFI44 Cytoplasm

IFI44L Cytoplasm

IFIH1 Cytoplasm, Nucleus, Cytosol

IFIT1 Cytoplasm, Cytosol

IRF7 Cytoplasm, Nucleus, Cytosol, Nucleoplasm, Endosome Membrane

LAP3 Cytoplasm, Nucleus, Nucleolus, Mitochondria, trans-Golgi network

RABGAP1L Nucleus, early Endosome, Golgi apparatus

RSDA2 Mitochondria, Golgi apparatus

STAT1 Cytoplasm, Nucleus, Cytosol, Nucleoplasm, Nucleolus

XAF1 Nucleus, Cytosol, Mitochondria

The top ten results from our MS Excel sheet showing the top returns for genes per disease by significant p-value, the differentially expressed genes, only the genes; STAT1 for Systemic

Lupus and RSAD2 for Sjogren’s Syndrome emerged in the top 10 genes from these lists.

21

Table 4.2 Top Ten genes by p-value/Disease

RA p-value SLE p-value MS p-value PSS p-value 1 HBB 5.73E-16 NAPA 1.18E-11 HBB 7.11E-16 LGALS3BP 3.18E-08 2 HBA2/HBA12.62E-13 ARF4 4.67E-11 HINT3 1.86E-14 KLHDC7B 9.17E-07 3 ZFP57 1.45E-07 LY6E 5.78E-11 SERPINB9 1.56E-12 EPSTI1 1.13E-05 4 DERL1 1.76E-07 OASL 1.94E-10 STT3A 2.12E-12 IFI27 1.57E-05 5 LOC653739 2.07E-07 MX2 4.44E-10 STAT2 2.14E-12 SERPING1 2.50E-05 6 OSBPL11 2.17E-07 ISG15 1.63E-09 MS4A3 2.17E-12 ISG15 3.51E-05 7 C14orf118 2.24E-07 TLN1 2.66E-09 NEAT1 4.36E-12 USP18 5.38E-05 8 CCNY 3.52E-07 SIGLEC1 2.70E-09 PKN2 5.31E-12 HERC6 6.94E-05 9 XRN1 3.70E-07 STAT1 3.57E-09 EIF4A1 1.13E-11 RSAD2 7.15E-05 10 RAP2B/RAP2A5.28E-07 YWHAH 1.01E-08 GLUD1 1.17E-11 IFI6 8.88E-05

The essential regulation role of the ISGF3 complex is to regulate processes using both positive and negative feedback loops. This regulator has been predicted in TFBS for ISGF3 in promoter regions of the IFN pathway genes. It elicits control over the gene receptors IFIH1 and

DDX58, the two ISGs (CXCL10, ISG20), while the SOC inhibitor and transcription factor IRF7 control transcription through their DNA binding sites for the ISGF3 complex. (12) The main pathway of ‘our’ genes involves the inducible IFN-beta transcription and is activated by IFN- beta. The expression of IFN-beta is assumed to be in response to the recognition of viral, bacterial or degraded endogenous nucleic acids by cellular receptors. They can be physiologically induced and lead to the activation of the two main pathways. (13) These genes are generally considered members of the Type 1 – Interferon (β) signaling pathway and some have been identified as participants in other pathways and complexes redundantly have been described in the literature. The IFN genes involved in this pathway were included; CXCL10,

DDX58, IFIH1, IFIT1, IRF7, ISG15, ISG20, RSAD2 and STAT1. (13) Some of these genes have been associated with other complexes within the pathway itself. They are the (TLR) Toll-

Like Receptors and ISGF3 complexes. The TLR complexes that have been mentioned in 22 literature relevant to our results are TLR3, TLR7 and TLR9, most of which are regulated by

SOCS inhibition and control by either positive or negative feedback loops. (13) The Type 1

Interferon response genes called IRG’s; LY6E, HERC5, IFI44L, ISG15, MX1/2, EPSTI1 and

RSAD2 and up-regulated genes; IFI27, IFI44, and IFI44L and other genes including; GBP1,

IFITM1, IFI44L, OAS2/3, and STAT1 all have active roles within this pathway. (14), (15).

4.2 The Ten Shared Common Genes

The names of our common genes are IFI44- Interferon-induced protein 44, IFI44L- Interferon- induced protein 44-like, IFIH1- Interferon induced with helicase C domain 1, IFIT1- interferon- induced protein with tetratricopeptide repeats 1, IRF7- Interferon regulatory factor 7, LAP3- leucine aminopeptidase, RABGAP1L- RAB GTPase activating protein 1-like, RSAD2- Radical

S-adenosyl methionine domain containing 2, STAT1- Signal transducer and activator of transcription 1, XAF1- XIAP associated factor 1, other genes of mention are CXCL10- guanylate binding protein 1 DDX58/60- DEAD box polypeptide 58/60 (Asp-Glu-Ala-Asp),

HERC5/6 probable E3 ubiquitin-protein ligase, ISG15- ISG15 ubiquitin-like modifier (IFIT5),

ISG20- Interferon stimulated exonuclease, OAS Proteins (2, 3) 2'-5'-oligoadenylate synthetase

2/3, SAMD9/L- sterile alpha motif domain containing 9/like and the USP18- Ubiquitin specific peptidase 18. These common genes appear in both literature and our results as shown by Figures

4.1 and 4.2, the network pathways may not align exactly due to differences in their respective expression values and which autoimmune disease data was analyzed from the source data.

23

Figure 4.1. Interferon Network Pathway, source http://www.nature.com/ni/journal/v10/n1/images/ni.1688-F1.jpg

Our 25 genes based upon the list from page 16 of INF pathway components and our own results, were submitted to the BioGPS site and using the “String” plug-in layout, it produced our

Interferon Network Pathway, providing both differing network pathway images (Figure 4.2) and mapping gene interactions (Figure 4.3).

Figure 4.2. The 25 genes forming a network pathway based upon the 3-way merges of Rheumatoid Arthritis, Systemic Lupus and Multiple Sclerosis combined with PSS (16) 24

Figure 4.3. The Interferon network pathway of gene interactions from BioGPS, note some of the non-participating nodes were removed for improved viewing.

Our final network pathway was just our ten genes uncovered by using our methods, all of which are commonly shared with varying expression in these four autoimmune diseases, RA,

SLE, MS and PSS shown in Figure 4.4.

Figure 4.4 The BioGPS Network Pathway for the ten shared genes between these (4) autoimmune diseases.

25

4.3 Limitations

This study was limited by the quality and availability of the microarray data for these four autoimmune diseases. For example, having only one data set for the Sjögren’s Syndrome, which may not reflect the actual number of differentially expressed genes? Additionally older

Affymetrix chipsets A.2 and B, which were included in some studies were not separated out or labeled differently from the newer chipsets. Lack of compatibility for 64-bit OS Windows systems with older BioConductor packages, including differences between the same package for

32/64-bit operating systems. In the future the option to use parallel computing options (T/F) will require the user to know the number of cores in order to improve computational speed. Lastly, the inability to get consistent Meta data from sample submissions to conduct more in-depth analysis of gender and age, essential for female dominated autoimmune diseases, like those in this study.

26

Chapter 5

Conclusion

Our study identified ten genes associated with these four autoimmune diseases. These genes specifically are important to the Interferon pathway; the regulation of STAT1 in gene activation and phosphorylation of Janus kinases, the family of genes for ISG56/P56 (IFIT1), LAP3 involvement in TLR trafficking, and finally control of apoptosis with XAF1, wherein inhibition leads to inflammation. Each of these genes has the potential of becoming selected for biomarkers in positively identifying these autoimmune diseases. Further studies will be necessary to test this common gene set elucidating the pathophysiology mechanism for each of these autoimmune diseases. One of the hopes of this study was the development of a biomarker which would help practitioners to better diagnosis responsible autoimmunity genes, as they would be negatively expressed, except when the disease was present. The effect of the PSS data upon the other three group lists limited overall gene inclusion, while it simplified the objectives.

27

References

1. AARDA, Inc. [Online] https://www.aarda.org/autoimmune_statistics.php.

2. Rau, et al. [Online] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1447637/.

3. https://en.wikipedia.org/wiki/Autoimmunity. [Online]

4. Primer in Rheumatic diseases. Arthritis, Foundation. 16:427, 1989, Journal of

Rheumatology.

5. Arthritis Today. [Online] www.arthritistoday.org/about-arhtritis/signs-and-symptoms/.

6. http://hmg.oxfordjournals.org/content/20/17/3494.long. [Online]

7. Lymphoproliferative disorders in Sjögren's syndrome. Sugai, et al. March;3(3), 2004,

Autoimmunology Review, pp. 175-82.

8. Sex differences and genomics in autoimmune diseases. Amur, et al. 2012, Journal of

Autoimmunity, pp. 254-65.

9. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Irizarry, et al. 2003, Biostatistics , pp. 249-64.

10. Lucid Charts- Venn Diagrams. [Online] https://www.lucidchart.com/.

11. McGilvray, et al. McGilvray, et al. 10/2011 , The ISG15/USP18 Ubiquitin-like pathway(ISGylation system) in hepatitis C virus infection.

12. Hundeshagen, et al. 2012, Journal of Neuroinflammation.

13. Elevated type I interferon-like activity in a subset. Hundshagen, et al. 2012, Journal of

Neuroinflammation, Vol. 9, p. 140.

28

14. The interferon type I signature towards prediction of non-response to rituximab in rheumatoid arthritis patients. Raterman, et al. Arthritis Research & Therapy, 2012, Vol.

15. www.interferome.org. Interferome. [Online] http//www.interferome.org.

16. BioGPS- Your Gene Portal System. [Online] http://www.biogps.gnf.org/.

17. Noncanonical autophagy is required for type I interferon secretion in response to

DNA-immune complexes. Henault, et al. 2012, Cell Press, Vol. 37(6), pp. 986-97.

18. Genetic Analyses of Interferon Pathway–Related Genes Reveal Multiple New Loci

Associated With Systemic Lupus Erythematosus. Ramos, et al. 7, 2011, ARTHRITIS &

RHEUMATISM, Vol. 63, pp. 2049-57.

19. Deletion variants of RABGAP1L, 10q21.3, and C4 are associated with the risk of systemic lupus erythematosus in Korean women. Kim, et al. 2013, Arthritis and

Rheumatism, Vol. 65(4), pp. 1055-63.

20. Interferon-g Induces X-linked Inhibitor of Apoptosis-associated Factor-1 and Noxa

Expression and Potentiates Human Vascular Smooth Muscle Cell Apoptosis by STAT3

Activation. Yalai, et al. 2008, Journal of Biological Chemistry, Vol. 283, pp. 6832-42.

21. Complement receptor 2/CD21- human naive B cells contain mostly autoreactive unresponsive clones. Isnardi, et al. 24, 2010, Blood, Vol. 115, pp. 5026-36.

22. Gene expression profile during monocytes to macrophage differentiation . Liu, et al.

1, 2008, Immunology, Vol. 117, pp. 70-80.

23. Platelet transcriptional profile and protein expression in patients with systemic lupus erythematosus: up-regulation of the type I interferon system is strongly associated with vascular disease. Lood et al. 11, 2010/2013, Blood, Vol. 116, pp. 1951-57.

29

24. Combined deficiency of proapoptotic regulators Bim and Fas results in the early onset of systemic autoimmunity. Hutcheson, et al. 2, 2008, Immunity, Vol. 28, pp. 206-

217.

25. A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus. Chaussabel, et al. 1, 2008/2013, Immunity, Vol. 29, pp. 150-164.

26. Abrogation of T cell quiescence characterizes patients at high risk for multiple sclerosis after the initial neurological event. Corvol, et al. 33, 2008, PNAS, Vol. 105, pp.

11839-44.

27. Activation of mammalian target of rapamycin controls the loss of TCRzeta in lupus T cells through HRES-1/Rab4-regulated lysosomal degradation. Fernandez, et al. 2009,

Journal of Immunology, pp. 2063-73.

28. The trait of MS: Altered transcription regulation of nuclear receptors networks operate in the pre-disease state. Achiron, et al. 2, 2010, Neurobiologial Disorders, Vol.

38, pp. 201-209.

29. Human TCR-alpha beta+ CD4- CD8- T cells can derive from CD8+ T cells and display an inflammatory effector phenotype. Crispin et al. 7, 2009, Journal of

Immunology, Vol. 183, pp. 4675-81.

30. Systematic review of genome-wide expression studies in multiple sclerosis.

Kemppinen, et al. 1, 2011, BMJ Open, Vol. 18, p. e00053.

31. Gender-associated differences of perforin polymorphisms in the susceptibility to multiple sclerosis. . Camina-Tato, et al. 9, 2010, Journal of Immunology, Vol. 185, pp.

5392-404.

30

32. Elevation of Sema4A implicates Th cell skewing and the efficacy of IFN-β therapy in multiple sclerosis. Nakatsuji, et al. 10, 2011/2013, Journal of Immunology, Vol. 188, pp.

4585-65.

33. Netting neutrophils induce endothelial damage, infiltrate tissues, and expose immunostimulatory molecules in systemic lupus erythematosus. Villanueva et al. 1, 2011,

Journal of Immunology, Vol. 187, pp. 538-552.

34. B cell signature during inactive systemic lupus is heterogeneous: toward a biological dissection of lupus. Garaud et al. 8, 2011, PLoS One, Vol. 6, p. e23900.

35. The multifaceted balance of TNF-α and type I/II interferon responses in SLE and RA: how monocytes manage the impact of cytokines. Smiljanovic et al. 11, 2012, Journal of

Molecular Medicine, Vol. 90, pp. 1295-1309.

36. Brooks, et al. Cytokine stimulation of T lymphocytes regulates their capacity to induce monocyte of tumour necrosis factor-alpha but not in interleukin-10: possible relevance to pathophysiology of rheumatoid arthritis. 2011.

37. Systemic increase in type I interferon activity in Sjögrens syndrome: A putative role for plasmacytoid dendritic cells. Wildenberg, et al. 7, 2008, European Journal of

Immunology, Vol. 38, pp. 2024-2033.

38. http://www.ncbi.nlm.nih.gov/pubmed/?term=common+genes+of+disease name.

[Online]

39. Primary Sjögren's syndrome (pSS) is a chronic autoimmune disease with complex etiopathogenesis. Despite extensive studies to understand the disease process utilizing human and mouse models, the intersection between these species remains elusive. To addres. [Online]

31

40. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1809188/pdf/cei0138-0164.pdf.

[Online]

41. http://www.arthritistoday.org/about-arthritis/signs-and-symptoms/arthritis-swelling- and-stiffness.php. [Online]

42. http://www.ncbi.nlm.nih.gov/gene/. [Online]

43. http://www.ncbi.nlm.nih.gov/pubmed/. [Online]

44. NIH - Funding/Categorical Spending. [Online] 2013. http://report.nih.gov/categorical_spending.aspx.

45. 1997, National Hosptial Discharge Survey. s.l. : Vital Health and Statistics, Series 13,

No.145.

46. 23andMe. [Online] https://www.23andme.com/health/Sjogrens-Syndrome/.

47. National MS Society. [Online] http://www.nationalmssociety.org/about-multiple- sclerosis/what-we-know-about-ms/who-gets-ms/index.aspx.

48. Overcoming Multiple Sclerosis. [Online] http://www.overcomingmultiplesclerosis.org/About-MS/Causes-of-MS/Smoking/.

49. Diagnosis of Multiple Sclerosis. [Online] http://ms.about.com/od/multiplesclerosis101/a/ms_diagnosis.htm.

50. Lupus Research Institute. [Online] http://lupusresearchinstitute.org/lupus-facts/lupus- diagnosis.

51. The action of IFN-beta induxces IFITM1, IFIT3 and IFI44 in the monocytic cell.

Wildenberg, et al. 2008, European journal of Immunology, Vols. 38-7, pp. 2024-2033.

32

52. Hoffecker, et al, http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0063725,

PLoS One.

53. Noncanonical autophagy is required for type I interferon secretion in response to

DNA-immune complexes. Henault, et al 2012, Cell Press, Vol. 37(6), pp. 986-97.

54. Zerhouni, Elias http://autoimmune.pathology.jhmi.edu/adrp.pdf. [Online]

55. Peripheral blood gene expression profiling in Sjögren’s syndrome. Emamian et al,

10(4), 2009, Genes and Immunity, pp. 258-296.

56. Kemppinen, et al, BMJ Open- Systematic review of genome-wide expression studies in multiple sclerosis. [Online] 2011. http://bmjopen.bmj.com/content/1/1/e000053.long.

57. Platelet transcriptional profile and protein expression in patients with systemic lupus erythematosus: up-regulation of the type I interferon system is strongly associated with vascular disease. Lood et al, 11, 2010-June, Blood, Vol. 116, pp. 1951-57.

58. IFIH1-GCA-KCNH7 : influence on multiple sclerois risk. Martinez et al, 2008,

Nature/European Journal of Human Genetics, Vol. 16, pp. 861-4.

59. McCall, et al, Assessing affymetrix GeneChip microarray quality. BMC

Bioinformatics. 2011.

60. A Genomic Approach to Human Autoimmune Diseases. Pascaul et al, 2011-April,

Annual Review Immunology, pp. 535-71.

33

Appendices

A.1- Merge R Code for combining genes by p-values r1=read.csv("25160c.csv") r2=read.csv("29434c.csv") r1[1:5,]; r2[1:5,]; colnames(r1)[1] <- "Gene.Symbol"; colnames(r2)[1] <- "Gene.Symbol" m12 <- merge(r1, r2, by="Gene.Symbol") m12[1:3,]; colnames(m2)[1] <- "id" # Prints out the first n rows from data frames, renames their Gene.Symbols to id, and creates pvalue.x, pvalue.y." # output # id pvalue.x pvalue.y #1 AARS 7.922160e-02 4.346300e-04 #2 AARS 3.139907e-02 4.346300e-04 write.csv(m12, file="Merge12.csv") x<- colnames(m12)[1] <- "pvalue.x" y<- colnames(m12)[1] <- "pvalue.y" summary(m12) m12

Table B.1.0- GEOR2R , SLE GSE13732 gene.symbol p.value HEATR7B2 6.67E-07 ATP2B3 1.38E-06 IRX3 8.11E-06 LOC285593 1.04E-05 DKK3 1.42E-05 LOC653160 2.66E-05 PRLR 3.15E-05 PTGFR 4.47E-05 ABTB1 4.60E-05 RELA 4.68E-05

34

Table B.1.1- Gene lists from combining genes by p-values

Gene Symbol Table B-MS:PSS Table D-SLE:PSS Table E-RA:PSS

ATF3 CXCL10* -- -- 6.99E-03 DDX60 -- 2.98E-06 2.02E-03 EPSTI1* 1.13E-05 FUT4 -- 2.11E-02 2.42E-03 GBP1 -- 2.30E-05 6.32E-03 GMPR -- 7.34E-06 4.80E-04 HERC6 -- 3.79E-05 6.94E-05 IFIT3 -- 1.15E-05 1.42E-04 IFIT5 -- 1.03E-02 8.13E-03 IFITM1 3.46E-02 1.33E-04 -- ISG15 -- 1.63E-09 3.51E-05 ISG20* 3.83E-03 -- -- HIST1H2AC 2.94E-02 6.04E-03 -- MX2 -- 4.44E-10 3.70E-04 OAS2 -- 4.60E-08 3.03E-04 OAS3 -- 2.22E-05 5.54E-04 SAMD9 -- 2.10E-05 4.48E-04 SAMD9L 4.28E-05 -- 3.63E-04 SCO2 -- 1.36E-06 1.67E-03 TUBB2A -- 3.27E-03 1.54E-02 USP18 -- 1.18E-07 5.38E-05

Table B.1.2 The 2-way merge of MS & PSS after combining p-values gene.symbolpval SAMD9L 4.28E-05 XAF1 9.51E-05 RSAD2 1.16E-04 IFIT1 2.95E-04 IFI44 3.85E-04 IFI44L 9.08E-04 IRF7 1.14E-03 LAP3 2.61E-03 ISG20 3.83E-03 RABGAP1L 8.94E-03 STAT1 1.63E-02 IFIH1 2.24E-02 HIST1H2AC 2.94E-02 IFITM1 3.46E-02

35

Table B.1.3 The 3-way Merge of MS_RA_PSS = No SLE, before combining p-values with duplicates

gene.symbol pval.x pval.y 1 IFI44 3.54E-05 0.015075 2 IFI44 3.54E-05 0.006721 3 IFI44L 0.000739 0.001845 4 IFIH1 0.00502 0.02243 5 IFIH1 0.00502 0.04243 6 IRF7 0.00586 0.001137 7 RSAD2 0.00883 7.15E-05 8 RSAD2 0.00883 0.000116 9 SAMD9L 0.000996 0.003572 10 SAMD9L 0.000996 0.008666 11 SAMD9L 0.000996 0.010269 12 STAT1 0.00577 0.027404 13 STAT1 0.00577 0.016272 14 STAT1 0.00577 0.004892 15 STAT1 0.00577 0.030224 16 XAF1 9.51E-05 0.022496 17 XAF1 9.51E-05 0.00237

C- Heatmap, R code using expression level data bb<- read.csv("ra38exphm1.csv") dev.off() row.names(bb) <- bb$GENE row.names(bb) ##cluster matrix1 <- data.matrix(bb) hc <- hclust(dist(bb)) plot(hc) # dendogram produced, view & save #heatmap(hc) library(heatmap.plus) library(gplots) 36 dev.off() cb=bb cc <-cb[,-1] cc x<- matrix1 <- data.matrix(cc) dim(x) # MATRIX DIMENSIONS heatmap.2(x, Rowv=TRUE, Colv=FALSE, trace="none",col=greenred(25))

D- Additional Resources; either used as data or as reference information LAP3 –TLR9 (17) RABGAP1L- SLE & IFN (18), (19) XAF1- INF (20) STAT1 regulation (20) GSE13917 (21) GSE 8286 (22) GSE22132 (23) GEO DataSet Authors GSE10325 (24) GSE11909 (25) GSE13732 (26) GSE13887 (27) GSE14895 (28) GSE16130 (29) GSE21942 (30) GSE23205 (31) GSE26484 (32) GSE26975 (33) GSE30153 (34) GSE38351 (35) AE-E-MIMR-341 (36) AE-E-MEXP-1242 (37)

E- Author Information: If you require more information or would like to contact me please scan this QR Code and contact me through LinkedIn.

37

38