<<

Regulator combinations identify systemic sclerosis patients with more severe disease

Yue Wang, … , Monique Hinchcliff, Michael L. Whitfield

JCI Insight. 2020. https://doi.org/10.1172/jci.insight.137567.

Research In-Press Preview Dermatology

Systemic sclerosis (SSc) is a heterogeneous autoimmune disorder that results in skin fibrosis, autoantibody production and internal organ dysfunction. We previously identified four ‘intrinsic’ subsets of SSc based upon skin expression that are found across organ systems. regulators that underlie the SSc intrinsic subsets, or are associated with clinical covariates, have not been systematically characterized. Here we present a computational framework to calculate the activity scores of gene expression regulators and identify their associations with SSc clinical outcomes. We find regulator activity scores can reproduce the intrinsic molecular subsets with distinct sets of regulators identified for inflammatory, fibroproliferative and normal-like samples. Regulators most highly correlated with modified Rodnan skin score (MRSS) also varied by intrinsic subset. We identify a subgroup of fibroproliferative/inflammatory SSc patients with more severe pathophenotypes. We further identify a subgroup of SSc patients that had higher MRSS and increased likelihood of interstitial lung disease. Using an independent cohort, we show this group was most likely to show forced vital capacity decline over a period of 36 – 54 months. Our results demonstrate an association between the activation of regulators, gene expression subsets and clinical variables that can identify SSc patients with more severe disease.

Find the latest version: https://jci.me/137567/pdf 1 Regulator combinations identify systemic sclerosis patients with more severe 2 disease 3 4 Yue Wang1, Jennifer M. Franks1, Monica Yang2, Diana M. Toledo1, Tammara A. Wood1, Monique 5 Hinchcliff2 and Michael L. Whitfield1,* 6 7 1Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, New 8 Hampshire 03756, USA. 2Department of Internal Medicine, Northwestern University Feinberg School of 9 Medicine, Chicago, Illinois, 60611, USA 10 11 12 Corresponding Author: 13 Michael L. Whitfield, Ph.D. 14 Department of Biomedical Data Science 15 Geisel School of Medicine at Dartmouth 16 1 Medical Center Drive 17 Lebanon, NH 03756 18 [email protected] 19 Phone: 603-650-1109, Fax: 603-650-1188 20 21 Conflict of interest statement 22 The authors have declared that no conflict of interest exists. 23 Abstract

24 Systemic sclerosis (SSc) is a heterogeneous autoimmune disorder that results in skin fibrosis,

25 autoantibody production and internal organ dysfunction. We previously identified four ‘intrinsic’

26 subsets of SSc based upon skin gene expression that are found across organ systems. Gene

27 expression regulators that underlie the SSc intrinsic subsets, or are associated with clinical

28 covariates, have not been systematically characterized. Here we present a computational

29 framework to calculate the activity scores of gene expression regulators and identify their

30 associations with SSc clinical outcomes. We find regulator activity scores can reproduce the

31 intrinsic molecular subsets with distinct sets of regulators identified for inflammatory,

32 fibroproliferative, limited and normal-like samples. Regulators most highly correlated with

33 modified Rodnan skin score (MRSS) also varied by intrinsic subset. We identify subgroups of

34 fibroproliferative and inflammatory SSc patients with more severe pathophenotypes, such as

35 higher MRSS and increased likelihood of interstitial lung disease (ILD). Using an independent

36 cohort, we show the group with more severe ILD was more likely to show forced vital capacity

37 decline over a period of 36 – 54 months. Our results demonstrate an association between the

38 activation of regulators, gene expression subsets and clinical variables that can identify SSc

39 patients with more severe disease.

40

41

42 Keywords: Systemic sclerosis, Biomarkers, Pulmonary fibrosis, forced vital capacity decline,

43 Computational biology 44 Introduction

45 Systemic sclerosis (SSc) is a heterogeneous autoimmune disease that results in the production of

46 autoantibodies, skin fibrosis, and internal organ involvement (1); the pattern and severity of skin

47 and organ involvement varies across patients. Clinically, SSc is divided into two subtypes based

48 on the extent of skin involvement including limited cutaneous (lc) and diffuse cutaneous (dc) SSc

49 (2). The lungs, heart, kidney, and other organs may also be involved (1, 3, 4). We have previously

50 identified SSc ‘intrinsic’ subsets (fibroproliferative, inflammatory, limited, and normal-like) based

51 upon skin gene expression (5-7), and that may predict response to therapy (8, 9). Analysis of skin

52 gene expression across cohorts identified interactions between immune and stromal cells that may

53 act as key drivers of SSc pathogenesis in patients with a permissive genetic background (10, 11).

54 Regulators, either acting at the transcriptional or post-transcriptional level, control the

55 expression of their target gene networks, thus they contribute to different biological phenotypes

56 and can be proxies for tightly, coordinated and regulated pathways (12). However, the regulators

57 that underlie SSc and the association with clinical phenotypes have not been systematically

58 investigated. The goals of these analyses were two-fold. First, to identify regulators of gene

59 expression, such as transcription factors (TFs) and microRNAs (miRNAs), that are enriched in the

60 SSc intrinsic subsets, and second to identify regulators that could identify SSc patients with more

61 severe skin and lung disease. Our reasoning is that the regulators, and the network of that

62 they control, may be informative of pathological processes acting in SSc.

63 Herein, we constructed a computational framework to systematically examine the activity

64 of 836 regulators across 431 SSc skin and 35 SSc blood samples using publicly available gene

65 expression data (see Figure 1 for an overview). We characterized each regulator’s target genes and

66 their expression profile to infer the regulator activity scores for each SSc sample using the BASE 67 algorithm (13). Regulator activity scores were correlated with the modified Rodnan Skin Score

68 (MRSS), a common measure of SSc severity, to identify those with activity scores highly

69 associated with severity of skin disease, particularly in fibroproliferative and inflammatory

70 patients. We then built an interaction network using regulators that were significantly associated

71 with MRSS to give a comprehensive picture of the regulatory interactions within the SSc intrinsic

72 subsets. Then, novel subgroups within the fibroproliferative or inflammatory intrinsic subset were

73 identified by using a combination of the activity scores of two MRSS-correlated regulators.

74 Further, we found a novel group of SSc samples containing patients with higher MRSS and

75 significant decline in forced vital capacity (FVC) over 36 months of follow-up.

76

77 Results

78 Regulator activity scores reproduce the SSc intrinsic subsets

79 We analyzed the regulator activity scores in four independent, publicly available SSc datasets of

80 Milano et al., Pendergrass et al., Hinchcliff et al., and Assassi et al. (5-7, 14). For each dataset, we

81 first calculated sample-specific regulator activity scores using BASE (13) by integrating gene

82 expression profiles and regulator target gene sets (see Methods). We applied intrinsic gene analysis

83 to calculate a within-between score (15) to select regulators that showed the most consistent

84 activity scores within an SSc intrinsic subset, but had the most variable activity scores across all

85 subsets. The 270 regulators that had a false discovery rate (FDR) less than 2% in at least three

86 datasets were considered to be “intrinsic SSc regulators” and included activator ;

87 CCAAT/enhancer-binding proteins; members of the family, ETS family, STAT family, and

88 GATA family; glucocorticoid receptors; interferon regulatory factors; RUNX1-related core 89 binding factors; B- and T-cell-related transcription factors; and numerous miRNAs

90 (Supplementary Table S1).

91 We organized the samples and regulators in each dataset by hierarchical clustering of the

92 activity scores with the 270 regulators. Broadly, we found that samples in the fibroproliferative

93 subset clustered together and displayed activation of key regulators of cell proliferation, such as

94 members of the E2F, , and ETS families (Figure 2 and Supplementary Figures S1 to S4).

95 Target genes of these fibroproliferative cluster regulators are highly enriched in cell cycle and

96 DNA replication pathways (Supplementary Table S2 and Figure S5; corrected P-value < 0.05),

97 consistent with the activation of biological processes enriched in the fibroproliferative subset.

98 Immune-related proteins, such as those from the STAT family, Runx1-related core binding factors,

99 and nuclear factor of activated T cells (NFAT), are enriched in the inflammatory subset (Fig. 2

100 and Supplementary Figures. S1 to S4). Their target genes are significantly involved in immune-

101 related and signal transduction pathways such as the B-/T-cell signaling pathway, Th17-

102 cell differentiation, and the TGF-beta signaling pathway (Supplementary Table S2 and Figure S6;

103 corrected P-value < 0.05).

104 There was also a strong cluster of regulators for the Milano limited samples, which

105 included 52 regulators such as the (GR), , and androgen

106 receptor (Figure 2A and Supplementary Figure S1). Surprisingly, in the Milano dataset, although

107 the limited and inflammatory subsets were comprised of different sets of regulators, the pathways

108 to which these regulators mapped were highly shared (Jaccard score = 0.53; Supplementary Table

109 S2). For instance, the MAPK signaling pathway, PI3K-Akt signaling pathway, and Wnt signaling

110 pathway were all enriched in both clusters (Supplementary Table S2), suggesting that the pathways

111 driving these subsets may be similar but that the regulators that activate the pathways are different. 112 In the Assassi dataset (Figure 2B), 27 out of 42 enriched pathways in the fibroproliferative samples

113 were shared with those in the inflammatory samples and included the MAPK signaling pathway,

114 the PI3K-Akt signaling pathway, and the Wnt signaling pathway (Supplementary Table S2).

115 Similar results were found in the Pendergrass and Hinchcliff datasets (Supplementary Figure S3

116 and S4; Supplementary Table S2).

117

118 Regulator activity scores are correlated with MRSS

119 Next, we asked if regulator activity scores could be an additional index to identify patients with

120 more severe disease. Correlations were calculated between each regulator’s activity score and

121 patient’s MRSS in all four datasets. Mean correlations were calculated and ranked in decreasing

122 order (Supplementary Tables S3 and S4 for fibroproliferative and inflammatory samples,

123 respectively). We chose to focus on the 50 top- and bottom-ranked regulators. In the

124 fibroproliferative subset, regulators that had the highest (correlation > 0.17, Supplementary Table

125 S5) and lowest (correlation < -0.21, Supplementary Table S6) correlation were identified. In

126 particular, the activity scores of POU domain-related TFs (OCT1 and PIT1), forkhead box (FOX)

127 TFs, and many microRNAs were positively correlated with the MRSS of fibroproliferative SSc

128 samples (Figure 3A). POU domain-related TFs function in the cell cycle regulation of

129 immunoglobulins and are involved in viral infection (16). Regulators in the FOX family are well-

130 known regulators of cell proliferation, e.g., FOXC1 increases the proliferation of fibroblast-like

131 synoviocytes in autoimmune diseases (17). In contrast, activities of the tumor suppressor

132 , ETS-related proteins, and immune-related TFs (including STAT3, SMAD4, and macrophage

133 migration inhibitory factor) were negatively correlated with the MRSS (Figure 3A). 134 Similarly, for the inflammatory subset, as shown in Fig. 3B, we identified the most and

135 least correlated regulators, with the lowest positive correlation of 0.3 (Supplementary Table S7)

136 and the highest negative correlation of -0.26 (Supplementary Table S8). Immune-related

137 transcription factors, such as NFAT, nuclear factor kappa-light-chain-enhancer of activated B cells

138 (NFKB), RUNX1-related , STAT family, SRF, and interferon regulatory factor

139 (IRF) were positively correlated with MRSS (Figure 3B). In contrast, we found that macrophage

140 migration inhibitory factor, E2F family proteins, and several microRNAs were negatively

141 correlated with the MRSS.

142

143 Regulator interaction network in the context of SSc

144 Based on the clinically relevant regulators that were identified, a network-based analysis was

145 conducted to explore the interactions among those regulators in the context of SSc (Figure 1). We

146 defined that an interaction was established if a regulator’s target list contained another regulator

147 listed in the MSigDB C3 database. We began by investigating the regulator-target interactions in

148 the context of fibroproliferative (Figure 4A) and inflammatory (Figure 4B) samples. The two

149 colors of nodes represent the type of correlation with the MRSS (red, positive; cyan, negative).

150 The arrows indicate target directions and a circular arrow indicates a self-regulating feedback loop.

151 The size of a node indicates its relative degree of connection. Therefore, a larger node indicates

152 that a given regulator plays a more important role by regulating others in SSc. For example, the

153 SP3, which is negatively correlated with the MRSS of fibroproliferative

154 samples, interacts with five other regulators, including cell-cycle-related regulators (miR-485-3p

155 and TFAP2A) and immune-related regulators (STAT3 and ELK1) (Figure 4A). These results are

156 in line with previous findings that SP3 is able to control IL-10 gene expression and interact with 157 (18, 19). Though SP1 is well studied in SSc, this suggests that, as a collagen metabolism

158 regulator, SP3 may also play an important role in SSc etiology (20, 21).

159 Next, by combining the regulators associated with both fibroproliferative and

160 inflammatory samples, we built a more comprehensive interaction network in the context of SSc

161 (Figure 4C). Again, the size of a node indicates its degree of connection and different colors of

162 nodes represent different categories. Besides those regulators that have specific correlation

163 directions with MRSS (four corners in Figure 4C), we found that the activities of paired box 3

164 (PAX3) and sex-determining region Y (SRY), as shown in dark red, were positively correlated

165 with MRSS in both fibroproliferative and inflammatory samples. PAX3 contributes to the

166 embryotic development of the central nervous system and heart vasculature (22), which suggests

167 that there are underlying associations between autoimmunity and the nervous system (23).

168 Meanwhile, SRY has been shown to have a strong relationship with autoimmune diseases (24).

169 MiR485-3p’s activities are positively and negatively correlated with the MRSS of

170 fibroproliferative and inflammatory samples, respectively (purple node). Conversely, the activities

171 of STAT3, ETS1, RFX1, PEA3, and MADH4 are positively and negatively correlated with the

172 MRSS of inflammatory and fibroproliferative samples, respectively (light green nodes).

173 Additionally, E2F, miR142-5p, MIF1, and NFMUE1 were negatively correlated with both

174 intrinsic subsets of patients. Interestingly, we found that the activities of a subset of microRNAs

175 were positively correlated with the MRSS in fibroproliferative samples and were negatively

176 correlated with the MRSS in inflammatory samples.

177

178 Regulator activity scores identify subgroups that differ by disease severity 179 We then asked whether combinations of regulator pairs could identify novel subgroups of SSc

180 within the intrinsic subsets that had more severe MRSS. We used a combination of gene expression

181 data and clinical information collected from Milano et al., Pendergrass et al., and Hinchcliff et al.

182 to generate a larger cohort to increase the statistical power. The combined data from Franks et al.

183 was used (25). Regulator activity scores and their correlation with the MRSS were re-calculated

184 based on the combined data. We again focused on the 50 top- and bottom-ranked regulators, which

185 we reasoned were likely to be the most clinically relevant regulators in each intrinsic subset

186 (Supplementary Table S9 for fibroproliferative and S10 for inflammatory). Then, we performed

187 pairwise analyses of regulators that were positively correlated with the MRSS in both

188 fibroproliferative and inflammatory subsets.

189 For the pairwise combinations of regulator activity scores in the fibroproliferative subset,

190 a total of 1,225 pairs were divided into four subgroups, using 0 as cutoff for each activity score.

191 By comparing MRSS between groups, we found 910 pairs where samples that had positive activity

192 of both regulators (i.e. double positive) also had more severe skin disease compared to those in

193 other three groups (FDR<1%, Supplementary Table S11). For example, a top-ranked pair,

194 CART1_01 and NMYC_01, classified fibroproliferative samples into four subgroups: those that

195 were positive for both regulators (red, group 1); those positive for CART1_01 and negative for

196 NMYC_01 (blue, group 2); those that were negative for both regulators (brown, group 3); and

197 those that were negative for CART1_01 and positive for NMYC_01 (grey, group 4), as shown in

198 Figure 5A. CART1, known as TNF receptor associated factor 4, is able to target fibroblasts (26)

199 and regulate cell-cycle pathways (27, 28). NMYC, a member of the MYC family, plays an

200 important role in regulating cell cycle and metabolism in many human diseases (29). Similar 201 observations were found among the significant regulator pairs, suggesting that a potentially large

202 number of regulators are dysregulated in SSc (Supplementary Table S11).

203 We found that patients with samples in group 1 had more severe disease than patients with

204 samples in the other groups (all two-sided Wilcoxon rank test; P < 0.0005; Figure 5B). The fraction

205 of samples from dcSSc patients in group 1 was similar to that in group 4 (Supplementary Figure

206 S7), but a significant difference in MRSS persisted. The fraction of samples from dcSSc patients

207 in group 1 was 1.23-fold and 1.65-fold higher than groups 2 and 3, respectively (two-tailed Fisher’s

208 exact test; P = 0.04 and P = 0.0008). Patients with samples in group 4 had higher MRSS compared

209 to patients with samples in groups 2 and 3 (all two-sided Wilcoxon rank test; P < 0.05; Figure 5B),

210 likely due to the higher prevalence of dcSSc. These observations suggest that patients with higher

211 NMYC activity are more likely to be dcSSc and that patients with higher NMYC activity have a

212 worse disease phenotype than those that have lower NMYC activity. Patients with samples in

213 groups 1 and 2 had relatively higher fractions of early-stage disease than those in groups 3 and 4,

214 which might suggest that higher CART1 activity is associated with early-stage disease

215 (Supplementary Figure S7). Moreover, we found that samples in group 1 came from individuals

216 who were significantly younger at diagnosis and had notably longer disease durations at baseline

217 compared to other groups (all two-sided Wilcoxon rank test; P £ 0.05; Figure 5C). We stratified

218 patients by treatment to determine if the subgroups could derive from treatment with disease-

219 modifying anti-rheumatic drugs (DMARDs). We found that samples in group 1, 3 and 4 came

220 primarily from individuals who were not on immunosuppressive treatment at baseline

221 (Supplementary Table S12). A slightly larger proportion of baseline samples from group 2 had

222 treatment of Mycophenolate Mofetil (MMF) (47.4%) compared to those who had no treatments

223 (31.6%). Therefore, treatment does not appear to be a defining feature of the groups. 224 We repeated the pairwise analyses for patients in the inflammatory subset and identified

225 312 pairs where patients that were double-positive had a more severe disease phenotype

226 (Supplementary Table S13). For example, using a top-ranked combination of NFAT (NFAT_Q6)

227 and SMAD4 (SMAD4_Q6), samples from patients were divided into positive/positive (group 1,

228 purple), positive/negative (group 2, blue), negative/negative (group 3, brown), and

229 negative/positive (group 4, grey) groups (Figure 5D). NFAT is a well-known TF that has a

230 significant role in the immune system (30). SMAD4 plays important roles in the TGF-beta and

231 fibroblast growth factor signaling pathways (31). Again, patients with samples in group 1 had

232 significantly higher MRSS than those in other groups (all two-sided Wilcoxon rank test; P < 0.05;

233 Figure 5E). Similar to our results with the fibroproliferative subset, we found samples in group 1

234 were more likely to be from dcSSc patients compared to other groups (Figure 5F, 1.25-fold and

235 1.96-fold difference when comparing dcSSc fraction in group 1 to that in groups 2 and 3,

236 respectively; two-tailed Fisher’s exact test; P = 0.03 and P = 0.01). Group 1 samples had a

237 relatively higher fraction of samples coming from dcSSc patients than in group 4 (1.18-fold).

238 Interestingly, this observation was found in many immune system-related regulator pairs, as well

239 (Supplementary Table S12). We found that group 1 and group 2 samples were more likely to be

240 from early-stage patients (mean disease duration of 27.15 and 24 months since first onsest of non-

241 Raynaud’s symptoms, respectively), compared to group 3 (mean disease duration of 106 months)

242 and 4 (disease duration of 33.83 months) (Figure 5F). Unlike the fibroproliferative subset, we did

243 not find significant differences between the age and disease duration among patients grouped using

244 these novel subgroups among inflammatory subset. We also examined treatment impacts on these

245 subgroups. We found that the majority of samples in group 1 and 2 were from individuals that

246 were not on immunosuppressive therapies (61.7% and 62.5% respectively; Supplementary Table 247 S14). Samples in group 3 and 4 came from individuals for which 50% and 44%, respectively, were

248 not on an immunosuppressive therapy.

249

250 Novel subgroups associate with FVC decline, a surrogate for interstitial lung disease

251 Leveraging the interstitial lung disease (ILD) and non-ILD patients contained within the Assassi

252 dataset (14), we determined if novel subgroups were associated with the presence of ILD (FVC

253 reduction). By performing the pairwise analyses of the common regulators across four datasets in

254 both fibroproliferative and inflammatory subsets, we classified 23 SSc patients with ILD and 37

255 non-ILD SSc patients into four subgroups, where group 1 patients had the worst disease phenotype

256 compared to other groups (FDR < 5%). Additionally, for each single pair, we calculated the fold-

257 change of the fraction of ILD between group 1 and other groups. As a result, we identified 28 pairs

258 of regulators with a fold change greater than 1.5 (Supplementary Table S15). For example, using

259 the top-ranked pair of FOXO4_02 and COREBINDINGFACTOR_Q6, patients were classified

260 into four groups, as shown in Figure 6A. FOXO4 is known to decrease the activity of hypoxia-

261 inducible factor, which is a validated biomarker for lung diseases (32-34). The core binding factor,

262 which is within the RUNX family, has been shown to be highly involved in autoimmunity and

263 inflammation and several autoimmune diseases and contributes to pulmonary fibrosis (35-37).

264 Again, patients in group 1 had the highest MRSS compared to that in other groups (all two-sided

265 Wilcoxon rank test; P < 0.05; Figure 6B). The majority of patients in each group were not on

266 immunosuppressive therapies (Supplementary Table S16). We also found that patients in groups

267 1 and 4 were more likely to be dcSSc than those in groups 2 and 3 (Supplementary Table S14). In

268 contrast, patients in groups 2 and 3 were more likely to be lcSSc than those in groups 1 and 4. This

269 suggests that the activity of the core binding factor might be higher in patients with dcSSc versus 270 lcSSc. More patients in group 1 had ILD (50%) compared to other groups (25 – 29% ILD; Figure

271 6C), which is consistent with a previous study showing that patients with dcSSc are more likely to

272 develop fibrotic pulmonary complications (1). Patients in groups 2 and 3 were more likely to be

273 classified in the normal-like intrinsic subset compared to those in the rest of the groups

274 (Supplementary Table S16).

275 We repeated the analysis in the skin biopsy samples analyzed in Hinchcliff et al., as shown

276 in Figure 6D. Patients with unclear ILD diagnosis at baseline or with morphea were excluded. We

277 found that group 1 samples were from patients with more severe skin involvement than those in

278 the other three groups (Figure 6E). Again, the majority of samples in group 1 (85.7%) and group

279 4 (82.7%) were from patients with dcSSc compared to other groups (Supplementary Table S17);

280 samples in groups 2 and 3 were more likely to be from normal-like patients when compared to the

281 other groups. We did not find enrichment of ILD in group 1 at initial biopsy but did find those

282 patients had an increased rate of FVC decline over 36 months. ILD was present in group 1 (45%),

283 group 2 (67%), group 3 (70%) and group 4 (45%) (Supplementary Figure S8).

284 We hypothesized that the observed stratification might result from the majority of patients

285 in the Hinchcliff dataset being early stage (66.7% samples from patients in group1 had SSc disease

286 duration < 18 months at baseline), while only two patients in the Assassi dataset were early stage.

287 To test this, patients were stratified into groups based on their baseline biopsies. FVC predicted

288 values were compared between baseline and the last time point (> 36 months). We found that

289 patients in group 1 had a significant FVC decline compared to other groups (two-tail paired

290 samples T test p-value=0.02, Figure 6F) that was not observed in the other groups.

291 SScMH_06 was the only patient that lacked ILD at baseline but had developed ILD by the

292 last time point. This patient’s baseline biopsy was classified in group 1. Analysis of patient’s 293 pulmonary function measures showed that FVC, first second of forced expiration (FEV1) and

294 adjusted diffusing capacity of carbon monoxide (DLCOadj) decreased over time (Supplementary

295 Figure S8). These observations suggest that using the combination of core binding factor and

296 FOXO4 captures the features of FVC decline in SSc skin biopsies.

297 To further validate this finding, we applied an activity score calculation to an independent

298 peripheral blood mononuclear cell (PBMC) dataset (Cheadle et al. (38)). Using the same pair of

299 regulators, we again found that samples were able to be divided into four groups (Supplementary

300 Figure S9). Notably, half of the patients in group 1 had ILD which is the largest ILD fraction

301 compared to other groups (16.7 – 33.3% ILD, Supplementary Figure S9). These observations

302 suggest that the use of regulator pairs, such as core binding factor and FOXO4, is able to identify

303 a subgroup of SSc patients that have worse skin disease and are more likely to have ILD. These

304 findings indicate that our skin regulator signature is also applicable to blood samples.

305

306 Discussion

307 In order to understand the functions of regulators of gene expression in the context of SSc, our

308 computational framework utilized gene expression profiles and target gene lists of regulators to

309 calculate the activity scores from mRNA expression data. Our results are robust because we

310 included five independent SSc gene expression datasets (Table 1).

311 Our results demonstrate that intrinsic subsets were clustered together based on regulators’

312 activity scores (Supplementary Figures S1 to S4). Interestingly, we found that a small number of

313 SSc patients identified as “normal-like” by gene expression grouped with inflammatory or

314 fibroproliferative subset when ordered using regulator activity. This may have been the result of

315 using all genes in the genome that passed basic quality filters to infer regulator activity score rather 316 than a more limited intrinsic gene list. Activity scores could present a future opportunity to further

317 investigate patients classified as “normal-like” by gene expression.

318 The intrinsic subset classification system has been used to stratify patients in SSc clinical

319 trials (25) and may predict response to therapy (39-41). The TF signatures reported here may

320 assist in further stratification of SSc patients within the existing intrinsic subsets in both cross-

321 sectional and longitudinal studies. We hope the data we generate here will provide mechanistic

322 insight into these subsets for personalized medicine in SSc as they provide a comprehensive view

323 of the regulators that underlie the intrinsic gene expression subsets, and provides the opportunity

324 to mechanistically understand the regulatory networks that give rise to these different groups of

325 patients.

326 An intriguing result from this analysis was that although each intrinsic subset often

327 contained different regulators, the enriched pathways that were activated showed a high level of

328 consistency (Supplementary Table S2). For example, we found the MAPK signaling, P13K-Akt

329 signaling, and TGF-beta signaling pathways were consistently identified across subsets, which

330 suggest they are essential pathways for SSc pathogenesis. Four KEGG pathways, chronic myeloid

331 leukemia, endocrine resistance, the longevity regulating pathway, and transcriptional mis-

332 regulation in cancer were all shared by fibroproliferative, inflammatory, and limited samples in

333 the Milano dataset (the remaining three datasets lacked limited SSc patients). Several microbiome

334 related pathways were enriched in inflammatory subsets suggesting an important relationship

335 between SSc and microbiome (Supplementary Table S2) (42, 43).

336 The results of these analyses provide a catalogue of transcription factors and miRNAs that

337 are deregulated in SSc and provides information about their correlation with clinical covariates

338 associated with skin and lung disease. We hope this information can now be used to study the 339 mechanisms of SSc in these more severe patients. There were a number of immune-related

340 regulators that were associated with more severe disease (Figure 4). A regulators correlation of

341 MRSS across datasets provides one method by which regulators could be prioritized for further

342 study (Supplementary Table S5 to S8). As an example, the activity of FOX family TFs was

343 positively correlated with MRSS in the fibroproliferative subsets, while E2F family regulators

344 were negatively correlated with MRSS in the fibroproliferative and surprisingly also in the

345 inflammatory subsets (Figure 4C). Previous studies have shown that a dysfunction in E2F

346 signaling enhances the function of inflammatory cytokines and leads to autoimmunity (44-46).

347 We repeatedly identified the RUNX1-related core binding factor to be a regulator of the

348 inflammatory subset (Figures 2, 3 and 4). ILD patients with higher RUNX1-related core binding

349 factor activities were more likely to be dcSSc and were more likely to have inflammatory

350 signatures (Supplementary Tables S16). In contrast, samples from patients with lower RUNX1-

351 related core binding activities were more likely to be from normal-like patients (Supplementary

352 Tables S16). Additionally, we found that high expression of two other RUNX1-related core

353 binding factors (AML1_01 and AML1_Q6) identified inflammatory subsets with the highest

354 MRSS (mean PCC = 0.28). RUNX1-related core binding factors are essential immune regulators

355 (36, 37), and decreased expression has been reported in SSc T-regulatory cells (47). Our regulator

356 interaction network in inflammatory subsets suggests RUNX1 may be a central regulator of

357 inflammatory processes in SSc end target tissues (Supplementary Figure S10).

358 We also identified novel subgroups within fibroproliferative and inflammatory subsets that

359 may be at risk for more severe disease (Figure 5). Samples from patients assigned to a double-high

360 regulator group were more likely to be from patients have higher MRSS and with dcSSc. Analysis

361 of multiple datasets suggests patients in these double high groups were more likely to have FVC 362 decline. We first identified this in Assassi et al. (14) and validated it in independent skin and

363 PBMC datasets, which suggests that the regulator pairing is informative across multiple tissues.

364 The results suggested that DMARDs were not a major confounder or driver of these novel

365 subgroups (Supplementary Table S12, S14 and S16).

366 Limitations of our analyses include the small number of samples of limited intrinsic subset

367 across the cohorts and the dependence on published clinical data. The Milano and Hinchcliff

368 datasets contained small numbers of limited patients (n=7 and n=2, respectively). There were no

369 patient samples in Pendergrass or Assassi datasets classified as limited. Although, we did observe

370 a cluster of TFs for the limited patients in Milano, and we have listed the enriched pathways in

371 Supplementary Table S2, it was not possible to validate these results in an independent dataset. A

372 second limitation is that our definitions of ILD were variable across cohorts. In these analyses,

373 ILD was defined as in each of the original publications with the exception of the Hinchcliff cohort,

374 where ILD was defined using the criteria defined in the methods.

375 In summary, we applied computational approaches to investigate the function of regulators

376 associated with SSc pathogenesis. Though SSc is a complex and heterogeneous disease, taking

377 advantage of the previously well-defined intrinsic subsets, we identified the most significant and

378 highly correlated TFs for fibroproliferative and inflammatory patients. These observations might

379 provide a list of novel regulators for those distinct SSc patients. The prediction of ILD could

380 provide additional information to determine probability of ILD development over time.

381

382 383 Methods

384 Dataset collection

385 Two types of datasets were used, gene expression datasets from patients with SSc and publicly

386 available target gene lists for regulators. The regulator target gene lists and motif gene sets (C3)

387 were downloaded from MSigDB (48, 49) (Nov, 2017), which totaled 836 gene sets, of which 615

388 were TFs and 221 were miRNAs. Five independent SSc gene expression datasets were obtained

389 from the gene expression omnibus (GEO) (50), as shown in Table 1. Milano (GSE9285) (7),

390 Pendergrass (GSE32413) (6), and Hinchcliff datasets (GSE59787) (5) were generated by analysis

391 of skin biopsies on Agilent Technologies two- channel DNA microarrays in a common reference

392 design and included 75, 89, and 165 samples, respectively. For these three datasets, both lesional

393 (forearm) and non-lesional (lower back) skin biopsies were collected and processed using similar

394 protocols, as described in the original studies. These studies have shown the consistent gene

395 expression between lesional (forearm) and non-lesional (lower back) samples. Intrinsic subset

396 assignments were as defined in the original publications. Assassi (GSE58095) (14), contains gene

397 expression data collected from 102 SSc involved forearm skin biopsies using a single channel

398 DNA microarray and intrinsic subset assignments were determined in a separate study using a

399 previously trained classifier (25). Additionally, we downloaded an independent PBMC RNA gene

400 expression dataset, Cheadle et al. (GSE33463) (38), that contains blood from 27 patients with SSc,

401 8 of whom had ILD as previously defined. Clinical data, such as MRSS, disease duration, ILD

402 status for each dataset were either obtained from NCBI GEO or were requested from the authors

403 of each study. Early SSc was defined using criteria for each published dataset (usually as SSc

404 disease duration less than 18 months from the time of the first non-Raynaud symptom attributed

405 to SSc to the sample collection time). 406

407 SSc-ILD definition across datasets

408 In this meta-analysis of published cohorts, presence or absence of ILD used was as defined

409 in each of the original publications. These were as follows. For the Assassi et al.(14) dataset,

410 which was used as the training dataset, patients were classified as having ILD when the FVC%

411 predicted was less than 70%. For the Hinchcliff et al. dataset (GSE59787), which was used as

412 validation cohort, ILD was defined as the presence of radiographic findings consistent with ILD

413 in the opinion of an expert thoracic radiologist (5, 51). The Cheadle et al. dataset (GSE33463),

414 also used as a validation cohort, defined ILD according to pulmonary function tests (PFT) and

415 chest high-resolution computed tomography (HRCT) (38).

416

417 Regulator activity score calculation

418 To implement our computational pipeline, we first imputed missing gene expression values using

419 the mean value of a probe across samples. Then, probe-level DNA microarray data were collapsed

420 to genes using median values. Next, by integrating target genes and gene expression profiles, we

421 applied a statistical algorithm called BASE (13) to calculate a regulator activity score for each

422 regulator in each sample. This algorithm has been successfully applied in tumors to infer the

423 activity of regulators based on target gene expression (52, 53). By quantile normalizing the gene

424 expression profile, BASE ensures that samples have a comparable distribution at the genetic level.

425 Then, for single channel DNA microarray datasets, BASE normalizes its gene expression based

426 on the median expression of each gene. For two-channel DNA microarray data, since the data has

427 already been processed by calculating log ratios, no additional processing is needed. Afterwards,

428 for a regulator whose target gene list is � = {�, … , �, … , �} (if genei is target gene, gi is 1; 429 otherwise gi is 0) and a sample with gene expression profile is � = {�, … , �, … , �}, where n is

430 the number of genes, BASE sorts the gene expression in decreasing order and generates two

431 cumulative distribution functions. The first function calls the foreground function, which captures

432 the gene expression levels of target genes of this regulator. Then, the background function is

433 calculated to represent the gene expression levels of non-target genes. When target genes have a

434 higher gene expression level, the foreground function increases dramatically and the background

435 function increases slowly. When the target genes have lower levels of gene expression, the

436 foreground function increases slowly and the background function increases dramatically. The

437 maximum difference of these two functions was used as a preliminary regulator activity score in

438 this sample. A higher score indicates that the target genes of a given regulator are being more

439 highly expressed, which translates to a high regulator activity. Further, by performing a

440 permutation that randomly permutes g for 1,000 times, BASE recalculates the regulator activity

441 score to provide a score vector �� = {�, �, … , �}. Finally, by dividing the mean of absolute

442 values of sp, BASE normalizes the preliminary regulator activity score and infers a sample-specific

443 regulatory activity score. As a general approach, we considered TF activity scores as being

444 positively correlated and miRNA activity scores as being negatively correlated with the expression

445 of their target genes. We recognize that there are cases that violate these assumptions (i.e.

446 transcriptional and activating miRNAs) that will have to be considered on an individual

447 gene basis. Z-transformed activity scores for each dataset are listed in Supplementary Table S18.

448

449 Common regulator identification

450 Using the calculated regulatory activity scores, we calculated the correlations between scores and

451 the MRSS that were obtained at the time of the skin biopsy. Mean correlation was calculated across 452 all cohorts and ranked in decreasing order. The 50 most top- and bottom-ranked regulators were

453 considered to be clinically relevant regulators.

454

455 Network construction

456 To build the regulator interaction network, we applied our previous workflow (53). After mapping

457 regulators to the genetic level, an interaction could be identified in the case when the target gene

458 list of a regulator contains the genetic symbol of other regulators within the MSigDB C3 database.

459 To reduce the complexity of the network, duplicated regulators were amalgamated, and

460 unclassified motifs were excluded. In the network, different colors implied the correlation

461 directions (positively or negatively) with MRSS. The arrow on the edges showed the regulatory

462 direction and a circle with an arrow indicates that a regulator participates in a self-feedback loop.

463 The network was created using Cytoscape (54).

464

465 Statistics

466 We calculated “within-between” scores for each regulator in each dataset using the intrinsic

467 subsets as groups (15). False discovery rate (FDR) was provided to define the degree of consistent

468 activity in one intrinsic subset and the degree of the diverse activity between intrinsic subsets. An

469 FDR of 2% was used as a statistical significance cutoff since the results showed the most

470 consistency with the previously defined intrinsic subsets.

471 Pathway enrichment analysis was conducted via g:Profiler (55) with best per-parent group

472 setting. The KEGG pathway dataset was selected (56). A corrected P-value < 0.05 for multiple

473 testing using the default g:SCS method was applied to define the significant enriched pathways. 474 For the subgroup analysis, P-values between groups were calculated by two-tail Mann-

475 Whitney-Wilcoxon test. Then, the FDR was calculated with the p.adjust() function in R. For

476 subgroups in intrinsic subsets, an FDR<1% was used to define significance. Given their smaller

477 sample size, an FDR<5% was used to define significance for subgroups in ILD samples. For the

478 FVC comparisons between baseline and the last time point, two-tail paired samples T test was

479 used.

480 Heatmaps were plotted using the heatmap.2() function in R “gplots” package (57). The

481 activity scores were processed with the scale() function in R. 482 Author contributions

483 YW and MLW designed the study. YW, JMF, DMT and TAW collected the data. YW performed

484 the research and analyzed the data. JMF called intrinsic subsets. MH and MY collected the data

485 for pulmonary fibrosis. YW and DMT performed the analyses of pulmonary fibrosis. YW and

486 MLW wrote the manuscript. All authors discussed the results, and have read and edited the

487 manuscript.

488

489 Acknowledgments

490 This study was supported by grants to MLW from the Scleroderma Research Foundation, the

491 Ralph and Marian Falk Medical Research Trust, and the National Institutes of Health (P50

492 AR060780). YW was supported by the Scleroderma Research Foundation, the Ralph and Marian

493 Falk Medical Research Trust, the National Institutes of Health under P50 AR060780, and the

494 Centers of Biomedical Research Excellence under 1P20GM130454. DMT was supported by NIH

495 Diversity Supplement P50 AR060780-07S1 and T32 GM008704. JMF was supported by the

496 Burroughs-Wellcome Fund Big Data in the Life Sciences Training Program, and the NIH BD2K

497 T32 5T32LM012204-03. 498 REFERENCE 499 500 1. Allanore Y, Simms R, Distler O, Trojanowska M, Pope J, Denton CP, et al. Systemic sclerosis. 501 Nat Rev Dis Primers. 2015;1:15002. 502 2. LeRoy EC, Black C, Fleischmajer R, Jablonska S, Krieg T, Medsger TA, Jr., et al. Scleroderma 503 (systemic sclerosis): classification, subsets and pathogenesis. J Rheumatol. 504 1988;15(2):202-5. 505 3. Nihtyanova SI, Schreiber BE, Ong VH, Rosenberg D, Moinzadeh P, Coghlan JG, et al. 506 Prediction of pulmonary complications and long-term survival in systemic sclerosis. 507 Arthritis Rheumatol. 2014;66(6):1625-35. 508 4. Domsic RT, Rodriguez-Reyna T, Lucas M, Fertig N, and Medsger TA, Jr. Skin thickness 509 progression rate: a predictor of mortality and early internal organ involvement in diffuse 510 scleroderma. Ann Rheum Dis. 2011;70(1):104-9. 511 5. Hinchcliff M, Huang CC, Wood TA, Matthew Mahoney J, Martyanov V, Bhattacharyya S, 512 et al. Molecular signatures in skin associated with clinical improvement during 513 mycophenolate treatment in systemic sclerosis. J Invest Dermatol. 2013;133(8):1979-89. 514 6. Pendergrass SA, Lemaire R, Francis IP, Mahoney JM, Lafyatis R, and Whitfield ML. Intrinsic 515 gene expression subsets of diffuse cutaneous systemic sclerosis are stable in serial skin 516 biopsies. J Invest Dermatol. 2012;132(5):1363-73. 517 7. Milano A, Pendergrass SA, Sargent JL, George LK, McCalmont TH, Connolly MK, et al. 518 Molecular subsets in the gene expression signatures of scleroderma skin. PLoS One. 519 2008;3(7):e2696. 520 8. Chakravarty EF, Martyanov V, Fiorentino D, Wood TA, Haddon DJ, Jarrell JA, et al. Gene 521 expression changes reflect clinical response in a placebo-controlled randomized trial of 522 abatacept in patients with diffuse cutaneous systemic sclerosis. Arthritis Res Ther. 523 2015;17:159. 524 9. Gordon JK, Martyanov V, Franks JM, Bernstein EJ, Szymonifka J, Magro C, et al. Belimumab 525 for the Treatment of Early Diffuse Systemic Sclerosis: Results of a Randomized, Double- 526 Blind, Placebo-Controlled, Pilot Trial. Arthritis Rheumatol. 2018;70(2):308-16. 527 10. Mahoney JM, Taroni J, Martyanov V, Wood TA, Greene CS, Pioli PA, et al. Systems level 528 analysis of systemic sclerosis shows a network of immune and profibrotic pathways 529 connected with genetic polymorphisms. PLoS Comput Biol. 2015;11(1):e1004005. 530 11. Taroni JN, Mahoney JM, and Whitfield ML. The mechanistic implications of gene 531 expression studies in SSc: Insights from Systems Biology. Curr Treatm Opt Rheumatol. 532 2017;3(3):181-92. 533 12. Hobert O. Gene regulation by transcription factors and microRNAs. Science. 534 2008;319(5871):1785-6. 535 13. Cheng C, Yan X, Sun F, and Li LM. Inferring activity changes of transcription factors by 536 binding association with sorted expression profiles. BMC Bioinformatics. 2007;8:452. 537 14. Assassi S, Swindell WR, Wu M, Tan FD, Khanna D, Furst DE, et al. Dissecting the 538 heterogeneity of skin gene expression patterns in systemic sclerosis. Arthritis Rheumatol. 539 2015;67(11):3016-26. 540 15. Prat A, Adamo B, Cheang MC, Anders CK, Carey LA, and Perou CM. Molecular 541 characterization of basal-like and non-basal-like triple-negative breast cancer. Oncologist. 542 2013;18(2):123-33. 543 16. Segil N, Roberts SB, and Heintz N. Mitotic phosphorylation of the Oct-1 homeodomain 544 and regulation of Oct-1 DNA binding activity. Science. 1991;254(5039):1814-6. 545 17. Yu Z, Xu H, Wang H, and Wang Y. Foxc1 promotes the proliferation of fibroblast-like 546 synoviocytes in rheumatoid arthritis via PI3K/AKT signalling pathway. Tissue Cell. 547 2018;53:15-22. 548 18. Tone M, Powell MJ, Tone Y, Thompson SA, and Waldmann H. IL-10 gene expression is 549 controlled by the transcription factors Sp1 and Sp3. J Immunol. 2000;165(1):286-91. 550 19. Rotheneder H, Geymayer S, and Haidweger E. Transcription factors of the Sp1 family: 551 interaction with E2F and regulation of the murine thymidine kinase . J Mol Biol. 552 1999;293(5):1005-15. 553 20. Ihn H, and Tamaki K. Increased phosphorylation of transcription factor Sp1 in scleroderma 554 fibroblasts: association with increased expression of the type I collagen gene. Arthritis 555 Rheum. 2000;43(10):2240-7. 556 21. Strehlow D, and Korn JH. Biology of the scleroderma fibroblast. Curr Opin Rheumatol. 557 1998;10(6):572-8. 558 22. Blake JA, and Ziman MR. : regulators of lineage specification and progenitor cell 559 maintenance. Development. 2014;141(4):737-51. 560 23. Selmi C, Barin JG, and Rose NR. Current trends in autoimmunity and the nervous system. 561 J Autoimmun. 2016;75:20-9. 562 24. Lockshin MD. Sex differences in autoimmune disease. Lupus. 2006;15(11):753-6. 563 25. Franks JM, Martyanov V, Cai G, Wang Y, Li Z, Wood TA, et al. A Machine Learning Classifier 564 for Assigning Individual Patients With Systemic Sclerosis to Intrinsic Molecular Subsets. 565 Arthritis Rheumatol. 2019. 566 26. Kim E, Kim W, Lee S, Chun J, Kang J, Park G, et al. TRAF4 promotes lung cancer 567 aggressiveness by modulating tumor microenvironment in normal fibroblasts. Sci Rep. 568 2017;7(1):8923. 569 27. Liu K, Wu X, Zang X, Huang Z, Lin Z, Tan W, et al. TRAF4 Regulates Migration, Invasion, and 570 Epithelial-Mesenchymal Transition via PI3K/AKT Signaling in Hepatocellular Carcinoma. 571 Oncol Res. 2017;25(8):1329-40. 572 28. Yao W, Wang X, Cai Q, Gao S, Wang J, and Zhang P. TRAF4 enhances osteosarcoma cell 573 proliferation and invasion by Akt signaling pathway. Oncol Res. 2014;22(1):21-8. 574 29. Ruiz-Perez MV, Henley AB, and Arsenian-Henriksson M. The MYCN Protein in Health and 575 Disease. Genes (Basel). 2017;8(4). 576 30. Rao A, Luo C, and Hogan PG. Transcription factors of the NFAT family: regulation and 577 function. Annu Rev Immunol. 1997;15:707-47. 578 31. Demagny H, Araki T, and De Robertis EM. The tumor suppressor Smad4/DPC4 is regulated 579 by phosphorylations that integrate FGF, Wnt, and TGF-beta signaling. Cell Rep. 580 2014;9(2):688-700. 581 32. Shimoda LA, and Semenza GL. HIF and the lung: role of hypoxia-inducible factors in 582 pulmonary development and disease. Am J Respir Crit Care Med. 2011;183(2):152-6. 583 33. Ueno M, Maeno T, Nomura M, Aoyagi-Ikeda K, Matsui H, Hara K, et al. Hypoxia-inducible 584 factor-1alpha mediates TGF-beta-induced PAI-1 production in alveolar macrophages in 585 pulmonary fibrosis. Am J Physiol Lung Cell Mol Physiol. 2011;300(5):L740-52. 586 34. Tang TT, and Lasky LA. The forkhead transcription factor FOXO4 induces the down- 587 regulation of hypoxia-inducible factor 1 alpha by a von Hippel-Lindau protein- 588 independent mechanism. J Biol Chem. 2003;278(32):30125-35. 589 35. Mummler C, Burgy O, Hermann S, Mutze K, Gunther A, and Konigshoff M. Cell-specific 590 expression of runt-related transcription factor 2 contributes to pulmonary fibrosis. FASEB 591 J. 2018;32(2):703-16. 592 36. Alarcon-Riquelme ME. Role of RUNX in autoimmune diseases linking rheumatoid arthritis, 593 psoriasis and lupus. Arthritis Res Ther. 2004;6(4):169-73. 594 37. de Bruijn MF, and Speck NA. Core-binding factors in hematopoiesis and immune function. 595 Oncogene. 2004;23(24):4238-48. 596 38. Cheadle C, Berger AE, Mathai SC, Grigoryev DN, Watkins TN, Sugawara Y, et al. Erythroid- 597 specific transcriptional changes in PBMCs from pulmonary hypertension patients. PLoS 598 One. 2012;7(4):e34951. 599 39. Franks J MV, Wood TA, Crofford L, Keyes-Elstein L, Furst DE, Goldmuntz E, Mayes MD, 600 McSweeney P, Nash R, Sullivan K, Whitfield ML. Machine Learning Classification of 601 Peripheral Blood Gene Expression Identifies a Subset of Patients with Systemic Sclerosis 602 Most Likely to Show Clinical Improvement in Response to Hematopoietic Stem Cell 603 Transplant. Arthritis Rheumatol. 2018;70 (suppl 10). 604 40. Hinchcliff M, Toledo DM, Taroni JN, Wood TA, Franks JM, Ball MS, et al. Mycophenolate 605 Mofetil Treatment of Systemic Sclerosis Reduces Myeloid Cell Numbers and Attenuates 606 the Inflammatory Gene Signature in Skin. J Invest Dermatol. 2018;138(6):1301-10. 607 41. Khanna D, Spino C, Johnson S, Chung L, Whitfield ML, Denton CP, et al. Abatacept in Early 608 Diffuse Cutaneous Systemic Sclerosis: Results of a Phase II Investigator-Initiated, 609 Multicenter, Double-Blind, Randomized, Placebo-Controlled Trial. Arthritis Rheumatol. 610 2020;72(1):125-36. 611 42. Johnson ME, Franks JM, Cai G, Mehta BK, Wood TA, Archambault K, et al. Microbiome 612 dysbiosis is associated with disease duration and increased inflammatory gene expression 613 in systemic sclerosis skin. Arthritis Res Ther. 2019;21(1):49. 614 43. Volkmann ER. Intestinal microbiome in scleroderma: recent progress. Curr Opin 615 Rheumatol. 2017;29(6):553-60. 616 44. Zhang R, Wang L, Pan JH, and Han J. A critical role of E2F transcription factor 2 in 617 proinflammatory cytokines-dependent proliferation and invasiveness of fibroblast-like 618 synoviocytes in rheumatoid Arthritis. Sci Rep. 2018;8(1):2623. 619 45. Ankers JM, Awais R, Jones NA, Boyd J, Ryan S, Adamson AD, et al. Dynamic NF-kappaB 620 and E2F interactions control the priority and timing of inflammatory signalling and cell 621 proliferation. Elife. 2016;5. 622 46. Murga M, Fernandez-Capetillo O, Field SJ, Moreno B, Borlado LR, Fujiwara Y, et al. 623 Mutation of in mice causes enhanced T lymphocyte proliferation, leading to the 624 development of autoimmunity. Immunity. 2001;15(6):959-70. 625 47. Kataoka H, Yasuda S, Fukaya S, Oku K, Horita T, Atsumi T, et al. Decreased expression of 626 Runx1 and lowered proportion of Foxp3(+) CD25(+) CD4(+) regulatory T cells in systemic 627 sclerosis. Mod Rheumatol. 2015;25(1):90-5. 628 48. Liberzon A, Birger C, Thorvaldsdottir H, Ghandi M, Mesirov JP, and Tamayo P. The 629 Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 630 2015;1(6):417-25. 631 49. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene 632 set enrichment analysis: a knowledge-based approach for interpreting genome-wide 633 expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545-50. 634 50. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: 635 archive for functional genomics data sets--update. Nucleic Acids Res. 2013;41(Database 636 issue):D991-5. 637 51. Showalter K, Hoffmann A, Rouleau G, Aaby D, Lee J, Richardson C, et al. Performance of 638 Forced Vital Capacity and Lung Diffusion Cutpoints for Associated Radiographic Interstitial 639 Lung Disease in Systemic Sclerosis. J Rheumatol. 2018;45(11):1572-6. 640 52. Wang Y, Ung MH, Xia T, Cheng W, and Cheng C. Cancer cell line specific co-factors 641 modulate the FOXM1 cistrome. Oncotarget. 2017;8(44):76498-515. 642 53. Andrews E, Wang Y, Xia T, Cheng W, and Cheng C. Contextual Refinement of Regulatory 643 Targets Reveals Effects on Breast Cancer Prognosis of the Regulome. PLoS Comput Biol. 644 2017;13(1):e1005340. 645 54. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software 646 environment for integrated models of biomolecular interaction networks. Genome Res. 647 2003;13(11):2498-504. 648 55. Reimand J, Arak T, Adler P, Kolberg L, Reisberg S, Peterson H, et al. g:Profiler-a web server 649 for functional interpretation of gene lists (2016 update). Nucleic Acids Res. 650 2016;44(W1):W83-9. 651 56. Kanehisa M, and Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids 652 Res. 2000;28(1):27-30. 653 57. Gregory R. Warnes BB, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber, Andy 654 Liaw, Thomas Lumley, Martin Maechler, Arni Magnusson, Steffen Moeller, Marc Schwartz, 655 Bill Venables and Tal Galili. gplots: Various R Programming Tools for Plotting Data. CRAN 656 R. 2020.

657

658 Target Gene Data Gene Expression Milano et al. Pendergrass et al. Hinchcliff et al. MSigDB Assassi et al. C3 BASE (836)

Calculation of sample-specific regulatory activity scores for each dataset

Pairwise Analysis Pairwise Analysis Novel subgroup has Novel subgroup has more severe disease more severe disease Calculation of correlations between activity scores and MRSS for each dataset dcSSc High High lcSSc unclear

Group4 Group1

0 Fibroproliferative Inflammatory Group3 Group2 dcSSc lcSSc Identification of clinically Combination of clinically Identification of clinically Group4 Group1 other 0 relevant regulators relevant regulators Group3 Group2

NMYC_01 activity relevant regulators SMAD4_Q6 activity

Low Low Low 0 High Low 0 High CART1_01 activity NFAT_Q6 activity Regulatory Network in SSc Pairwise Analysis PIT1 FOXL1 COREBINDINGFACTOR interacts with Novel subgroup has More like to have ILD or interacts with ETF

interacts with FOXJ1 interacts with FALZ PTF1A

interacts with XBP1 interacts with interacts with Novel subgroup has interacts with ZID interacts with AML

interacts with

interacts with GCMA CART1interacts with

interacts with interacts with FVC decline over interacts with

interacts with more severe disease

interacts with interacts with interacts with NFATC1 interacts with

interacts with

interacts with interacts with ESR1 interacts with interacts with TCF2 FOXD3 NFKB1 SRF interacts with

interacts with interacts with interacts with interacts with interacts with interacts with NKX3-1 interacts with

interacts with interacts with interacts with more severe disease interacts with NKX2-1

STAT interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with ISRE IRF interacts with interacts with IRF7 interacts with interacts with interacts with 36 months of follow-up

interacts with POU2F1 CDP interacts with

interacts with CDPCR1 FOXC1 interacts with interacts with

interacts with interacts with

SRY interacts with interacts with ELF2 interacts with interacts with

interacts with PAX3 interacts with

interacts with FOXJ2 interacts with STAT5B interacts with ETS2 interacts with

interacts with High AMEF2 interacts with FOXO4 interacts with PU1 interacts with interacts with interacts with FOXF2 interacts with

interacts with FXR MIR-455 STATinteracts with 6 interacts with interacts with interacts with MIR-504 interacts with PAX4

interacts with T3R GFI1 interacts with STAT5A MIR-330 MIR-329 interacts with ATF1

interacts with

interacts with

interacts with MIR-128A/128B interacts with G4 G1 P=0.02

interacts with

interacts with MIR-512-3P interacts with interacts with MIR-365 interacts with 0 interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with

interacts with

interacts with interacts with interacts with MIR-328 interacts with interacts with

interacts with

interacts with interacts with OR_Q6 interacts with interacts with interacts with G3 interacts with interacts with G2

interacts with interacts with interacts with interacts with interacts with interacts with STATinteracts3 with interacts with interacts with T

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with

interacts with

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with 100 interacts with

interacts with interacts with

MIR-485-3P C interacts with RFX1

interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with ETS1 A interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with

interacts with F interacts with interacts with PEA3

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with MADH4 interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with interacts with

interacts with

interacts with interacts with interacts with

interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with

interacts with interacts with interacts with interacts with

interacts with

interacts with interacts with interacts with interacts with

interacts with HSF activity

MAZ interacts with interacts with

interacts with

interacts with interacts with

TP53 interacts with 80 TCF11MAFG interacts with interacts with SP3 E2F1 interacts with ESRRA interacts with

interacts with interacts with FVC% interacts with interacts with

interacts with

interacts with TFAP2A interacts with interacts with interacts with interacts with interacts with NKX2-5 interacts with

interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with

interacts with interacts with

interacts with interacts with interacts with

interacts with ZNF238 SF1 interacts with E2F1DPinteracts with 1

interacts with JUN interacts with NR2F2 interacts with

interacts with

interacts with interacts with MI interactsR-142-5 with P interacts with interacts with NFYA interacts with interacts with ARNT interacts with

interacts with interacts with

interacts with interacts with E2F1DP2 interacts with LEF1 interacts with interacts with interacts with interacts with interacts with interacts with

interacts with

interacts with

interacts with interacts with MYC interacts with interacts with interacts with HNF4A interacts with interacts with MIR-34A/34C/449 interacts with

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with E2F4DP2 interacts with interacts with ZIC1 interacts with HMX1 interacts with ETS interacts with STAT1 interacts with E2F

interacts with ELF1 interacts with interacts with interacts with MIR-7 interacts with interacts with interacts with HSF2 MIR-29A/29B/29C P300 interacts with

interacts with 60 MIF1 NFMUE1 interacts with MIR-369-3P interacts with COREBINDING

interacts with MIR-199A/199B TFAP4 UBP1 SREBF1 CREB1 Low interacts with MIR-484 RREB1 MIR-19A/19B interacts with MIR-520G/520H GABPA interacts with TLX2 AP1FJ MIR-145 ELK1 MIR-126 interacts with ATF3 Low 0 High MIR-150 MIR-18A/18B MIR-27A/27B baseline last TFAP2G LET-7A,LET-7B,LET-7C,LET-7D,LET-7E,LET-7F/98,LET-7G,LET-7I FOXO4_02 activity 659 660 Figure 1. Overview of computational workflow in this study. Briefly, by integrating target gene lists and 661 SSc patients’ gene expression, we calculated sample-specific regulatory activity scores for each dataset. 662 Then, by calculating the correlations between activity scores and MRSS, fibroproliferative and 663 inflammatory associated regulators are identified. Using those regulators, we further identify novel 664 subgroups of patients in given intrinsic subset, build a regulatory network in the context of SSc and identify 665 novel subgroups of SSc patients who are more like to have ILD or FVC decline over 36 months of follow- 666 up. A B

1 2 1 Fibroproliferative 2 Inflammatory Limited Normal-like Activity Scores

Low High

Milano et al Assassi et al

Activator protein E2F MYC Estrogen receptor 1 ETS domain-containing protein Glucocortocoid receptor Nuclear respiratory factor Immunoglobulin protein Sterol regualtory element-binding protein MYC CCAAT enhancer binding protein 2 Specificity protein POU Class 1 protein Paired box protein 3 Helox-loop-helix protein FOXA2, FOXD3, FOXJ1 and FOXJ2 Activator protein Estrogen receptor STAT/STAT5A/STAT5B Early B cell factor SMAD4 Early growth response FOXO3 Enhancer factor C RUNX1-related core binding factor Fatty acid binding protein 1 Nuclear factor of activated T cells GATA1/GATA2/GATA3 Nuclear factor of activated B cells Helix-loop-helix protein Interferon regulatory factor Sterol regualtory element-binding protein ETS proto-oncogene protein SMAD/SMAD4 STAT3/STAT4 Sex determining region Y GATA/GATA1/GATA3 T cell translocation protein FOX/FOXO4/FOXQ1 FOXA1/FOXF1/FOXF2 2 Nuclear factor of activated T-cells Sex determining region Y TATA-box binding protein STAT/STAT3/STAT5A/STAT5B E2F RUNX1-related core binding factor E2F1 Nuclear factor of activated T cells Nuclear factor of activated B cells MYC Interferon regulatory factor YY1 ETS proto-oncogene protein B cell nuclear factor ETS domain-containing protein 667 Nuclear respiratory factor 668 Figure 2. Activity scores reproduce intrinsic subtypes in SSc datasets. Heat maps were plotted to show 669 the intrinsic subsets clusters in the A) Milano et al. and B) Assassi et al. datasets. In the heatmap, each row 670 is a regulator, each column is an SSc sample, and different colors infer intrinsic subtypes (red: 671 fibroproliferative; purple: inflammatory; yellow: limited; and green: normal-like). The cells in the heatmap 672 are the normalized activity scores, within which blue is a low score and red is a high score. Driven regulators 673 were listed for each cluster. A B OCT1_07 ETS2_B PIT_Q6 NFAT_O6 FREAC3_01 STAT_01 FREAC2_01 NFKB_Q6 FREAC7_01 STAT5A FOXD3_01 SRF_C FOXO4_02 COREBINDINGFACTOR_Q6 CART1_01 IRF_Q6 FOXJ2_02 SMAD4_Q6 GFI1_01 STAT3_02 ER_06_02 STAT6_02 microRNAs PU1_Q6 PCC in Inflammatory samples PCC in Fibroproliferative samples

SMAD4_Q6 NFMUE1_Q6 MYC_Q2 GATTGGY_NFY_Q6_01 E2F_Q2 E2F4DP_01 MIF1_01 ARNT_01 STAT3_02 SF1_Q6 P53_02 ERR1_Q2 RFX1_01 E2F1_Q6 AP2GAMMA_01 Let7 complex TGAYRTCA_ATF3_Q6 STAT1_03 ETS1_B RYTGCNNRGNAAC_MIF_01 P300_01 E2F_Q6 NKX25_02 microRNAs

rass rass Mean Mean Milano Assassi Milano Assassi Hinchcliff Hinchcliff enderg enderg 674 P P 675 Figure 3. Identification of clinically relevant regulators. Heat maps were plotted to show the Pearson 676 correlation coefficients (PCC) between activity scores and the MRSS of samples across datasets for A) 677 fibroproliferative; and B) inflammatory SSc samples. In the heatmap, each row is a regulator, each column 678 is a dataset, and the cells in the heatmap are PCC (within which blue is low PCC and red is high PCC). The 679 median PCC was calculated for each regulator across datasets to show the correlation power. The heatmap 680 was plotted by ranking the median PCC in decreasing order. The short, black, bold line is the cutoff we 681 used to select out the common regulators in all datasets. Significant regulators are listed. A B Fibroproliferative Inflammatory

RREB1 interacts with ISRE E2F interacts with STAT6 HNF4A interacts with ETF interacts with MIR-29A/29B/29C interacts with TCF2 interacts with ZID interacts with interacts with interacts with ETS interacts with

interacts with interacts with ATF1 MYC interacts with T3R interacts with interacts with STAT5A

interacts with interacts with interacts with Positive correlation interacts with interacts with interacts with interacts with

interacts with

interacts with CDP

interacts with interacts with interacts with

interacts with MIR-455 MIR-365 FOXD3 MIR-19A/19B ARNT interacts with NFMUE1 HSF ELF2 interacts with

interacts with

interacts with interacts with GFI1 interacts with POU2F1interacts with

interacts with

interacts with interacts with CDPCR1 interacts with interacts with

PIT1 interacts with NFKB1

interacts with interacts with interacts with PU1 interacts with interacts with interacts with

interacts with interacts with

interacts with

interacts with ZICinteracts with 1 PEA3 interacts with ETS1 interacts with interacts with MIR-7 PTF1A interacts with interacts with interacts with interacts with NFMUE1 Negative correlation interacts with LEF1 interacts with interacts with IRF interacts with interacts with interacts with interacts with interacts with

interacts with STAT5B interacts with interacts with

interacts with FALZ interacts with MIR-512-3P interacts with UBP1 ATF3 interacts with interacts with PAX3 interacts with NKX3-1 interacts with ESRRA interacts with interacts with interacts with interacts with SF1 interacts with interacts with TLX2

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with SRF NKX2-5 interacts with P300 interacts with interacts with FOXL1 interacts with MIR-328 interacts with interacts with PAX3 interacts with MAZ interacts with interacts with interacts with interacts with PEA3 interacts with interacts with RFX1 interacts with RFX1 interacts with FOXO4 interacts with interacts with interacts with interacts with NR2F2 interacts with interacts with interacts with interacts with FOXC1 ETS2 interacts with interacts with

interacts with interacts with interacts withCART1 interacts with XBP1

interacts with interacts with

NFATC1 interacts with interacts with

interacts with MIR-128A/128B interacts with ZNF238 interacts with JUN NKX2-1 interacts with ESR1 interacts with interacts with interacts with SP3 interacts with interacts with interacts with interacts with STAT3 interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with FXR interacts with interacts with interacts with SRY CREB1 interacts with interacts with interacts with interacts with STAT3 interacts with

interacts with interacts with MIR-485-3P interacts with FOXJ2 interacts with interacts with MIR-199A/199B TCF11MAFG interacts with interacts with ELKinteracts1 with interacts with PAX4 interacts with interacts with

interacts with

interacts with

interacts with interacts with

interacts with interacts with ETS1 AMEF2 interacts with

E2F1DP2 interacts with interacts with NFYA

interacts with FOXF2 interacts with MIR-150 interacts with

interacts with TFAP4 interacts with interacts with E2F1 TFAP2A MIR-330 GCMA interacts with ELF1 interacts with

interacts with E2F interacts with MIR-34A/34C/449 interacts with interacts with

interacts with interacts with interacts with MIR-142-5P SRY FOXJ1 HSF2 MIR-329 MIR-520G/520H MADH4 interacts with E2F4DP2 HMX1 MIR-142-5P TFAP2G SREBF1 E2F1DP1

interacts with MIF1 LET-7A,LET-7B,LET-7C,LET-7D,LET-7E,LET-7F/98,LET-7G,LET-7I

C Combined

PIT1 FOXL1 COREBINDINGFACTOR

interacts with

interacts with ETF

interacts with FOXJ1 interacts with FALZ PTF1A

interacts with XBP1 interacts with

interacts with interacts with ZID interacts with AML

interacts with

interacts with GCMA CART1interacts with

interacts with interacts with

interacts with

interacts with

interacts with interacts with interacts with NFATC1 interacts with

interacts with

interacts with interacts with ESR1 interacts with interacts with Positive correlation only in Fibroproliferative TCF2 FOXD3 NFKB1 SRF interacts with

interacts with interacts with interacts with interacts with interacts with interacts with NKX3-1 interacts with

interacts with interacts with interacts with

interacts with NKX2-1

STAT interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with ISRE IRF interacts with interacts with IRF7 interacts with interacts with interacts with

interacts with POU2F1 CDP interacts with

interacts with CDPCR1 FOXC1 interacts with interacts with

interacts with interacts with

SRY interacts with interacts with ELF2 interacts with interacts with

interacts with PAX3 interacts with

interacts with FOXJ2 interacts with STAT5B Positive correlation only in Inflammatory interacts with ETS2 interacts with

interacts with

AMEF2 interacts with FOXO4 interacts with PU1 interacts with interacts with interacts with FOXF2 interacts with

interacts with FXR MIR-455 STATinteracts with 6 interacts with

interacts with

interacts with MIR-504 interacts with PAX4

interacts with T3R GFI1 interacts with STAT5A MIR-330 MIR-329 interacts with ATF1

interacts with interacts with Positive correlation in both Fibroproliferative

interacts with MIR-128A/128B interacts with

interacts with

interacts with MIR-512-3P interacts with interacts with MIR-365 interacts with interacts with interacts with and Inflammatory

interacts with interacts with

interacts with interacts with interacts with

interacts with

interacts with interacts with interacts with MIR-328 interacts with interacts with

interacts with

interacts with interacts with

interacts with interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with STATinteracts3 with interacts with interacts with

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with Positive correlation in Fibroproliferative and

interacts with

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with MIR-485-3P interacts with RFX1

interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with negative correlation in Inflammatory interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with ETS1

interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with PEA3

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with MADH4 interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with Positive correlation in Inflammatory and interacts with interacts with interacts with interacts with

interacts with interacts with

interacts with

interacts with interacts with interacts with

interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with negative correlation in Fibroproliferative

interacts with interacts with interacts with interacts with

interacts with

interacts with interacts with interacts with interacts with HSF interacts with

MAZ interacts with interacts with

interacts with

interacts with interacts with TCF11MAFG TP53 interacts with interacts with interacts with SP3 E2F1 interacts with ESRRA interacts with Negative correlation only in Fibroproliferative interacts with interacts with interacts with interacts with

interacts with

interacts with TFAP2A interacts with interacts with interacts with interacts with interacts with NKX2-5 interacts with

interacts with

interacts with interacts with interacts with interacts with interacts with interacts with interacts with interacts with

interacts with

interacts with interacts with

interacts with interacts with interacts with

interacts with ZNF238 SF1 interacts with E2F1DPinteracts with 1

interacts with JUN interacts with NR2F2 interacts with

interacts with

interacts with interacts with MI interactsR-142-5 with P interacts with interacts with NFYA interacts with interacts with ARNT interacts with

interacts with interacts with

interacts with interacts with E2F1DP2 interacts with LEF1 interacts with interacts with interacts with interacts with interacts with interacts with

interacts with Negative correlation only in Inflammatory interacts with

interacts with interacts with MYC interacts with interacts with interacts with HNF4A interacts with interacts with MIR-34A/34C/449 interacts with

interacts with interacts with

interacts with interacts with

interacts with interacts with interacts with interacts with interacts with E2F4DP2 interacts with interacts with ZIC1 interacts with HMX1 interacts with ETS interacts with STAT1 interacts with E2F

interacts with ELF1 interacts with interacts with interacts with MIR-7 interacts with interacts with interacts with HSF2 MIR-29A/29B/29C P300 interacts with interacts with MIF1 NFMUE1 interacts with MIR-369-3P interacts with Negative correlation in both Fibroproliferative

interacts with MIR-199A/199B TFAP4 UBP1 SREBF1 CREB1 interacts with MIR-484 RREB1 MIR-19A/19B and Inflammatory interacts with MIR-520G/520H GABPA interacts with TLX2 AP1FJ MIR-145 ELK1 MIR-126

interacts with ATF3 MIR-150 MIR-18A/18B MIR-27A/27B TFAP2G 682 LET-7A,LET-7B,LET-7C,LET-7D,LET-7E,LET-7F/98,LET-7G,LET-7I 683 Figure 4. Regulator interaction networks in the context of SSc. Using the intrinsic subtype-specific 684 clinically relevant regulators, we created networks in A) only fibroproliferative samples; B) only 685 inflammatory samples; and C) both of them. In networks A) and B), red nodes represent regulators whose 686 activity scores are positively correlated with MRSS and cyan nodes represent regulators whose negatively 687 scores are positively correlated with MRSS. The size of the node is positively correlated to its degree of 688 connection. The arrow direction points from regulator to target and a circle means self-regulation. In the 689 network of C), different colors represent different clusters. A B C 35 P=0.0007 dcSSc 40 P=0.05 P=0.01 3 lcSSc P=0.02 P=0.03 unclear P=4e-05 P=1e-05 P=0.0001 30 P=0.007

2 60 30 25 1 P=0.04 P=0.03 Group4 Group1 20

0 50 Group3 Group2 20

MRSS 15 −1 Age at dignosis

NMYC_01 activity 40 10

−2 10 Duration at baseline Fibroproliferative

−3 5

0 30

−4 0 −2 −1 0 1 2 3 4

CART1_01 activity Group1 Group2 Group3 Group4 Group1Group2 Group3 Group4 Group1Group2 Group3 Group4

D E F 40 4 30

3 P=7e-05 P=0.01 P=0.007 97.9% 40 20 dcSSc

2 10 83.3% lcSSc #Samples 78.6% 50% 30 other dcSSc 0 1 lcSSc 40 MRSS Group4 Group1 other 20

0 30 SMAD4_Q6 activity

Inflammatory Group3 Group2 74.5% 20 Early −1 10 #Samples 10 50% Late 71.4% 50% −2 0 other −1 0 1 2 3 4 5 NFAT_Q6 activity Group1 Group2 Group3 Group4 690 Group1 Group2 Group3 Group4 691 Figure 5. Regulator pairs identify novel subgroups of samples in intrinsic subsets. Shown is the sample 692 distribution of A) fibroproliferative and D) inflammatory samples based on the activity scores of given 693 regulator pairs. Group 1 (red dots for fibroproliferative and purple dots for inflammatory) samples have 694 positive scores of both regulators. Group 3 (brown dots) samples have negative scores of both regulators. 695 Group 2 (blue dots) and group 4 (grey dots) samples have one positive and one negative score. The boxplot 696 of MRSS comparisons between groups is shown in B) for fibroproliferative and E) for inflammatory 697 samples. C) Boxplots for age at diagnosis and disease duration at baseline are shown for each group of 698 fibroproliferative samples. Two-tailed Mann-Whitney-Wilcoxon test P-values are listed. F) Clinical 699 subtypes and stage fractions for each group are listed in inflammatory samples. Fractions of dcSSc and 700 early-stage samples are listed. A Assassi B Assassi C Assassi

40 50% 30

4 NonILD P=0.003 P=0.0005 P=0.03 ILD 25 3 30 2 20 OR_Q6 activity T C

1 20 MRSS 15 A F 25% G4 G1 #Samples 27% 0 10 G3 G2 10 29% −1 5 COREBINDING −2 0

0 2 4 6 8 FOXO4_02 activity Group1 Group2 Group3 Group4 Group1 Group2 Group3 Group4

D E F Hinchcliff Hinchcliff Hinchcliff Group 1 Group 2 2 35 p=0.02 90 p=0.8 100 G4 G1 30 P=3e-07 P=2e-08 P=0.09

0 80 G3 G2 80 25 P=2e-04 P=7e-05 FVC% 60 OR_Q6 activity −2 20

T 65

C baseline last baseline last A F

MRSS 15 Group 3 Group 4 −4

10 p=0.14 p=0.7 80 100

−6 5 60 80 COREBINDING

0 FVC%

−8 40 60 −2 −1 0 1 2 3 4 5 baseline last baseline last FOXO4_02 activity Group1 Group2 Group3 Group4 701 702 Figure 6. Regulator pairs associated with SSc complicated by pulmonary fibrosis. A) The distribution 703 of samples distribution between ILD and non-ILD SSc samples; B) MRSS comparisons; C) Barplot of the 704 fraction of ILD samples in each group from the Assassi dataset with a given regulator pair; D) Validation 705 of sample distribution; E) MRSS comparisons from the Hinchcliff dataset with the same regulator pair; and 706 F) Validation with patients using their FVC, FEV1, DLCO, and TLC measures.

707

708 709 Table 1. Summarization of datasets used in the study. #Intrinsic subtypes #Clinical #Disease Dataset GEO ID Platform #Samples (proliferative/inflammatory/ subtypes stage limited/normal-like) (dcSSc/lcSSc) (early/late) Milano et Two-channel GSE9285 75 24/9/13/22 31/16 7/45 al. array Pendergrass Two-channel GSE32413 89 26/25/9/23 66/0 59/6 et al. array Hinchcliff Two-channel GSE59787 165 49/42/40/30 111/21 75/64 et al. array Assassi et One-channel GSE58095 102 27/22/0/53 43/18 5/5 al. array Cheadle et One-channel GSE33463 69 /// / / al. array 710