<<

Comprehensive Understanding of the Response Profiles

A. Background The current project aims to collect datasets profiling expression patterns of human cytokine treatment response from the NCBI GEO and EBI ArrayExpress databases. The Framework for Data Curation already hosted a list of candidate datasets. You will read the study design and sample annotations to select the relevant datasets and label the sample conditions to enable automatic analysis.

If you want to build a new data collection project for your topic of interest instead of working on our existing cytokine project, please read section D. We will explain the cytokine project’s configurations to give you an example on creating your curation task.

A.1. Cytokine are a broad category of small mediating signaling. Many cell types can release cytokines and receive cytokines from other producers through receptors on the cell surface. Despite some overlap in the literature terminology, we exclude , hormones, or growth factors, which are also essential molecules. Meanwhile, we count two cytokines in the same family as the same if they share the same receptors. In this project, we will focus on the following families and use the member symbols as standard names (Table 1).

Family Members (use these symbols as standard cytokine names)

Colony-stimulating factor GCSF, GMCSF, MCSF

Interferon IFNA, IFNB, IFNG

Interleukin IL1, IL1RA, IL2, IL3, IL4, IL5, IL6, IL7, IL9, IL10, IL11, IL12, IL13, IL15, IL16, IL17, IL18, IL19, IL20, IL21, IL22, IL23, IL24, IL25, IL26, IL27, IL28, IL29, IL30, IL31, IL32, IL33, IL34, IL35, IL36, IL36RA, IL37, TSLP, LIF, OSM

Tumor necrosis factor TNFA, LTA, LTB, CD40L, FASL, CD27L, CD30L, 41BBL, TRAIL, OPGL, APRIL, LIGHT, TWEAK, BAFF

Unassigned TGFB, MIF Table 1. Cytokine families and members from Appendix III of (Murphy and Weaver, 2016).

A.2. Annotation for Automatic Analysis

1 Each dataset submission in the GEO or ArrayExpress databases will contain sample annotations. However, the original information typically cannot enable automatic analysis by a computer program. For example, the dataset GSE72502 from the GEO contains six expression profiles with the following sample information (Table 2A). A biologist could interpret this dataset as a study of alpha (IFNA in Table 1) response in derived from peripheral blood. There are three donors involved, and each donor includes a pair of control and IFNA treated conditions. To generate expression profiles of IFNA treatment, the biologist would first calculate the log fold change (logFC) between the IFNA and Control conditions for each donor, and then compute the median logFC among all three donors.

However, a computer program cannot automatically understand these conditions unless you label them in standardized vocabularies (Table 2B). For every dataset, you can define the columns Treatment, Condition, and Sub Condition. A computer program can first group the samples according to the sub conditions. Within each sub condition group, the program can compute the logFC between any cytokine and control condition. Finally, the program will merge values in different sub conditions into one median value for the condition.

ID title cell type treatment donor GSM1863325 1. PBMC peripheral blood monocytes none A GSM1863326 2. PBMC peripheral blood monocytes none B GSM1863327 3. PBMC peripheral blood monocytes none C GSM1863328 4. PBMC + IFNa peripheral blood monocytes interferon-alpha B GSM1863329 5. PBMC + IFNa peripheral blood monocytes interferon-alpha A GSM1863330 6. PBMC + IFNa peripheral blood monocytes interferon-alpha C A. Original. ID Treatment Condition Sub Condition GSM1863325 Control A GSM1863326 Control Monocyte B GSM1863327 Control Monocyte C GSM1863328 IFNA Monocyte B GSM1863329 IFNA Monocyte A GSM1863330 IFNA Monocyte C B. Standardized.

Table 2. Original and standardized sample annotations of GSE72502.

2 Another example is dataset GSE37624, studying the IL1 and IL33 treatment response in the HUVEC cell line (Table 3A). There is no sub condition (e.g., donor) in this study. The dose and duration of treatment are available. We can create the standardized annotation in Table 3B.

ID protocol cell type GSM923359 stimulated with IL-1b 0.5ng/ml for 4 hours HUVEC pool of 10 donors GSM923360 stimulated with IL-33 50ng/ml for 4 hours HUVEC pool of 10 donors GSM923361 control HUVEC pool of 10 donors GSM923362 control HUVEC pool of 10 donors GSM923363 stimulated with IL-33 50ng/ml for 4 hours HUVEC pool of 10 donors GSM923364 stimulated with IL-1b 0.5ng/ml for 4 hours HUVEC pool of 10 donors A. Original. ID Treatment Condition Dose Duration GSM923359 IL1 HUVEC 0.5ng/ml 4h GSM923360 IL33 HUVEC 50ng/ml 4h GSM923361 Control HUVEC 4h GSM923362 Control HUVEC 4h GSM923363 IL33 HUVEC 50ng/ml 4h GSM923364 IL1 HUVEC 0.5ng/ml 4h B. Standardized.

Table 3. Original and standardized sample annotations of GSE37624.

In summary, annotations for automatic analysis will include predefined columns with content in standardized vocabularies that a computer program can analyze without human intervention. The role of human curation is to transform the original annotation into standardized forms.

B. Procedure

Step One Register a curator account at the Curation Platform. Email the administrator (Peng Jiang: [email protected]) your username and ask to join the Human Cytokine Response Project. The reviewer account has access to the Human Cytokine Response Project by default. Please use Google Chrome for the best performance. Firefox and Safari also work, although the regular expression functions will be compromised. This platform does NOT work on IE.

3 Step Two Read the help document of the web platform to understand the workflow and assist functions in the NCI Curation platform. Specifically, the current project has the following two stages. ● Stage one: Identify dataset profiling the treatment response of human cytokines following these criteria: 1. All datasets must have both cytokine treatment and control conditions. Some studies profiled by two-color microarrays may use the control samples as the background reference (Cy3 or Cy5) thus may not explicitly provide control conditions. For simplicity, we ignore such datasets. 2. Each condition should have at least two biological replicates. For certain datasets, each cell model may not have replicates. However, if several cell models were profiled, you may group them as biological replicates by labeling the cell model names as Sub Conditions, which can be grouped as an average value (Table 4 Sub Condition).

● Stage two: annotate five standard columns in Table 4. Our later analysis will ignore any rows with blank values in the Treatment or Condition columns. For other columns, please leave the cell blank if the value is not available. For the Sub Condition column, if you input any value, our later analysis will ignore rows with blank content. Otherwise, if you leave the whole Sub Condition blank, we will ignore this column in the analysis.

Column Content Treatment All cytokine names must follow the member symbols in Table 1. The control condition should be labeled as Control. We only focus on cytokine mono-treatment and control conditions. You may include combination cytokine treatment or other compounds, although these additional annotations are not required. Condition Please annotate the name of model systems for cytokine treatment study and keep the name consistent across all conditions. Our automatic program will group samples by different conditions and compute the differential expression between cytokine-treated and control samples within each condition. Sub Condition This column typically hosts cell line or donor names without experimental replicates. For each condition, an automatic program can merge the values among all sub conditions as biological replicates. More broadly, the curator can input any information here to merge values for each condition. Dose Dose of the cytokine

4 Duration Duration length of the treatment Table 4. Standard columns for sample annotation table.

To demonstrate an example, we show how to annotate the Table 2 dataset in steps: 1. In the dataset annotation table (main menu CURATION -> ANNOTATE), please find the ID GSE72502 and click on the sample count 6 to open the sample annotation table. You will find that the Treatment column is already populated with values from the source column treatment. However, the names do not comply with the standard in Table 4. 2. In the column selector, select Treatment as the destination column. Then, click the button Translate Vocabulary to standardize names. Note: the automatic vocabulary map does not always work perfectly. In most cases, curators need to adjust names with extra steps. 3. Type Monocyte in the first input box of the Condition column. Copy this word by Ctrl (Command for Mac) + v. In the column selector, select Condition as the destination column and select the Paste from Clipboard checkbox. Then, Ctrl (Command for Mac) + v to paste values to the entire column. 4. Select donor id as the source column and Sub Condition as the destination column. Then, click the button Copy to transfer donor IDs to the sub condition fields. 5. The Dose and Duration information is not available; thus, keep these columns blank. 6. Click the button Submit to finalize.

Step Three The administrator should have assigned you a list of datasets as an exercise after step one. Finish the annotation and email the administrator. You will communicate with the administrator to review your curation results as a training process. After the training, the administrator will assign you the final task list to finish.

Step Four After data curation, you should download the data annotations and generate differential expression profiles automatically. Please click CURATION -> RESULTS, then go to the bottom page and click the Download button to download a file with URLs to sample annotations and data matrices (if available when FDC can automatically parse it from public repositories). After getting the result file with URL list, you can download all files with "wget" or python "urllib", and develop an automatic analysis script. Here is an example python program to download files and process data.

C. Regular Expression Usage During the curation, advanced users might find the convenience of using regular expressions. There are many tutorials for beginners. Please install the Atom editor for exercise. Besides the basic forms that many tutorials will cover, there are a few advanced techniques to help you.

5 ● Match string without a pattern. Replace the pattern with your input. ^((?!pattern).)*$

● Match several patterns AND together. Replace pattern 1, 2, 3 with your input. (?=.*pattern1)(?=.*pattern2)(?=.*pattern3)

D. Configuration of Human Cytokine Response Project In this section, we will explain how we created the cytokine response project. You can use the same procedure to build your curation task. Go to PROJECT -> CREATE, then you will see a few fields to fill in. After creating the project, the system administrator will review all publicly displayed information of your project for approval (or request you to change if anything is inappropriate).

Field Explanation

ID An automatically generated unique identifier for the new project

Title Title of your new project

Description Description of your new project. The system administrator will review your new project based on the Title and Description. As long as there is no inappropriate information, we will approve your project.

Fields Fields to be annotated in the sample meta information table. For example, we listed the following fields for the cytokine project: Treatment, Condition, Sub Condition, Dose, Duration

You can also insert new fields or delete existing fields for each individual sample table if you need to collect different fields for each dataset. For example, different anticancer therapy clinical studies in GEO might provide very different clinical information.

Keywords Keyword patterns in regular expression format. Our system will highlight all matched patterns in the study design descriptions and sample fields. You can use # symbol to comment a line. Different patterns can be separated by comma ‘,’ (avoid comma in the regular expression) or new lines. For white space ‘ ‘, we will actually match a few common separators in molecule names, such as “._-”. The platform will only search for isolated keyword patterns starting and ending with either blank space or at the boundary of a text segment.

Please see our keywords in the cytokine project.

6 # Colony stimulating factor Colony stimulating factor (G|GM|M) CSF CSF [1-3]($|(?=[^0-9]))

# Interferon Type ([1-3]|I+) (IFN|Interferon) (IFN|Interferon) [αAβBγGλL] IFN($|(?=[^a-z])) Interferon

# (IL|Interleukin) [1-9][0-9]?($|(?=[^0-9])) Interleukin TSLP LIF($|(?=[^a-z])), leukemia inhibitory factor OSM($|(?=[^a-z])),

# TNF , TNF [αA], cachectin, TNF($|(?=[^a-z])) [αAβB] LT [αAβB] (CD27|CD30|CD40|Fas) (|L) 4 1BB L Trail, AP0 2 L OPG L, RANK L #APRIL #LIGHT TWEAK($|(?=[^a-z])) BAFF($|(?=[^a-z])), CD257($|(?=[^0-9]))

# Unassigned (TGF|Transforming )[-_ ]?[aAβB], Transforming growth factor, TGF($|(?=[^a-z])) migration inhibitory factor, MIF($|(?=[^a-z]))

Keywords_filter If you set this as True, the platform will search for existences of keywords (patterns defined above) in the dataset and sample tables, and filter out all datasets without keyword presence.

Processed_filter If you set this as True, the platform will only show datasets with processed gene expression matrices. If you are only working on -wide transcriptomic studies, please set it as True. Otherwise, please set it as False to include all possible datasets.

7 Vocabulary Controlled vocabulary for the sample table curation. Ideally, all field content should stay within your defined vocabularies. The control vocabularies for the cytokine project is listed below.

Control GCSF, GMCSF, MCSF IFNA, IFNB, IFNG, IFNL IL1A, IL1B, IL1RA, IL2, IL3, IL4, IL5, IL6, IL7, IL9, IL10, IL11, IL12, IL13, IL15, IL16, IL17A, IL17F, IL18, IL19, IL20, IL21, IL22, IL23, IL24, IL25, IL26, IL27, IL28, IL29, IL30, IL31, IL32, IL33, IL34, IL35, IL36A, IL36B, IL36G, IL36RA, IL37, TSLP, LIF, OSM TNFA, LTA, LTB, CD40L, FASL, CD27L, CD30L, 41BBL, TRAIL, OPGL, APRIL, LIGHT, TWEAK, BAFF TGFB1, TGFB2, TGFB3, MIF LPS PBMC, Monocyte, Macrophage, , T CD4, T CD8, Dendritic, NK, ,

Vocabulary_map Automatic map to standardize terms in sample meta information to standard vocabularies in such format: Matching pattern : target vocabulary

Our platform will match for patterns and replace them with the target vocabulary. The map table for the cytokine project is listed below:

None|Vehicle|DMSO|PBS|Untreated|Untreat|unstimulated|Mock|No stimulation : Control

((G|) (CSF|Colony stimulating factor))|(CSF 3) : GCSF ((GM|Granulocyte macrophage) (CSF|Colony stimulating factor))|(CSF 2) : GMCSF ((M|Macrophage) (CSF|Colony stimulating factor))|(CSF 1) : MCSF

(IFN|Interferon) ([αA]|alpha|alfa)[1-9]?: IFNA (IFN|Interferon) ([βB]|beta): IFNB (IFN|Interferon) ([γG]|gamma): IFNG (IFN|Interferon) ([λL]|lambda)[1-9]?: IFNL (IL|Interleukin) 28[abαβ]? : IFNL (IL|Interleukin) 29 : IFNL

(IL|Interleukin) 1 ([αA]|alpha|alfa) : IL1A (IL|Interleukin) 1 ([βB]|beta) : IL1B (IL|Interleukin) 1 R[αA] : IL1RA (IL|Interleukin) 2 : IL2

8 (IL|Interleukin) 3 : IL3 (IL|Interleukin) 4 : IL4 (IL|Interleukin) 5 : IL5 (IL|Interleukin) 6 : IL6 (IL|Interleukin) 7 : IL7 (IL|Interleukin) 9 : IL9 (IL|Interleukin) 10 : IL10 (IL|Interleukin) 11 : IL11 (IL|Interleukin) 12 : IL12 (IL|Interleukin) 13 : IL13 (IL|Interleukin) 15 : IL15 (IL|Interleukin) 16 : IL16 (IL|Interleukin) 17 ([aα]|alpha|alfa) : IL17A (IL|Interleukin) 17 F : IL17F (IL|Interleukin) 18 : IL18 (IL|Interleukin) 19 : IL19 (IL|Interleukin) 20 : IL20 (IL|Interleukin) 21 : IL21 (IL|Interleukin) 22 : IL22 (IL|Interleukin) 23 : IL23 (IL|Interleukin) 24 : IL24 (IL|Interleukin) (25|17E) : IL25 (IL|Interleukin) 26 : IL26 (IL|Interleukin) 27 : IL27 (IL|Interleukin) 30 : IL27 (IL|Interleukin) 31 : IL31 (IL|Interleukin) 32 : IL32 (IL|Interleukin) 33 : IL33 (IL|Interleukin) 34 : IL34 (IL|Interleukin) 35 : IL35 (IL|Interleukin) 36 ([αA]|alpha|alfa): IL36 (IL|Interleukin) 36 ([βB]|beta): IL36 (IL|Interleukin) 36 ([γG]|gamma): IL36 (IL|Interleukin) 36 R[αA] : IL36RA (IL|Interleukin) 37 : IL37

Leukemia inhibitory factor : LIF Oncostatin M : OSM

(TNF|Tumor necrosis factor) ([αA]|alpha|alfa) : TNFA (LT|Lymphotoxin) ([αA]|alpha|alfa) : LTA (LT|Lymphotoxin) ([βB]|beta) : LTB CD27 (L|ligand) : CD27L CD30 (L|ligand) : CD30L

9 CD40 (L|ligand) : CD40L FAS (L|ligand) : FASL 4 1BB (L|ligand) : 41BBL (AP0 2 (L|ligand))|(TNFSF 10) : TRAIL (RANK (L|ligand))|(TNFSF 11) : OPGL

(TGF|Transforming growth factor) ([αA]|alpha|alfa)[1-9]? : TGFA (TGF|Transforming growth factor) ([βB]|beta) 1? : TGFB1 (TGF|Transforming growth factor) ([βB]|beta) 2 : TGFB2 (TGF|Transforming growth factor) ([βB]|beta) 3 : TGFB3 lipo poly saccharide : LPS nanogram per milliliter : ng/m unit per milliliter : u/m millimolar : mM (μM|micromolar) : uM nanomolar : nM picomolar : pM (μ|u|micro)mol/L : uM (n|nano)mol/L : nM (p|pico)mol/L : pM peripheral blood mononuclear cell(s?) : PBMC monocyte(s?) derived macrophage(s?) : Macrophage monocyte(s?) derived (s?) : Dendritic Human Umbilical Vein Endothelial Cell(s?) : HUVEC

10 References Murphy, K. and Weaver, C. (2016) Janeway’s Immunobiology. Garland Science.

11