Human Cytokine Response Profiles
Total Page:16
File Type:pdf, Size:1020Kb
Comprehensive Understanding of the Human Cytokine Response Profiles A. Background The current project aims to collect datasets profiling gene expression patterns of human cytokine treatment response from the NCBI GEO and EBI ArrayExpress databases. The Framework for Data Curation already hosted a list of candidate datasets. You will read the study design and sample annotations to select the relevant datasets and label the sample conditions to enable automatic analysis. If you want to build a new data collection project for your topic of interest instead of working on our existing cytokine project, please read section D. We will explain the cytokine project’s configurations to give you an example on creating your curation task. A.1. Cytokine Cytokines are a broad category of small proteins mediating cell signaling. Many cell types can release cytokines and receive cytokines from other producers through receptors on the cell surface. Despite some overlap in the literature terminology, we exclude chemokines, hormones, or growth factors, which are also essential cell signaling molecules. Meanwhile, we count two cytokines in the same family as the same if they share the same receptors. In this project, we will focus on the following families and use the member symbols as standard names (Table 1). Family Members (use these symbols as standard cytokine names) Colony-stimulating factor GCSF, GMCSF, MCSF Interferon IFNA, IFNB, IFNG Interleukin IL1, IL1RA, IL2, IL3, IL4, IL5, IL6, IL7, IL9, IL10, IL11, IL12, IL13, IL15, IL16, IL17, IL18, IL19, IL20, IL21, IL22, IL23, IL24, IL25, IL26, IL27, IL28, IL29, IL30, IL31, IL32, IL33, IL34, IL35, IL36, IL36RA, IL37, TSLP, LIF, OSM Tumor necrosis factor TNFA, LTA, LTB, CD40L, FASL, CD27L, CD30L, 41BBL, TRAIL, OPGL, APRIL, LIGHT, TWEAK, BAFF Unassigned TGFB, MIF Table 1. Cytokine families and members from Appendix III of (Murphy and Weaver, 2016). A.2. Annotation for Automatic Analysis 1 Each dataset submission in the GEO or ArrayExpress databases will contain sample annotations. However, the original information typically cannot enable automatic analysis by a computer program. For example, the dataset GSE72502 from the GEO contains six expression profiles with the following sample information (Table 2A). A biologist could interpret this dataset as a study of interferon alpha (IFNA in Table 1) response in monocytes derived from peripheral blood. There are three donors involved, and each donor includes a pair of control and IFNA treated conditions. To generate expression profiles of IFNA treatment, the biologist would first calculate the log fold change (logFC) between the IFNA and Control conditions for each donor, and then compute the median logFC among all three donors. However, a computer program cannot automatically understand these conditions unless you label them in standardized vocabularies (Table 2B). For every dataset, you can define the columns Treatment, Condition, and Sub Condition. A computer program can first group the samples according to the sub conditions. Within each sub condition group, the program can compute the logFC between any cytokine and control condition. Finally, the program will merge values in different sub conditions into one median value for the condition. ID title cell type treatment donor GSM1863325 1. PBMC peripheral blood monocytes none A GSM1863326 2. PBMC peripheral blood monocytes none B GSM1863327 3. PBMC peripheral blood monocytes none C GSM1863328 4. PBMC + IFNa peripheral blood monocytes interferon-alpha B GSM1863329 5. PBMC + IFNa peripheral blood monocytes interferon-alpha A GSM1863330 6. PBMC + IFNa peripheral blood monocytes interferon-alpha C A. Original. ID Treatment Condition Sub Condition GSM1863325 Control Monocyte A GSM1863326 Control Monocyte B GSM1863327 Control Monocyte C GSM1863328 IFNA Monocyte B GSM1863329 IFNA Monocyte A GSM1863330 IFNA Monocyte C B. Standardized. Table 2. Original and standardized sample annotations of GSE72502. 2 Another example is dataset GSE37624, studying the IL1 and IL33 treatment response in the HUVEC cell line (Table 3A). There is no sub condition (e.g., donor) in this study. The dose and duration of treatment are available. We can create the standardized annotation in Table 3B. ID protocol cell type GSM923359 stimulated with IL-1b 0.5ng/ml for 4 hours HUVEC pool of 10 donors GSM923360 stimulated with IL-33 50ng/ml for 4 hours HUVEC pool of 10 donors GSM923361 control HUVEC pool of 10 donors GSM923362 control HUVEC pool of 10 donors GSM923363 stimulated with IL-33 50ng/ml for 4 hours HUVEC pool of 10 donors GSM923364 stimulated with IL-1b 0.5ng/ml for 4 hours HUVEC pool of 10 donors A. Original. ID Treatment Condition Dose Duration GSM923359 IL1 HUVEC 0.5ng/ml 4h GSM923360 IL33 HUVEC 50ng/ml 4h GSM923361 Control HUVEC 4h GSM923362 Control HUVEC 4h GSM923363 IL33 HUVEC 50ng/ml 4h GSM923364 IL1 HUVEC 0.5ng/ml 4h B. Standardized. Table 3. Original and standardized sample annotations of GSE37624. In summary, annotations for automatic analysis will include predefined columns with content in standardized vocabularies that a computer program can analyze without human intervention. The role of human curation is to transform the original annotation into standardized forms. B. Procedure Step One Register a curator account at the Curation Platform. Email the administrator (Peng Jiang: [email protected]) your username and ask to join the Human Cytokine Response Project. The reviewer account has access to the Human Cytokine Response Project by default. Please use Google Chrome for the best performance. Firefox and Safari also work, although the regular expression functions will be compromised. This platform does NOT work on IE. 3 Step Two Read the help document of the web platform to understand the workflow and assist functions in the NCI Curation platform. Specifically, the current project has the following two stages. ● Stage one: Identify gene expression dataset profiling the treatment response of human cytokines following these criteria: 1. All datasets must have both cytokine treatment and control conditions. Some studies profiled by two-color microarrays may use the control samples as the background reference (Cy3 or Cy5) thus may not explicitly provide control conditions. For simplicity, we ignore such datasets. 2. Each condition should have at least two biological replicates. For certain datasets, each cell model may not have replicates. However, if several cell models were profiled, you may group them as biological replicates by labeling the cell model names as Sub Conditions, which can be grouped as an average value (Table 4 Sub Condition). ● Stage two: annotate five standard columns in Table 4. Our later analysis will ignore any rows with blank values in the Treatment or Condition columns. For other columns, please leave the cell blank if the value is not available. For the Sub Condition column, if you input any value, our later analysis will ignore rows with blank content. Otherwise, if you leave the whole Sub Condition blank, we will ignore this column in the analysis. Column Content Treatment All cytokine names must follow the member symbols in Table 1. The control condition should be labeled as Control. We only focus on cytokine mono-treatment and control conditions. You may include combination cytokine treatment or other compounds, although these additional annotations are not required. Condition Please annotate the name of model systems for cytokine treatment study and keep the name consistent across all conditions. Our automatic program will group samples by different conditions and compute the differential expression between cytokine-treated and control samples within each condition. Sub Condition This column typically hosts cell line or donor names without experimental replicates. For each condition, an automatic program can merge the values among all sub conditions as biological replicates. More broadly, the curator can input any information here to merge values for each condition. Dose Dose of the cytokine 4 Duration Duration length of the treatment Table 4. Standard columns for sample annotation table. To demonstrate an example, we show how to annotate the Table 2 dataset in steps: 1. In the dataset annotation table (main menu CURATION -> ANNOTATE), please find the ID GSE72502 and click on the sample count 6 to open the sample annotation table. You will find that the Treatment column is already populated with values from the source column treatment. However, the names do not comply with the standard in Table 4. 2. In the column selector, select Treatment as the destination column. Then, click the button Translate Vocabulary to standardize names. Note: the automatic vocabulary map does not always work perfectly. In most cases, curators need to adjust names with extra steps. 3. Type Monocyte in the first input box of the Condition column. Copy this word by Ctrl (Command for Mac) + v. In the column selector, select Condition as the destination column and select the Paste from Clipboard checkbox. Then, Ctrl (Command for Mac) + v to paste values to the entire column. 4. Select donor id as the source column and Sub Condition as the destination column. Then, click the button Copy to transfer donor IDs to the sub condition fields. 5. The Dose and Duration information is not available; thus, keep these columns blank. 6. Click the button Submit to finalize. Step Three The administrator should have assigned you a list of datasets as an exercise after step one. Finish the annotation and email the administrator. You will communicate with the administrator to review your curation results as a training process. After the training, the administrator will assign you the final task list to finish. Step Four After data curation, you should download the data annotations and generate differential expression profiles automatically. Please click CURATION -> RESULTS, then go to the bottom page and click the Download button to download a file with URLs to sample annotations and data matrices (if available when FDC can automatically parse it from public repositories). After getting the result file with URL list, you can download all files with "wget" or python "urllib", and develop an automatic analysis script.