bioRxiv preprint doi: https://doi.org/10.1101/256784; this version posted February 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

clusterProfiler: An universal enrichment tool for functional and comparative study

Guangchuang Yu

State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, 21 Sassoon Road, Pokfulam, Hong Kong SAR, China.

Email: [email protected]

I present the open-source software package clusterProfiler ( https://github.com/GuangchuangYu/clusterProfiler) for functional enrichment analysis. clusterProfiler provides a universal interface for functional enrichment analysis for internal supported ontologies/pathways as well as annotation data provided by users or obtained from online databases. It enables comparative analysis and offers comprehensive visualization tools for result interpretation.

Functional enrichment analysis is one of the most widely used technique for interpreting lists or genome-wide regions of interest (ROIs) derived from high-throughput sequencing (HTS). Although tools are proliferating to perform gene-centric or epigenomic enrichment analysis, most of them are designed for model organisms or specific domains (e.g. fungi1 or plant2 etc.) embedded with particular annotations (e.g. (GO) or Kyoto Encyclopaedia of and Genomes (KEGG)). Non-model organisms and functional annotations other than GO and KEGG are poorly supported. Moreover, an increasing concern upon the quality of gene annotation has raised an alarm in biomedical research. A previous study3 reported that about 42% of the tools were outdated by more than five years and functional significance were severely underestimated with only 26% of biological processes or pathways were captured compare to using up-to-date annotation. Reanalysing GTEx dataset4 published by ENCODE consortium using clusterProfiler uncovered a lot of new pathways that leads to generate new hypotheses (https://github.com/GuangchuangYu/enrichment4GTEx_clusterProfiler). Such negative impacts of outdated annotation can be propagated for years and hinder follow-up studies. bioRxiv preprint doi: https://doi.org/10.1101/256784; this version posted February 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

The clusterProfiler package was designed by considering the supports of multiple ontology/pathway annotations, up-to-date gene annotations, multiple organisms, user's annotation data and comparative analysis. The package employs modular design and supports disease analyses (Disease Ontology5, Network of Cancer Gene6, gene-disease and variant- disease associations7), Reactome pathway and Medical Subject Headings (MeSH) analysis via its sub-packages DOSE5, ReactomePA8 and meshes (https://github.com/GuangchuangYu/meshes). The clusterProfiler package internally supports GO and KEGG. GO annotation data can be obtained directly from Bioconductor OrgDb packages, or retrieved from web resources (e.g. AnnotationHub, biomaRt9 and UniProt10 etc.). KEGG Pathway and Module were directly obtained using KEGG API and supports more than five thousand genomes (http://www.genome.jp/kegg/catalog/org_list.html) as well as KEGG Orthology (KO) which is particular useful for metagenomic functional studies. In addition, clusterProfiler provides general functions, enricher and GSEA, to perform fisher's exact test and gene set enrichment analysis using user defined annotations, making it possible to support novel functional annotation of newly sequenced species (e.g. electronic annotation using Blast2GO11 and KAAS12), unsupported ontologies/pathways (e.g. InterPro Domain, Clusters of Orthologous Groups, Mouse phenotype ontology etc.) or customized annotation. clusterProfiler provides parser functions to import GMT file so that gene sets downloaded from Molecular Signatures Database can be directly supported as well as to retrieve whole genome GO annotations from UniProt database using taxonomic ID. Epigenomic enrichment analysis is also supported with user-provided ROIs in BED format by combining with in-house annotation package ChIPseeker13. GO semantic similarity measurement implemented in another in-house package GOSemSim14, which allows calculating GO term similarity using several methods based on information content and graph structure, was also incorporated in clusterProfiler to remove redundancy of enriched GO terms. This feature simplifies result and assists in interpretation as well as against annotation/interpretation bias.

Albeit functional enrichment analysis is commonly used in downstream analysis to decipher key biological processes, visualization of enriched result is mainly depicted by bar chart of most significant terms, which maybe dominated by redundant terms and cannot reveal the complex of gene-pathway associations. The clusterProfiler sub-package, enrichplot (https://github.com/GuangchuangYu/enrichplot), is designed to visualize enrichment result by integrating expression data (Fig. 1). These methods allow users without programming skills to generate effective visualization to explore and interpret results as well as to find patterns within bioRxiv preprint doi: https://doi.org/10.1101/256784; this version posted February 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

the data. In addition, all these visualization methods were implemented based on ggplot2, which allows customization using grammar of graphics.

Comparing functional profiles reflects functional consensus and difference among different experiments and helps identifying differential functional modules in omic datasets. With the infrastructure of clusterProfiler to support a wide range of ontology/pathway annotations and plenty of organisms, the comparison can be applied to many circumstances based on gene or ROI levels. Furthermore, the compareCluster function supports using formula, which is widely used in R language to express statistical model, making it easy to compare functional profiles for different conditions and at different time points.

AUTHOR CONTRIBUTIONS G. Y. conceived and designed this project, implemented the algorithms and software packages and wrote the manuscript.

COMPLETING FINANCIAL INTERESTS The author declares no competing financial interests.

1. Priebe, S., Kreisel, C., Horn, F., Guthke, R. & Linde, J. Bioinformatics 31, 445–446 (2015).

2. Yi, X., Du, Z. & Su, Z. Nucleic Acids Res. 41, W98–W103 (2013).

3. Wadi, L., Meyer, M., Weiser, J., Stein, L. D. & Reimand, J. Nat. Methods 13, 705 (2016).

4. Melé, M. et al. Science 348, 660–665 (2015).

5. Yu, G., Wang, L.-G., Yan, G.-R. & He, Q.-Y. Bioinformatics 31, 608–609 (2015).

6. An, O., Dall’Olio, G. M., Mourikis, T. P. & Ciccarelli, F. D. Nucleic Acids Res. 44, D992–D999 (2016).

7. Piñero, J. et al. Database J. Biol. Databases Curation 2015, (2015).

8. Yu, G. & He, Q.-Y. Mol. Biosyst. 12, 477–479 (2016).

9. Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Nat. Protoc. 4, 1184–1191 (2009).

10. Huntley, R. P., Sawford, T., Martin, M. J. & O’Donovan, C. GigaScience 3, 4 (2014).

11. Conesa, A. et al. Bioinformatics 21, 3674–3676 (2005). bioRxiv preprint doi: https://doi.org/10.1101/256784; this version posted February 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

12. Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A. C. & Kanehisa, M. Nucleic Acids Res. 35, W182–W185

(2007).

13. Yu, G., Wang, L.-G. & He, Q.-Y. Bioinformatics 31, 2382–2383 (2015).

14. Yu, G. et al. Bioinformatics 26, 976–978 (2010).

HJURP mitotic nuclear division sister chromatid segregation HJURP ASPM a CDCA8 b nuclear division f mitotic nuclear division ERCC6L sister chromatid segregation BIRC5 CENPN segregation fold change MAD2L1 4 mitotic sister chromatid segregation CENPM regulation of nuclear division nuclear division CDT1 NEK2 SKA1 CDCA8 2 chromosome segregation PRC1 ERCC6L nuclear chromosome segregation KIF11 UBE2C fold change TOP2A CDC20 0 organelle fission p.adjust CDC20 4 TTK KIF18A CENPN cytoskeleton organization involved in mitosis BMP4 KIF23 −2 mitotic spindle organization 2.5e−09 2 NCAPG PTTG1 CENPE spindle organization UBE2S TPX2 5.0e−09 size regulation of nuclear division NDC80 CENPM 0 regulation of mitotic nuclear division 7.5e−09 chromosome segregation KIF4A MYBL2 18 −2 microtubule cytoskeleton organization CCNB1 NDC80 regulation of mitotic sister chromatid separation 1.0e−08 mitotic nuclear division 23 mitotic sister chromatid separation sister chromatid segregation size MYBL2 AURKA regulation of chromosome separation KIF4A NCAPH 29 nuclear division 18 chromosome separation regulation of nuclear division SKA1 regulation of chromosome segregation AURKA KIF18B 23 34 NCAPH KIF18B DLGAP5 regulation of mitotic sister chromatid segregation TACC3 UBE2C metaphase/anaphase transition of mitotic cell cycle 29 KIFC1 0 10 20 30 AURKB CCNB1 TOP2A PRC1 NUSAP1 category BMP4 34 chromosome segregation DLGAP5 KIF23 TRIP13 TPX2 TRIP13 mitotic nuclear division g organelle fission PTTG1 KIFC1 nuclear division NUSAP1 CHEK1 TACC3 nuclear division CHEK1 chromosome segregation regulation of nuclear division CENPE AURKB NEK2 mitotic nuclear division NCAPG UBE2S sister chromatid segregation nuclear chromosome segregation BP sister chromatid segregation ASPM TTK MAD2L1 mitotic sister chromatid segregation KIF11 BIRC5 CDT1 KIF18A microtubule cytoskeleton organization involved in mitosis spindle organization p.adjust antigen processing and presentation of exogenous peptide antigen via MHC class II mitotic spindle organization 0.001 G2/M transition of mitotic cell cycle c chronic inflammatory response spindle 0.002 meiotic cell cycle chromosomal region regulation of lymphocyte chemotaxis 0.003 regulation of microtubule cytoskeleton organization condensed chromosome regulation of microtubule−based process DNA replication chromosome, centromeric region 0.004 chemokine−mediated signaling pathway CC histone phosphorylation fold change 0.005 chromosome condensation 4.5 condensed chromosome, centromeric region chromosome localization microtubule associated complex establishment of chromosome localization localization to kinetochore 4.0 midbody Count mitotic spindle checkpoint condensed chromosome kinetochore mitotic spindle assembly checkpoint 3.5 10 lymphocyte chemotaxis spindle microtubule anaphase−promoting complex−dependent catabolic process 20 regulation of mitotic cell cycle phase transition 3.0 receptor ligand activity negative regulation of chromosome separation 30 positive regulation of mitotic cell cycle 2.5 tubulin binding regulation of chromosome organization microtubule binding regulation of chromosome segregation regulation of chromosome separation motor activity regulation of mitotic nuclear division microtubule motor activity MF regulation of nuclear division chromosome segregation chemokine receptor binding nuclear division chemokine activity sister chromatid segregation mitotic nuclear division histone kinase activity CXCR chemokine receptor binding RAGE receptor binding TTK DTL IDO1 CDT1 KIF11 KIF23 NEK2 PRC1 TPX2 SKA1 CDK1 E2F8EZH2 CCL8 LAG3 BIRC5BMP4 KIF4AKIFC1 ASPM GINS1GINS2 MCM5RRM2MAPT MELKDQA1 CDC20 MYBL2 NDC80 PTTG1TACC3 UBE2S TOP2A CDC45GATA3 FOXA1 CCL18 CXCL9 ACKR1 DACH1 − AURKAAURKB CCNB1 CDCA8 CENPECHEK1 KIF18AKIF18B NCAPGNCAPH TRIP13 UBE2C CENPMCENPN HJURP S100A7CCNA2 MCM10 TRIM58 S100A8CCNB2FOXM1 0.05 0.10 0.15 DLGAP5 MAD2L1 NUSAP1 ERCC6L CXCL10CXCL11CXCL13CXCL14 CX3CR1CHAF1B HLA GeneRatio Cell cycle d biological_process e h Proteasome protein complex subunit organization regulation of microtubule cytoskeleton organization DNA replication cellular process single−organism cellular process IL−17 signaling pathway Ribosome biogenesis in eukaryotes Primary immunodeficiency antigen processing and presentation of exogenous peptide antigen via MHC class II Biosynthesis of amino acids Rheumatoid arthritis chromosome segregation regulation of microtubule−based process p53 signaling pathway cellular component organization or biogenesis cell cycle NF−kappa B signaling pathway chromosome localization Antigen processing and presentation microtubule−based process chromosome segregation p.adjust establishment of chromosome localization Oocyte meiosis Cellular senescence 0.011 regulation of nuclear division TNF signaling pathway 0.012 sister chromatid segregation mitotic nuclear division RNA transport 0.013 Toll−like receptor signaling pathway 0.014 cell cycle process mitotic spindle checkpoint regulation of mitotic nuclear division Natural killer cell mediated cytotoxicity 0.015 mitotic cell cycle T cell receptor signaling pathway cellular component organization regulation of chromosome separation Chemokine signaling pathway 0.016 Measles regulation of chromosome segregation nuclear division Tuberculosis regulation of chromosome organization size Carbon metabolism mitotic spindle assembly checkpoint 10 Influenza A p.adjust negative regulation of chromosome separation Cytokine−cytokine receptor interaction mitotic cell cycle process lymphocyte chemotaxis 20 organelle organization 1e−10 meiotic cell cycle HTLV−I infection nuclear chromosome segregation 30 Viral carcinogenesis 2e−10 regulation of mitotic cell cycle phase transition Focal adhesion 3e−10 positive regulation of mitotic cell cycle ECM−receptor interaction 4e−10 p.adjust Drug metabolism − cytochrome P450 Protein digestion and absorption anaphase−promoting complex−dependent catabolic process 0.001 −2 0 2 4 organelle fission cytoskeleton organization chromosome organization relationship protein localization to kinetochore regulation of lymphocyte chemotaxis single−organism organelle organization is_a 0.002 Protein digestion and absorption part_of i j congenital nervous system abnormality 0.003 physical disorder 1 hypospadias nervous system benign neoplasm p.adjust chemokine−mediated signaling pathway 0 congenital heart disease chromosome condensation 0.01 mitotic nuclear division neuroma neurilemmoma G2/M transition of mitotic cell cycle 0.02 nuclear division microtubule cytoskeleton organizationsister chromatid segregation −1 prostate cancer histone phosphorylation male reproductive organ cancer 0.03 −2 autosomal dominant disease 0.04 Ranked list metric Ranked prostate carcinoma hereditary breast ovarian cancer central nervous system cancer GeneRatio microtubule cytoskeleton organization involved in mitosis chronic inflammatory response 0.0 gallbladder carcinoma mitotic sister chromatid segregation sensory system cancer 0.02 spindle organization peripheral nervous system neoplasm 0.04 −0.2 mood disorder musculoskeletal system cancer 0.06 connective tissue cancer −0.4 osteosarcoma pancreatic cancer mitotic spindle organization DNA replication CBX6_BF CBX7_BF HMF_CTCF AoAF_CTCF 0 4000 8000 12000 (377) (443) (5052) (5235) Running Enrichment Score Position in the Ranked List of Genes

Figure 1. Visualization methods for enrichment result. (a) Gene-concept network depicts the linkages of genes and biological concepts as a network. (b) Circular version of gene-concept network. (c) Displaying gene-concept relationship as a heatmap with gene expression data integrated. (d) Directed acyclic graph induced by most significant GO terms. (e) Enrichment map organizes enriched terms into a network with edges weighted by the ratio of overlapping gene sets. Mutually overlapping gene sets are tending to cluster together, making it easy to identify functional modules. (f) Bar chart to display gene count or ratio as bar height and coloured by enrichment scores (e.g. p.adjust). (g) Dot chart that similar to bar chart with capability to encode another score as dot size (also known as bubble plot). Both bar and dot charts support faceting to visualize sub-ontologies simultaneously. (h) Ridge line plot for expression distribution of GSEA result. By default, only display the distribution of core enriched genes (also known as leading edge) and can be switched to visualize the distribution of whole gene sets. (i) Running score and pre-ranked list of GSEA result. (j) Comparing enriched results across multiple experiments. bioRxiv preprint doi: https://doi.org/10.1101/256784; this version posted February 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

This example compared disease associations of multiple genome-wide ROIs (data from GSM1295076, GSM1295077, GSM749665, GSM749666).