<<

From Data to Knowledge

Fighting COVID-19 by Mining Insights from Heterogeneous Datasets

A case study on CORD-19 & Academic Graph

GO FAIR US Webinar 04/22/2020

Iris SHEN Principal Data Scientist, Microsoft Research OUTLINE

• CORD-19 Dataset [meta-data + full text; 50K+]

• Microsoft Academic Graph (MAG) [Academic Knowledge Graph; 230M+]

• Link CORD-19 and MAG

• Insights and More CORD-19 Dataset

• WHO • HOW • AI2 / MSR / CZI / NIH / Georgetown U • WHY • AI – fight COVID-19 • WHAT • Meta-data (52K) • Full text (41K) • 04-17 version • WHEN • WHERE • Started Mar 16, 2020 ://pages.semanticscholar.org/coronavirus-research • Update weekly (Friday) / daily https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge Microsoft Academic Graph (MAG)

• WHO • WHAT • Microsoft Academic (MSR) • WHY • AI – scholarly communications • HOW

Knowledge in the graph form

• WHERE • WHEN Project: https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/ • Start from 2014 Search portal: https://academic.microsoft.com/ • Update weekly Get the free graph: https://docs.microsoft.com/en-us/academic-services/graph/ CORD-19 + MAG

• WHY • HOW • Richer Semantics • Microsoft Academic Knowledge • Knowledge Graph for NLP tasks Exploration Service (MAKES) API • Open more publications(?) (47k+, 91% of CORD-19) • WHAT • Follow homogeneous citation links • 91% CORD-19 mapped to MAG to create the closure graph (47K+) (11 hops, <1% convergence, 58M+) • CORD-19 Closure Graph (58M)

• WHEN • WHERE

• Update weekly https://pages.semanticscholar.org/coronavirus-research https://github.com/microsoft/mag-covid19-research-examples https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-resources-and-their-application-to-covid-19-research Enrich Semantics – Entity Stats Comparison What are CORD-19 papers talking about? MAG • 700K+ topics / fields of study • 230K+ from Wikipedia • 420K+ from UMLS (NIH, Bio-Med) • 50K+ mined from MAG corpus • 6 level auto-constructed taxonomy • 19 domain (manually-curated) • 270 sub-domain (manually-curated)

CORD-19 – 28K Closure Graph – 600K More interesting questions to be answered…

• Are papers in CORD-19 enough? What else shall be included? [With the help of CORD-19 closure graph]

• Analytics questions / new insights – how topics are evolved - information diffusion; team size vs. impact analysis (“disruptive index”) etc.

• Knowledge graph-assisted NLP/ research Summary

• CORD-19 Dataset [full text; 50K+]

• Microsoft Academic Graph (MAG) [Knowledge Graph; 230M+ publications]

• Link CORD-19 and MAG  richer semantics, closure graph, insights THANK YOU & Questions?

Iris Shen, [email protected] References Principal Data Scientist, Microsoft Research

• Lo, Kyle, et al. “CORD-19: the COVID-19 Open Research Dataset.” ArXiv Preprint (to be released), 2020. • Sinha, Arnab, et al. “An Overview of Microsoft Academic Service (MAS) and Applications.” WWW, 2015, pp. 243–246. • Lo, Kyle, et al. “GORC: A Large Contextual Citation Graph of Academic Papers.” ArXiv Preprint ArXiv:1911.02782, 2019. • Wang, Kuansan, et al. “A Review of Microsoft Academic Services for Science of Science Studies.” Frontiers in Big Data, vol. 2, 2019. • Wang, Kuansan, et al. “Microsoft Academic Graph: When Experts Are Not Enough.” Quantitative Science Studies, vol. 1, no. 1, 2020, pp. 396–413. • Shen, Zhihong, et al. “A Web-Scale System for Scientific Knowledge Exploration.” ACL, 2018, pp. 87–92. • Wu, Lingfei, et al. “Large Teams Develop and Small Teams Disrupt Science and Technology.” Nature, vol. 566, no. 7744, 2019, pp. 378–382.