<<

Document Analysis at Scale

Miles Crawford, Director of Engineering Outline

● Introduction to the Allen Institute for and Semantic Scholar ● at Semantic Scholar ● Creating www.semanticscholar.org ● Other resources for researchers Introduction to AI2 and S2 “AI for the common good” Allen Institute for Artificial Intelligence

Mosaic Aristo Euclid Common Sense Knowledge Machine Reading and Math and Geometry and Reasoning Question Answering Comprehension

AllenNLP PRIOR Semantic Scholar Deep Semantic NLP Visual Reasoning AI-Based Academic Platform Knowledge Semantic Scholar: Vision & Strategy

Semantic Scholar makes the world's scholarly knowledge easy to survey and consume. Semantic Scholar: Vision & Strategy

Differentiation: S2 is dramatically better at surveying, extracting, and helping researchers consume the most relevant information from the world's research

Scale: Attract and retain a significant and sustainable share of academic search traffic

Impact on research with AI: Work towards a “Wright Brothers” moment for research through research on novel AI techniques that are prototyped with millions of active users semanticscholar.org Research at Semantic Scholar Research at S2: Three Levels of Analysis

Paper Relationships Macro Paper: Extract meaningful structures

Figures

Tables

Topics

Relations Neural Networks Omniglot

Results

C-peptide [contraindicated with] Diabetes Mellitus

The addition of MbPA reaches a test perplexity of 29.2 which is, to the authors’ knowledge, state-of-the-art at time of . Peters et al. ACL 2017 -- Semi-supervised Sequence Tagging with Bidirectional Langua… Ammar et al. SemEval 2017 -- Semi-supervised End-to-end Entity and Relation Extrac… Siegel et al. JCDL 2018 -- Extracting Scientific Figures with Distantly Supervised Neural… Paper: Extract meaningful structures Relationships: Establishing Connections

Ontology Matching

uses method should cite

UMLS

Discovered KB

Ammar et al. NAACL 2018 -- Construction of the Literature Graph in Semantic Scholar Wang et al. BioNLP 2018 -- Ontology Alignment in the Biomedical Domain Using Entit… Bhagavatula et al. NAACL 2018 -- Content-Based Recommendation Relationships: Establishing Connections Relationships: Establishing Connections Macro: Trends in research

pre- vs. citation rates

understanding peer reviews

quantify demographic in clinical trials

Kang et al. NAACL 2018 -- A Dataset of Peer Reviews (PeerRead): Collection, Insights... Feldman et al. arXiv 2018 -- Citation Count Analysis for Papers with 2018 Publications

A Dataset of Peer Reviews (PeerRead): Collection, Insights and Extracting Scientific Figures with Distantly Supervised Neural NLP Applications Networks Kang, Ammar, Dalvi, van Zuylen, Kohlmeier, Hovy, Schwartz Siegel, Lourie, Power, Ammar NAACL 2018 JCDL 2018

Content-Based Citation Recommendation Neural Relation Extraction using a Combination of Full Bhagavatula, Feldman, Power, Ammar. Supervision and Distant Supervision NAACL 2018 Beltagy, Lo, Ammar arXiv 2018 Construction of the Literature Graph in Semantic Scholar Ammar, Groeneveld, Bhagavatula, …, Etzioni Citation Count Analysis for Papers with Preprints NAACL 2018 Feldman, Lo, Ammar arXiv preprint 2018

Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context Wang, Bhagavatula, Neumann, Lo, Wilhelm, Ammar BioNLP 2018 Bringing it All to Production semanticscholar.org: Scale and Reach

● Comprehensive literature graph to explore papers, authors, topics.

● View and search across 40MM+ Computer & BioMedical academic papers.

● Expansion to all of science in 2019.

● Topic extraction covering 350k+ topics, 230MM+ mentions, and 6.7MM+ relationships.

● Used by over 2.2MM unique users per month.

● Global adoption. Architecture Overview

Data Acquisition Pipeline Web Application Models

Figures Node Server Web Crawler Crawler Junk Filter React Entities Application DBLP

De-duplicate User’s & Cluster Browser PubMed

React Etc... Metadata API Application Springer Create Citation Metadata The Graph Internet

Etc... Etc...Etc... Join Extracted Data

Create Search Index Elasticsearch Normalizing Paper Records

Data Acquisition Sourced Paper

Title: N/A Crawler Authors: N/A

PDF: a2ho-4.pdf Sourced Paper etc... Title: Analyzing... PubMed Authors: A, B, C

Sourced Paper PDF: N/A Amazon etc... S3 Title: Findings of... Springer Authors: A, B, C PDF: findings.pdf etc...

Etc... Etc...Etc... Pipeline Distributed Paper Resolution

Junk Filter Title Block: analyzscientificdocument

Title: Analyzing Scientific Documents De-duplicate Author: John Doe & Cluster Title: Analyzing Scientific Documents Title: Analyzing scientific documents Author: John Doe Author: J. Doe Create Title: Analyze Scientific Documents Title: Analyze Scientific Documents Author: Jane Smith Author: Jane Smith

Join Extracted Data

Title Block: deeplearningfornlp Create Search Index Title: for NLP Title: Deep Learning for NLP Author: Jane Smith Author: Jane Smith Pipeline Distributed Citation/Reference Matching

Junk Filter Title: Structured Content Extraction Author: John Doe Title Block: analyzscientificdocument

Title: Analyzing Scientific Documents De-duplicate & Cluster Author: John Doe

References: 1. Analyzing Scientific Title: Analyze Scientific Documents Create Author: Jane Smith Citation Documents, Doe et al., Graph SCIDOCA 2016 2. Deep Learning with NLP, Join Smith, NAACL 2017 Extracted Title Block: deeplearningfornlp Data 3. Structural Heuristics in Academic Documents, Oscar, SCIDOCA 2016 Title: Deep Learning for NLP Create Author: Jane Smith Search Index ? Pipeline Scalable ML Models

Notifications for new or changed Junk Filter papers

Model Task System De-duplicate & Cluster

Create Citation Table of Paper 1: abc Paper 1: Paper 1: abc Paper 1: abc extraction inputs Paper 2: def Paper 2: Paper 2: Paper 2: def Graph Etc... and results per Paper 3: Paper 3: Paper 3: Paper 3: hij model: Paper 4: Paper 4: Paper 4: Paper 4: Join Extracted Data Docker containers on autoscaling host groups: Figure Figure Key Topic Create ExtractionFigure ExtractionFigure ResultsFigure LocationFigure Etc...Figure Search Index Extractionv1.0 Extractionv1.1 Extractionv1.0 Extractionv1.3 Extraction v1.0 v1.0 v1.0 v1.0 v1.0 Other Resources Corpus

● Full Corpus of Papers: ○ Some data restricted due to some licensing/ conditions. ○ Record-per-line JSON archive, 36GB. ○ http://labs.semanticscholar.org/corpus/ ● API: ○ Lightweight single paper or author access. ○ JSON over REST API. ○ Link resolver for easy linking to Semantic Scholar. ○ http://api.semanticscholar.org/ Use of Open Corpus as Training Set

An independent researcher created a system for embedding papers as a vector to support similarity measurements and recommendation generation https://github.com/Santosh-Gupta/Research2Vec ArXiv Added a Bibliography via API

ArXiv used our API to fetch and re-display citation and reference lists on their pages. Repositories

Cite-o-matic: Find relevant for drafts of papers. https://github.com/allenai/citeomatic

Science Parse v1: CRF-based extraction of metadata from PDFs. https://github.com/allenai/science-parse

Science Parse v2: LSTM-based extraction of metadata from PDFs. https://github.com/allenai/spv2

Deep Figures: Extract figures and tables from papers. https://github.com/allenai/deepfigures-open What would you like to see? Thank you!

[email protected] http://allenai.org https://semanticscholar.org Appendix 1st dataset of peer reviews

Dongyeop Kang, Waleed Ammar, Bhavana Dalvi Mishra, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, Roy Schwartz. NAACL 2018 Prepublishing before paper gets accepted is associated with 65% more citations!

Sergey Feldman, Kyle Lo, Waleed Ammar. arXiv 2018 Gender bias in clinical trials Cite-o-matic

Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar. NAACL 2018 Literature graph

Synthesize knowledge

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang, Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen, and . NAACL 2018 (industry track) Ontology alignment

Lucy Wang, Chandra Bhagavatula, Mark Neumann, Kyle Lo, Chris Wilhelm and Waleed Ammar. BioNLP 2018 Context-sensitive word representations

Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. ACL 2017 Waleed Ammar, Matthew E. Peters, Chandra Bhagavatula, Russell Power. SemEval 2017 Figure extraction

Noah Siegel, Nicholas Lourie, Russell Power, Waleed Ammar. JCDL 2018. Next steps

1. Extract Meaningful Structures

2018 2019 Next steps steps 1. Extract Meaningful Structures

2018 2019

Address more paper FAQs (e.g., PICO elements). Next steps steps 1. Extract Meaningful Structures

2018 2019

Model salience of extracted structures at the document level. Next steps steps 1. Extract Meaningful Structures

2018 2019

User’s implicit/explicit feedback. Next steps steps 1. Extract Meaningful Structures

2018 2019

User’s implicit/explicit feedback. Next steps steps 1. Extract Meaningful Structures

2018 2019

User’s implicit/explicit feedback. Next steps steps 1. Extract Meaningful Structures

2018 2019

Adapt existing models to more domains.

Ganin and Lempitsky. ICML’15 Next steps steps 2. Establish Connections

2018 2019

Next steps steps 2. Establish Connections

2018 2019

More work on ontology matching and deduplication.

Next steps steps 2. Establish Connections

2018 2019

More work on author disambiguation.

Next steps steps 2. Establish Connections

2018 2019

Implement v1.0 of Knowledge Graph Querying Example query: find researchers with first-author publications in ACL or NIPS since 2017, worked on knowledge completion or question answering since 2010, and are affiliated with a university in the US.

Next steps steps 3. Macro Analysis

2018 2019

Next steps steps 3. Macro Analysis

2018 2019

Continue analyzing in clinical trials. Next steps steps 3. Macro Analysis

2018 2019

Study influence of government vs. industry funding.

Next steps steps 3. Macro Analysis

2018 2019

Study evolution and influence of research communities.