Semantic Scholar Document Analysis at Scale
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Scholar Document Analysis at Scale Miles Crawford, Director of Engineering Outline ● Introduction to the Allen Institute for Artificial Intelligence and Semantic Scholar ● Research at Semantic Scholar ● Creating www.semanticscholar.org ● Other resources for researchers Introduction to AI2 and S2 “AI for the common good” Allen Institute for Artificial Intelligence Mosaic Aristo Euclid Common Sense Knowledge Machine Reading and Math and Geometry and Reasoning Question Answering Comprehension AllenNLP PRIOR Semantic Scholar Deep Semantic NLP Visual Reasoning AI-Based Academic Platform Knowledge Semantic Scholar: Vision & Strategy Semantic Scholar makes the world's scholarly knowledge easy to survey and consume. Semantic Scholar: Vision & Strategy Differentiation: S2 is dramatically better at surveying, extracting, and helping researchers consume the most relevant information from the world's research Scale: Attract and retain a significant and sustainable share of academic search traffic Impact on research with AI: Work towards a “Wright Brothers” moment for research through research on novel AI techniques that are prototyped with millions of active users semanticscholar.org Research at Semantic Scholar Research at S2: Three Levels of Analysis Paper Relationships Macro Paper: Extract meaningful structures Figures Tables Topics Relations Neural Networks Omniglot Backpropagation Results C-peptide [contraindicated with] Diabetes Mellitus The addition of MbPA reaches a test perplexity of 29.2 which is, to the authors’ knowledge, state-of-the-art at time of writing. Peters et al. ACL 2017 -- Semi-supervised Sequence Tagging with Bidirectional Langua… Ammar et al. SemEval 2017 -- Semi-supervised End-to-end Entity and Relation Extrac… Siegel et al. JCDL 2018 -- Extracting Scientific Figures with Distantly Supervised Neural… Paper: Extract meaningful structures Relationships: Establishing Connections Ontology Matching uses method should cite UMLS Discovered KB Ammar et al. NAACL 2018 -- Construction of the Literature Graph in Semantic Scholar Wang et al. BioNLP 2018 -- Ontology Alignment in the Biomedical Domain Using Entit… Bhagavatula et al. NAACL 2018 -- Content-Based Citation Recommendation Relationships: Establishing Connections Relationships: Establishing Connections Macro: Trends in research pre-publishing vs. citation rates understanding peer reviews quantify demographic bias in clinical trials Kang et al. NAACL 2018 -- A Dataset of Peer Reviews (PeerRead): Collection, Insights... Feldman et al. arXiv 2018 -- Citation Count Analysis for Papers with Preprints 2018 Publications A Dataset of Peer Reviews (PeerRead): Collection, Insights and Extracting Scientific Figures with Distantly Supervised Neural NLP Applications Networks Kang, Ammar, Dalvi, van Zuylen, Kohlmeier, Hovy, Schwartz Siegel, Lourie, Power, Ammar NAACL 2018 JCDL 2018 Content-Based Citation Recommendation Neural Relation Extraction using a Combination of Full Bhagavatula, Feldman, Power, Ammar. Supervision and Distant Supervision NAACL 2018 Beltagy, Lo, Ammar arXiv preprint 2018 Construction of the Literature Graph in Semantic Scholar Ammar, Groeneveld, Bhagavatula, …, Etzioni Citation Count Analysis for Papers with Preprints NAACL 2018 Feldman, Lo, Ammar arXiv preprint 2018 Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context Wang, Bhagavatula, Neumann, Lo, Wilhelm, Ammar BioNLP 2018 Bringing it All to Production semanticscholar.org: Scale and Reach ● Comprehensive literature graph to explore papers, authors, topics. ● View and search across 40MM+ Computer Science & BioMedical academic papers. ● Expansion to all of science in 2019. ● Topic extraction covering 350k+ topics, 230MM+ mentions, and 6.7MM+ relationships. ● Used by over 2.2MM unique users per month. ● Global adoption. Architecture Overview Data Acquisition Pipeline Machine Learning Web Application Models Figures Node Server Web Crawler Crawler Junk Filter React Entities Application DBLP De-duplicate User’s & Cluster Metadata Browser PubMed React Etc... Metadata API Application Springer Create Citation Metadata The Graph Internet Etc... Etc...Etc... Join Extracted Data Create Search Index Elasticsearch Normalizing Paper Records Data Acquisition Sourced Paper Title: N/A Crawler Authors: N/A PDF: a2ho-4.pdf Sourced Paper etc... Title: Analyzing... PubMed Authors: A, B, C Sourced Paper PDF: N/A Amazon etc... S3 Title: Findings of... Springer Authors: A, B, C PDF: findings.pdf etc... Etc... Etc...Etc... Pipeline Distributed Paper Resolution Junk Filter Title Block: analyzscientificdocument Title: Analyzing Scientific Documents De-duplicate Author: John Doe & Cluster Title: Analyzing Scientific Documents Title: Analyzing scientific documents Author: John Doe Author: J. Doe Create Citation Graph Title: Analyze Scientific Documents Title: Analyze Scientific Documents Author: Jane Smith Author: Jane Smith Join Extracted Data Title Block: deeplearningfornlp Create Search Index Title: Deep Learning for NLP Title: Deep Learning for NLP Author: Jane Smith Author: Jane Smith Pipeline Distributed Citation/Reference Matching Junk Filter Title: Structured Content Extraction Author: John Doe Title Block: analyzscientificdocument Title: Analyzing Scientific Documents De-duplicate & Cluster Author: John Doe References: 1. Analyzing Scientific Title: Analyze Scientific Documents Create Author: Jane Smith Citation Documents, Doe et al., Graph SCIDOCA 2016 2. Deep Learning with NLP, Join Smith, NAACL 2017 Extracted Title Block: deeplearningfornlp Data 3. Structural Heuristics in Academic Documents, Oscar, SCIDOCA 2016 Title: Deep Learning for NLP Create Author: Jane Smith Search Index ? Pipeline Scalable ML Models Notifications for new or changed Junk Filter papers Model Task System De-duplicate & Cluster Create Citation Table of Paper 1: abc Paper 1: Paper 1: abc Paper 1: abc extraction inputs Paper 2: def Paper 2: Paper 2: Paper 2: def Graph Etc... and results per Paper 3: Paper 3: Paper 3: Paper 3: hij model: Paper 4: Paper 4: Paper 4: Paper 4: Join Extracted Data Docker containers on autoscaling host groups: Figure Figure Key Topic Create ExtractionFigure ExtractionFigure ResultsFigure LocationFigure Etc...Figure Search Index Extractionv1.0 Extractionv1.1 Extractionv1.0 Extractionv1.3 Extraction v1.0 v1.0 v1.0 v1.0 v1.0 Other Resources Open Research Corpus ● Full Corpus of Papers: ○ Some data restricted due to some licensing/copyright conditions. ○ Record-per-line JSON archive, 36GB. ○ http://labs.semanticscholar.org/corpus/ ● API: ○ Lightweight single paper or author access. ○ JSON over REST API. ○ Link resolver for easy linking to Semantic Scholar. ○ http://api.semanticscholar.org/ Use of Open Corpus as Training Set An independent researcher created a system for embedding papers as a vector to support similarity measurements and recommendation generation https://github.com/Santosh-Gupta/Research2Vec ArXiv Added a Bibliography via API ArXiv used our API to fetch and re-display citation and reference lists on their pages. Open Source Repositories Cite-o-matic: Find relevant citations for drafts of papers. https://github.com/allenai/citeomatic Science Parse v1: CRF-based extraction of metadata from PDFs. https://github.com/allenai/science-parse Science Parse v2: LSTM-based extraction of metadata from PDFs. https://github.com/allenai/spv2 Deep Figures: Extract figures and tables from papers. https://github.com/allenai/deepfigures-open What would you like to see? Thank you! [email protected] http://allenai.org https://semanticscholar.org Appendix 1st dataset of peer reviews Dongyeop Kang, Waleed Ammar, Bhavana Dalvi Mishra, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, Roy Schwartz. NAACL 2018 Prepublishing before paper gets accepted is associated with 65% more citations! Sergey Feldman, Kyle Lo, Waleed Ammar. arXiv 2018 Gender bias in clinical trials Cite-o-matic Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar. NAACL 2018 Literature graph Synthesize knowledge Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang, Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni. NAACL 2018 (industry track) Ontology alignment Lucy Wang, Chandra Bhagavatula, Mark Neumann, Kyle Lo, Chris Wilhelm and Waleed Ammar. BioNLP 2018 Context-sensitive word representations Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. ACL 2017 Waleed Ammar, Matthew E. Peters, Chandra Bhagavatula, Russell Power. SemEval 2017 Figure extraction Noah Siegel, Nicholas Lourie, Russell Power, Waleed Ammar. JCDL 2018. Next steps 1. Extract Meaningful Structures 2018 2019 Next steps steps 1. Extract Meaningful Structures 2018 2019 Address more paper FAQs (e.g., PICO elements). Next steps steps 1. Extract Meaningful Structures 2018 2019 Model salience of extracted structures at the document level. Next steps steps 1. Extract Meaningful Structures 2018 2019 User’s implicit/explicit feedback. Next steps steps 1. Extract Meaningful Structures 2018 2019 User’s implicit/explicit feedback. Next steps steps 1. Extract Meaningful Structures 2018 2019 User’s implicit/explicit feedback. Next steps steps 1. Extract Meaningful Structures 2018 2019 Adapt existing models to more domains. Ganin and Lempitsky. ICML’15 Next steps steps 2. Establish Connections 2018