Harnessing the Power of Content
Total Page:16
File Type:pdf, Size:1020Kb
HARNESSING THE POWER OF CONTENT EXTRACTING VALUE FROM SCIENTIFIC LITERATURE: THE POWER OF MINING FULL-TEXT ARTICLES CONTENTS 1. Executive Summary 3 2. Understanding Biology in the Era of Big Data 4 3. The Challenge of Mining the Scientific Literature 5 4. Manual Curation is a Problematic Option 5-6 5. Text-mining Technology Provides a Better Solution 6 6. A Wide Net is Needed to Capture Relevant Results 7 7. More Facts are Found in Full-Text Articles 8-9 8. Co-occurrences are Often Mentioned in Full-text First 10-11 9. Conclusion 12 ELSEVIER - WHITE PAPER - HARNESSING THE POWER OF CONTENT • 2014 2 1. EXECUTIVE SUMMARY Biological researchers face an overwhelming and constantly growing body of scientific literature that they must search through in order to keep up with developments in their field. Staying up-to-date with new findings is critical, yet seemingly impossible. With more than 1 million new citations in PubMed each year, how can anyone possibly read through all of the relevant full-text articles to uncover important data? For years manual review and curation of scientific publications has been the gold standard, with computer-based solutions ranking far behind in terms of accuracy and completeness. Fortunately times have changed; researchers now have ready access to significant computational power, and the capabilities of text mining software have improved dramatically. Elsevier’s proprietary text-mining technology has proven that well-designed software applications can effectively “read” full-text articles, extracting far more relevant biological information than is found in article abstracts alone, with accuracies comparable to human researchers, and with much higher throughput. This technology helps ensure researchers are not missing out on valuable insights from the literature, giving them a better understanding of the biology behind specific pathologies and drug responses, supporting the target discovery process, and offering the competitive advantage that they need to succeed in their work. ELSEVIER - WHITE PAPER - HARNESSING THE POWER OF CONTENT • 2014 3 2. UNDERSTANDING BIOLOGY IN THE ERA OF BIG DATA Biological research today can be phenotype, as an alternative to reading summarized in one word – data. Massive hundreds of scientific articles. Biological quantities of information are generated knowledgebases can also be used daily from large and small laboratories, to build detailed models of complex individuals, universities, and commercial diseases to help uncover the underlying organizations. Their findings, along with cause of a disease at the level of gene those of other individuals or groups, are and protein networks, thereby helping to published in multiple papers, across a guide early target discovery, assessment, wide range of scientific journals, over and prioritization. a period of years. On average, each Enter Pathway Studio: This integrated year more than a million new records data mining and visualization software are added to PubMed – a service of allows investigators to search for facts the National Library of Medicine that on proteins, genes, diseases, drugs, and indexes more than 23 million references other entities; analyze experimental comprising the MEDLINE bibliographic results and interpret data in a biological database at present. New findings are context; and build and visualize disease published, and yet it seems that each models representing the underlying new result engenders more questions. biology of a particular condition or The information needed to understand response. Pathway Studio provides experimental results, describe fully an direct access to underlying evidence individual cellular pathway, or identify from the literature so that scientists the interactions of multiple regulatory can independently review and decide networks is scattered throughout dozens on the applicability of those facts to if not hundreds or thousands their research. Underpinning Pathway of articles and publications. The volume Studio, Elsevier’s proprietary text-mining and complexity of the data have tool extracts facts from more than 2.5 outstripped researchers’ ability to keep million full-text articles in almost 1,500 up with the progress of science, even in biomedical journals, from Elsevier as well their chosen fields. Researchers are left as other trusted publishers, along with with a question that’s impossible more than 23 million PubMed abstracts. to answer – “What have I missed?” Having access to the information What if you had a tool to dramatically derived from all those publications gives reduce your odds of missing critically researchers a measurable advantage in important research results? One way rendering a complete picture of the genes to speed up this process is to create and proteins involved in the biology of a a database of scientific knowledge disease or response to a drug. (a “knowledgebase”) from published literature that can then be used to quickly find and summarize information related to a particular disease or ELSEVIER - WHITE PAPER - HARNESSING THE POWER OF CONTENT • 2014 4 3. THE CHALLENGE OF MINING THE SCIENTIFIC LITERATURE Researchers have three main choices same paper concluded that less than for building knowledgebases derived half of the key facts from the body of from the burgeoning literature: reading a paper are present in its abstract2–5. the articles (or abstracts) themselves, Authors may choose to exclude from the manual curation by experts, or using a abstract some technical or secondary specialized automated text-mining tool. information5, or information that is less positive or less supportive of the main Researchers are used to reading dozens idea of the publication: a study examining if not hundreds of scientific articles to randomized controlled trials published in understand their area of research. But high-profile journals showed that nearly reading a full paper takes time – even half of papers that mentioned harm in very experienced researchers can only the body of the paper did not report it read and fully comprehend a few articles in abstract6. Also, authors sometimes an hour. One option to save time is to underestimate a finding’s ultimate simply read the abstract of the paper. value; what may appear to be a relatively Overall, the current They’re short, most are available on minor finding in one paper may lead to PubMed, and many can be read quickly. version of Elsevier’s extensive research years later, resulting However, authors cannot include in hundreds of follow-up articles. text-mining technology everything important in an abstract. has a 98% accuracy rate Based on this information, every Overall, the process of deciding what researcher would agree that it is not for entity detection goes into the abstract is often somewhat enough to read just the abstract; one subjective. For example, when other and an 88% accuracy must read the full paper. But researchers’ researchers cite a particular paper, rate for relationship time is limited – how can they identify 20% of keywords they mention are not all the articles relevant to their research, extraction-figures that present in the abstract, which means much less gather all the critical that it didn’t contain all the important are essentially identical information present in those articles? to high-quality annotation information1. Multiple studies comparing the full text and abstract from the by human experts. 4. MANUAL CURATION IS A PROBLEMATIC OPTION The alternative approach to personally Even the best human curators are reviewing the body of scientific literature not perfectly accurate, nor are they is to rely on PhD researchers trained as completely consistent. Numerous curators. Traditionally, manual curation scientific publications highlight errors of papers has been viewed as the gold and inconsistencies of human curators. standard and assumed to be more For example, in one study using manual accurate than automated systems. annotation of PubMed articles with Although this was certainly true in the Gene Ontology Annotation terms, only past, current text analysis technology 39% of the terms assigned by three is highly accurate, rivaling that of even different curators were identical7. In experienced curators. another study8, the average precision ELSEVIER - WHITE PAPER - HARNESSING THE POWER OF CONTENT • 2014 5 of annotating medical events in clinical among pairs of annotators ranged narratives by three experts was reported from 31% to 77%, and recall to be 88%. Drews et al. reported 73% ranged from 49% to 71%) was also inter-annotator agreement of extracting reported for phenotype annotation clinical data from medical records9. And for the BioCreative workshop10. low inter-annotator agreement (precision 5. TEXT-MINING TECHNOLOGY PROVIDES A BETTER SOLUTION The ability to “machine-read” and It then analyzes the sentence structure “understand” published articles has been and “decomposes” each sentence into in existence for approximately 35 years, a set of Subject-Verb-Object triplets but only recently has the sophistication (e.g., insulin regulates glucose uptake). of those machine-reading applications Using strict linguistic rules, the software aligned with the wide availability of analyzes each triplet to determine how sufficient computational power to make these concepts are related to each other fully automated text extraction a viable and extracts the relationship between alternative to reading and manually the Subject and the Object, guided by a curating scientific literature. Elsevier