Commmon Factors for Parkinson's Disease and Crohn's Disease
Total Page:16
File Type:pdf, Size:1020Kb
© The MITRE Corporation. All rights reserved LITERATURE-RELATED DISCOVERY: COMMON FACTORS FOR PARKINSON’S DISEASE AND CROHN’S DISEASE Dr. Ronald N. Kostoff The MITRE Corporation 7515 Colshire Drive McLean, VA 22102 DISCLAIMER (The views contained in this paper are solely those of the author and do not reflect the views of the MITRE Corporation) ACKNOWLEDGEMENT This project was supported by MITRE internal research funding. KEYWORDS Literature-related discovery; Text mining; Scientometrics; Parkinson‟s disease; Crohn‟s disease; neurodegeneration; autoimmunity; inflammation © The MITRE Corporation. All rights reserved ABSTRACT Literature-related discovery (LRD) is the linking of two or more literature concepts that have heretofore not been linked (i.e., disjoint), in order to produce novel, interesting, and intelligible knowledge (i.e., potential discovery). The mainstream software for assisting LRD is Arrowsmith, and it generates intermediate linking literatures by matching title phrases from two disjoint literatures (literatures that do not share common records). Arrowsmith then prioritizes these linking phrases through a series of text-based filters. The present study examines another route to linking disjoint literatures (shared references) through a process called bibliographic coupling. Two disjoint literatures were selected: Parkinson‟s Disease (PD) (neurodegeneration) and Crohn‟s Disease (CD)(autoimmune). Three cases were examined: matching phrases in records with no shared references; shared references in records with no matching phrases; matching phrases in records with shared references. In addition, the main themes in the body of shared references were examined through grouping techniques to identify the myriad linkages between the two literatures. No concepts were identified in the matching phrases/no shared references records that could not be found in the matching phrases/shared references records. Some new concepts (at the sub-set level of the main themes) not found in the matching phrases/shared references records were identified in the shared references/no matching phrases records. The synergy of matching phrases and shared references provides a strong prioritization to the selection of promising matching phrases as discovery mechanisms. There were three major themes that unified the PD and CD literatures: Genetics; Neuroimmunology; Cell Death. However, these themes are not completely independent. For example, there are genetic determinants of the inflammatory response. Naturally occurring genetic variants in important inflammatory mediators such as TNF-alpha appear to alter inflammatory responses in numerous experimental and a few clinical models of inflammation. Additionally, there is a strong link between neuroimmunology and cell death. In PD, for example, neuroinflammatory processes that are mediated by activated glial and peripheral immune cells might eventually lead to dopaminergic cell death and subsequent disease progression. © The MITRE Corporation. All rights reserved 1. INTRODUCTION Discovery in science is the generation of novel, interesting, and intelligible knowledge about the objects of study. Literature-related discovery (LRD) is the linking of two or more literature concepts that have heretofore not been linked (i.e., disjoint), in order to produce novel, interesting, plausible, and intelligible knowledge (i.e., potential discovery). Two major variants of LRD are: 1) open discovery systems (ODS), where one starts with a problem and generates a potential solution (or vice versa); and closed discovery systems (CDS), where one starts with a problem and a potential solution (or starts with two problems), and generates linking mechanism(s). (Kostoff, 2008). A 2008 study surveyed the literature on ODS LRD (Kostoff et al, 2008a), and included some CDS techniques peripherally. Perhaps the most widely known and used CDS approach is expressed by the Arrowsmith software (Smalheiser, 2005), where the mechanisms that link two literatures are identified/discovered through common phrases shared by document titles in each of the literatures. These common phrases can be viewed as reflections of intermediate literatures that link the two primary literatures of interest. The author has used Arrowsmith for discovery experiments related to potential treatments and preventative measures in an ODS mode, and has found it a useful supplement to the author‟s variant of LRD (Kostoff et al, 2008b,c). One operational problem with Arrowsmith and similar text-based linking techniques is that thousands, or tens of thousands, of shared phrases may be identified. There is no guarantee that each shared phrase represents a relevant concept linking the two disparate literatures. To check for relevancy of a given phrase, each article in the two literatures (or a representative sample of articles if the number is large) must be examined by experts, and a judgment made as to whether there is a relevant linkage. Checking out each phrase for relevancy can be a very time consuming process. Therefore, for feasibility, some method of prioritizing phrases for relevancy is required. In the MS and PD LRD studies published by the author (Kostoff et al, 2008b,c), Arrowsmith was used sporadically to compare with the author‟s approach. Because the author‟s approach used abstracts, while Arrowsmith used titles, the author‟s LRD variant generated far more voluminous potential discovery. However, in using Arrowsmith, the author found the most efficient ranking approach for potential discovery purposes to be matching of the shared title phrases with dominant medical terms obtained from clustering the MS and PD © The MITRE Corporation. All rights reserved abstracts. The clustering algorithm used TF-IDF term weighting, and the highest weighted biomedical terms in each cluster were selected for comparison with the matching phrases. The developers of Arrowsmith have generated their own method for prioritizing/filtering the matching phrases (Torvik and Smalheiser, 2007; Smalheiser, 2005). They identified eight features that could be incorporated into a filtering model, which the user could select optionally. These include: features that capture various aspects of absolute and relative frequencies of the terms within each literature; recency of the term when first appearing in MEDLINE; cohesion (roughly, whether the term refers to a highly specific or a general concept); concepthood (i.e. whether the term maps to any UMLS semantic categories or not); shared MeSH headings between the two disjoint literatures, and presence on a stoplist of common words. All of these features focused on filtering of the text linkages. However, there are many ways in which intermediate literature linkages between two or more primary disjoint literatures can be established. One could search for common authors, common institutions, common journals, common citing papers, common MeSH terms, common keywords, etc. One promising alternative, or supplement, to matching title phrases is matching references. A reference in an article reflects one or more concepts upon which the article draws. Two articles that share a common reference (bibliographic coupling) would therefore have some linkage through the shared concept(s), even though the articles themselves might have vastly different terminology. So, searching for linkages among two or more articles through shared references offers a way to identify linking mechanisms or underlying joint themes that might not be accessible through text linking alone. The main objective of this paper is to demonstrate proof-of-principle that shared references between two disjoint literatures provide a strong prioritization mechanism for selecting shared title phrases for potential CSD. A secondary objective is to demonstrate that some mechanisms linking the two literatures through shared references were not identifiable through phrase matching alone. The demonstration testbed is the Parkinson‟s Disease literature (neurodegeneration) and the Crohn‟s Disease literature (autoimmunity). What are the common medical/biological themes that link/underlay these two seemingly disparate diseases? © The MITRE Corporation. All rights reserved 2. BACKGROUND 2.1. Text Mining 2.1.1. Closed Discovery Systems Over the past two decades, there have been numerous attempts to identify indirect linkages among two or more disjoint literatures. Starting in the mid-late 1980s, Swanson and colleagues (e.g., Swanson, 1987; Swanson, 1988; Swanson, 1990; Smalheiser and Swanson, 1996; Swanson et al, 2001) published a series of papers on mechanisms connecting two essentially disjoint literatures. The concepts linking the two disjoint literatures were obtained by identifying phrases common to titles in both literatures. For example, in Swanson‟s article linking magnesium to migraine, he found that the phrase “spreading depression” occurred in the titles of papers in both the magnesium and migraine literatures. He then read the papers in each literature containing “spreading depression” in their titles to ascertain whether a physiologically defendable linking mechanism existed. During this evolutionary process, Swanson and Smalheiser developed a software package (Arrowsmith) that would import the two disjoint literatures and identify all the common title phrases (Smalheiser, 2005). Wren et al (2004) also used a linking methodology based on phrase co- occurrences. They constructed “a network of tentative relationships between „objects‟ of biomedical research interest (e.g. genes,