Research Prioritization Through Prediction of Future Impact on Biomedical Science: a Position Paper on Inference-Analytics Ganapathiraju and Orii
Total Page:16
File Type:pdf, Size:1020Kb
Score Rank Re-rank by Predicted Impact on Future Science Score Rank Re-rank by Confidence Score Rank Re-rank by Resource Requirements or Reagent Availability Inference Analytics: A Necessary Next step to Big Data Analytics in Biology Research prioritization through prediction of future impact on biomedical science: a position paper on inference-analytics Ganapathiraju and Orii Ganapathiraju and Orii GigaScience 2013, 2:11 http://www.gigasciencejournal.com/content/2/1/11 Ganapathiraju and Orii GigaScience 2013, 2:11 http://www.gigasciencejournal.com/content/2/1/11 RESEARCH Open Access Research prioritization through prediction of future impact on biomedical science: a position paper on inference-analytics Madhavi K Ganapathiraju1,2* and Naoki Orii1,2 Abstract Background: Advances in biotechnology have created “big-data” situations in molecular and cellular biology. Several sophisticated algorithms have been developed that process big data to generate hundreds of biomedical hypotheses (or predictions). The bottleneck to translating this large number of biological hypotheses is that each of them needs to be studied by experimentation for interpreting its functional significance. Even when the predictions are estimated to be very accurate, from a biologist’s perspective, the choice of which of these predictions is to be studied further is made based on factors like availability of reagents and resources and the possibility of formulating some reasonable hypothesis about its biological relevance. When viewed from a global perspective, say from that of a federal funding agency, ideally the choice of which prediction should be studied would be made based on which of them can make the most translational impact. Results: We propose that algorithms be developed to identify which of the computationally generated hypotheses have potential for high translational impact; this way, funding agencies and scientific community can invest resources and drive the research based on a global view of biomedical impact without being deterred by local view of feasibility. In short, data-analytic algorithms analyze big-data and generate hypotheses; in contrast, the proposed inference-analytic algorithms analyze these hypotheses and rank them by predicted biological impact. We demonstrate this through the development of an algorithm to predict biomedical impact of protein-protein interactions (PPIs) which is estimated by the number of future publications that cite the paper which originally reported the PPI. Conclusions: This position paper describes a new computational problem that is relevant in the era of big-data and discusses the challenges that exist in studying this problem, highlighting the need for the scientific community to engage in this line of research. The proposed class of algorithms, namely inference-analytic algorithms, is necessary to ensure that resources are invested in translating those computational outcomes that promise maximum biological impact. Application of this concept to predict biomedical impact of PPIs illustrates not only the concept, but also the challenges in designing these algorithms. Keywords: Impact prediction, Data analytics, Inference analytics, Protein-protein interaction prediction, Big-data * Correspondence: [email protected] 1Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15206, USA 2Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA © 2013 Ganapathiraju and Orii; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ganapathiraju and Orii GigaScience 2013, 2:11 Page 2 of 13 http://www.gigasciencejournal.com/content/2/1/11 Background Inference analytics for research prioritization Big data is everywhere today, be it data from publications In domains such as biology where hypothesis verification (newspapers, journals, internet pages or tweets), or be it and interpretation are resource-intensive, the large num- cosmological, climatic, or ecological data. This trend is ber of inferences drawn by data-analytic algorithms would facilitated primarily by the exponential increases in the have to be reanalyzed by various criteria such as availabil- capabilities of computing, storage and communication ity of resources (budget, reagents or scientific expertise). technologies. In biology, advances in biotechnology have We call such algorithms which re-analyze data analytic in- resulted in the creation of big data of many types: genomic, ferences as inference analytic algorithms (Figure 1); an in- proteomic, trascriptomic, epigenomic and metabolomic ference analytic algorithm that is essential in the field of characterizations of several species and their populations. biology is that of predicting the future impact of an infer- Large-scale data analytics can aid in discovering pat- ence on biomedicine. terns in these data to gain new scientific insights; how- ever, the domain of biology is very different from most Predicting future impact of inferences is necessary in other domains in this aspect. Validating a computa- biology and biomedicine tional result is cheap in domains like language trans- Just as grant proposals are evaluated for their “difference lation, and can be carried out by any average individual; making capability” to determine priority of funding, the when the number of results is in hundreds or thou- inferences drawn at a large scale may be evaluated by their sands, it is possible to crowdsource the validations potential for biomedical impact before investing resources using Amazon MechanicalTurk [1] or the like. Manual to experimentally study them (Figure 1). Impact predic- validators can often correct mistakes of the algorithm, tion, or ranking the computational inferences based on fu- and these corrections may in turn be used to improve ture impact is distinct from other methods of re-ranking the algorithm in future. Secondly, the computational computational outcomes, such as ranking by confidence results are often ready for direct use or interpretation. of prediction. Here, we assume that all computational out- For example, outcomes of algorithms for machine comes are equally accurate, and propose to re-rank them translation from one language to another, information by their predicted impact on future science. Impact pre- retrieval, weather prediction, etc. may be directly de- diction is also distinct from task prioritization, a well- ployed for use in real life. studied area in computer science. Task prioritization Compared to these domains, there are four fundamen- assumes that priorities and costs are known for the tasks, tal differences in large-scale data analytics for molecular and it optimizes allocation of resources to tasks to achieve and cellular biology: maximum yield. Here, our goal is to predict the biomed- ical impact (a type of priority) so that resource allocation (i) Conclusions drawn by algorithms cannot be (e.g., by a funding agency) can be carried out based on validated manually, and often require carrying out these priorities. experiments. (ii) Even when the computational inferences are Citation count can be used as a surrogate measure for accurate and do not require further validation, biomedical impact converting the inferences to meaningful insights The focus of this work is on estimating the biological requires experimentation. importance of a specific PPI (i.e., how central is a PPI to- (iii)Experimental methods require resources that are wards understanding other biological or disease related expensive (material and financial resources) or even factors). How do we measure this importance? Consider unavailable (suitable antibodies). the publications that report each of these PPIs; then (iv) Validation and interpretation of inferences requires consider the publications that cite these original publica- scientific expertise which may be scarce (for tions. In this narrow domain of publications reporting or example, there may not be any scientist with citing PPIs, we estimate that each article typically con- expertise in studying some of the proteins). tains one primary result; we measure the importance of a PPI in terms of how many papers cite the original pub- For these reasons validating or interpreting all the hun- lication reporting that PPI. For example, the PPI of dreds of computational inferences of data-analytic algo- EGFR with Actin was central in advancing our know- rithms is expensive and is not amenable to crowdsourcing. ledge of several other biological concepts discovered In this position paper, we introduce the concept called based on this PPI. For this reason, the paper reporting inference-analytics, and a specific inference-analytics this PPI [2] has been cited by 35 articles in PubMed. problem called impact prediction. We use the terms The impact of EGFR-Actin is therefore estimated to be predictions, computational inferences or hypotheses to at 35 “biomedical results”. Thus, we approximate impact refer to the outcomes of data-analytic algorithms. prediction with citation count prediction. Ganapathiraju and Orii GigaScience 2013, 2:11 Page 3 of 13 http://www.gigasciencejournal.com/content/2/1/11 Data (A) Use Analytics Inferences Crowdsourcing for Validation Inferences Data (B) Experiments Analytics