correspondence Curation of chemogenomics data

To the Editor: With the rapid accumulation Original set It is also crucial for journals to support and of data in all areas of encourage the use of standardized electronic research, scientists rely increasingly 1 Chemical curation protocols and formats (such as MIABE9)

on historical chemogenomics data and 2 Duplicate analysis for chemical data sharing and to require computational models to guide small- authors to upload their data electronically to 3 Analysis of intra- and interlab molecule bioactivity screens and chemical experimental variability public repositories at the time of manuscript probe development. However, there submission. 4 Exclusion of unreliable data sources is a growing public concern about the Among other measures, the chemical frequent irreproducibility of experimental 5 Detection and verification of biology community should adopt a culture activity clis data reported in peer-reviewed scientific of curation as a mandatory component of

Error rate Error Calculation and tuning of dataset publications1,2. An editorial in this journal3 6 modelability index primary data processing and a prerequisite emphasized a critical need to address this 7 Consensus QSAR predictions for data sharing. Chemical and biological data problem, an issue that has also received curation workflows can be developed further Identification and correction of attention from the US National Institutes of 8 mislabeled compounds and utilized to flag (and where possible, fix) Health (NIH) leadership4. Since successful those records and ultimately improve the development of chemical probes and robust quality of data analysis and the prediction screening assays—one central objective of Curated set performances of modeling approaches. chemical biology—rely on the prior art in the Experimental and computational scientists field, it is critical that researchers establish Dataset size (number of records) should convene to agree upon standards the highest possible quality standards for data and best practices for data generation, deposited in chemogenomics databases. Figure 1 | General workflow for comprehensive reporting and curation of chemogenomics Concerning the impact of poor data in curation of chemogenomics datasets. Each step data, which will improve data reproducibility chemogenomic databases, we5 and others6 can be done using existing and accelerate the progression from data to have shown that inaccurate and inconsistent techniques and software tools. The workflow knowledge in chemical biology research. representations of chemical structures ensures the detection and elimination of the in available molecular datasets result in following: nonstandardized and duplicated Acknowledgments models of poor accuracy, whereas data chemical structures (steps 1 and 2); records The authors are grateful for financial support from the US associated with unreliable data sources or high National Institutes of Health (GM 096967 and GM66940) curation improves the modeling outcome. and Environmental Protection Agency (RD 83499901). Researchers relying on non-curated historical experimental variability (steps 3 and 4); structural D.F. thanks the North Carolina State University Chancellor’s data are taking a risk of corrupting their outliers and unverified activity cliffs (steps 5 Faculty Excellence Program. results owing to the following ‘five I’s’: data and 6). Some mislabeled compounds can thereby Nature America, Inc. All rights reserved. America, Inc. © 201 5 Nature Competing financial interests be identified and corrected (steps 7 and 8). may be incomplete, inaccurate, imprecise, The authors declare no competing financial interests. incompatible and/or irreproducible. These considerations emphasize the need for References 1. Baker, M. Nature 521, 274–276 (2015). npg thorough curation as the first critical step of and calculation and tuning of the dataset 2. Prinz, F., Schlange, T. & Asadullah, K. Nat. Rev. Discov. 10, 712 7 any data analysis study to ensure the stability modelability index (step 6), which estimates (2011). and reliability of the models and to guide the feasibility of obtaining predictive 3. Anonymous. Nat. Chem. Biol. 9, 345 (2013). experimental follow-up5. quantitative structure-activity relationship 4. Collins, F.S. & Tabak, L.A. Nature 505, 612–613 (2014). 5. Fourches, D., Muratov, E. & Tropsha, A. J. Chem. Inf. Model. 50, As one means of addressing the data (QSAR) models for a given dataset, serve 1189–1204 (2010). quality problem, we propose a general as additional indicators of data quality. 6. Young, D., Martin, D., Venkatapathy, R. & Harten, P. QSAR Comb. chemical and biological data curation Consensus QSAR modeling (step 7), used for Sci. 27, 1337–1345 (2008). workflow (Fig. 1) that relies on existing the identification and correction of potentially 7. Golbraikh, A., Muratov, E., Fourches, D. & Tropsha, A. J. Chem. Inf. Model. 54, 1–4 (2014). cheminformatics approaches to flag, and erroneous values or categories of compound 8. Anonymous. Nature 496, 398 (2013). in some cases correct, possibly erroneous bioactivities (step 8), conclude the workflow. 9. Orchard, S. et al. Nat. Rev. Drug Discov. 10, 661–669 (2011). entries in large chemogenomics datasets. As a community, we must take This workflow begins with chemical data multifaceted approaches to ensure the quality Denis Fourches1, Eugene Muratov2 & curation following a previously established and reproducibility of chemogenomics Alexander Tropsha2 protocol5 (step 1 in Fig. 1), resulting in the data through better data generation and identification and correction of structural reporting. The Nature family of journals8 1Department of Chemistry, errors. Duplicate analysis (step 2) assesses have taken steps in this direction by removing Research Center, North Carolina State data quality and removes duplicate chemical space restrictions for method sections and University, Raleigh, North Carolina, USA. structures and contradictory records. Analysis having external statisticians verifying the 2Laboratory for Molecular Modeling, Division of intra- and interlab experimental variability correctness of statistical tests reported in some of Chemical Biology and , (step 3) and exclusion of unreliable data manuscripts considered for publication. The UNC Eshelman School of Pharmacy, University sources (step 4) help increase data quality NIH is also developing plans to stimulate of North Carolina at Chapel Hill, Chapel Hill, and aid decision-making about combination researchers to enhance reproducibility of their North Carolina, USA. of data from different sources. Detection research results (http://grants.nih.gov/grants/ e-mail: [email protected] or and verification of activity ‘cliffs’ (step 5) guide/notice-files/NOT-OD-15-103.html). [email protected]

nature chemical biology | VOL 11 | AUGUST 2015 | www.nature.com/naturechemicalbiology 535