Transitioning BioCyc to a Subscription Model

Peter D. Karp ecocyc.org SRI International biocyc.org metacyc.org

© 2014 SRI International BioCyc.org Collection of 9,300 Pathway/Genome Databases •Pathway/Genome Database (PGDB) – combines information about – Pathways, reactions, substrates – Enzymes, transporters – , replicons – Transcription factors/sites, promoters, operons

•Tier 1: Highly curated PGDBs – MetaCyc, HumanCyc, YeastCyc – EcoCyc -- Escherichia coli K-12 – AraCyc – Arabidopsis thaliana

•Tier 2: Moderately curated -- 44 PGDBs – Bacillus subtilis, Mycobacterium tuberculosis

•Tier 3: Computationally-derived DBs

© 2014 SRI International BioCyc Use Cases

• Access to an extremely wide range of curated and computationally predicted information: – Genes and – Metabolic pathways, reactions, metabolites – Regulatory networks • expression data analysis • Metabolomics data analysis • Execute metabolic models • Metabolic route searches • Comparative analysis

© 2014 SRI International Highly Curated Pathway/Genome Databases

Database Organism Organization Publications Curated From MetaCyc Multiorganism SRI 51,000 EcoCyc E. coli SRI 32,000 BsubCyc B. subtilis SRI 4,000 HumanCyc H. sapiens SRI AraCyc A. thaliana TAIR/Carnegie 4,100 Institution YeastCyc S. cerevisiae SGD/SRI 980 MouseCyc M. musculus MGD/Jackson Laboratory

http://biocyc.org/otherpgdbs.shtml

© 2014 SRI International Creation of BioCyc Databases

Computational Inferences

Predict metabolic reactions NIH Predict operons Predict transport reactions Compute orthologs RefSeq Predict metabolic pathways Compute Pfam domains Predict pathway hole fillers Curation PGDB

Regulatory data Database links [regtransbase] Organism phenotype data Subcellular locations [psortdb] Gene essentiality data Phenotype microarray data GO terms [] features [uniprot]

Data Import © 2014 SRI International BioCyc Curated Data

• Gene functions • Metabolic pathways, reactions, metabolites • Regulatory interactions

© 2014 SRI International © 2014 SRI International Current Funding Sources for Curation

• EcoCyc grant from NIH/NIGMS (3 FTE curators) • MetaCyc grant from NIH/NIGMS (1 FTE curator) • Support curation of two of our 9,600 databases

• Additional revenues will let us curate additional databases

© 2014 SRI International BioCyc has Moved to a Subscription Model

• Gov’t supported databases remain free/open • Other databases accessible via subscription • Subscriptions available to individuals and institutions • Institutional subscription price depends upon usage level

• Phoenix Bioinformatics provides us with sales, marketing, and paywall services

© 2014 SRI International • Estimated cost/article for curation in EcoCyc project: – $219 – 6-15% open-access publication fee – Slightly more than 10% of the cost of coffee breaks for an R01 project

© 2014 SRI International • Randomly choose curated assertions from CGD and from EcoCyc • Validate accuracy of those assertions in publications • CGD error rate: 1.82% • EcoCyc error rate: 1.40%

© 2014 SRI International © 2014 SRI International •No

© 2014 SRI International • NL-understanding problem is 60 years old • Lots of progress, but error rates are unacceptable (18%, 24%, 45%) • Info extraction software typically extracts narrow slivers of info • Cannot arbitrate among conflicts in the literature • Some evidence that info-extraction software can speed curation

© 2014 SRI International Curation Complexity Varies Among Databases

• Number of extracted datatypes • Number of database fields • Amount of meta-data (evidence codes) • Amount of interpretation and synthesis • Authoring of mini-reviews • End-uses of information (metabolic modeling)

© 2014 SRI International • Much evidence to date indicates crowd-sourced curation is not a successful model • The author-curation model shows more promise for biocuration

© 2014 SRI International