Ermrest: an Entity-Relationship Data Storage Service for Web-Based, Data-Oriented Collaboration
Total Page:16
File Type:pdf, Size:1020Kb
ERMrest: an entity-relationship data storage service for web-based, data-oriented collaboration. Karl Czajkowski, Carl Kesselman, Robert Schuler, Hongsuda Tangmunarunkit Information Sciences Institute Viterbi School of Engineering University of Southern California Marina del Rey, CA 90292 Email: fkarlcz,carl,schuler,[email protected] Abstract—Scientific discovery is increasingly dependent on a are hard to evolve over time, and make it difficult to search scientist’s ability to acquire, curate, integrate, analyze, and share for specific data values. large and diverse collections of data. While the details vary from domain to domain, these data often consist of diverse digital In previous work, we have argued for an alternative ap- assets (e.g. image files, sequence data, or simulation outputs) that proach based on scientific asset management [5]. We separate are organized with complex relationships and context which may the “science data” (e.g. microscope images, sequence data, evolve over the course of an investigation. In addition, discovery flow cytometry data) from the “metadata” (e.g. references, is often collaborative, such that sharing of the data and its provenance, properties, and contextual relationships). We have organizational context is highly desirable. Common systems for also defined a data-oriented architecture which expresses col- managing file or asset metadata hide their inherent relational laboration as the manipulation of shared data resources housed structures, while traditional relational database systems do not extend to the distributed collaborative environment often seen in complementary object (asset) and relational (metadata) in scientific investigations. To address these issues, we introduce stores [6]. The metadata encode not only properties and refer- ERMrest, a collaborative data management service which allows ences of individual assets, but relationships among assets and general entity-relationship modeling of metadata manipulated other domain-specific elements such as experiments, protocol by RESTful access methods. We present the design criteria, events, and materials. architecture, and service implementation, as well as describe an ecosystem of tools and services that we have created to A relational metadata model can be expressed as an entity- integrate metadata into an end-to-end scientific data life cycle. relationship model (ERM), defining entity types (tables) and ERMrest has been deployed to hundreds of users across multiple relationships (foreign keys and associations). Many projects scientific research communities and projects. We present two can be well-served by simple models with only a handful representative use cases: an international consortium and an of entities and relationships, and non-experts can easily think early-phase, multidisciplinary research project. about their domain in terms of the main concepts which they want to manipulate [14]. We introduce Entity-Relationship Models via Representational State Transfer (ERMrest), a web I. INTRODUCTION service where ER models can be created and maintained by clients—incrementally introducing, using, and refining do- Scientific discovery has undergone a profound transfor- main concepts in a collaborative, mutable metadata store. mation, driven by exponentially increasing amounts of data The service provides a full-featured RESTful [7] interface to generated by high-throughput instruments, pervasive sensor underlying relational data and models. networks, and large-scale computational analyses [1]. Scien- tific collaboration has always been data-oriented, enabled by Despite the popularity and utility of relational databases, arXiv:1610.06044v1 [cs.DB] 19 Oct 2016 the exchange of data through the lifetime of an investigation. and renewed focus on their Cloud hosting, their data often With todays data deluge, the traditional methods for data remain locked behind application-specific services and treated organization and exchange during a scientific investigation as an internal component. We argue that relational storage are inadequate. This results in significant investigator over- should be made directly accessible to web clients for scientific heads [2] and unreproducable data [3]. While much attention collaboration. Costly, application-specific services to control has been given to the publication, citation, and access of cu- data access and update are replaced with generic storage tier rated scientific data intended for publication [4], little attention rules for atomicity and data integrity, combined with fine- has been paid to the data management needs which show up grained authorization to enforce trust-based chains of custody in daily practice of data-rich scientific collaborations. within data-sharing communities. Thus, end users and user- agents can directly consume and contribute science data within In current practice, scientists often rely on directory struc- a flexible and adaptive collaboration environment. tures in shared file systems, with data characteristics (i.e. metadata) coded in file names, text files, or spreadsheets. If In this paper, we focus on the requirements, design and collaboration extends beyond one institutional boundary, the implementation of the ERMrest service. We draw upon our data may be stored in a cloud based storage system, such as as experience with several data-sharing projects to define the Dropbox or Google Drive. These approaches are error prone, problem space, but report on two representative applications: become fragile as the complexity or size of the data grows, first, a large complex, multinational consortium accelerating research on G-Coupled Protein Receptors; and second, a assets, e.g. recording information about events, protocols, and multidisciplinary synaptomics research project as an example materials. Depending on the level of formality in projects, of an early-phase, exploratory collaboration. different kinds of provenance and quality-control metadata may also be introduced. The rest of this paper is organized as follows. In Sec- tion II we discuss key characteristics of the data-oriented 2) Evolving data models: Not only should the technology collaboration problem domain. In Section III, we describe be configurable for different project-specific models, but the in detail the ERMrest service, followed in Section IV by models should be allowed to continue to evolve throughout a a brief discussion of the other components in the ERMrest project’s life-cycle, i.e. while the system is in use. Early-phase, software ecosystem. In Section V, we describe the GPCR and exploratory projects need quick setup and simple models while synaptomics use cases in more detail. We conclude with related users establish experiment protocols, collaboration methods, work in Section VI and conclusions in Section VII. metadata nomenclature, and coding standards. As projects mature, users may identify new use cases and new formalisms which early users would have never prioritized. Finally, as II. APPLICATION CHARACTERISTICS AND CHALLENGES projects transition to curating published data products, the Our perspective on scientific asset management, data- modeling goals can drift further from the lab or process- oriented collaboration, and the specifics of ERMrest as a management goals of an active, experimental project. metadata catalog service have been informed by a number of As an example, the GPCR scientific workflow is complex, scientific projects for which we provide bioinformatics support. involving many distinct experiment methods with different Rather than bespoke solutions for each project, we have sought data-handling pipelines. Our pilot model focused on core archetypal requirements and evidence of feature gaps where domain concepts and a proof-of-concept pipeline, introducing our model-driven, data-oriented tools could be enhanced to the system for early test users. This core model continues to support broad classes of collaboration. evolve with experience, similar to most of our other projects; Here, we outline the main application characteristics, illus- but, due to the breadth of activities within the consortium, trated by five ongoing projects: A) the hub for the FaceBase it also expands in bursts as more tasks and objectives are project, organizing a central repository for data generated by a incorporated into the data-management system. number of individually-funded spoke sites; B) the microscopy Typical of many repositories and analysis-based projects, core for the Center for Regenerative Medicine and Stem the FaceBase hub metadata was initially modeled after a Cell Research (CIRM), offering microscope slide-scanning bulk export from an ancestor project. This repository model as a service; C) the GUDMAP project, curating microscope expands as new spoke data is integrated. Likewise, the CIRM imagery with assessments and annotations by domain-experts; and GUDMAP models were initially informed by existing D) the GPCR Consortium, an international collaboration to image archives and microscopy idioms. In all projects, models discover and analyze G-Protein Coupled Receptor molecular continue to be refined in response to evolving user needs, structures; and finally, E) Mapping the Dynamic Synaptome, newly identified search goals, and various metadata curation a multidisciplinary effort to develop methods for in vivo objectives. measurement of the synaptome. The Synaptome project started from scratch with little 1) Heterogeneous metadata: A model-neutral metadata guiding data. We quickly found that very simple models could