CNSA: a Data Repository for Archiving Omics Data

CNSA: a Data Repository for Archiving Omics Data

bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.030833; this version posted April 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. CNSA: a data repository for archiving omics data Xueqin Guo1, Fengzhen Chen1, Fei Gao1, Ling Li1, Ke Liu1, Lijin You1, Cong Hua1, Fan Yang1, Wanliang Liu1, Chunhua Peng1, Lina Wang1, Xiaoxia Yang1, Feiyu Zhou1, Jiawei Tong1, Jia Cai1, Zhiyong Li1, Bo Wan1, Lei Zhang1, Tao Yang1, Minwen Zhang1, Linlin Yang1, Yawen Yang 1, Wenjun Zeng1, Bo Wang1, Xiaofeng Wei1*, Xun Xu1,2,3* 1. China National GeneBank, Shenzhen 518120, China 2. BGI-Shenzhen, Shenzhen 518083, China 3. Guangdong Provincial Key Laboratory of Genome Read and Write, Shenzhen 518120, China *Corresponding author: Xun Xu: [email protected] Xiaofeng Wei: [email protected] Xueqin Guo and Fengzhen Chen contributed equally to this work. Abstract With the application and development of high-throughput sequencing technology in life and health sciences, massive multi-dimensional biological data brings the problem of efficient management and utilization. Database development and biocuration are the prerequisites for the reuse of these big data. Here, relying on China National GeneBank (CNGB), we present CNGB Sequence Archive (CNSA) for archiving omics data, including raw sequencing data and its analytical data and related metadata which are organized into six objects, namely Project, Sample, Experiment, Run, Assembly, and Variation at present. Moreover, CNSA has created the correlation model of living samples, sample information, and analytical data on some projects, so that all data can be traced throughout the life cycle from the living sample to the sample information to the analytical data. Complying with the data standards commonly used in the life sciences, CNSA is committed to building a comprehensive and curated data repository for the storage, management and sharing of omics data, improving the data standards, and providing free access to open data resources for worldwide scientific communities to support academic research and the bio-industry. bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.030833; this version posted April 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Database URL: https://db.cngb.org/cnsa/ Introduction In data-intensive science era, life science research is seen as a data-driven, exploration-centered style of science. With the development of sequencing technology, the rapid increase of sequencing throughput and the dramatic drop of per-base cost of raw sequence have made large-scale population genomics research, precision medicine research, and biodiversity research possible, e.g., the UK's 100,000 Genomes Project (1), the International Cancer Genome Consortium (ICGC) (2), the Cancer Genome Atlas (TCGA) (3), the China Kadoorie Biobank (CKB : https://www.ckbiobank.org/site/) (4) and Earth BioGenome Project (EBP) (5). However, it poses great challenges in big data deposition, integration and sharing, where IT infrastructure and software tools are heavily used. Many organizations in the world have made great efforts for archiving and sharing of omics data. The International Nucleotide Sequence Database Collaboration (INSDC) (6) represents one of the most celebrated global initiatives in data and associated metadata sharing, which operates between DNA Data Bank of Japan (DDBJ) (7), the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) (8), and the National Center for Biotechnology Information (NCBI) (9). In order to facilitate exchange of information on genomic samples and their derived data, the Global Genome Biodiversity Network (GGBN) Data Standard (10) is intended to provide a platform to promote the efficient sharing and usage of genomic sample material and associated specimen information in a consistent way. Global Alliance for Genomics and Health (GA4GH) (11), an international, nonprofit alliance, already brings together more than 500 leading organizations to accelerate progress in genomic research and human health. In addition, DataCite (12) has developed tools and methods that make data more accessible and more useful. In China, many scientific institutions have also made great efforts and established multiple omics database systems such as the National Genomics Data Center (NGDC: bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.030833; this version posted April 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. https://bigd.big.ac.cn/), BIGBIM (https://www.biosino.org/bigbim/index), the National Center for Protein Science •Shanghai (NCPSS: http://www.sibcb-ncpss.org/index.action), etc. The CNGB (13) (https://www.cngb.org/home.html) based in Shenzhen was established in January 2011, which is committed to supporting public welfare, life science research, innovation and industry incubation, through effective bioresource conservation, digitalization and utilization. Based on this concept and relying on the CNGB, China National GeneBank DataBase (CNGBdb: https://db.cngb.org/) is built as a unified platform built for biological big data sharing and application services to the research community. Based on the big data and cloud computing technologies, it provides data services such as archive, analysis, knowledge search, management authorization, and visualization. CNSA is the data archiving system of CNGBdb and is built for archiving omics data including not only raw sequencing data and its analysis result data, but also related metadata. At present, CNSA follows the data standards and structures of INSDC, DataCite, GA4GH and GGBN for data compatibility and provides global users with data archival and sharing services of omics data such as data submission, data storage, data retrieval and data reference. All archived public data is freely available to worldwide scientific communities. Methods System structure Based on the Django framework (Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design, https://www.djangoproject.com/), CNSA is developed in Python. In order to provide more stable and fast services, CNSA server is built on the Centos-7 operating system with the following six servers: NGINX for providing static resource access, uWSGI for deploying services, PostgreSQL and MongoDB for supporting metadata storage, bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.030833; this version posted April 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. FTP server for uploading and storing data files and Elasticsearch for data retrieval. In addition, Redis-based caching system is used to help improve data verification speed. Data security Relying on the data room of CNGB and CNGBdb that have passed the three-level review of information security level protection and the protection capability review of trusted cloud service, CNSA adopts corresponding security technologies in user access, data room, firewall, application architecture, database, and data storage. The entire site of the CNSA uses https to encrypt user access requests to prevent stealing and tampering during data transmission. Django ORM is used to avoid SQL injection, and fields retrieved from the database are filtered before being displayed to prevent XSS attacks. All information submitted by users is submitted in Post mode, and Django's CsrfViewMiddleware is used to prevent CSRF attacks. The firewall security technology of the CNGB data room can ensure the legality of data access. Moreover, CNSA adopts high-performance distributed object storage for data archiving, and combines a high-availability and disaster tolerance backup storage system to ensure data storage security, and the database is backed up daily and can be restored in a timely manner. Results Data objects and structure At present, CNSA follows the data standards and structures of INSDC, DataCite, GA4GH and GGBN to ensure data compatibility, and all data are organized into six objects, i.e., Project, Sample, Experiment, Run, Assembly and Variation. The definitions of data objects and main fields used to describe the data objects are listed in Table 1. As is illustrated in Figure 1, project and sample can be submitted independently, a projects can also be associated with one or more projects, and a sample can also be associated with one or more of experiments, assemblies, or variations. An experiment can be associated with one or more runs. Moreover, each data is assigned a corresponding accession number that can be referenced. Also note bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.030833; this version posted April 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    15 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us