Published online 29 April 2015 Nucleic Acids Research, 2015, Vol. 43, No. 10 4823–4832 doi: 10.1093/nar/gkv272 GenoLIB: a database of biological parts derived from a library of common plasmid features Neil R. Adames1,†, Mandy L. Wilson1,†, Gang Fang1,2, Matthew W. Lux1,3, Benjamin S. Glick4,5 and Jean Peccoud1,*
1Virginia Bioinformatics Institute, Virginia Tech, 1015 Life Science Circle, Blacksburg, VA 24061, USA, 2School of Biological Technology, Xi’an University of Arts and Science, Xi’an, Shaanxi Province 710065, China, 3Biosciences Division, Edgewood Chemical Biological Center, 5183 Blackhawk Rd Aberdeen Proving Grounds MD 21010, USA, 4Molecular Genetics & Cell Biology, University of Chicago, 920 E. 58th St., Chicago, IL 60637, USA and 5GSL Biotech LLC, 5211 S. Kenwood Ave., Chicago, IL 60615, USA
Received November 21, 2014; Revised March 18, 2015; Accepted March 18, 2015
ABSTRACT cific assembly strategies such as the BioBrick3 ( ) or BglBrick (4) standards. Synthetic biologists have also recognized the Synthetic biologists rely on databases of biologi- need to use standard representations to describe parts. The cal parts to design genetic devices and systems. Registry of Standard Biological Parts (www.partsregistry. The sequences and descriptions of genetic parts org) supporting the iGEM competition was the first attempt are often derived from features of previously de- at standardizing parts data (5). This pioneering experiment scribed plasmids using ad hoc, error-prone and led to the idea of reporting data describing parts’ functions time-consuming curation processes because exist- as standardized datasheets (6). ing databases of plasmids and features are loosely The most essential piece of information associated with organized. These databases often lack consistency a biological part is its sequence. Getting quality sequence in the way they identify and describe sequences. information has proved more problematic than one might Furthermore, legacy bioinformatics file formats like expect. For instance, an early review entitled ‘Genetic parts to program bacteria’ included hardly any references to se- GenBank do not provide enough information about quence data (7). A frequent occurrence in the literature is the purpose of features. We have analyzed the anno- defining constructs by names of standard promoters and ∼ tations of a library of 2000 widely used plasmids CDSs, allowing the sequences to be deduced, but providing to build a non-redundant database of plasmid fea- no information about other key sequences, such as 5 non- tures. We looked at the variability of plasmid fea- coding regions. An assessment of the Registry of Standard tures, their usage statistics and their distributions Biological Parts uncovered several problems, such as miss- by feature type. We segmented the plasmid features ing part sequences or discrepancies between the published by expression hosts. We derived a library of biologi- and physical sequences of biological parts (8). More gen- cal parts from the database of plasmid features. The erally, many journals have sequence disclosure policies, but library was formatted using the Synthetic Biology these policies do not apply to plasmid sequences (9). A num- Open Language, an emerging standard developed ber of other reasons, such as tight budgets or limited access to a sequencing facility, also explain why so many plasmids to better organize libraries of genetic parts to facili- described in the literature lack complete sequences. Repos- tate synthetic biology workflows. As proof, the library itories such as Addgene (10,11) aim to address this issue was converted into GenoCAD grammar files to allow by documenting plasmids in their collections with sequence users to import and customize the library based on and annotation data, and also by associating plasmids with the needs of their research projects. phenotype data, but such efforts have covered only a small fraction of the plasmids in published papers. INTRODUCTION To overcome these limitations, we sought alternatives to peer-reviewed publications for obtaining quality sequence The concept of standard biological parts is central to syn- data that could provide a solid foundation for the develop- thetic biology. Biological parts are annotated DNA se- ment of a comprehensive database of genetic parts. Many quences that can be combined to make larger genetic sys- cloning and expression vectors have been used for decades tems (1,2). Initially, parts standardization focused on spe-
*To whom correspondence should be addressed. Tel: +1 540 231 0403; Fax: +1 540 231 2606; Email: [email protected] †These authors contributed equally to the paper as first authors.