An XML Application for Genomic Data Interoperation
Total Page:16
File Type:pdf, Size:1020Kb
An XML application for genomic data interoperation Kei-Hoi Cheung1, Yang Liu2, Anuj Kumar3, Michael Snyder2,3, Mark Gerstein2, Perry Miller1,3 1Center for Medical Informatics, Department of Anesthesiology 2 Department of Molecular Biophysics and Biochemistry, 3Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520, USA {kei.cheung, anuj.kumar, michael.snyder, mark.gerstein, perry.miller}@yale.edu Abstract HTML documents resided on different Web servers to be linked to one another through hypertext links. Such As the eXtensible Markup Language (XML) hypertext links have revolutionized the way information becomes a popular or standard language for (stored in remote databases) can be linked over the exchanging data over the Internet/Web, there are a Internet. This level of database inter-connectivity has growing number of genome Web sites that make their already proven very useful since it allows the user to data available in XML format. Publishing genomic navigate data from one Web site to another related Web data in XML format alone would not be that useful if site very easily. According to Karp [1], however, this is there is a lack of development of software applications not the “Holy Grail” of genomic data interoperation. that could take advantage of the XML technology to There are two problems associated with this hypertext process these XML-formatted data. This paper linking approach. illustrates the usefulness of XML in representing and interoperating genomic data between two different 1. Item-by-item linking. One problem is that the user data sources (Snyder's laboratory at Yale and SGD at has to click on the link one at a time in order to Stanford). In particular, we compare the locations of retrieve related information. This will be a time- transposon insertions in the yeast DNA sequences that consuming and tedious method to collect related have been identified by BLAST searches with the information if the number of links involved is great, chromosomal locations of the yeast open reading thereby slowing data collation prior to analysis. frames (ORFs) stored in SGD. Such a comparison 2. Fixed linked fields. The other problem occurs when allows us to characterize the transposon insertions by we attempt to establish links to external data indicating whether they fall into any ORFs (which may sources. These external links restrict how we can potentially encode proteins that possess essential access related data. For example, some genome biological functions). To implement this XML-based databases allow their data entries to be linked via interoperation, we used NCBI's "blastall" (which gives accession numbers (unique object/record an XML output option) and SGD's yeast nucleotide identifiers). However, if these numbers are not sequence dataset to establish a local blast server. Also, available in the public interface, there will be no we converted the SGD's ORF location data file (which way to establish links using other fields (such as is available in tab-delimited format) into an XML gene names or gene symbols). document based on the BIOML (BIOpolymer Markup Language) standard. Despite its hyperlink capability, the HTML is designed mainly for data display purposes. It is not 1. Introduction suitable for large-scale machine processing. To address this, many genome sites have distributed large datasets With the growing use of the Web, large quantities as flat files (e.g., tab-delimited files). Researchers can of biological data have been made accessible to the then download these files and process them by custom scientific community through many genome Web sites programs. According to [2], this flat-file approach is such as the National Center for Biotechnology very limited, because it lacks such abilities as Information (NCBI) Web site referencing, controlled vocabulary, and constraints. (http://www.ncbi.nlm.nih.gov/) and the Protein Data Often fields are ambiguous and their contents are Bank (PDB) Web site (http://www.rcsb.org/pdb/). The contextual. In other words, the programmer has to Web has widely been accepted because it is an Internet- manually interpret the semantics. This hinders the use based standard that has been incorporated into multiple of flat files by programs without human interaction. To platforms (e.g., Unix, Windows, and Mac). Web fully automate genomic-scale data interoperation, we browsers such as Netscape and Internet Explorer (IE) need a data representation format that not only are platform-independent and are easy to use. They separates the semantic content from the display content, provide for the user the capability of browsing through but it also allows computer programs to process the a large set of data graphically on their local computers. semantic part efficiently. This graphical capability is made possible by the The eXtensible Markup Language (XML) has HyperText Markup Language (HTML). Another emerged as a popular format (both human and machine important feature of the Web is that it allows multiple readable) for exchanging information over the Web. XML was designed to overcome the limitations of These mutant strains are subsequently used in a HTML and flat files as described previously. It is variety of functional studies, enabling the analysis of derived from the Standard Generalized Markup gene expression, disruption phenotypes, and protein Language (SGML), the international standard for localization [6]. This strain collection is maintained in defining descriptions of the structure and content of 96-well format. Each strain is assigned a unique ID different types of electronic documents. XML based upon its position within a 96-well storage plate. documents are self-describing. A set of user-defined For example, strain "V108B6" is stored in plate 108 at tags can be created for one or many XML documents. position B6. The prefix "V" indicates that this strain Syntactic and semantic rules can be defined for these carries a transposon insertion within a region of the tags in the form of Document Type Definitions (DTD). yeast genome expressed during vegetative growth. In general, the XML tags are used to identify different Yeast cells propagate vegetatively when provided with types of hierarchically related elements in the sufficient nutrients; under appropriate conditions of document, with the possibility of referencing and starvation, however, yeast cells undergo meiosis and recursion. Besides its use in data publishing, XML spore formation. Strains carrying a transposon gives the means for defining strongly structured insertion affecting a gene whose expression is induced documents so that computer programs can easily during this sporulation process are named with an "M" navigate through them and access relevant pieces of (meiotic) prefix. This ID designation is useful in information. Another advantage of using XML is that tracking given strains during subsequent analysis steps. there is a large body of XML-related software tools and As described below, number of computer programs are technologies including Document Object Model used to identify a genomic region (e.g., a gene) (DOM) and eXtensible Stylesheet Language (XSL) that disrupted by a transposon insertion within each strain. are available in the public domain. There has been an increasing use of XML in the • Sequencing. An initial step of the project is to genome community. Recently, we have seen a growing sequence the DNA samples (yeast mutant strains) number of genome sites that distribute data in XML collected. Automatic DNA sequencers such as the format. Among these are NCBI, PDB, and Gene ones manufactured by Applied Biosystems support Ontology (http://www.geneontology.org/). In addition, high throughput sequencing. Following high- a number of XML-related standards have been throughput automated sequencing, DNA sequence proposed for representing different types of biological data is typically output as chromatograms (trace data. Among them are MAML signals) that must be subsequently converted into (http://www.ncbi.nlm.nih.gov/geo/maml/index.cgi) and nucleotide sequence. These chromatogram data sets GEML (http://www.geml.org/), which are XML-based are stored in binary files ("chromat" files). In our languages for describing gene expression data; BIOML case, each chromat file contains DNA sequence data (http://www.bioml.com/BIOML/) that describes for a given strain. These files are named according biopolymers including genes and proteins; and BSML to the strain ID described previously. We process (http://www.labbook.com/products/xmlbsml.asp) that these chromat files with the PHRED-PHRAP describes DNA sequence data. package [7, 8] to produce nucleotide sequences This paper describes how to use XML to represent (clone sequences). In addition, we have configured and interoperate the yeast data that have been produced PHRED-PHRAP to remove vector sequence as well at two different sites: Snyder’s Lab at Yale and the as the transposon sequence itself from each clone Saccharomyces Genome Database (SGD) [3] at sequence. Occasionally, transposon sequence are Stanford. The paper is organized as follows. Section 2 missed by PHRED-PHRAP due to sequencing will give an overview of the yeast genomic project to errors. These errors generate DNA sequence data which our XML approach is applied. This section will that imperfectly matches the pre-specified also describe the computer programs used to process transposon sequence. Often, these sequencing the data in different stages. In Section 3, we will errors are minor; manual inspection is usually describe our XML approach to interoperating the yeast sufficient to identify transposon sequence in these datasets of interest. Also, some examples will be given. cases. To address this problem, a script was written Section 4 will provide some discussion of how to to scan the output of PHRED-PHRAP for varying improve and extend our work. Finally, we will give the patterns of this transposon sequence. Specifically, conclusion in Section 5. the sequence data are scanned for a region of 10 nucleotides corresponding to the extreme 5’ end of 2.