An XML Format for Proteomics Database to Accelerate Collaboration Among Researchers
Total Page:16
File Type:pdf, Size:1020Kb
AO HUPO workshop in Tsukuba (Dec. 19, 2002) An XML Format For Proteomics Database to Accelerate Collaboration Among Researchers HUP-ML: Human Proteome Markup Language K. Kamijo, H. Mizuguchi, A. Kenmochi, M. Sato, Y. Takaki and A. Tsugita Proteomics Res. Center, NEC Corp. 1 Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 2 Introduction z Download entries from public DBs as a flat-file z easy for a person to read z different formats for every DB z sometimes needs special access methods and special applications for each format z Needs machine-readable formats for software tools z To boost studies by exchanging data among researchers Activates standardization 3 XML format z XML (eXtensible Markup Language) W3C: World Wide Web Consortium (inception in 1996) Attribute Key Value Example Element Start Tag <tag_element_source attribute_growth=“8 weeks”> rice leaf </tag_element_source> End Tag z Highly readable for machine and person z Can represent information hierarchy and relationships z Details can be added right away z Convenient for exchanging data z Easy to translate to other formats 4 z Logical-check by a Document Type Definition (DTD) Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 5 XML features z Exchangeable z Inter-operability: z OS (Operating System): Windows, Linux, etc. z Software Development Languages z Communication Protocols z Inter-national / Inter-business z W3C has supervised the XML specifications z XML is well prevailed internationally z XML supports multilingual character code sets W3C: World Wide Web Consortium (inception in 1996) 6 XML features z Extensible z W3C supervised ONLY the language specifications z Everyone could create a NEW language structure based on the specifications z Hierarchy structure z A change in a node does not affect data in other nodes z Easy to add new information z when they are available 7 XML features z Retrievable with a computer z Easy to find target information in an XML document z By using ‘Tag’ element structure z It is difficult to retrieve them in a flat text file and a binary file. z Possible to find more specific information with ‘Attribute’ keywords step by step XML document <procedure> Type 1: Whole Retrieval <process seq=‘1’ action=‘spin-down’> Ex) Pick up all <procedure> information ... </process > Type 2: Partial Retrieval < process seq =‘2’ action=‘stand’> Ex) Pick up the FIRST <process> information ... in <procedure>. </process > < process seq =‘3’ action=‘store’> ... Type 3: Retrieval by Attribute </process > Ex) Pick up <process> information which </procedure> 8 </procedure> action attribute is ‘store’ XML features z Easy to develop XML applications z A variety of XML tools are available for DEVELOPERs z Powerful XML parser / processor z A developer does not need to create an XML format parser z Little tools for original formats z Possible to validate a well-formatted document z By using tools and schema files z DTD (Document Type Definition) file z XML Schema file, Relax NG z A Developer could concentrate ONLY a function of applications (NOT XML document handling) 9 XML features z A variety of XML tools are available for USERS z Standardization Tools z Retrieval Protocols: XPath z Data Transformation: XSLT (Extensible Stylesheet Language Transformations) z Easy to handle XML document by using such tools 10 XML features z A variety of related tools z XML database management system z XML formatter, XML Editor z Communication Protocols (ex. SOAP) z Major applications and tools tend to support XML document handling z RDB: ORACLE, SQL Server .... z Web browser: Microsoft Internet Explorer ... z Applications: Microsoft Office ... 11 XML features z Easy to combine XML applications with other functions z Data processing tools z Embedding link information (XLink, XPointer) z Electric authentication, Encryption z Some tools automatically bring their functions to applications XML Secure simple link Encryption System XPointer XPath extended link integrity message authentication XSLT XML Signature signer authentication XML Link 12 XML features z Easy to combine with other XML format documents z By using ‘NameSpaces’ technique, it is possible to combine them, integrate them and divide into original XML formats z Example: Combination of HTML and the following XML z Figures (SVG: Scalable Vector Graphics,Web3D etc.) z Formula (MathML: Mathematical Markup Language) z Chemical formula (CML: Chemical Markup Language) <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" dir="ltr"> : <svg:svg xmlns:svg="http://www.w3.org/2000/svg" width=“10cm" height=“10cm" viewBox="0 0 30 30" version="1.1"> <svg:g transform="scale(1.25)"> <svg:polygon style="fill:lime; stroke:blue; stroke-width:10" points=“50,2 3,15 6,35 25,50 3,39.2 34,15"/> <svg:switch> <svg:foreignObject x="70" y="140" width="360" height="340"> <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> : MathML SVG XHTML </math> </svg:foreignObject> </svg:switch> </svg:g> </svg:svg> </body></html> 13 XML features z XML is a ‘reliable’ format to use in business z XML gets a major position of data description languages z Many software venders express XML support z Activity of standardization organization (W3C) z XML will be used in long time because XML is not any vender's format 14 XML workflow DB DTD or Stylesheet XML Text Editor Schema Document XML Editor XML Validate Transform Document HTML XML Application DB Postscript 15 Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 16 XML in Bioinformatics "The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web." -- GenBank, W3C XML Web site, 2000-07-06. EMBL, DDBJ, PIR, PDB, etc. User (Researcher) Local access Public DBs Wrapper Converter Easy to andle XML Easy to distribute XML XML Applications Internet Easy to re-use XML XML DB Application Security Gate XML XML Item selection Wrapper Easy to control Private DBs priority level User (Researcher) 17 Conventional XMLs in Life Science z Genomics z BSML, GAME, DAS, DDBJ-XML,… z Gene Expression z MAGE-ML(GEML, MAML), GeneXML, … z Proteomics z PSDML (PIR’s XML) z BioML: Functional Proteomics z ProML: Structural Proteomics z (BSML) 18 Conventional XMLs in Life Science z Genomics z BSML z Bioinformatic Sequence Markup Language z For exchanging genomic sequence and annotation z By LabBook developers under a 1997 grant from National Human Genome Research Institute in the US z One Main XML format used by the Interoperable Informatics Infrastructure Consortium (I3C) z GAME z Genome Annotation Markup Language z To represent information about specific regions of sequence z Supporter: EBI XEMBL Project 19 Conventional XMLs in Life Science z Gene Expression z MAGE-ML z Micro Array and Gene Expression Markup Language z To describe and communicate information on microarray- based experiments z Discuss in OMG LSR Meeting in 2001 z Supporter: EMBL-EBI, Rosetta Inpharmatics, etc. z Contains microarray designs, manufacturing info., experiment setup and execution information, as well as gene expression data and data analysis results z Purpose: To provide a framework where the researchers can exchange microarray data and compare the analysis results even if the maker of the array chip is different 20 Conventional XMLs in Life Science z Proteomics z PSDML z Protein Sequence Database Markup Language z An open-standard markup language used to store protein information in the PIR database z BioML z Biopolymer Markup Language z Developed by ProteoMetrics company z designed to be used for the general annotation of biopolymer sequence information z specifying all experiment information on molecular entities proteins, genes and other biopolymers 21 Advantage of Using A Markup Language For Life Science z An XML-based tree structure z Easily maps into the biopolymer information z hierarchical and nested at different levels of complexity z Bioinformatics data have numerous relationships z Extensible (Flexibility) z New data emerge regularly z Data are updated frequently 22 Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 23 Analysis flow in Life Science Tissue disruption 2DE Mass Spectrometer Proteome Tissue disruption 2DE Mass Spectrometer Extraction Spot picking (Detector)(Detector) Analysis Extraction Spot picking ConcentrationConcentration (LC)(LC) (N-/C-terminal(N-/C-terminal seq.) seq.) Experiment Sample Experiment Data Design preparation (Analysis) Acquisition Result Data Knowledge Report Analysis Mining Discovery Protein identification Chromosome Protein identification Chromosome RelatedRelated proteins proteins (PMF, PST) Genome (PMF, PST) Genome BindingsBindings 24 Functions/StructureFunctions/Structure Conventional XMLs in Life Science XML DNADNA array array data data (MAGE-ML)(MAGE-ML) Experiment Sample Experiment Data Design preparation (Analysis) Acquisition Result Data KnowledgeGene/Protein Gene/ProteinReport Analysis Mining DiscoverySequenceSequence and and FeaturesFeatures (AGAVE,(AGAVE, BSML, BSML, XML PSDML,PSDML, BioML, BioML, ProML)ProML) 25 Our XML-based data model Our XML zz Proteome-analysisProteome-analysis orientedoriented zz