AO HUPO workshop in Tsukuba (Dec. 19, 2002) An XML Format For Proteomics Database to Accelerate Collaboration Among Researchers

HUP-ML: Human Proteome

K. Kamijo, H. Mizuguchi, A. Kenmochi, M. Sato, Y. Takaki and A. Tsugita Proteomics Res. Center, NEC Corp.

1

Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 2 Introduction z Download entries from public DBs as a flat-file z easy for a person to read z different formats for every DB z sometimes needs special access methods and special applications for each format z Needs machine-readable formats for software tools z To boost studies by exchanging data among researchers Activates standardization

3

XML format

z XML (eXtensible Markup Language) W3C: World Wide Web Consortium (inception in 1996) Attribute Key Value Example Element Start Tag rice leaf End Tag

z Highly readable for machine and person z Can represent information hierarchy and relationships z Details can be added right away z Convenient for exchanging data z Easy to translate to other formats 4 z Logical-check by a (DTD) Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 5

XML features z Exchangeable z Inter-operability: z OS (Operating System): Windows, Linux, etc. z Software Development Languages z Communication Protocols z Inter-national / Inter-business z W3C has supervised the XML specifications z XML is well prevailed internationally z XML supports multilingual character code sets W3C: World Wide Web Consortium (inception in 1996)

6 XML features z Extensible z W3C supervised ONLY the language specifications z Everyone could create a NEW language structure based on the specifications z Hierarchy structure z A change in a node does not affect data in other nodes z Easy to add new information z when they are available

7

XML features

z Retrievable with a computer z Easy to find target information in an XML document z By using ‘Tag’ element structure z It is difficult to retrieve them in a flat text file and a binary file. z Possible to find more specific information with ‘Attribute’ keywords step by step

XML document Type 1: Whole Retrieval Ex) Pick up all information ... Type 2: Partial Retrieval < process seq =‘2’ action=‘stand’> Ex) Pick up the FIRST information ... in . < process seq =‘3’ action=‘store’> ... Type 3: Retrieval by Attribute Ex) Pick up information which 8 action attribute is ‘store’ XML features z Easy to develop XML applications z A variety of XML tools are available for DEVELOPERs z Powerful XML parser / processor z A developer does not need to create an XML format parser z Little tools for original formats z Possible to validate a well-formatted document z By using tools and schema files z DTD (Document Type Definition) file z XML Schema file, Relax NG z A Developer could concentrate ONLY a function of applications (NOT XML document handling)

9

XML features z A variety of XML tools are available for USERS z Standardization Tools z Retrieval Protocols: XPath z Data Transformation: XSLT (Extensible Stylesheet Language Transformations) z Easy to handle XML document by using such tools

10 XML features z A variety of related tools z XML database management system z XML formatter, XML Editor z Communication Protocols (ex. SOAP) z Major applications and tools tend to support XML document handling z RDB: ORACLE, SQL Server .... z Web browser: Microsoft Internet Explorer ... z Applications: Microsoft Office ...

11

XML features z Easy to combine XML applications with other functions z Data processing tools z Embedding link information (XLink, XPointer) z Electric authentication, Encryption z Some tools automatically bring their functions to applications

XML Secure simple link Encryption System XPointer XPath extended link

integrity message authentication XSLT XML Signature signer authentication XML Link 12 XML features z Easy to combine with other XML format documents z By using ‘NameSpaces’ technique, it is possible to combine them, integrate them and divide into original XML formats z Example: Combination of HTML and the following XML z Figures (SVG: Scalable Vector Graphics,Web3D etc.) z Formula (MathML: Mathematical Markup Language) z Chemical formula (CML: Chemical Markup Language)

xml:lang="en" dir="ltr"> : : MathML SVG XHTML 13

XML features z XML is a ‘reliable’ format to use in business z XML gets a major position of data description languages z Many software venders express XML support z Activity of standardization organization (W3C) z XML will be used in long time because XML is not any vender's format

14 XML workflow

DB DTD or Stylesheet XML Schema Document XML Editor XML Validate Transform Document HTML XML Application DB Postscript

15

Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 16 XML in Bioinformatics "The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web." -- GenBank, W3C XML Web site, 2000-07-06. EMBL, DDBJ, PIR, PDB, etc. User (Researcher) Local access Public DBs

Wrapper Converter Easy to andle XML Easy to distribute XML

XML Applications Internet Easy to re-use XML XML DB Application Security Gate XML XML Item selection Wrapper Easy to control Private DBs priority level User (Researcher) 17

Conventional XMLs in Life Science z Genomics z BSML, GAME, DAS, DDBJ-XML,… z Gene Expression z MAGE-ML(GEML, MAML), GeneXML, … z Proteomics z PSDML (PIR’s XML) z BioML: Functional Proteomics z ProML: Structural Proteomics z (BSML)

18 Conventional XMLs in Life Science z Genomics z BSML z Bioinformatic Sequence Markup Language z For exchanging genomic sequence and annotation z By LabBook developers under a 1997 grant from National Human Genome Research Institute in the US z One Main XML format used by the Interoperable Informatics Infrastructure Consortium (I3C) z GAME z Genome Annotation Markup Language z To represent information about specific regions of sequence z Supporter: EBI XEMBL Project 19

Conventional XMLs in Life Science z Gene Expression z MAGE-ML z Micro Array and Gene Expression Markup Language z To describe and communicate information on microarray- based experiments z Discuss in OMG LSR Meeting in 2001 z Supporter: EMBL-EBI, Rosetta Inpharmatics, etc. z Contains microarray designs, manufacturing info., experiment setup and execution information, as well as gene expression data and data analysis results z Purpose: To provide a framework where the researchers can exchange microarray data and compare the analysis

results even if the maker of the array chip is different 20 Conventional XMLs in Life Science z Proteomics z PSDML z Protein Sequence Database Markup Language z An open-standard markup language used to store protein information in the PIR database z BioML z Biopolymer Markup Language z Developed by ProteoMetrics company z designed to be used for the general annotation of biopolymer sequence information z specifying all experiment information on molecular entities ƒ proteins, genes and other biopolymers 21

Advantage of Using A Markup Language For Life Science z An XML-based tree structure z Easily maps into the biopolymer information z hierarchical and nested at different levels of complexity z Bioinformatics data have numerous relationships z Extensible (Flexibility) z New data emerge regularly z Data are updated frequently

22 Overview (HUP-ML)

z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML

z Demonstration of “HUP-ML Editor” 23

Analysis flow in Life Science

Tissue disruption 2DE Mass Spectrometer Proteome Tissue disruption 2DE Mass Spectrometer Extraction Spot picking (Detector)(Detector) Analysis Extraction Spot picking ConcentrationConcentration (LC)(LC) (N-/C-terminal(N-/C-terminal seq.) seq.)

Experiment Sample Experiment Data Design preparation (Analysis) Acquisition

Result Data Knowledge Report Analysis Mining Discovery

Protein identification Chromosome Protein identification Chromosome RelatedRelated proteins proteins (PMF, PST) Genome (PMF, PST) Genome BindingsBindings 24 Functions/StructureFunctions/Structure Conventional XMLs in Life Science

XML DNADNA array array data data (MAGE-ML)(MAGE-ML)

Experiment Sample Experiment Data Design preparation (Analysis) Acquisition

Result Data KnowledgeGene/Protein Gene/ProteinReport Analysis Mining DiscoverySequenceSequence and and FeaturesFeatures (AGAVE,(AGAVE, BSML, BSML, XML PSDML,PSDML, BioML, BioML, ProML)ProML) 25

Our XML-based data model

Our XML zz Proteome-analysisProteome-analysis orientedoriented zz DescribesDescribes Experiment Sample Experiment zDataz SampleSample preparation preparation Design preparation (Analysis) Acquisition zz MethodologyMethodology zz 2D2D gel gel image image / /LC LC results results zz SpotSpot information information zz SequenceSequence and and feature feature zz 3D3D structure structure z Includes other open XMLs used Result Data Knowledgez Includes other open XMLs used Report Analysis Mining Discoveryinin lifelife sciencescience

HUP-MLHUP-ML -- Human Human Proteome Proteome markup markup language language - - 26 HUP-ML

z Current version (version 0.43, beta) z source information z sample preparation information

z Target method: 2DE Simpson,R.J.,Tsugita,A.,Celis,J.E,Garrels,J.I.& Mewes,H.W., ”Workshop on two-dimensional z Gel information gel protein database,” Electrophoresis 13, 1055-1061(1992) z Spot information etc. z Discussion points for future version (tomorrow) z MS information description z Other method description (ex. LC) z etc. 27

HUP-ML (current version, 0.43) z Information Structure: Proteome Gel info. Source info. Sample preparation info. Methodology info. Gel Image Spot info.

28 HUP-ML document exchange scheme Researchers Researchers in Demonstration Experimenters the world

Demonstration Browser HUP-ML Network Editor Browser Internet CGI Database engine *.xml

HUP-ML document Database

29

Overview (HUP-ML)

z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML

z Demonstration of “HUP-ML Editor” 30 By A. Tsugita et al.(2002) Example: Human Kidney Glomerulus Proteome

Extraglomerul Granule ar mesangial Macra densa cell cell cell Efferent arteriole Afferent arteriole

Glomerular epithelial cells Mesangial (podocyte) matrix

Mesangial cell

Bowman’s capsule epithelial cell Glomerular Glomerular endothelial basement cell Proximal membrane Nephron tubule Glomerulus31 epithelial cell

Sample of HUP-ML (1)

Source information

--"> HomoHomo sapiens sapiens HumanHuman /> /> /> KidneyKidney Glomerulus Glomerulus /> ">4848 /> /> NormalNormal

32 Sample of HUP-ML (2) Sample preparation

- - Standard sieving technique Standard sieving technique sample="sample="supernatantsupernatant" " using four stainless sieves. The glomeruli on using four stainless sieves. The glomeruli on temp="temp="-80-80" "temp_unit=" temp_unit="degreedegree in in C C" " the 150 micro m sieves were collected ice cold the 150 micro m sieves were collected ice cold time_unit="time_unit="minmin" "/> /> phosphate-buffered saline (PBS). disruption> disruption> /> target, condition ) lists - - -- - sample="collection" /> /> sample="collection" /> > name="Pharmalyte(pH3-10)" /> /> time="60" time_unit="min" /> time="60" time_unit="min" /> />/> /> sample="sample="suspensionsuspension" " time="time="2020" "time_unit=" time_unit="minmin">"> name="chymostain" /> 12000 /> 12000 /> 33

Sample of HUP-ML (3)

Gel condition

- -10T17:20:00"> Gel Information : - - - - Size, pH, ..... linear dry strip linear dry strip - - guiding_dye="PBP"> includingincluding standa standardrd proteins proteins /> - - /> /> Running : /> />

34 Sample of HUP-ML (4)

PIR data area

Spot information area

35

HUP-ML document exchange scheme Researchers Researchers in Demonstration Experimenters the world

Demonstration Browser HUP-ML Network Editor Browser Internet CGI Database engine *.xml

HUP-ML document Database

36 HUP-ML Editor for Proteomics Information

Our XML Document Gel Image

Gel Info.

37

HUP-ML Editor ( Example)

Spot list

38 HUP-ML Editor ( Browsing)

Click!

Spot Detail

Click!

39

HUP-ML Editor ( Source Information)

Source Information

It is possible to import form ‘templates’ or other XML documents.

40 Features of our data model Our proteomics XML: z describes sample preparations z Improves reliability of analysis results z can distribute experimental information z share know-how z improves skills z handle both gel-image and analysis results

z describes analysis information z image recognition

41

Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 42 Description Items in HUP-ML z Related materials z An example of HUP-ML document z Current version of DTD file z Item list in DTD z CD-ROM (HUP-ML Editor version 0.43 beta)

43

Overview (HUP-ML) z Introduction z XML features z XMLs in life science z HUP-ML concept z Example of HUP-ML z Details of HUP-ML z Demonstration of “HUP-ML Editor” 44 Developments

z HUP-ML DTD z Collaboration with AOHUPO z Open current version of DTD z HUP-ML editor for free distribution z demonstrates later z Prototype WWW-based management system z for registration, viewing, and retrieval of entries

45

HUP-ML document exchange scheme Researchers Researchers in Demonstration Experimenters the world

Demonstration Browser HUP-ML Network Editor Browser Internet CGI Database engine *.xml

HUP-ML document Database

46 Example of WWW UI ( Search )

47

Future work z Refine HUP-ML specification and description items z Collaboration with AOHUPO z Version up HUP-ML editor z Convert from other XML formats z Related tools z XML schema z XML Stylesheet for HUP-ML z Relation to other analysis tools z image-analysis software z homology-analysis tools, etc. z Collaboration with other standardization groups 48 Our XML Workflows

DB MS

DTD or Stylesheet XML Schema XML Editor Document

XML Document Validate Transform XML Application DB

could be supported by AOHUPO.

could be developed by third party. 49

50