Ntroducing XML

ntroducing XML .. 983 An Eagle'5 Eye View of XML his chapter introduces you to XML, the Extensible TMarkup Language. It explains, in general terms, what XML is and how it is used. It shows you how different XML technologies work together, and how to create an XML docu ment and deIiver it to readers. What Is XML? XML stands for Extensible Markup Language (often miscapi talized as eXtensible Markup Language to justify the acronym). XML is a set of rules for defining semantic tags that break a document into parts and identify the different parts of the doc ument. It is a meta-markup language that defines a syntax in which other domain-specifie markup languages can be written. XML is a meta-markup language The first thing you need to understand about XML is that it isn't just another markup language like HTML, TeX, or trofL These languages define a fixed set of tags that describe a fixed number of elements. If the markup language you use doesn't contain the tag you need, you're out of luck. Vou can wait for the next version of the markup language, hopihg that it includes the tag you need; but then you're really at the mercy of whatever the vendor chooses to include. XML, however, is a meta-markup language. I1's a language that lets you make up the tags you need as you go along. These tags must be organized according to certain general principles, but they're quite flexible in their meaning. For example, if you're working on genealogy and need to describe family names, personal names, dates, births, adoptions, deaths, burial sites, marriages, divorces, and so on, you can cre ate tags for each of these. Vou don't have to force your data to fit into paragraphs, list items, table cells, or other very general categories. Vou can document the tags you create in a schema written in any of severallan guages, including document type definitions (DTDs) and the W3C XML Schema Language. You'lllearn more about DTDs and schemas in Parts Il and IV of this book. For now, think of a schema as a vocabulary and a syntax for certain kinds of docu ments. For example, the schema for Peter Murray-Rust's Chemical Markup Language (CML) is a DTD that describes a vocabulary and a syntax for the molecular sciences: chemistry, crystallography, solid-state physics, and the like. It includes tags for atoms, molecules, bonds, spectra, and so on. Many different people in the field can share this schema. Other schemas are available for other fields, and you can create your own. XML defines the meta syntax that domain-specifie markup languages such as MusicXML, MathML, and CML must follow. lt specifies the rules for the low-Ievel syntax, saying how markup is distinguished from content, how attributes are attached to elements, and so forth, without saying what these tags, elements, and attributes are or what they mean. It gives the patterns that elements must follow without specifying the names of the elements. For example, XML says that tags begin with a <and end with a >. However, XML does not tell you what names must go between the <and the>. If an application understands this meta syntax, it at least partially understands aIl the languages bullt from this meta syntax. A browser does not need to know in advance each and every tag that might be used by thousands of different markup languages. lnstead, the browser discovers the tags used by any given document as it reads the document or its schema. The detaUed instructions about how to display the content of these tags are provided in a separate style sheet that is attached to the document. For example, consider the three-dimensional Schrôdinger equation: h2 2 lft......~ d",(r, t) = _ 2m V v(r, t) + V(r)",(r, t) XML means you don't have to wait for browser vendors to catch up with your ideas. Vou can invent the tags you need, when you need them, and tell the browsers how to display these tags. XML describes strudure and semantics, not formatting XML markup describes a document's structure and meaning. It does not describe the formatting of the elements on the page. Vou can add formatting to a document with a style sheet. The document itself only contains tags that say what is in the document, not what the document looks like. By contrast, HTML encompasses formatting, structural, and semantic markup. <B> Is a formatting tag that makes its content boldo <STRONG> is a semantic tag that means its contents are especially important. <TO> is a structural tag that indicates that the contents are a ceIl in a table. In fact, sorne tags can have aIl three kinds of meaning. An <H 1> tag can simuItaneously mean 20-point Helvetica bold, a level 1 headlng, and the title of the page. For example, in HTML, a song might be described using a definition title, definition data, an unordered list. and list items. But none of these elements actually have anything to do with music. The HTML might look something like this: <DT>Hot Cop <00> by Jacques Morali. Henri Belolo. and Victor Willis <UL> <LI> Jacques Morali <LI> PolyGram Records <LI> 6:20 <LI> 1978 <LI> Village People </UL> ln XML, the same data could be marked up Iike thls: <SONG> <TITLE>Hot Cop</TITLE> <COMPOSER>Jacques Morali</COMPOSER> <COMPOSER>Henri Belolo</COMPOSER> <COMPOSER>Victor Willis</COMPOSER> <PROOUCER>Jacques Morali</PROOUCER> <PUBLISHER>PolyGram Records</PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Village People</ARTIST> </SONG> Instead of generic tags such as <DT> and <L 1>, this example uses meaningful tags such as <SONG>, <TITLE>, <COMPOSER>, and <YEAR>. These tags didn't come from any preexisting standard or specification. 1 just made them up on the spot because they fit the information 1was describing. Domain-specifie tagging has a number of advantages, not the least of which is that it's easier for a hum an to read the source code to determine what the author intended. XML markup also makes it easier for nonhuman automated computer software to locate aIl of the songs in the document. A computer program reading HTML can't tell more than that an element is a DT. It cannot determine whether that DT repre sents a song titIe, a definition, or sorne designer's favorite means of indenting text. ln tact, a single document might weIl contain DT elements with aIl three meanings. XML element names can be chosen such that they have extra meaning in additional contexts. For example, they might be the field names of a database. XML is far more flexible and amenable to varied uses than HTML because a limited number of tags don't have to serve many different purposes. XML offers an infinite number of tags to fill an infinite number of needs. Why Are Developers Excited About XML? XML makes easy many web-development tasks that are extremely difficuIt with HTML, and it makes tasks that are impossible with HTML possible. Because XML is extensible, developers like it for many reasons. Which reasons most interest you depends on your individual needs, but once you learn XML, you're Iikely to discover that ifs the solution to more than one problem you're already struggling with. This section investigates sorne of the generic uses of XML that excite developers. In Chapter 2, you'll see sorne of the specific applications that have already been devel oped with XML. Domain-specifie markup languages XML enables individuaI professions (for example, music, chemistry, human resources) to develop their own domain-specific markup languages. Domain-specific markup languages enable practitioners in the field to trade notes, data, and informa tion without worrying about whether or not the person on the receiving end has the particular proprietary payware that was used to create the data. They can even send documents to people outside the profession with a reasonable confidence that those who receive them will at least be able to view the documents. Furthermore, creating separate markup languages for different domains does not lead to bloatware or unnecessary complexity for those outside the profession. Vou may not be interested in electricaI engineering diagrams, but electricaI engineers are. Vou may not need to include sheet music in your web pages, but composers do. XML lets the electrical engineers describe their circuits and the composers notate their scores, mostly without stepping on each other's toes. Neither field needs special support from browser manufacturers or complicated plug-ins, as is true today. Self-describing data Much computer data from the last 40 years is lost, not because of naturai disaster or decaying backup media (though those are problems too, ones XML doesn't solve), but simply because no one bothered to document how the data formats. A Lotus 1-2-3 file on a 1S-year-old S.25-inch floppy disk might be irretrievable in most corporations today without a huge investment of Ume and resources. Data in a less-known binary format such as Lotus Jazz may be gone forever. XML is, at a low level, an incredibly simple data format. It can be written in 100 per cent pure ASCII or Unicode text, as weIl as in a few other well-detined formats. Text is reasonably resistant to corruption. The removal of bytes or even large sequences of bytes does not noticeably corrupt the remaining text. This starkly contrasts with many other formats, such as compressed data or serialized Java objects, in which the corruption or loss of even a single byte can render the rest of the file unreadabIe.

Load more