1628 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1629
B.3 Document Type Declaration Outline DTDs are introduced into XML documents using the document type declaration (i.e., B B.1 Introduction DOCTYPE). A document type declaration is placed in the XML document’s prolog (i.e., all B.2 Parsers, Well-Formed and Valid XML Documents lines preceding the root element), begins with . The docu- B.3 Document Type Declaration ment type declaration can point to declarations that are outside the XML document (called the external subset) or can contain the declaration inside the document (called the internal Document Type B.4 Element Type Declarations subset). For example, an internal subset might look like B.4.1 Sequences, Pipe Characters and Occurrence Indicators Definition (DTD) B.4.2 EMPTY, Mixed Content and ANY B.5 Attribute Declarations ]> B.6 Attribute Types The first myMessage is the name of the document type declaration. Anything inside B.6.1 Tokenized Attribute Type (ID, IDREF, ENTITY, NMTOKEN) the square brackets ([]) constitutes the internal subset. As we will see momentarily, ELE- B.6.2 Enumerated Attribute Types MENT and #PCDATA are used in “element declarations.” B.7 Conditional Sections External subsets physically exist in a different file that typically ends with the.dtd extension, although this file extension is not required. External subsets are specified using B.8 Whitespace Characters Objectives either keyword the keyword SYSTEM or the keyword PUBLIC. For example, the DOC- B.9 Internet and World Wide Web Resources TYPE external subset might look like • To understand what a DTD is. Summary • Terminology • Self-Review Exercises • Answers to Self-Review Exercises • To be able to write DTDs. • To be able to declare elements and attributes in a DTD. which points to the myDTD.dtd document. The PUBLIC keyword indicates that the DTD • To understand the difference between general entities is widely used (e.g., the DTD for HTML documents). The DTD may be made available in B.1 Introduction well-known locations for more efficient downloading. We used such a DTD in Chapters 9 and parameter entities. and 10 when we created XHTML documents. The DOCTYPE • To be able to use conditional sections with entities. In this appendix, we discuss Document Type Definitions (DTDs), which define an XML document’s structure (e.g., what elements, attributes, etc. are permitted in the document). • To be able to use NOTATIONs. ten recommended to ensure document conformity, especially in business-to-business is processed. (B2B) transactions, where XML documents are exchanged. DTDs specify an XML docu- uses the PUBLIC keyword to reference the well-known DTD for XHTML version 1.0. To whom nothing is given, of him can nothing be required. ment’s structure and are themselves defined using EBNF (Extended Backus-Naur Form) XML parsers that do not have a local copy of the DTD may use the URL provided to down- Henry Fielding grammar—not the XML syntax introduced in Appendix A. load the DTD to perform validation. Like everything metaphysical, the harmony between thought Both the internal and external subset may be specified at the same time. For example, and reality is to be found in the grammar of the language. the DOCTYPE Ludwig Wittgenstein Grammar, which knows how to control even kings. B.2 Parsers, Well-Formed and Valid XML Documents Molière Parsers are generally classified as validating or nonvalidating. A validating parser is able ]> to read a DTD and determine whether the XML document conforms to it. If the document conforms to the DTD, it is referred to as valid. If the document fails to conform to the DTD contains declarations from the myDTD.dtd document, as well as an internal declaration. but is syntactically correct, it is well formed, but not valid. By definition, a valid document Software Engineering Observation B.1 is well formed. The document type declaration’s internal subset plus its external subset form the DTD. 0.0 A nonvalidating parser is able to read the DTD, but cannot check the document against the DTD for conformity. If the document is syntactically correct, it is well formed. In this appendix, we use a Java program we created to check a document conformance. Software Engineering Observation B.2 This program, named Validator.jar, is located in the Appendix B examples directory. The internal subset is visible only within the document in which it resides. Other external Validator.jar uses the reference implementation for the Java API for XML Pro- documents cannot be validated against it. DTDs that are used by many documents should be
cessing 1.1, which requires crimson.jar and jaxp.jar. placed in the external subset. 0.0
© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.
1630 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1631 1632 Document Type Definition (DTD) Appendix B
B.4 Element Type Declarations 1 Occurrence Indicator Description Elements are the primary building blocks used in XML documents and are declared in a 2 DTD with element type declarations (ELEMENTs). For example, to declare element 3 4 myMessage, we might write Plus sign ( + ) An element can appear any number of times, but must appear at least once (i.e., the element appears one or more times). C:\>java -jar Validator.jar welcome.xml Asterisk ( * ) An element is optional, and if used, the element can appear any Document is valid. number of times (i.e., the element appears zero or more times). The element name (e.g., myElement) that follows ELEMENT is often called a generic Question mark ( ? ) An element is optional, and if used, the element can appear only identifier. The set of parentheses that follow the element name specify the element’s al- Fig. B.2 Validation by using an external DTD. once (i.e., the element appears zero or one times). lowed content and is called the content specification. Keyword PCDATA specifies that the element must contain parsable character data. These data will be parsed by the XML pars- er, therefore any markup text (i.e., <, >, &, etc.) will be treated as markup. We will discuss Fig. B.4 Occurrence indicators. the content specification in detail momentarily. 1 2 A plus sign indicates one or more occurrences. For example, Common Programming Error B.1 3 4 Attempting to use the same element name in multiple element type declarations is an error. 0.0 5 6 specifies that element album contains one or more song elements. Figure B.1 lists an XML document that contains a reference to an external DTD in the 7 The frequency of an element group (i.e., two or more elements that occur in some com- DOCTYPE. We use Validator.jar to check the document’s conformity against its DTD. 8 9
© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.
Appendix B Document Type Definition (DTD) 1633 1634 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1635
and
© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. 1636 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1637 1638 Document Type Definition (DTD) Appendix B
Line 7 declares element format as a mixed content element. According to the dec- laration, the format element may contain either parsed character data (PCDATA), ele- C:>java -jar Validator.jar invalid-mixed.xml C:\>java -jar Validator.jar welcome2.xml ment bold or element italic. The asterisk indicates that the content can occur zero fatal error: Mixed content model for "format" must end with ")*", not Document is valid. ",". or more times. Lines 8 and 9 specify that bold and italic elements only have PCDATA for their content specification—they cannot contain child elements. Despite the Fig. B.7 Declaring attributes (part 2 of 2). Fig. B.6 Changing a pipe character to a comma in a DTD (part 2 of 2). fact that elements with PCDATA content specification cannot contain child elements, when checked against the DTD attribute list declaration they are still considered to have mixed content. The comma (,), plus sign (+) and ques- tion mark (?) occurrence indicators cannot be used with mixed-content elements that declares EMPTY element x. The attribute declaration specifies that y is an attribute of x. contain only PCDATA. Keyword CDATA indicates that y can contain any character text except for the <, >, &, ' and Figure B.6 shows the results of changing the first pipe character in line 7 of Fig. B.5 to " characters. Note that the CDATA keyword in an attribute declaration has a different mean- does not conform to it because attribute number is missing from element message. a comma and the result of removing the asterisk. Both of these are illegal DTD syntax. ing than the CDATA section in an XML document we introduced in Appendix A. Recall An attribute declaration with default value #FIXED specifies that the attribute value that in a CDATA section all characters are legal except the ]]> end tag. Keyword #RE- is constant and cannot be different in the XML document. For example, Common Programming Error B.3 QUIRED specifies that the attribute must be provided for element x. We will say more When declaring mixed content, not listing PCDATA as the first item is an error. B.3 about other keywords momentarily. Figure B.7 demonstrates how to specify attribute declarations for an element. Line 9 indicates that the value "02115" is the only value attribute zip can have. The XML doc- An element declared as type ANY can contain any content, including PCDATA, ele- declares attribute for element . Attribute contains required . id message id CDATA ument is not valid if attribute zip contains a value different from "02115". If element ments or a combination of elements and PCDATA. Elements with ANY content can also be Attribute values are normalized (i.e., consecutive whitespace characters are combined into empty elements. address does not contain attribute zip, the default value "02115" is passed to the ap- one whitespace character). We discuss normalization in detail in Section B.8. Line 13 plication that is using the XML document’s data. Software Engineering Observation B.3 assigns attribute id the value "6343070". DTDs allow document authors to specify an attribute’s default value using attribute Elements with ANY content are commonly used in the early stages of DTD development. Doc- defaults, which we briefly touched upon in the previous section. Keywords #IMPLIED, ument authors typically replace ANY content with more specific content as the DTD evolves. B.3 B.6 Attribute Types #REQUIRED and #FIXED are attribute defaults. Keyword #IMPLIED specifies that if the attribute does not appear in the element, then the application using the XML document can Attribute types are classified as either string (CDATA), tokenized or enumerated. String at- B.5 Attribute Declarations use whatever value (if any) it chooses. tribute types do not impose any constraints on attribute values, other than disallowing the In this section, we discuss attribute declarations. An attribute declaration specifies an at- Keyword #REQUIRED indicates that the attribute must appear in the element. The < and & characters. Entity references (e.g., <, &, etc.) must be used for these char- acters. Tokenized attribute types impose constraints on attribute values, such as which char- tribute list for an element by using the ATTLIST attribute list declaration. An element can XML document is not valid if the attribute is missing. For example, the markup have any number of attributes. For example, acters are permitted in an attribute name. We discuss tokenized attribute types in the next
© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.
Appendix B Document Type Definition (DTD) 1639 1640 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1641
Common Programming Error B.4 C:\>java -jar ParserTest.jar idexample.xml 3 Using the same value for multiple ID attributes is a logic error: The document validated 4 against the DTD is not valid. B.4 5 6 7 and isbnCPP. The parser replaces the entity references with their values. These entities
21 an error. B.5 34 22 35 23
24
28 Java How to Program 4th edition. Declaring attributes of type ID as #FIXED is an error. B.7 Fig. B.9 Error displayed when an invalid ID is referenced (part 2 of 2). 29 30 31
© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.
1642 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1643 1644 Document Type Definition (DTD) Appendix B
which indicates sportsClub contains a required NMTOKEN phone attribute. An exam- ]]>
© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. Appendix B Document Type Definition (DTD) 1645 1646 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1647
Figure B.14 contains a DTD and markup that conforms to the DTD. Line 28 assigns a 17 39 This is some additional text. 18 value containing multiple whitespace characters to attribute cdata. Attribute cdata 40 19 (declared in line 11) is required and must contain CDATA. As mentioned earlier, CDATA 41 can contain almost any text, including whitespace. As the output illustrates, spaces in 42 Fig. B.12 Conditional sections in a DTD (part 2 of 2). CDATA are preserved and passed on to the application that is using the XML document. Line 30 assigns a value to attribute id that contains leading whitespace. Attribute id C:\Documents and Settings\Administrator\My Documents\advanced java xml Lines 4–5 declare entities reject and accept, with the values IGNORE and is declared on line 14 with tokenized attribute type ID. Because this is not CDATA, it is nor- appendice s\appBexamples>java -jar Validator.jar whitespace.xml INCLUDE, respectively. Because each of these entities is preceded by a percent (%) char- malized and the leading whitespace characters are removed. Similarly, lines 32 and 34 Document is valid. acter, they can be used only inside the DTD in which they are declared. These types of enti- assign values that contain leading whitespace to attributes nmtoken and enumera- ties—called parameter entities—allow document authors to create entities specific to a tion—which are declared in the DTD as an NMTOKEN and an enumeration, respectively. DTD—not an XML document. Recall that the DTD is the combination of the internal subset Both these attributes are normalized by the parser. C:\>java -jar ParserTest.jar ..\appBexamples\whitespace.xml and external subset. Parameter entities may be placed only in the external subset. Lines 7–13 use the entities accept and reject, which represent the strings INCLUDE and IGNORE, respectively. Notice that the parameter entity references are pre- 1
© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.
1648 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1649 1650 Document Type Definition (DTD) Appendix B
SUMMARY • DTDs allow document authors to specify an attribute’s default value using attribute defaults. Key- .dtd extension ID tokenized attribute type • Document Type Definitions (DTDs) define an XML document’s structure (e.g., what elements, words #IMPLIED, #REQUIRED and #FIXED are attribute defaults. ANY IDREF tokenized attribute type attributes, etc. are permitted in the XML document). An XML document is not required to have a • Keyword #IMPLIED specifies that if the attribute does not appear in the element, then the appli- asterisk (*) IGNORE corresponding DTD. DTDs use EBNF (Extended Backus-Naur Form) grammar. cation using the XML document can apply whatever value (if any) it chooses. ATTLIST statement INCLUDE attribute content internal subset • Parsers are generally classified as validating or nonvalidating. A validating parser can read the • Keyword #REQUIRED indicates that the attribute must appear in the element. The XML docu- attribute declaration mixed content DTD and determine whether or not the XML document conforms to it. If the document conforms ment is not valid if the attribute is missing. attribute default mixed-content element to the DTD, it is referred to as valid. If the document fails to conform to the DTD but is syntacti- • An attribute declaration with default value #FIXED specifies that the attribute value is constant attribute list mixed-content type cally correct, it is well formed but not valid. By definition, a valid document is well formed. and cannot be different in the XML document. attribute name NDATA • A nonvalidating parser is able to read a DTD, but cannot check the document against the DTD for • Attribute types are classified as either string (CDATA), tokenized or enumerated. String attribute attribute value NMTOKEN tokenized attribute type (name token) conformity. If the document is syntactically correct, it is well formed. types do not impose any constraints on attribute values, other than disallowing the < and & char- CDATA nonvalid document • DTDs are introduced into XML documents by using the document type declaration (i.e., DOC- acters. Entity references (e.g., <, &, etc.) must be used for these characters. Tokenized character data type nonvalidating parser TYPE). The document type declaration can point to declarations that are outside the XML docu- attributes impose constraints on attribute values, such as which characters are permitted in an at- child element normalization ment (called the external subset) or can contain the declaration inside the document (called the tribute name. Enumerated attributes are the most restrictive of the three types. They can take only comma character NOTATION internal subset). one of the values listed in the attribute declaration. conditional section notation type content specification occurrence indicator • External subsets physically exist in a different file that typically ends with the .dtd extension, • Four different tokenized attribute types exist: ID, IDREF, ENTITY and NMTOKEN. Tokenized content specification type optional element although this file extension is not required. External subsets are specified using keyword SYSTEM attribute type ID uniquely identifies an element. Attributes with type IDREF point to elements declaration parameter entity or PUBLIC Both the internal and external subset may be specified at the same time. with an ID attribute. A validating parser verifies that every ID attribute type referenced by IDREF is in the XML document. default value of an attribute parsed character data • Elements are the primary building block used in XML documents and are declared in a DTD with DOCTYPE (document type declaration) parser • Entity attributes indicate that an attribute has an entity for its value and are specified using token- element type declarations (ELEMENTs). document type declaration percent sign (%) ized attribute type ENTITY. The primary constraint placed on ENTITY attribute types is that they • The element name that follows ELEMENT is often called a generic identifier. The set of parenthe- double quote (") period must refer to external unparsed entities. ses that follow the element name specify the element’s allowed content and is called the content DTD (Document Type Definition) pipe character (|) specification. • Attribute type ENTITIES may also be used in a DTD to indicate that an attribute has multiple EBNF (Extended Backus-Naur Form) grammar plus sign (+) entities for its value. Each entity is separated by a space. element question mark (?) • Keyword PCDATA specifies that the element must contain parsable character data—that is, any element content quote (') text except the characters less than (<) and ampersand (&). • A more restrictive attribute type is attribute type NMTOKEN (name token), whose value consists of letters, digits, periods, underscores, hyphens and colon characters. element name sequence (,) • An XML document is a standalone XML document if it does not reference an external DTD. ELEMENT statement standalone XML document • Attribute type NMTOKENS may contain multiple string tokens separated by spaces. • An XML element that can have only another element for content is said to have element content. element type declaration (!ELEMENT) string attribute type • Enumerated attribute types declare a list of possible values an attribute can have. The attribute EMPTY string token • DTDs allow document authors to define the order and frequency of child elements. The comma must be assigned a value from this list to conform to the DTD. Enumerated type values are sepa- empty element structural definition (,)—called a sequence—specifies the order in which the elements must occur. Choices are spec- rated by pipe characters (|). ENTITIES syntax ified using the pipe character (|). The content specification may contain any number of pipe-char- • NOTATION is also an enumerated attribute type. Notations provide information that an applica- entity attribute SYSTEM acter-separated choices. tion using the XML document can use to handle unparsed entities. ENTITY tokenized attribute type text • An element’s frequency (i.e., number of occurrences) is specified by using either the plus sign (+), • Keyword NDATA indicates that the content of an external entity is not XML. The name of the no- enumerated attribute type tokenized attribute type asterisk (*) or question mark (?) occurrence indicator. tation that handles this unparsed entity is placed to the right of NDATA. Extended Backus-Naur Form (EBNF) grammar type • The frequency of an element group (i.e., two or more elements that occur in some combination) is external subset valid document • DTDs provide the ability to include or exclude declarations using conditional sections. Keyword specified by enclosing the element names inside the content specification, followed by an occur- external unparsed entity validating parser INCLUDE specifies that declarations are included, while keyword IGNORE specifies that decla- rence indicator. fixed value validation rations are excluded. Conditional sections are often used with entities. general entity well-formed document • Elements can be further refined by describing the content types they may contain. Content speci- • Parameter entities are preceded by percent (%) characters and can be used only inside the DTD in generic identifier whitespace character fication types (e.g., EMPTY, mixed content, ANY, etc.) describe nonelement content. which they are declared. Parameter entities allow document authors to create entities specific to a hyphen (-) • An element can be declared as having mixed content (i.e., a combination of elements and DTD—not an XML document. PCDATA). The comma (,), plus sign (+) and question mark (?) occurrence indicators cannot be • Whitespace is either preserved or normalized, depending on the context in which it is used. Spaces SELF-REVIEW EXERCISES used with mixed content elements. in CDATA are preserved. Attribute values with tokenized attribute types ID, NMTOKEN and enu- B.1 State whether the following are true or false. If the answer is false, explain why. • An element declared as type ANY can contain any content including PCDATA, elements, or a com- meration are normalized. a) The document type declaration, , introduces DTDs in XML documents. bination of elements and PCDATA. Elements with ANY content can also be empty elements. DOCTYPE b) External DTDs are specified by using the keyword EXTERNAL. • An attribute list for an element is declared using the ATTLIST element type declaration. TERMINOLOGY c) A DTD can contain either internal or external subsets of declarations, but not both. • Attribute values are normalized (i.e., consecutive whitespace characters are combined into one #FIXED #PCDATA d) Child elements are declared in parentheses, inside an element type declaration. whitespace character). #IMPLIED #REQUIRED e) An element that appears any number of times is followed by an exclamation point (!).
© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.
Appendix B Document Type Definition (DTD) 1651
f) A mixed-content element can contain text as well as other declared elements. g) An attribute declared as type CDATA can contain all characters except for the asterisk (*) and pound sign (#). h) Each element attribute of type ID must have a unique value. i) Enumerated attribute types are the most restrictive category of attribute types. j) An enumerated attribute type requires a default value. B.2 Fill in the blanks in each of the following statements: a) The set of document type declarations inside an XML document is called the . b) Elements are declared with the type declaration. c) Keyword indicates that an element contains parsable character data. d) In an element type declaration, the pipe character (|) indicates that the element can con- tain of the elements indicated. e) Attributes are declared by using the type. f) Keyword specifies that the attribute can take only a specific value that has been defined in the DTD. g) ID, IDREF, and NMTOKEN are all types of tokenized attributes. h) The % character is used to declare a/an . i) DTD is an acronym for . j) Conditional sections of DTDs are often used with .
ANSWERS TO SELF-REVIEW EXERCISES B.1 a) True. b) False. External DTDs are specified using keyword SYSTEM. c) False. A DTD contains both the internal and external subsets. d) True. e) False. An element that appears one or zero times is specified by a question mark (?). f) True. g) False. An attribute declared as type CDATA can contain all characters except for ampersand (&), less than (<), greater than (>), quote (') and double quotes ("). h) True. i) True. j) False. A default value is not required. B.2 a) internal subset. b) ELEMENT. c) PCDATA. d) one. e) ATTLIST. f) #FIXED. g) ENTITY. h) parameter entity. i) Document Type Definition. j) entities.
© 2002 by Prentice-Hall, Inc.