1628 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1629

B.3 Document Type Declaration Outline DTDs are introduced into XML documents using the document type declaration (i.e., B B.1 Introduction DOCTYPE). A document type declaration is placed in the XML document’s prolog (i.e., all B.2 Parsers, Well-Formed and Valid XML Documents lines preceding the root element), begins with . The docu- B.3 Document Type Declaration ment type declaration can point to declarations that are outside the XML document (called the external subset) or can contain the declaration inside the document (called the internal Document Type B.4 Element Type Declarations subset). For example, an internal subset might look like B.4.1 Sequences, Pipe Characters and Occurrence Indicators Definition (DTD) B.4.2 EMPTY, Mixed Content and ANY B.5 Attribute Declarations ]> B.6 Attribute Types The first myMessage is the name of the document type declaration. Anything inside B.6.1 Tokenized Attribute Type (ID, IDREF, ENTITY, NMTOKEN) the square brackets ([]) constitutes the internal subset. As we will see momentarily, ELE- B.6.2 Enumerated Attribute Types MENT and #PCDATA are used in “element declarations.” B.7 Conditional Sections External subsets physically exist in a different file that typically ends with the.dtd extension, although this file extension is not required. External subsets are specified using B.8 Whitespace Characters Objectives either keyword the keyword SYSTEM or the keyword PUBLIC. For example, the DOC- B.9 Internet and World Wide Web Resources TYPE external subset might look like • To understand what a DTD is. Summary • Terminology • Self-Review Exercises • Answers to Self-Review Exercises • To be able to write DTDs. • To be able to declare elements and attributes in a DTD. which points to the myDTD.dtd document. The PUBLIC keyword indicates that the DTD • To understand the difference between general entities is widely used (e.g., the DTD for HTML documents). The DTD may be made available in B.1 Introduction well-known locations for more efficient downloading. We used such a DTD in Chapters 9 and parameter entities. and 10 when we created XHTML documents. The DOCTYPE • To be able to use conditional sections with entities. In this appendix, we discuss Document Type Definitions (DTDs), which define an XML document’s structure (e.g., what elements, attributes, etc. are permitted in the document). • To be able to use NOTATIONs. ten recommended to ensure document conformity, especially in business-to-business is processed. (B2B) transactions, where XML documents are exchanged. DTDs specify an XML docu- uses the PUBLIC keyword to reference the well-known DTD for XHTML version 1.0. To whom nothing is given, of him can nothing be required. ment’s structure and are themselves defined using EBNF (Extended Backus-Naur Form) XML parsers that do not have a local copy of the DTD may use the URL provided to down- Henry Fielding grammar—not the XML syntax introduced in Appendix A. load the DTD to perform validation. Like everything metaphysical, the harmony between thought Both the internal and external subset may be specified at the same time. For example, and reality is to be found in the grammar of the language. the DOCTYPE Ludwig Wittgenstein Grammar, which knows how to control even kings. B.2 Parsers, Well-Formed and Valid XML Documents Molière Parsers are generally classified as validating or nonvalidating. A validating parser is able ]> to read a DTD and determine whether the XML document conforms to it. If the document conforms to the DTD, it is referred to as valid. If the document fails to conform to the DTD contains declarations from the myDTD.dtd document, as well as an internal declaration. but is syntactically correct, it is well formed, but not valid. By definition, a valid document Software Engineering Observation B.1 is well formed. The document type declaration’s internal subset plus its external subset form the DTD. 0.0 A nonvalidating parser is able to read the DTD, but cannot check the document against the DTD for conformity. If the document is syntactically correct, it is well formed. In this appendix, we use a Java program we created to check a document conformance. Software Engineering Observation B.2 This program, named Validator.jar, is located in the Appendix B examples directory. The internal subset is visible only within the document in which it resides. Other external Validator.jar uses the reference implementation for the Java API for XML Pro- documents cannot be validated against it. DTDs that are used by many documents should be

cessing 1.1, which requires crimson.jar and jaxp.jar. placed in the external subset. 0.0

© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.

1630 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1631 1632 Document Type Definition (DTD) Appendix B

B.4 Element Type Declarations 1 Occurrence Indicator Description Elements are the primary building blocks used in XML documents and are declared in a 2 DTD with element type declarations (ELEMENTs). For example, to declare element 3 4 myMessage, we might write Plus sign ( + ) An element can appear any number of times, but must appear at least once (i.e., the element appears one or more times). C:\>java -jar Validator.jar welcome. Asterisk ( * ) An element is optional, and if used, the element can appear any Document is valid. number of times (i.e., the element appears zero or more times). The element name (e.g., myElement) that follows ELEMENT is often called a generic Question mark ( ? ) An element is optional, and if used, the element can appear only identifier. The set of parentheses that follow the element name specify the element’s al- Fig. B.2 Validation by using an external DTD. once (i.e., the element appears zero or one times). lowed content and is called the content specification. Keyword PCDATA specifies that the element must contain parsable character data. These data will be parsed by the XML pars- er, therefore any markup text (i.e., <, >, &, etc.) will be treated as markup. We will discuss Fig. B.4 Occurrence indicators. the content specification in detail momentarily. 1 2 A plus sign indicates one or more occurrences. For example, Common Programming Error B.1 3 4 Attempting to use the same element name in multiple element type declarations is an error. 0.0 5 6 specifies that element album contains one or more song elements. Figure B.1 lists an XML document that contains a reference to an external DTD in the 7 The frequency of an element group (i.e., two or more elements that occur in some com- DOCTYPE. We use Validator.jar to check the document’s conformity against its DTD. 8 9 bination) is specified by enclosing the element names inside the content specification with The document type declaration (line 6) specifies the name of the root element as 10 parentheses, followed by either the plus sign, asterisk or question mark. For example, MyMessage. The element myMessage (lines 8–10) contains a single child element named message (line 9). C:\>java -jar Validator.jar welcome-invalid.xml Line 3 of the DTD (Fig. B.2) declares element myMessage. Notice that the content error: Element "myMessage" requires additional elements. indicates that element album contains one title element followed by any number of specification contains the name message. This indicates that element myMessage con- songTitle/duration element groups. At least one songTitle/duration group tains exactly one child element named message. Because myMessage can have only an Fig. B.3 Invalid XML document. must follow title, and in each of these element groups, the songTitle must precede element as its content, it is said to have element content. Line 4, declares element message the duration. An example of markup that conforms to this is whose content is of type PCDATA. Common Programming Error B.2 B.4.1 Sequences, Pipe Characters and Occurrence Indicators XML Classical Hits Having a root element name other than the name specified in the document type declaration DTDs allow the document author to define the order and frequency of child elements. The XML Overture is an error. 0.0 comma (,)—called a sequence—specifies the order in which the elements must occur. For 10 example, If an XML document’s structure is inconsistent with its corresponding DTD, but is XML Symphony 1.0 syntactically correct, the document is only well formed—not valid. Figure B.3 shows the 54 messages generated when the required message element is omitted. specifies that element classroom must contain exactly one teacher element followed which contains one title element followed by two songTitle/duration groups. The asterisk (*) character indicates an optional element that, if used, can occur any number 1 by exactly one student element. The content specification can contain any number of 2 items in sequence. of times. For example, 3 Similarly, choices are specified using the pipe character (|), as in 4 5 indicates that element library contains any number of book elements, including the 6 possibility of none at all. Markup examples that conform to this mark up are 7 which specifies that element dessert must contain either one iceCream element or one 8 pastry element, but not both. The content specification may contain any number of pipe 9 Welcome to XML! The Wealth of Nations 10 character-separated choices. The Iliad An element’s frequency (i.e., number of occurrences) is specified by using either the The Jungle Fig. B.1 XML document declaring its associated DTD. plus sign (+), asterisk (*) or question mark (?) occurrence indicator (Fig. B.4).

© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.

Appendix B Document Type Definition (DTD) 1633 1634 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1635

and declares element oven to be an empty element. The markup for an oven element would grape appear as half-sour sour Optional elements that, if used, may occur only once are followed by a question mark half-sour (?). For example, chocolate or and in an XML document conforming to this declaration. indicates that element seat contains at most one person element. Examples of markup An element can also be declared as having mixed content. Such elements may contain that conform to this are semi-sweet any combination of elements and PCDATA. For example, the declaration whipped sweet Jane Doe indicates that element myMessage contains mixed content. Markup conforming to this The declaration declaration might look like and Here is some text, some ( goat | cow )?,( chicken+ | duck* ) )> other textand even more text. Now we consider three more element type declarations and provide a declaration for indicates that element farm can have one or more farmer elements, any number of op- each. The declaration tional dog elements or an optional cat element, any number of optional pig elements, an optional goat or cow element and one or more chicken elements or any number of op- Element myMessage contains two message elements and three instances of character Figure B.5 specifies the DTD as an internal subset (lines 6–10). In the prolog (line 1), we use the standalone attribute with a value of yes. An XML document is standalone specifies that a class element must contain a number element, either one instructor Jane Doe if it does not reference an external subset. This DTD defines three elements: one that con- John Doe element or any number of assistant elements and either one credit element or one tains mixed content and two that contain parsed character data. noCredit element. Markup examples that conform to this are Lucy Bo Jill 123 1 2 Dr. Harvey Deitel and 4 3 4 5 Red Green and 6 Billy 7 Sue 8 9 456 10 ]> Tem Nieto 11 Paul Deitel 12 3 B.4.2 EMPTY, Mixed Content and ANY 13 Book catalog entry: Elements must be further refined by specifying the types of content they contain. In the pre- 14 XML 15 XML How to Program The declaration vious section, we introduced element content, indicating that an element can contain one or more child elements as its content. In this section, we introduce content specification types 16 This book carefully explains XML-based systems development. 17 In addition to element content, three other types of content exist: EMPTY, mixed con- tent and ANY. Keyword EMPTY declares empty elements, which do not contain character C:\>java -jar Validator.jar mixed.xml specifies that element donutBox can have zero or one jelly elements, followed by zero data or child elements. For example, Document is valid. or more lemon elements, followed by one or more creme or sugar elements or exactly one glazed element. Markup examples that conform to this are Fig. B.5 Example of a mixed-content element.

© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. 1636 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1637 1638 Document Type Definition (DTD) Appendix B

Line 7 declares element format as a mixed content element. According to the dec- laration, the format element may contain either parsed character data (PCDATA), ele- C:>java -jar Validator.jar invalid-mixed.xml C:\>java -jar Validator.jar welcome2.xml ment bold or element italic. The asterisk indicates that the content can occur zero fatal error: Mixed content model for "format" must end with ")*", not Document is valid. ",". or more times. Lines 8 and 9 specify that bold and italic elements only have PCDATA for their content specification—they cannot contain child elements. Despite the Fig. B.7 Declaring attributes (part 2 of 2). Fig. B.6 Changing a pipe character to a comma in a DTD (part 2 of 2). fact that elements with PCDATA content specification cannot contain child elements, when checked against the DTD attribute list declaration they are still considered to have mixed content. The comma (,), plus sign (+) and ques- tion mark (?) occurrence indicators cannot be used with mixed-content elements that declares EMPTY element x. The attribute declaration specifies that y is an attribute of x. contain only PCDATA. Keyword CDATA indicates that y can contain any character text except for the <, >, &, ' and Figure B.6 shows the results of changing the first pipe character in line 7 of Fig. B.5 to " characters. Note that the CDATA keyword in an attribute declaration has a different mean- does not conform to it because attribute number is missing from element message. a comma and the result of removing the asterisk. Both of these are illegal DTD syntax. ing than the CDATA section in an XML document we introduced in Appendix A. Recall An attribute declaration with default value #FIXED specifies that the attribute value that in a CDATA section all characters are legal except the ]]> end tag. Keyword #RE- is constant and cannot be different in the XML document. For example, Common Programming Error B.3 QUIRED specifies that the attribute must be provided for element x. We will say more When declaring mixed content, not listing PCDATA as the first item is an error. B.3 about other keywords momentarily. Figure B.7 demonstrates how to specify attribute declarations for an element. Line 9 indicates that the value "02115" is the only value attribute zip can have. The XML doc- An element declared as type ANY can contain any content, including PCDATA, ele- declares attribute for element . Attribute contains required . id message id CDATA ument is not valid if attribute zip contains a value different from "02115". If element ments or a combination of elements and PCDATA. Elements with ANY content can also be Attribute values are normalized (i.e., consecutive whitespace characters are combined into empty elements. address does not contain attribute zip, the default value "02115" is passed to the ap- one whitespace character). We discuss normalization in detail in Section B.8. Line 13 plication that is using the XML document’s data. Software Engineering Observation B.3 assigns attribute id the value "6343070". DTDs allow document authors to specify an attribute’s default value using attribute Elements with ANY content are commonly used in the early stages of DTD development. Doc- defaults, which we briefly touched upon in the previous section. Keywords #IMPLIED, ument authors typically replace ANY content with more specific content as the DTD evolves. B.3 B.6 Attribute Types #REQUIRED and #FIXED are attribute defaults. Keyword #IMPLIED specifies that if the attribute does not appear in the element, then the application using the XML document can Attribute types are classified as either string (CDATA), tokenized or enumerated. String at- B.5 Attribute Declarations use whatever value (if any) it chooses. tribute types do not impose any constraints on attribute values, other than disallowing the In this section, we discuss attribute declarations. An attribute declaration specifies an at- Keyword #REQUIRED indicates that the attribute must appear in the element. The < and & characters. Entity references (e.g., <, &, etc.) must be used for these char- acters. Tokenized attribute types impose constraints on attribute values, such as which char- tribute list for an element by using the ATTLIST attribute list declaration. An element can XML document is not valid if the attribute is missing. For example, the markup have any number of attributes. For example, acters are permitted in an attribute name. We discuss tokenized attribute types in the next XML and DTDs section. Enumerated attribute types are the most restrictive of the three types. They can take only one of the values listed in the attribute declaration. We discuss enumerated attribute types in Section B.6.2. 1 1 2 2 3 B.6.1 Tokenized Attribute Type (ID, IDREF, ENTITY, NMTOKEN) 3 4 Tokenized attribute types allow a DTD author to restrict the values used for attributes. For 4 5 example, an author may want to have a unique ID for each element or allow an attribute to 5 6 have only one or two different values. Four different tokenized attribute types exist: ID, 7 8 IDREF, ENTITY and NMTOKEN. 8 9 Tokenized attribute type ID uniquely identifies an element. Attributes with type 9 10 ]> IDREF point to elements with an ID attribute. A validating parser verifies that every ID 10 ]> 11 attribute type referenced by IDREF is in the XML document. 11 12 12 13 Figure B.8 lists an XML document that uses ID and IDREF attribute types. Element 13 Book catalog entry: 14 bookstore consists of element shipping and element book. Each shipping ele- 14 XML 15 Welcome to XML! ment describes who shipped the book and how long it will take for the book to arrive. 15 XML How to Program 16 Line 9 declares attribute shipID as an ID type attribute (i.e., each shipping ele- 16 This book carefully explains XML-based systems development. 17 ment has a unique identifier). Lines 27–37 declare book elements with attribute 17 18 shippedBy (line 11) of type IDREF. Attribute shippedBy points to one of the ship- Fig. B.6 Changing a pipe character to a comma in a DTD (part 1 of 2). Fig. B.7 Declaring attributes (part 1 of 2). ping elements by matching its shipID attribute.

© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.

Appendix B Document Type Definition (DTD) 1639 1640 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1641

Common Programming Error B.4 C:\>java -jar ParserTest.jar idexample.xml 3 Using the same value for multiple ID attributes is a logic error: The document validated 4 against the DTD is not valid. B.4 5 6 7 and isbnCPP. The parser replaces the entity references with their values. These entities 8 are called general entities. 2 to 4 days 9 Figure B.9 is a variation of Fig. B.8 that assigns shippedBy (line 32) the value 10 "bug". No shipID attribute has a value "bug", which results in a invalid XML document. 11 12 1 day 13 ]> 1 14 2 15 3 16 4 Java How to Program 4th edition. 17 2 to 4 days 5 18 6 20 8 XML How to Program. 21 1 day 9 22 10 23 11 24 12 C++ How to Program 3rd edition. 25 Java How to Program 4th edition. 13 26 14 27 15 28 16 ]> 29 C How to Program 3rd edition. Fig. B.8 XML document with ID and IDREF attribute types (part 2 of 2). 17 30 18 Common Programming Error B.5 31 19 32 20 2 to 4 days Not beginning a type attribute ID ’s value with a letter, an underscore (_) or a colon (:) is 33 C++ How to Program 3rd edition.

21 an error. B.5 34 22 35 23 Common Programming Error B.6

24 1 day Providing more than one ID attribute type for an element is an error. B.6 25 C:\>java -jar Validator.jar invalid-IDExample.xml 26 error: No element has an ID attribute with value "bug". 27 Common Programming Error B.7

28 Java How to Program 4th edition. Declaring attributes of type ID as #FIXED is an error. B.7 Fig. B.9 Error displayed when an invalid ID is referenced (part 2 of 2). 29 30 31 Related to entities are entity attributes, which indicate that an attribute has an entity for 32 XML How to Program. its value. Entity attributes are specified by using tokenized attribute type ENTITY. The pri- Line 7 declares a notation named that refers to a SYSTEM identifier named 33 mary constraint placed on ENTITY attribute types is that they must refer to external "iexplorer". Notations provide information that an application using the XML docu- 34 unparsed entities. An external unparsed entity is defined in the external subset of a DTD ment can use to handle unparsed entities. For example, the application using this document 35 and consists of character data that will not be parsed by the XML parser. 36 C++ How to Program 3rd edition. may choose to open Internet Explorer and load the document tour.html (line 8). 37 Figure B.10 lists an XML document that demonstrates the use of entities and entity Line 8 declares an entity named city that refers to an external document 38 attribute types. (tour.html). Keyword NDATA indicates that the content of this external entity is not XML. The name of the notation (e.g., html) that handles this unparsed entity is placed to C:\>java -jar Validator.jar IDExample.xml 1 the right of NDATA. Document is valid. 2 Line 11 declares attribute tour for element company. Attribute tour specifies a required ENTITY attribute type. Line 16 assigns entity city to attribute tour. If we Fig. B.8 XML document with ID and IDREF attribute types (part 1 of 2). Fig. B.9 Error displayed when an invalid ID is referenced (part 1 of 2). replaced line 16 with

© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.

1642 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1643 1644 Document Type Definition (DTD) Appendix B

does not provide a default value for gender. This type of declaration might be used to val- C:\>java -jar Validator.jar invalid-entityexample.xml idate a marked-up mailing list that contains first names, last names, addresses, etc. The ap- the document fails to conform to the DTD because entity country does not exist. error: Attribute value "country" does not name an unparsed entity. plication that uses such a mailing list may want to precede each name by either Mr., Ms. or Figure B.11 shows the error message generated when the above replacement is made. Mrs. However, some first names are gender neutral (e.g., Chris, Sam, etc.), and the appli- Fig. B.11 Error generated when a DTD contains a reference to an undefined entity cation may not know the person’s gender. In this case, the application has the flexibility (part 2 of 2). 1 to process the name in a gender-neutral way. 2 NOTATION is also an enumerated attribute type. For example, the declaration 3 Common Programming Error B.8 4 5 Not assigning an unparsed external entity to an attribute with attribute type ENTITY results indicates that reference must be assigned either JAVA or C. If a value is not assigned, 6 C is specified as the default. The notation for C might be declared as 8 Attribute type ENTITIES may also be used in a DTD to indicate that an attribute has multiple entities for its value. Each entity is separated by a space. For example "http://www.deitel.com/books/2000/chtp3/chtp3_toc.htm"> 10 11 B.7 Conditional Sections 12 specifies that attribute file is required to contain multiple entities. An example of markup 13 ]> that conforms to this might look like DTDs provide the ability to include or exclude declarations using conditional sections. 14 Keyword INCLUDE specifies that declarations are included, while keyword IGNORE speci- 15 16 fies that declarations are excluded. For example, the conditional section 17 Deitel & Associates, Inc. where animations, graph1 and graph2 are entities declared in a DTD. A more restrictive attribute type is NMTOKEN (name token), whose value consists of let- 19 ters, digits, periods, underscores, hyphens and colon characters. For example, consider the ]]> declaration C:\>java -jar Validator.jar entityexample.xml directs the parser to include the declaration of element name. Document is valid. Similarly, the conditional section

which indicates sportsClub contains a required NMTOKEN phone attribute. An exam- ]]> 1 An example that does not conform to this is directs the parser to exclude the declaration of element message. Conditional sections are 2 often used with entities, as demonstrated in Fig. B.12. 3 4 5 because spaces are not allowed in an NMTOKEN attribute. 1 6 7 multiple string tokens separated by spaces. 3 8 4 9 5 10 B.6.2 Enumerated Attribute Types 6 11 7 Enumerated attribute types declare a list of possible values an attribute can have. The at- 8 13 ]> tribute must be assigned a value from this list to conform to the DTD. Enumerated type val- 9 ]]> 14 10 15 ues are separated by pipe characters (|). For example, the declaration 11 12 17 Deitel & Associates, Inc. 13 ]]> 18 contains an enumerated attribute type declaration that allows attribute gender to have ei- 14 19 ther the value M or the value F. A default value of "F" is specified to the right of the ele- 15 ment attribute type. Alternatively, a declaration such as 16 Fig. B.11 Error generated when a DTD contains a reference to an undefined entity (part 1 of 2). Fig. B.12 Conditional sections in a DTD (part 1 of 2).

© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. Appendix B Document Type Definition (DTD) 1645 1646 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1647

Figure B.14 contains a DTD and markup that conforms to the DTD. Line 28 assigns a 17 39 This is some additional text. 18 value containing multiple whitespace characters to attribute cdata. Attribute cdata 40 19 (declared in line 11) is required and must contain CDATA. As mentioned earlier, CDATA 41 can contain almost any text, including whitespace. As the output illustrates, spaces in 42 Fig. B.12 Conditional sections in a DTD (part 2 of 2). CDATA are preserved and passed on to the application that is using the XML document. Line 30 assigns a value to attribute id that contains leading whitespace. Attribute id C:\Documents and Settings\Administrator\My Documents\advanced java xml Lines 4–5 declare entities reject and accept, with the values IGNORE and is declared on line 14 with tokenized attribute type ID. Because this is not CDATA, it is nor- appendice s\appBexamples>java -jar Validator.jar whitespace.xml INCLUDE, respectively. Because each of these entities is preceded by a percent (%) char- malized and the leading whitespace characters are removed. Similarly, lines 32 and 34 Document is valid. acter, they can be used only inside the DTD in which they are declared. These types of enti- assign values that contain leading whitespace to attributes nmtoken and enumera- ties—called parameter entities—allow document authors to create entities specific to a tion—which are declared in the DTD as an NMTOKEN and an enumeration, respectively. DTD—not an XML document. Recall that the DTD is the combination of the internal subset Both these attributes are normalized by the parser. C:\>java -jar ParserTest.jar ..\appBexamples\whitespace.xml and external subset. Parameter entities may be placed only in the external subset. Lines 7–13 use the entities accept and reject, which represent the strings INCLUDE and IGNORE, respectively. Notice that the parameter entity references are pre- 1 ceded by %, whereas normal entity references are preceded by &. Line 7 represents the 2 3 beginning tag of an IGNORE section (the value of the accept entity is IGNORE), while line 11 represents the start tag of an INCLUDE section. By changing the values of the enti- 4 5 ties, we can easily choose which message element declaration to allow. 6 8 hasID, hasNMTOKEN, hasEnumeration, hasMixed )> 9 Software Engineering Observation B.4 10 Parameter entities allow document authors to use entity names in DTDs without conflicting 11 B.4 with entities names used in an XML document. 12 This is text. 13 14 This is some additional text. B.8 Whitespace Characters 15 16 In Appendix A, we briefly discussed whitespace characters. In this section, we discuss how 17 whitespace characters relate to DTDs. Depending on the application, insignificant 18 whitespace characters may be collapsed into a single whitespace character or even removed 19 entirely. This process is called normalization. Whitespace is either preserved or normal- 20 22 23 B.9 Internet and World Wide Web Resources 1 24 ]> www.wdvl.com/Authoring/HTML/Validation/DTD.html 2 25 Contains a description of the historical uses of DTDs, including a description of SGML and the 3 26 HTML DTD. 4 27 5 28 www.xml101.com/dtd 6 29 Contains tutorials and explanations on creating DTDs. 7 30 www.w3schools.com/dtd 8 31 Contains DTD tutorials and examples. 9 32 10 Chairman 33 wdvl.internet.com/Authoring/Languages/XML/Tutorials/Intro/ 11 34 index3.html 35 A DTD tutorial for Web developers. 36 www.schema.net C:\>java -jar Validator.jar conditional.xml 37 This is text. Document is valid. 38 A DTD repository with XML links and resources. www.networking.ibm.com/xml/XmlValidatorForm.html Fig. B.13 XML document that conforms to conditional.dtd. Fig. B.14 Processing whitespace in an XML document (part 1 of 2). IBM’s DOMit XML Validator.

© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.

1648 Document Type Definition (DTD) Appendix B Appendix B Document Type Definition (DTD) 1649 1650 Document Type Definition (DTD) Appendix B

SUMMARY • DTDs allow document authors to specify an attribute’s default value using attribute defaults. Key- .dtd extension ID tokenized attribute type • Document Type Definitions (DTDs) define an XML document’s structure (e.g., what elements, words #IMPLIED, #REQUIRED and #FIXED are attribute defaults. ANY IDREF tokenized attribute type attributes, etc. are permitted in the XML document). An XML document is not required to have a • Keyword #IMPLIED specifies that if the attribute does not appear in the element, then the appli- asterisk (*) IGNORE corresponding DTD. DTDs use EBNF (Extended Backus-Naur Form) grammar. cation using the XML document can apply whatever value (if any) it chooses. ATTLIST statement INCLUDE attribute content internal subset • Parsers are generally classified as validating or nonvalidating. A validating parser can read the • Keyword #REQUIRED indicates that the attribute must appear in the element. The XML docu- attribute declaration mixed content DTD and determine whether or not the XML document conforms to it. If the document conforms ment is not valid if the attribute is missing. attribute default mixed-content element to the DTD, it is referred to as valid. If the document fails to conform to the DTD but is syntacti- • An attribute declaration with default value #FIXED specifies that the attribute value is constant attribute list mixed-content type cally correct, it is well formed but not valid. By definition, a valid document is well formed. and cannot be different in the XML document. attribute name NDATA • A nonvalidating parser is able to read a DTD, but cannot check the document against the DTD for • Attribute types are classified as either string (CDATA), tokenized or enumerated. String attribute attribute value NMTOKEN tokenized attribute type (name token) conformity. If the document is syntactically correct, it is well formed. types do not impose any constraints on attribute values, other than disallowing the < and & char- CDATA nonvalid document • DTDs are introduced into XML documents by using the document type declaration (i.e., DOC- acters. Entity references (e.g., <, &, etc.) must be used for these characters. Tokenized character data type nonvalidating parser TYPE). The document type declaration can point to declarations that are outside the XML docu- attributes impose constraints on attribute values, such as which characters are permitted in an at- child element normalization ment (called the external subset) or can contain the declaration inside the document (called the tribute name. Enumerated attributes are the most restrictive of the three types. They can take only comma character NOTATION internal subset). one of the values listed in the attribute declaration. conditional section notation type content specification occurrence indicator • External subsets physically exist in a different file that typically ends with the .dtd extension, • Four different tokenized attribute types exist: ID, IDREF, ENTITY and NMTOKEN. Tokenized content specification type optional element although this file extension is not required. External subsets are specified using keyword SYSTEM attribute type ID uniquely identifies an element. Attributes with type IDREF point to elements declaration parameter entity or PUBLIC Both the internal and external subset may be specified at the same time. with an ID attribute. A validating parser verifies that every ID attribute type referenced by IDREF is in the XML document. default value of an attribute parsed character data • Elements are the primary building block used in XML documents and are declared in a DTD with DOCTYPE (document type declaration) parser • Entity attributes indicate that an attribute has an entity for its value and are specified using token- element type declarations (ELEMENTs). document type declaration percent sign (%) ized attribute type ENTITY. The primary constraint placed on ENTITY attribute types is that they • The element name that follows ELEMENT is often called a generic identifier. The set of parenthe- double quote (") period must refer to external unparsed entities. ses that follow the element name specify the element’s allowed content and is called the content DTD (Document Type Definition) pipe character (|) specification. • Attribute type ENTITIES may also be used in a DTD to indicate that an attribute has multiple EBNF (Extended Backus-Naur Form) grammar plus sign (+) entities for its value. Each entity is separated by a space. element question mark (?) • Keyword PCDATA specifies that the element must contain parsable character data—that is, any element content quote (') text except the characters less than (<) and ampersand (&). • A more restrictive attribute type is attribute type NMTOKEN (name token), whose value consists of letters, digits, periods, underscores, hyphens and colon characters. element name sequence (,) • An XML document is a standalone XML document if it does not reference an external DTD. ELEMENT statement standalone XML document • Attribute type NMTOKENS may contain multiple string tokens separated by spaces. • An XML element that can have only another element for content is said to have element content. element type declaration (!ELEMENT) string attribute type • Enumerated attribute types declare a list of possible values an attribute can have. The attribute EMPTY string token • DTDs allow document authors to define the order and frequency of child elements. The comma must be assigned a value from this list to conform to the DTD. Enumerated type values are sepa- empty element structural definition (,)—called a sequence—specifies the order in which the elements must occur. Choices are spec- rated by pipe characters (|). ENTITIES syntax ified using the pipe character (|). The content specification may contain any number of pipe-char- • NOTATION is also an enumerated attribute type. Notations provide information that an applica- entity attribute SYSTEM acter-separated choices. tion using the XML document can use to handle unparsed entities. ENTITY tokenized attribute type text • An element’s frequency (i.e., number of occurrences) is specified by using either the plus sign (+), • Keyword NDATA indicates that the content of an external entity is not XML. The name of the no- enumerated attribute type tokenized attribute type asterisk (*) or question mark (?) occurrence indicator. tation that handles this unparsed entity is placed to the right of NDATA. Extended Backus-Naur Form (EBNF) grammar type • The frequency of an element group (i.e., two or more elements that occur in some combination) is external subset valid document • DTDs provide the ability to include or exclude declarations using conditional sections. Keyword specified by enclosing the element names inside the content specification, followed by an occur- external unparsed entity validating parser INCLUDE specifies that declarations are included, while keyword IGNORE specifies that decla- rence indicator. fixed value validation rations are excluded. Conditional sections are often used with entities. general entity well-formed document • Elements can be further refined by describing the content types they may contain. Content speci- • Parameter entities are preceded by percent (%) characters and can be used only inside the DTD in generic identifier whitespace character fication types (e.g., EMPTY, mixed content, ANY, etc.) describe nonelement content. which they are declared. Parameter entities allow document authors to create entities specific to a hyphen (-) • An element can be declared as having mixed content (i.e., a combination of elements and DTD—not an XML document. PCDATA). The comma (,), plus sign (+) and question mark (?) occurrence indicators cannot be • Whitespace is either preserved or normalized, depending on the context in which it is used. Spaces SELF-REVIEW EXERCISES used with mixed content elements. in CDATA are preserved. Attribute values with tokenized attribute types ID, NMTOKEN and enu- B.1 State whether the following are true or false. If the answer is false, explain why. • An element declared as type ANY can contain any content including PCDATA, elements, or a com- meration are normalized. a) The document type declaration, , introduces DTDs in XML documents. bination of elements and PCDATA. Elements with ANY content can also be empty elements. DOCTYPE b) External DTDs are specified by using the keyword EXTERNAL. • An attribute list for an element is declared using the ATTLIST element type declaration. TERMINOLOGY c) A DTD can contain either internal or external subsets of declarations, but not both. • Attribute values are normalized (i.e., consecutive whitespace characters are combined into one #FIXED #PCDATA d) Child elements are declared in parentheses, inside an element type declaration. whitespace character). #IMPLIED #REQUIRED e) An element that appears any number of times is followed by an exclamation point (!).

© 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc. © 2002 by Prentice-Hall, Inc.

Appendix B Document Type Definition (DTD) 1651

f) A mixed-content element can contain text as well as other declared elements. g) An attribute declared as type CDATA can contain all characters except for the asterisk (*) and pound sign (#). h) Each element attribute of type ID must have a unique value. i) Enumerated attribute types are the most restrictive category of attribute types. j) An enumerated attribute type requires a default value. B.2 Fill in the blanks in each of the following statements: a) The set of document type declarations inside an XML document is called the . b) Elements are declared with the type declaration. c) Keyword indicates that an element contains parsable character data. d) In an element type declaration, the pipe character (|) indicates that the element can con- tain of the elements indicated. e) Attributes are declared by using the type. f) Keyword specifies that the attribute can take only a specific value that has been defined in the DTD. g) ID, IDREF, and NMTOKEN are all types of tokenized attributes. h) The % character is used to declare a/an . i) DTD is an acronym for . j) Conditional sections of DTDs are often used with .

ANSWERS TO SELF-REVIEW EXERCISES B.1 a) True. b) False. External DTDs are specified using keyword SYSTEM. c) False. A DTD contains both the internal and external subsets. d) True. e) False. An element that appears one or zero times is specified by a question mark (?). f) True. g) False. An attribute declared as type CDATA can contain all characters except for ampersand (&), less than (<), greater than (>), quote (') and double quotes ("). h) True. i) True. j) False. A default value is not required. B.2 a) internal subset. b) ELEMENT. c) PCDATA. d) one. e) ATTLIST. f) #FIXED. g) ENTITY. h) parameter entity. i) Document Type Definition. j) entities.

© 2002 by Prentice-Hall, Inc.