Processing XML with Java

Elliotte Rusty Harold


Dedication

In memory of all the victims of the attacks on September 11, 2001

Table of Contents

Preface
Who You Are
What You Need to Know
What You Need to Have
How to Use This Book
The Online Edition
Some grammatical notes
Contacting the Author
Acknowledgements
1. XML for Data
Motivating XML
A Thought Experiment
Robustness
Extensibility
Ease of Use
XML Syntax
XML Documents
XML Applications
Elements and Tags
Text
Attributes
XML Declaration
Comments
Processing Instructions
Entities
Namespaces
Validity
DTDs
Schemas
Schematron
The Last Mile
Style sheets
CSS
Associating Style Sheets with XML Documents
XSL
Summary
2. XML Protocols: XML-RPC and SOAP
XML as a Message Format
Envelopes
Data Representation
HTTP as a Transport Protocol
How HTTP Works
HTTP in Java
RSS
Customizing the Request
Query Strings
How POST Works
XML-RPC
Data Structures
Faults
Validating XML-RPC
SOAP
A SOAP Example
Posting SOAP documents
Faults
Encoding Styles
SOAP Headers
SOAP Limitations
Validating SOAP
Custom Protocols
Summary
3. Writing XML with Java
Fibonacci Numbers
Writing XML
Better Coding Practices
Attributes
Producing Valid XML
Namespaces
Output Streams, Writers, and Encodings
A Simple XML-RPC Client
A Simple SOAP Client
Servlets
Summary
4. Converting Flat Files to XML
The Budget
The Model
Input
Determining the Output Format
Validation
Attributes
Building Hierarchical Structures from Flat Data
Alternatives to Java
Imposing Hierarchy with XSLT
The XML Query Language
Relational Databases
Summary
5. Reading XML
InputStreams and Readers
XML Parsers
Choosing an XML API
Choosing an XML Parser
Available Parsers
SAX
DOM
JAXP
JDOM
dom4j
ElectricXML
XMLPULL
Summary
6. SAX
What is SAX?
Parsing
Callback Interfaces
Implementing ContentHandler
Using the ContentHandler
The DefaultHandler Adapter Class
Receiving Documents
Receiving Elements
Handling Attributes
Receiving Characters
Receiving Processing Instructions
Receiving Namespace Mappings
Ignorable White Space
Receiving Skipped Entities
Receiving Locators
What the ContentHandler Doesn’t Tell You
Summary
7. The XMLReader Interface
Building Parser Objects
Input
InputSource
EntityResolver
Exceptions and Errors
SAXExceptions
The ErrorHandler interface
Features and Properties
Getting and Setting Features
Getting and Setting Properties
Required Features
Standard Features
Standard Properties
Xerces Custom Features
Xerces Custom Properties
DTDHandler
Summary
8. SAX Filters
The Filter Architecture
The XMLFilter interface
Content Filters
Filtering Tags
Filtering Elements
Filtering attributes
Filters that add content
Filters vs. Transforms
The XMLFilterImpl Class
Parsing non-XML Documents
Multihandler adapters
Summary
9. The Document Object Model
The Evolution of DOM
DOM Modules
Application Specific DOMs
Trees
Document nodes
Element nodes
Attribute nodes
Leaf nodes
Non-tree nodes
What is and isn’t in the tree
DOM Parsers for Java
Parsing documents with a DOM Parser
JAXP DocumentBuilder and DocumentBuilderFactory
DOM3 Load and Save
The Node Interface
Node Types
Node Properties
Navigating the tree
Modifying the tree
Utility Methods
The NodeList interface
JAXP Serialization
DOMException
Choosing between SAX and DOM
Summary
10. Creating XML Documents with DOM
DOMImplementation
Locating a DOMImplementation
Implementation Specific Class
JAXP DocumentBuilder
DOM3 DOMImplementationRegistry
The Document Interface as an Abstract Factory
The Document Interface as a Node Type
Getter methods
Finding elements
Transferring nodes between documents
Normalization
Summary
11. The Document Object Model Core
The Element Interface
Extracting Elements
Attributes
The NamedNodeMap Interface
The CharacterData interface
The Text Interface
The CDATASection Interface
The EntityReference Interface
The Attr Interface
The ProcessingInstruction Interface
The Comment Interface
The DocumentType Interface
The Entity Interface
The Notation Interface
Summary
12. The DOM Traversal Module
NodeIterator
Constructing NodeIterators with DocumentTraversal
Liveness
Filtering by Node Type
NodeFilter
TreeWalker
Summary
13. Output from DOM
Xerces Serialization
OutputFormat
DOM Level 3
Creating DOMWriters
Serialization Features
Filtering Output
Summary
14. JDOM
What is JDOM?
Creating XML Elements with JDOM
Creating XML Documents with JDOM
Writing XML Documents with JDOM
Document Type Declarations
Namespaces
Reading XML Documents with JDOM
Navigating JDOM Trees
Talking to DOM Programs
Talking to SAX Programs
Configuring SAXBuilder
SAXOutputter
Java Integration
Serializing JDOM Objects
Synchronizing JDOM Objects
Testing Equality
Hash codes
String representations
Cloning
What JDOM doesn’t do
Summary
15. The JDOM Model
The Document Class
The Element Class
Constructors
Navigation and Search
Attributes
The Attribute Class
The Text Class
The CDATA Class
The ProcessingInstruction Class
The Comment Class
Namespaces
The DocType class
The EntityRef Class
Summary
16. XPath
Queries
The XPath Data Model
Location Paths
Axes
Node tests
Predicates
Compound Location Paths
Absolute Location Paths
Abbreviated Location paths
Combining location paths
Expressions
Literals
Operators
Functions
XPath Engines
XPath with Saxon
XPath with Xalan
DOM Level 3 XPath
Namespace Bindings
Snapshots
Compiled Expressions
Jaxen
Summary
17. XSLT
XSL Transformations
Template Rules
Stylesheets
Taking the Value of a Node
Applying Templates
The Default Template Rules
Selection
Calling Templates by Name
TrAX
Thread Safety
Locating Transformers
The xml-stylesheet processing instruction
Features
XSLT Processor Attributes
URI Resolution
Error Handling
Passing Parameters to Style Sheets
Output Properties
Sources and Results
Extending XSLT with Java
Extension Functions
Extension Elements
Summary
A. XML APIs Quick Reference
SAX
org.xml.sax
org.xml.sax.ext
org.xml.sax.helpers
DOM
The DOM Data Model
org.w3c.dom
org.w3c.dom.traversal
JAXP
javax.xml.parsers
TrAX
javax.xml.transform
javax.xml.transform.stream
javax.xml.transform.dom
javax.xml.transform.sax
JDOM Quick Reference
org.jdom
org.jdom.filter
org.jdom.input
org.jdom.output
org.jdom.transform
org.jdom.xpath
XMLPULL
org.xmlpull.v1
B. SOAP 1.1 Schemas
The SOAP 1.1 Envelope Schema
The SOAP 1.1 Encoding Schema
W3C� SOFTWARE NOTICE AND LICENSE
Recommended Reading
Index

List of Figures

1.1. The clock order styled by CSS
1.2. The PDF version of the clock order produced by XSL
2.1. Slashdot headlines in XML
2.2. NASDAQ Stock Data Retrieved via a Query String
4.1. The list of maps data structure for the budget
4.2. A UML diagram for the budget class hierarchy
6.1. The Swing Based Tree Viewer
8.1. The XML parsing process
8.2. XML parsing with a filter
8.3. XML parsing with multiple filters
8.4. How data flows through the RDDL stripper program
8.5. The end of the RDDL specification as normally presented
8.6. The end of the RDDL specification after rddl:resource elements are replaced by small tables
16.1. XPath Explorer
16.2. An XPath data model

List of Tables

2.1. Primitive Data Types defined in the W3C XML Schema Language
2.2. Simple Value Elements defined in SOAP
3.1. Standard Character Sets and Encodings
4.1. Budget Fields
9.1. Node properties
9.2. DOM Parser Features
16.1. XPath Expanded Names and String-values
16.2. Abbreviated syntax for XPath
17.1. Xalan Conversions from XSLT to Java
17.2. Xalan Conversions from Java to XSLT
A.1. Node properties

List of Examples

1.1. A plain text document indicating an order for 12 Birdsong Clocks, SKU 244
1.2. An XML document indicating an order for 12 Birdsong Clocks, SKU 244
1.3. A document indicating an order for 12 Birdsong Clocks, SKU 244?
1.4. Still an order for 12 Birdsong Clocks, SKU 244
1.5. An XML document indicating an order for multiple products shipped to multiple addresses
1.6. An XML document that uses a default namespace
1.7. An XML document that uses two default namespaces
1.8. A DTD for order documents
1.9. order.xsd: a schema for order documents
1.10. order.sct: a Schematron schema for order documents
1.11. A CSS stylesheet for order documents
1.12. An XSLT stylesheet for order documents
1.13. An XSL-FO document for the clock order
2.1. An XML document that labels elements with schema simple types
2.2. URLGrabber
2.3. URLGrabberTest
2.4. An RSS 0.91 document
2.5. An RSS 1.0 document
2.6. An XML-RPC request document
2.7. POSTing an XML-RPC request document
2.8. An XML-RPC response
2.9. An XML-RPC request that passes an array as an argument
2.10. An XML-RPC response document that returns an array
2.11. An XML-RPC Request that passes a struct as an argument
2.12. An XML-RPC fault
2.13. A DTD for XML-RPC
2.14. A Schema for XML-RPC
2.15. A SOAP document requesting the current stock price of Red Hat
2.16. A SOAP Response
2.17. A SOAP document requesting the current stock price of Red Hat
2.18. A SOAP document returning the current stock price of Red Hat
2.19. A SOAP fault response
2.20. A SOAP document that specifies the encoding style
2.21. A schema that assigns type to elements in the http://namespaces.cafeconleche.org/xmljava/ch2/ namespace
2.22. A SOAP Request with a digital signature in the header
2.23. A SOAP Request with two header entries
2.24. A SOAP Request with a mustUnderstand attribute
2.25. A Master Schema for SOAP Trading documents
3.1. A program that calculates the Fibonacci numbers
3.2. The first 10 Fibonacci numbers in an XML document
3.3. A program that outputs the Fibonacci numbers as an XML document
3.4. Using named constants for element names
3.5. A Java program that writes an XML document that uses attributes
3.6. A Java program that generates a valid document
3.7. A MathML document containing Fibonacci numbers
3.8. A Java program that generates a MathML document
3.9. A Java program that writes an XML file
3.10. Connecting an XML-RPC server with URLConnection
3.11. Connecting to a SOAP server with URLConnection
3.12. A servlet that generates XML
4.1. A class that parses comma separated values into a List of HashMaps
4.2. Naively reproducing the original table structure in XML
4.3. A schema for the XML budget data
4.4. Converting to XML with attributes
4.5. A hierarchical arrangement of the budget data
4.6. The Budget class
4.7. The Agency class
4.8. The Bureau Class
4.9. An Account Class
4.10. The Subfunction Class
4.11. The driver class that builds the data structure and writes it out again
4.12. An XSLT stylesheet that converts flat XML data to hierarchical XML data
4.13. An XQuery that converts flat data to hierarchical data
4.14. A program that connects to a relational database using JDBC and converts the table to hierarchical XML
5.1. A response from the Fibonacci XML-RPC server
5.2. Reading an XML-RPC Response
5.3. A SAX based client for the Fibonacci XML-RPC server
5.4. The ContentHandler for the SAX client for the Fibonacci XML-RPC server
5.5. A DOM based client for the Fibonacci XML-RPC server
5.6. A JAXP based client for the Fibonacci XML-RPC server
5.7. A JDOM based client for the Fibonacci XML-RPC server
5.8. A dom4j based client for the Fibonacci XML-RPC server
5.9. An ElectricXML based client for the Fibonacci XML-RPC server
5.10. An XMLPULL based client for the Fibonacci XML-RPC server
6.1. A SAX program that parses a document
6.2. The SAX ContentHandler interface
6.3. A SAX ContentHandler that writes all #PCDATA onto a java.io.Writer
6.4. The driver method for the text extractor program
6.5. A subclass of DefaultHandler that writes all #PCDATA onto a java.io.Writer
6.6. A ContentHandler interface that resets its data structures between documents
6.7. A ContentHandler class that builds a GUI representation of an XML document
6.8. The SAX Attributes interface
6.9. A ContentHandler class that spiders XLinks
6.10. A SAX client for the Fibonacci XML-RPC server
6.11. A ContentHandler that prints processing instruction targets and data on System.out
6.12. The NamespaceSupport class
6.13. A document that uses ignorable white space to prettify the XML
6.14. An XML document containing a potentially skipped entity reference
6.15. The SAX Locator interface
6.16. Determining the locations of events
7.1. The SAX InputSource class
7.2. The EntityResolver interface
7.3. An XHTML EntityResolver
7.4. The SAXException class
7.5. The SAXParseException class
7.6. A SAX program that parses a document and identifies the line numbers of any well-formedness errors
7.7. The ErrorHandler interface
7.8. A SAX program that reports all problems found in an XML document
7.9. A SAX program that validates documents
7.10. A SAX program that echoes the parsed document
7.11. The LexicalHandler interface
7.12. An implementation of the LexicalHandler interface
7.13. The DeclHandler interface
7.14. A program that prints out a complete DTD
7.15. Making maximal use of Xerces’s special abilities
7.16. The DTDHandler interface
7.17. A caching DTDHandler
7.18. A Notation utility class
7.19. An UnparsedEntity utility class
7.20. A program that lists the unparsed entities and notations used in an XML document
8.1. The XMLFilter interface
8.2. A filter that blocks all events
8.3. A filter that filters nothing
8.4. A filter that times all parsing
8.5. Parsing a document through a filter
8.6. A ContentHandler filter
8.7. A filter that substitutes its own ContentHandler
8.8. A program that filters documents
8.9. A ContentHandler filter that throws away non-XHTML elements
8.10. The AttributesImpl helper class
8.11. Changing one element into another
8.12. A subclass of XMLFilterImpl
8.13. Accessing databases through SAX
8.14. A very simple user interface for extracting XML data from a relational database
8.15. Attaching multiple handlers of the same type to a single parser
9.1. Which modules does Oracle support?
9.2. An XML-RPC request document
9.3. A program that uses Xerces to check documents for well-formedness
9.4. A program that uses the Oracle XML parser to check documents for well-formedness
9.5. A program that uses JAXP to check documents for well-formedness
9.6. Using JAXP to check documents for well-formedness
9.7. A program that uses DOM3 to check documents for well-formedness
9.8. The Node interface
9.9. Changing short type constants to strings
9.10. A class to inspect the properties of a node
9.11. Walking the tree with the Node interface
9.12. A method that changes a document by reordering nodes
9.13. The NodeList interface
9.14. Using JAXP to both read and write an XML document
9.15. The DOMException class
10.1. The DOMImplementation interface
10.2. The DOMImplementationRegistry class
10.3. The DOMImplementationSource interface
10.4. The Document interface
10.5. Building an SVG document in memory using DOM
10.6. A DOM program that outputs the Fibonacci numbers as an XML document
10.7. A valid MathML document containing Fibonacci numbers
10.8. A DOM program that outputs the Fibonacci numbers as a MathML document
10.9. A valid MathML document using prefixed names
10.10. The properties of a Document object
10.11. An XML-RPC request document
10.12. An XML-RPC response document
10.13. A DOM based XML-RPC servlet
10.14. A DOM based SOAP servlet
11.1. The Element interface
11.2. Extracting examples from DocBook
11.3. A document that uses attributes
11.4. A DOM program that adds attributes
11.5. The NamedNodeMap interface
11.6. An XLink spider that uses DOM
11.7. The CharacterData interface
11.8. ROT13 encoder for XML documents
11.9. The Text interface
11.10. Printing the text nodes in an XML document
11.11. The CDATASection interface
11.12. Merging CDATA sections with text nodes
11.13. The EntityReference interface
11.14. Inserting entity references into a document
11.15. The Attr interface
11.16. Specifying all attributes
11.17. The ProcessingInstruction interface
11.18. Reading PseudoAttributes from a ProcessingInstruction
11.19. The Comment interface
11.20. Printing comments
11.21. The DocumentType interface
11.22. The Entity interface
11.23. Listing parsed entities used in the document
11.24. The Notation interface
11.25. Listing the Notations declared in a DTD
12.1. The NodeIterator interface
12.2. The DocumentTraversal factory interface
12.3. Using a NodeIterator to extract all the comments from a document
12.4. Using a NodeIterator to retrieve the complete text content of an element
12.5. The NodeFilter interface
12.6. An implementation of the NodeFilter interface
12.7. The TreeWalker interface
12.8. The ExampleFilter class
12.9. Navigating a sub-tree with TreeWalker
13.1. Using Xerces’ OutputFormat class to pretty print XML
13.2. Using Xerces’ OutputFormat class to pretty print MathML
13.3. The DOM3 DOMWriter interface
13.4. The DOM3 DOMErrorHandler interface
13.5. Serializing with DOMWriter
13.6. The DOM3 DOMImplementationLS interface
13.7. An implementation independent DOM3 program to build and serialize an XML document
13.8. The DOMWriterFilter interface
13.9. Filtering everything that isn’t XHTML on output
13.10. Using a DOMWriterFilter
14.1. A JDOM program that produces an XML document containing Fibonacci numbers
14.2. A Fibonacci DTD
14.3. A JDOM program that produces an XML document containing Fibonacci numbers
14.4. A MathML document containing the first three Fibonacci numbers
14.5. A JDOM program that uses namespaces
14.6. A JDOM program that uses the default namespace
14.7. A JDOM program that checks XML documents for well-formedness
14.8. A JDOM program that validates XML documents
14.9. A JDOM program that lists the elements used in a document
14.10. A JDOM program that lists the nodes used in a document
14.11. A JDOM program that schema validates documents
14.12. A JDOM program that passes documents to a SAX ContentHandler
15.1. The JDOM Document class
15.2. Inspecting elements
15.3. An XML-RPC request document
15.4. The JDOM Filter interface
15.5. The ContentFilter class
15.6. The ElementFilter class
15.7. A filter for xml-stylesheet processing instructions in the prolog
15.8. Moving elements between documents
15.9. Searching for RDDL resources
15.10. The JDOM Attribute class
15.11. The JDOM Text class
15.12. JDOM based ROT13 encoder for XML documents
15.13. The JDOM CDATA class
15.14. The JDOM ProcessingInstruction class
15.15. The JDOM Comment class
15.16. Printing comments
15.17. The JDOM Namespace class
15.18. An XML document that uses namespace prefixes in attribute values
15.19. The JDOM DocType class
15.20. Validating XHTML with the DocType class
15.21. The JDOM EntityRef class
16.1. Weather data in XML
16.2. A SOAP response document
16.3. An XML-RPC request document
16.4. A SOAP request document
16.5. The Xalan XPathAPI class
16.6. The XPathEvaluator interface
16.7. The XPathResult interface
16.8. An XML document containing namespace bindings and an XPath search expression
16.9. The DOM3 XPathExpression interface
17.1. An XSLT stylesheet for XML-RPC request documents
17.2. An XSLT stylesheet that echoes XML-RPC requests
17.3. An XML-RPC request document
17.4. An XML-RPC response document
17.5. An XSLT stylesheet that calculates Fibonacci numbers
17.6. A servlet that uses TrAX and XSLT to respond to XML-RPC requests
17.7. Testing the availability of TrAX features
17.8. The TrAX URIResolver interface
17.9. A URIResolver class
17.10. The TrAX ErrorListener interface
17.11. An ErrorListener that uses the Logging API
17.12. The TrAX OutputKeys class
17.13. The TrAX DOMSource class
17.14. The TrAX DOMResult class
17.15. The TrAX SAXSource class
17.16. The TrAX SAXResult class
17.17. The TrAX StreamSource class
17.18. The TrAX StreamResult class
17.19. A Java class that calculates Fibonacci numbers
17.20. The Xalan ExpressionContext interface
17.21. A stylesheet that uses an extension element

Copyright 2001, 2002 Elliotte Rusty Harold[email protected]Last Modified September 25, 2002
Up To Cafe con Leche