5.12. XML
Java 1.4
and Java 5.0 have added powerful XML processing features to the Java
platform:
- org.xml.sax
-
This package and its two
subpackages define the de facto standard SAX API (SAX stands for
Simple API for XML). SAX is an event-driven, XML-parsing API: a SAX
parser invokes methods of a specified
ContentHandler object (as well as some other
related handler objects) as it parses an XML document. The structure
and content of the document are fully described by the method calls.
This is a streaming API that does not build any permanent
representation of the document. It is up to the
ContentHandler implementation to store any state
or perform any actions that are appropriate. This package includes
classes for the SAX 2 API and deprecated classes for SAX 1.
- org.w3c.dom
-
This package defines interfaces
that represent an XML document in tree form. The Document Object
Model (DOM) is a recommendation (essentially a standard) of the World
Wide Web Consortium (W3C). A DOM parser reads an XML document and
converts it into a tree of nodes that represent the full content of
the document. Once the tree representation of the document is
created, a program can examine and manipulate it however it wants.
Java 1.4 includes the core module of the Level 2 DOM, and Java 5.0
includes the core, events, and load/save modules of the Level 3 DOM.
- javax.xml.parsers
-
This package provides high-level
interfaces for instantiating SAX and DOM parsers for parsing XML
documents.
- javax.xml.transform
-
This
package and its subpackages define a Java API for transforming XML
document content and representation using the XSLT standard.
- javax.xml.validation
-
This Java
5.0 package provides support for validating an XML document against a
schema. Implementations are required to support the W3C XML Schema
standard and may also support other schema types as well.
- javax.xml.xpath
-
This package, also new in
Java
5.0, supports the evaluation of XPath for selecting nodes in an XML
document.
Examples using each of these packages are presented in the following
sections.
5.12.1. Parsing XML with SAX
The first
step in parsing an XML document with SAX is to obtain a SAX parser.
If you have a SAX parser implementation of your own, you can simply
instantiate the appropriate parser class. It is usually simpler,
however, to use the javax.xml.parsers package to
instantiate whatever SAX parser is provided by the Java
implementation. The code looks like this:
import javax.xml.parsers.*;
// Obtain a factory object for creating SAX parsers
SAXParserFactory parserFactory = SAXParserFactory.newInstance();
// Configure the factory object to specify attributes of the parsers it creates
parserFactory.setValidating(true);
parserFactory.setNamespaceAware(true);
// Now create a SAXParser object
SAXParser parser = parserFactory.newSAXParser(); // May throw exceptions
The SAXParser
class is a simple wrapper around the
org.xml.sax.XMLReader class. Once you have
obtained one, as shown in the previous code, you can parse a document
by simply calling one of the various parse()
methods. Some of these methods use the deprecated SAX 1
HandlerBase class, and others use the current SAX
2 org.xml.sax.helpers.DefaultHandler class. The
DefaultHandler class provides an empty
implementation of all the methods of the
ContentHandler, ErrorHandler,
DTDHandler, and EntityResolver
interfaces. These are all the methods that the SAX parser can call
while parsing an XML document. By subclassing
DefaultHandler and defining the methods you care
about, you can perform whatever actions are necessary in response to
the method calls generated by the parser. The following code shows a
method that uses SAX to parse an XML file and determine the number of
XML elements that appear in a document as well as the number of
characters of plain text (possibly excluding
"ignorable whitespace") that appear
within those elements:
import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
public class SAXCount {
public static void main(String[] args)
throws SAXException,IOException, ParserConfigurationException
{
// Create a parser factory and use it to create a parser
SAXParserFactory parserFactory = SAXParserFactory.newInstance();
SAXParser parser = parserFactory.newSAXParser();
// This is the name of the file you're parsing
String filename = args[0];
// Instantiate a DefaultHandler subclass to do your counting for you
CountHandler handler = new CountHandler();
// Start the parser. It reads the file and calls methods of the handler.
parser.parse(new File(filename), handler);
// When you're done, report the results stored by your handler object
System.out.println(filename + " contains " + handler.numElements +
" elements and " + handler.numChars +
" other characters ");
}
// This inner class extends DefaultHandler to count elements and text in
// the XML file and saves the results in public fields. There are many
// other DefaultHandler methods you could override, but you need only
// these.
public static class CountHandler extends DefaultHandler {
public int numElements = 0, numChars = 0; // Save counts here
// This method is invoked when the parser encounters the opening tag
// of any XML element. Ignore the arguments but count the element.
public void startElement(String uri, String localname, String qname,
Attributes attributes) {
numElements++;
}
// This method is called for any plain text within an element
// Simply count the number of characters in that text
public void characters(char[] text, int start, int length) {
numChars += length;
}
}
}
5.12.2. Parsing XML with DOM
The DOM API is much different from
the SAX API. While SAX is an efficient way to scan an XML document,
it is not well-suited for programs that want to modify documents.
Instead of converting an XML document into a series of method calls,
a DOM parser converts the document into an
org.w3c.dom.Document object, which is a tree of
org.w3c.dom.Node objects. The conversion of the
complete XML document to tree form allows random access to the entire
document but can consume substantial amounts of memory.
In the DOM API, each node in the document
tree implements the Node interface and a
type-specific subinterface. (The most common types of node in a DOM
document are Element and Text
nodes.) When the parser is done parsing the document, your program
can examine and manipulate that tree using the various methods of
Node and its subinterfaces. The following code
uses JAXP to obtain a DOM parser (which, in JAXP parlance, is called
a DocumentBuilder). It then parses an XML file and
builds a document tree from it. Next, it examines the
Document tree to search for
<sect1> elements and prints the contents of
the <title> of each.
import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class GetSectionTitles {
public static void main(String[] args)
throws IOException, ParserConfigurationException,
org.xml.sax.SAXException
{
// Create a factory object for creating DOM parsers and configure it
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setIgnoringComments(true); // We want to ignore comments
factory.setCoalescing(true); // Convert CDATA to Text nodes
factory.setNamespaceAware(false); // No namespaces: this is default
factory.setValidating(false); // Don't validate DTD: also default
// Now use the factory to create a DOM parser, a.k.a. DocumentBuilder
DocumentBuilder parser = factory.newDocumentBuilder();
// Parse the file and build a Document tree to represent its content
Document document = parser.parse(new File(args[0]));
// Ask the document for a list of all <sect1> elements it contains
NodeList sections = document.getElementsByTagName("sect1");
// Loop through those <sect1> elements one at a time
int numSections = sections.getLength();
for(int i = 0; i < numSections; i++) {
Element section = (Element)sections.item(i); // A <sect1>
// The first Element child of each <sect1> should be a <title>
// element, but there may be some whitespace Text nodes first, so
// loop through the children until you find the first element
// child.
Node title = section.getFirstChild();
while(title != null && title.getNodeType() != Node.ELEMENT_NODE)
title = title.getNextSibling();
// Print the text contained in the Text node child of this element
if (title != null)
System.out.println(title.getFirstChild().getNodeValue());
}
}
}
5.12.3. Transforming XML Documents
The javax.xml.transform
package defines a TRansformerFactory class for
creating TRansformer objects. A
transformer can transform a document from its
Source representation into a new
Result representation and optionally apply an XSLT
transformation to the document content in the process. Three
subpackages define concrete implementations of the
Source and Result interfaces,
which allow documents to be transformed among three representations:
- javax.xml.transform.stream
-
Represents documents as streams of XML text.
- javax.xml.transform.dom
-
Represents documents as DOM Document TRees.
- javax.xml.transform.sax
-
Represents documents as sequences of SAX method calls.
The following code shows one use of these packages to transform the
representation of a document from a DOM Document
tree into a stream of XML text. An interesting feature of this code
is that it does not create the Document TRee by
parsing a file; instead, it builds it up from scratch.
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class DOMToStream {
public static void main(String[] args)
throws ParserConfigurationException,
TransformerConfigurationException,
TransformerException
{
// Create a DocumentBuilderFactory and a DocumentBuilder
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
// Instead of parsing an XML document, however, just create an empty
// document that you can build up yourself.
Document document = db.newDocument();
// Now build a document tree using DOM methods
Element book = document.createElement("book"); // Create new element
book.setAttribute("id", "javanut4"); // Give it an attribute
document.appendChild(book); // Add to the document
for(int i = 1; i <= 3; i++) { // Add more elements
Element chapter = document.createElement("chapter");
Element title = document.createElement("title");
title.appendChild(document.createTextNode("Chapter " + i));
chapter.appendChild(title);
chapter.appendChild(document.createElement("para"));
book.appendChild(chapter);
}
// Now create a TransformerFactory and use it to create a Transformer
// object to transform our DOM document into a stream of XML text.
// No arguments to newTransformer() means no XSLT stylesheet
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
// Create the Source and Result objects for the transformation
DOMSource source = new DOMSource(document); // DOM document
StreamResult result = new StreamResult(System.out); // to XML text
// Finally, do the transformation
transformer.transform(source, result);
}
}
The most interesting uses of
javax.xml.transform involve XSLT stylesheets. XSLT
is a complex but powerful XML grammar that describes how XML document
content should be converted to another form (e.g., XML, HTML, or
plain text). A tutorial on XSLT stylesheets is beyond the scope of
this book, but the following code (which contains only six key lines)
shows how you can apply such a stylesheet (which is an XML document
itself) to another XML document and write the resulting document to a
stream:
import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class Transform {
public static void main(String[] args)
throws TransformerConfigurationException,
TransformerException
{
// Get Source and Result objects for input, stylesheet, and output
StreamSource input = new StreamSource(new File(args[0]));
StreamSource stylesheet = new StreamSource(new File(args[1]));
StreamResult output = new StreamResult(new File(args[2]));
// Create a transformer and perform the transformation
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer(stylesheet);
transformer.transform(input, output);
}
}
5.12.4. Validating XML Documents
The
javax.xml.validation
package allows you to validate XML documents against a schema. SAX
and DOM parsers obtained from the
javax.xml.parsers package can perform validation
against a DTD during the parsing process, but this package separates
validation from parsing and also provides general support for
arbitrary schema types. All implementations must
support W3C
XML Schema and are allowed to support other schema types, such as
RELAX
NG.
To use this package, begin with a SchemaFactory
instancea parser for a specific type of schema. Use this
parser to parse a schema file into a Schema
object. Obtain a Validator from the
Schema, and then use the
Validator to validate your XML document. The
document is specified as a
SAXSource or DOMSource object.
You may recall these classes from the subpackages of
javax.xml.transform.
If the document is valid, the validate(
) method
of the Validator object returns normally. If it is
not valid, validate( ) throws a
SAXException. You can install an
org.xml.sax.ErrorHandler object for the
Validator to provide some control over the kinds
of validation errors that cause exceptions.
import javax.xml.XMLConstants;
import javax.xml.validation.*;
import javax.xml.transform.sax.SAXSource;
import org.xml.sax.*;
import java.io.*;
public class Validate {
public static void main(String[] args) throws IOException {
File documentFile = new File(args[0]); // 1st arg is document
File schemaFile = new File(args[1]); // 2nd arg is schema
// Get a parser to parse W3C schemas. Note use of javax.xml package
// This package contains just one class of constants.
SchemaFactory factory =
SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
// Now parse the schema file to create a Schema object
Schema schema = null;
try { schema = factory.newSchema(schemaFile); }
catch(SAXException e) { fail(e); }
// Get a Validator object from the Schema.
Validator validator = schema.newValidator();
// Get a SAXSource object for the document
// We could use a DOMSource here as well
SAXSource source =
new SAXSource(new InputSource(new FileReader(documentFile)));
// Now validate the document
try { validator.validate(source); }
catch(SAXException e) { fail(e); }
System.err.println("Document is valid");
}
static void fail(SAXException e) {
if (e instanceof SAXParseException) {
SAXParseException spe = (SAXParseException) e;
System.err.printf("%s:%d:%d: %s%n",
spe.getSystemId(), spe.getLineNumber(),
spe.getColumnNumber(), spe.getMessage());
}
else {
System.err.println(e.getMessage());
}
System.exit(1);
}
}
5.12.5. Evaluating XPath Expressions
XPath is a language for referring to
specific nodes in an XML document. For example, the XPath expression
"//section/title/text( )" refers to
the text inside of a <title> element inside
a <section> element at any depth within the
document. A full description of the XPath language is beyond the
scope of this book. The javax.xml.xpath package,
new in Java 5.0, provides a way to find all nodes in a document that
match an XPath expression.
import javax.xml.xpath.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
public class XPathEvaluator {
public static void main(String[] args)
throws ParserConfigurationException, XPathExpressionException,
org.xml.sax.SAXException, java.io.IOException
{
String documentName = args[0];
String expression = args[1];
// Parse the document to a DOM tree
// XPath can also be used with a SAX InputSource
DocumentBuilder parser =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = parser.parse(new java.io.File(documentName));
// Get an XPath object to evaluate the expression
XPath xpath = XPathFactory.newInstance().newXPath();
System.out.println(xpath.evaluate(expression, doc));
// Or evaluate the expression to obtain a DOM NodeList of all matching
// nodes. Then loop through each of the resulting nodes
NodeList nodes = (NodeList)xpath.evaluate(expression, doc,
XPathConstants.NODESET);
for(int i = 0, n = nodes.getLength(); i < n; i++) {
Node node = nodes.item(i);
System.out.println(node);
}
}
}
|