Section 5.12. XML

5.12. XML

Java 1.4 and Java 5.0 have added powerful XML processing features to the Java platform:

org.xml.sax: This package and its two subpackages define the de facto standard SAX API (SAX stands for Simple API for XML). SAX is an event-driven, XML-parsing API: a SAX parser invokes methods of a specified ContentHandler object (as well as some other related handler objects) as it parses an XML document. The structure and content of the document are fully described by the method calls. This is a streaming API that does not build any permanent representation of the document. It is up to the ContentHandler implementation to store any state or perform any actions that are appropriate. This package includes classes for the SAX 2 API and deprecated classes for SAX 1.
org.w3c.dom: This package defines interfaces that represent an XML document in tree form. The Document Object Model (DOM) is a recommendation (essentially a standard) of the World Wide Web Consortium (W3C). A DOM parser reads an XML document and converts it into a tree of nodes that represent the full content of the document. Once the tree representation of the document is created, a program can examine and manipulate it however it wants. Java 1.4 includes the core module of the Level 2 DOM, and Java 5.0 includes the core, events, and load/save modules of the Level 3 DOM.
javax.xml.parsers: This package provides high-level interfaces for instantiating SAX and DOM parsers for parsing XML documents.
javax.xml.transform: This package and its subpackages define a Java API for transforming XML document content and representation using the XSLT standard.
javax.xml.validation: This Java 5.0 package provides support for validating an XML document against a schema. Implementations are required to support the W3C XML Schema standard and may also support other schema types as well.
javax.xml.xpath: This package, also new in Java 5.0, supports the evaluation of XPath for selecting nodes in an XML document.

Examples using each of these packages are presented in the following sections.

5.12.1. Parsing XML with SAX

The first step in parsing an XML document with SAX is to obtain a SAX parser. If you have a SAX parser implementation of your own, you can simply instantiate the appropriate parser class. It is usually simpler, however, to use the javax.xml.parsers package to instantiate whatever SAX parser is provided by the Java implementation. The code looks like this:

import javax.xml.parsers.*;

// Obtain a factory object for creating SAX parsers
SAXParserFactory parserFactory = SAXParserFactory.newInstance();

// Configure the factory object to specify attributes of the parsers it creates
parserFactory.setValidating(true);
parserFactory.setNamespaceAware(true);

// Now create a SAXParser object
SAXParser parser = parserFactory.newSAXParser();   // May throw exceptions

The SAXParser class is a simple wrapper around the org.xml.sax.XMLReader class. Once you have obtained one, as shown in the previous code, you can parse a document by simply calling one of the various parse() methods. Some of these methods use the deprecated SAX 1 HandlerBase class, and others use the current SAX 2 org.xml.sax.helpers.DefaultHandler class. The DefaultHandler class provides an empty implementation of all the methods of the ContentHandler, ErrorHandler, DTDHandler, and EntityResolver interfaces. These are all the methods that the SAX parser can call while parsing an XML document. By subclassing DefaultHandler and defining the methods you care about, you can perform whatever actions are necessary in response to the method calls generated by the parser. The following code shows a method that uses SAX to parse an XML file and determine the number of XML elements that appear in a document as well as the number of characters of plain text (possibly excluding "ignorable whitespace") that appear within those elements:

import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class SAXCount {
    public static void main(String[] args) 
        throws SAXException,IOException, ParserConfigurationException
    {
        // Create a parser factory and use it to create a parser
        SAXParserFactory parserFactory = SAXParserFactory.newInstance();
        SAXParser parser = parserFactory.newSAXParser();
        // This is the name of the file you're parsing
        String filename = args[0];
        // Instantiate a DefaultHandler subclass to do your counting for you
        CountHandler handler = new CountHandler();
        // Start the parser. It reads the file and calls methods of the handler.
        parser.parse(new File(filename), handler);
        // When you're done, report the results stored by your handler object
        System.out.println(filename + " contains " + handler.numElements +
                           " elements and " + handler.numChars +
                           " other characters ");
    }

    // This inner class extends DefaultHandler to count elements and text in
    // the XML file and saves the results in public fields. There are many
    // other DefaultHandler methods you could override, but you need only 
    // these.
    public static class CountHandler extends DefaultHandler {
        public int numElements = 0, numChars = 0;  // Save counts here
        // This method is invoked when the parser encounters the opening tag
        // of any XML element. Ignore the arguments but count the element.
        public void startElement(String uri, String localname, String qname,
                                 Attributes attributes) {
            numElements++;
        }

        // This method is called for any plain text within an element
        // Simply count the number of characters in that text
        public void characters(char[] text, int start, int length) {
            numChars += length;
        }
    }
}

5.12.2. Parsing XML with DOM

The DOM API is much different from the SAX API. While SAX is an efficient way to scan an XML document, it is not well-suited for programs that want to modify documents. Instead of converting an XML document into a series of method calls, a DOM parser converts the document into an org.w3c.dom.Document object, which is a tree of org.w3c.dom.Node objects. The conversion of the complete XML document to tree form allows random access to the entire document but can consume substantial amounts of memory.

In the DOM API, each node in the document tree implements the Node interface and a type-specific subinterface. (The most common types of node in a DOM document are Element and Text nodes.) When the parser is done parsing the document, your program can examine and manipulate that tree using the various methods of Node and its subinterfaces. The following code uses JAXP to obtain a DOM parser (which, in JAXP parlance, is called a DocumentBuilder). It then parses an XML file and builds a document tree from it. Next, it examines the Document tree to search for <sect1> elements and prints the contents of the <title> of each.

import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;

public class GetSectionTitles {
    public static void main(String[] args)
        throws IOException, ParserConfigurationException,
               org.xml.sax.SAXException
    {
        // Create a factory object for creating DOM parsers and configure it
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setIgnoringComments(true); // We want to ignore comments
        factory.setCoalescing(true);       // Convert CDATA to Text nodes
        factory.setNamespaceAware(false);  // No namespaces: this is default
        factory.setValidating(false);      // Don't validate DTD: also default

        // Now use the factory to create a DOM parser, a.k.a. DocumentBuilder
        DocumentBuilder parser = factory.newDocumentBuilder();

        // Parse the file and build a Document tree to represent its content
        Document document = parser.parse(new File(args[0]));

        // Ask the document for a list of all <sect1> elements it contains
        NodeList sections = document.getElementsByTagName("sect1");
        // Loop through those <sect1> elements one at a time
        int numSections = sections.getLength();
        for(int i = 0; i < numSections; i++) {
            Element section = (Element)sections.item(i);  // A <sect1>
            // The first Element child of each <sect1> should be a <title>
            // element, but there may be some whitespace Text nodes first, so 
            // loop through the children until you find the first element 
            // child.
            Node title = section.getFirstChild();
            while(title != null && title.getNodeType() != Node.ELEMENT_NODE) 
                title = title.getNextSibling();
            // Print the text contained in the Text node child of this element
            if (title != null)
                System.out.println(title.getFirstChild().getNodeValue());
        }
    }
}

5.12.3. Transforming XML Documents

The javax.xml.transform package defines a TRansformerFactory class for creating TRansformer objects. A transformer can transform a document from its Source representation into a new Result representation and optionally apply an XSLT transformation to the document content in the process. Three subpackages define concrete implementations of the Source and Result interfaces, which allow documents to be transformed among three representations:

javax.xml.transform.stream: Represents documents as streams of XML text.
javax.xml.transform.dom: Represents documents as DOM Document TRees.
javax.xml.transform.sax: Represents documents as sequences of SAX method calls.

The following code shows one use of these packages to transform the representation of a document from a DOM Document tree into a stream of XML text. An interesting feature of this code is that it does not create the Document TRee by parsing a file; instead, it builds it up from scratch.

import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;

public class DOMToStream {
    public static void main(String[] args) 
        throws ParserConfigurationException,
               TransformerConfigurationException,
               TransformerException
    {
        // Create a DocumentBuilderFactory and a DocumentBuilder
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder();
        // Instead of parsing an XML document, however, just create an empty
        // document that you can build up yourself.
        Document document = db.newDocument();

        // Now build a document tree using DOM methods
        Element book = document.createElement("book"); // Create new element
        book.setAttribute("id", "javanut4");           // Give it an attribute
        document.appendChild(book);                    // Add to the document
        for(int i = 1; i <= 3; i++) {                  // Add more elements
            Element chapter = document.createElement("chapter");
            Element title = document.createElement("title");
            title.appendChild(document.createTextNode("Chapter " + i));
            chapter.appendChild(title);
            chapter.appendChild(document.createElement("para"));
            book.appendChild(chapter);
        }

        // Now create a TransformerFactory and use it to create a Transformer
        // object to transform our DOM document into a stream of XML text.
        // No arguments to newTransformer() means no XSLT stylesheet
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer();

        // Create the Source and Result objects for the transformation
        DOMSource source = new DOMSource(document);          // DOM document
        StreamResult result = new StreamResult(System.out);  // to XML text

        // Finally, do the transformation
        transformer.transform(source, result);
    }
}

The most interesting uses of javax.xml.transform involve XSLT stylesheets. XSLT is a complex but powerful XML grammar that describes how XML document content should be converted to another form (e.g., XML, HTML, or plain text). A tutorial on XSLT stylesheets is beyond the scope of this book, but the following code (which contains only six key lines) shows how you can apply such a stylesheet (which is an XML document itself) to another XML document and write the resulting document to a stream:

import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;

public class Transform {
    public static void main(String[] args)
        throws TransformerConfigurationException,
               TransformerException
    {
        // Get Source and Result objects for input, stylesheet, and output
        StreamSource input = new StreamSource(new File(args[0]));
        StreamSource stylesheet = new StreamSource(new File(args[1]));
        StreamResult output = new StreamResult(new File(args[2]));

        // Create a transformer and perform the transformation
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer = tf.newTransformer(stylesheet);
        transformer.transform(input, output);
    }
}

5.12.4. Validating XML Documents

The javax.xml.validation package allows you to validate XML documents against a schema. SAX and DOM parsers obtained from the javax.xml.parsers package can perform validation against a DTD during the parsing process, but this package separates validation from parsing and also provides general support for arbitrary schema types. All implementations must support W3C XML Schema and are allowed to support other schema types, such as RELAX NG.

To use this package, begin with a SchemaFactory instancea parser for a specific type of schema. Use this parser to parse a schema file into a Schema object. Obtain a Validator from the Schema, and then use the Validator to validate your XML document. The document is specified as a SAXSource or DOMSource object. You may recall these classes from the subpackages of javax.xml.transform.

If the document is valid, the validate( ) method of the Validator object returns normally. If it is not valid, validate( ) throws a SAXException. You can install an org.xml.sax.ErrorHandler object for the Validator to provide some control over the kinds of validation errors that cause exceptions.

import javax.xml.XMLConstants;
import javax.xml.validation.*;
import javax.xml.transform.sax.SAXSource;
import org.xml.sax.*;
import java.io.*;

public class Validate {
    public static void main(String[] args) throws IOException {
        File documentFile = new File(args[0]);  // 1st arg is document
        File schemaFile = new File(args[1]);    // 2nd arg is schema 

        // Get a parser to parse W3C schemas.  Note use of javax.xml package
        // This package contains just one class of constants.
        SchemaFactory factory =
            SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
        
        // Now parse the schema file to create a Schema object
        Schema schema = null;
        try { schema = factory.newSchema(schemaFile); }
        catch(SAXException e) { fail(e); }

        // Get a Validator object from the Schema.
        Validator validator = schema.newValidator();

        // Get a SAXSource object for the document
        // We could use a DOMSource here as well
        SAXSource source =
            new SAXSource(new InputSource(new FileReader(documentFile)));
        
        // Now validate the document
        try { validator.validate(source); }
        catch(SAXException e) { fail(e); }

        System.err.println("Document is valid");
    }

    static void fail(SAXException e) {
        if (e instanceof SAXParseException) {
            SAXParseException spe = (SAXParseException) e;
            System.err.printf("%s:%d:%d: %s%n",
                              spe.getSystemId(), spe.getLineNumber(),
                              spe.getColumnNumber(), spe.getMessage());
        }
        else {
            System.err.println(e.getMessage());
        }
        System.exit(1);
    }
}

5.12.5. Evaluating XPath Expressions

XPath is a language for referring to specific nodes in an XML document. For example, the XPath expression "//section/title/text( )" refers to the text inside of a <title> element inside a <section> element at any depth within the document. A full description of the XPath language is beyond the scope of this book. The javax.xml.xpath package, new in Java 5.0, provides a way to find all nodes in a document that match an XPath expression.

import javax.xml.xpath.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;

public class XPathEvaluator {
    public static void main(String[] args)
        throws ParserConfigurationException, XPathExpressionException,
               org.xml.sax.SAXException, java.io.IOException
    {
        String documentName = args[0];
        String expression = args[1];

        // Parse the document to a DOM tree
        // XPath can also be used with a SAX InputSource
        DocumentBuilder parser =
            DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc = parser.parse(new java.io.File(documentName));
        
        // Get an XPath object to evaluate the expression
        XPath xpath = XPathFactory.newInstance().newXPath();

        System.out.println(xpath.evaluate(expression, doc));

        // Or evaluate the expression to obtain a DOM NodeList of all matching
        // nodes.  Then loop through each of the resulting nodes
        NodeList nodes = (NodeList)xpath.evaluate(expression, doc,
                                                  XPathConstants.NODESET);
        for(int i = 0, n = nodes.getLength(); i < n; i++) {
            Node node = nodes.item(i);
            System.out.println(node);
        }
    }
}