Parsing XML with the SAX API

by Matthew Phillips

What is SAX?

SAX is an event driven XML parser. SAX uses the Observer design pattern to pass pieces of the document back to the client. If you are familiar with this pattern, you may skip to the Setup section at the end of this section. Unlike other implementations of the Observer pattern, such as the AWT event model, only one Observer can be registered with the parser.

The Observer pattern

The Observer pattern allows you to register an event listener with an event generator. When the event generator creates an event it passes control to the event listener to process that event. If you are familiar with AWT or Swing you see this in the way those handle events. You simply add your event listener (such as a ButtonListener) to your event generator (such as a Button through the addButtonListener interface). When the button is pressed, an event is generated and control is passed to the ButtonListener.

Setup

The only thing that you need (aside from a Java development environment, such as the SDK) is a parser implementation. J2SE 1.4 includes the Crimson parser, but for this tutorial I will be using the Xerces parser available from The Apache Software Foundation. As I go over running the examples, I will show you where to make changes for a parser other than Xerces. If the parser you choose is a standard implementation you should not need to change anything in the included code. The sample XML document that we will be parsing is located here. We will also be using a Book JavaBean.

The ContentHandler interface

The ContentHandler interface is where you tell the parser how you want a document to be handled. It is the event listener. You may view the sample implementation that I am using here.

I'll give a brief explanation of the methods that do nothing in this particular example and then move on to the methods that do the work in this example. The setDocumentLocator allows the parser to pass a Locator object that knows the line and column of the document the parser is at. This could be helpful for debugging purposes, but verify that your parser implements a Locator before relying on it.

The startDocument method is called by the parser when it begins to read the document. The endDocument method is called when the parser finishes with the document. If you have any start of document or end of document processing (such as opening a database connection) it should be done in these methods.

If you are using name spaces, the startPrefixMapping and endPrefixMapping allow you to do any processing that you may need to occur on a specific name space.

The ignorableWhiteSpace method allows you to process the white space that occurs between tags.

The processingInstruction method is called when the parser reads a processing instruction from the XML document.

The skippedEntity method occurs when a non-validating parser finds an entity reference that it cannot resolve.

Now we will turn to the meat of the ContentHandler implementation. Our startElement method receives four parameters. The namespaceURI is the URI of the name space that the element is a part of. The localName is the element name after any prefix or colon. The qualifiedName is the full name of the element, including any prefix. The attributes are any attributes of the element. There are two things we are concerned with when a new element occurs. The first is to instantiate a new StringBuffer to place the text content that follows the element. Our other concern is when the element is a book element. When this occurs we need to get the value of the isbn attribute, instantiate a new Book, instantiate a new List of authors, and assign the isbn to the book. We keep a copy of the isbn as a member element for later use in adding the book to the library.

The characters method is called by the parser when it reads the text content between the tags. The parser may call this method multiple times so a char array is passed with the start position of the content in question and the length of that content. The array itself may contain other content depending on the parser implementation. Our specific needs in this example is to save the content to the StringBuffer we instantiated in startElement.

The final method, endElement, is called when the parser encounters an end element. At that point we need to pass our content to the appropriate method of the book. Because the parser includes all white space when it calls the characters method, we trim that out when we assign it to a String. Using the qualifiedName of the element we can determine which method to call to assign the data to the Book. If the book tag is being closed we also need to assign the authors to the book and place the book in our library Map.

Putting it together

The class that starts the parsing process is found here. The first line of interest is where we instantiate the XMLReader. The no-parameter createXMLReader method of the XMLReaderFactory creates the default XML parser (I will show you how to set this in the next section). You could also pass a String to the method with the fully qualified class name of the parser that you want to use. The next two lines instantiate the Map we are using to represent the library and the ContentHandler from the previous section. After instantiating the ContentHandler, we register it with the parser. As stated before, only one ContentHandler may be registered with the parser. After registering the content handler we parse the document. The rest of the code reads through the Map and prints the results.

Running the code

After you compile the code you will need to add some parameters to run it. Assuming that you are using the Xerces parser you will type the following on the command line to run the code:

        java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser PrintLibrary.
        
You should see the books output to the command line.

Summary

There is a great deal more to the SAX API than what I have demonstrated here, but this should go. Xerces has examples that are included with it, although I have not looked at them closely. An excellent book on parsing XML documents with Java is Processing XML with Java: A Guide to SAX, DOM, JDOM, JAXP, and TrAX by Elliotte Rusty Harold.