程式扎記: [ In Action ] Working with XML - Reading XML documents

標籤

2015年8月18日 星期二

[ In Action ] Working with XML - Reading XML documents

Preface:
When working with XML, we have to somehow read it to begin with. This section will lead you through the many options available in Groovy for parsing XML: the normal DOM route, enhanced by Groovy; Groovy’s own XmlParser and XmlSlurper classes; SAX event-based parsing; and the recently introduced StAX pull-parsers.

Let’s suppose we have a little datastore in XML format for planning our Groovy self-education activities. In this datastore, we capture how many hours per week we can invest in this training, what tasks need to be done, and how many hours each task will eat up in total. To keep track of our progress, we will also store how many hours are “done” for each task. Listing 12.1 shows our XML datastore as it resides in a file named data/plan.xml.
- Listing 12.1 The example datastore data/plan.xml
We plan for two weeks, with eight hours for education each week. Three tasks are scheduled for the current week: reading this chapter (two hours for a quick reader), playing with the newly acquired knowledge (three hours of real fun), and using it in the real world (one hour done and one still left). This will be our running example for most of the chapter.

For reading such a datastore, we will present several different approaches: first using technologies built into the JRE, and then using the Groovy parsers. We’ll start with the more familiar DOM parser.

Working with a DOM parser:
Why do we bother with Java’s classic DOM parsers? Shouldn’t we restrict ourselves to show only Groovy specifics here? Well, first of all, even in Groovy code, we sometimes need DOM objects for further processing, for example when applying XPath expressions to an object as we will explain in section 12.2.3. For that reason, we show the Groovy way of retrieving the DOM representation of our datastore with the help of Java’s DOM parsers. Second, there is basic Groovy support for dealing with DOM NodeLists, and Groovy also provides extra helper classes to simplify common tasks within DOM.

Finally, it’s much easier to appreciate how slick the Groovy parsers are after having seen the “old” way of reading XML. We start by loading a DOM tree into memory.
Getting the document
Not surprisingly, the Document Object Model is based around the central abstraction of a document, realized as the Java interface org.w3c.dom.Document . An object of this type will hold our datastore. The Java way of retrieving a document is through the parse method of a DocumentBuilder (= parser). This method takes an InputStream to read the XML from. So a first attempt of reading is:
  1. def doc = builder.parse(new FileInputStream('data/plan.xml'))  
Now, where does builder come from? We are working slowly backward to find a solution. The builder must be of type DocumentBuilder. Instances of this type are delivered from a DocumentBuilderFactory , which has a factory method called newDocumentBuilder :
  1. import javax.xml.parsers.DocumentBuilderFactory as fac  
  2.   
  3. def builder = fac.newInstance().newDocumentBuilder()  
  4. def doc     = builder.parse(new FileInputStream('data/plan.xml'))  
Java’s XML handling API is designed with flexibility in mind. A downside of this flexibility is that for our simple example, we have a few hoops to jump through in order to retrieve our file. It’s not too bad, though, and now that we have it we can dive into the document.

Walking the DOM
The document object is not yet the root of our datastore. In order to get the top-level element, which is plan in our case, we have to ask the document for itsdocumentElement property:
  1. def plan = doc.documentElement  
We can now work with the plan variable. It’s of type org.w3c.dom.Node and so it can be asked for its nodeType and nodeName . The nodeType isNode.ELEMENT_NODE , and nodeName is plan. The design of such DOM nodes is a bit strange (to put it mildly). Every node has the same properties, such asnodeType , nodeName , nodeValue , childNodes , and attributes (to name only a few; see the API documentation for the full list). However, what is stored in these properties and how they behave depends on the value of the nodeType property. We will deal with types ELEMENT_NODE , ATTRIBUTE_NODE , and TEXT_NODE(see the API documentation for the exhaustive list).

It is not surprising that XML elements are stored in nodes of type ELEMENT_NODE , but it is surprising that attributes are also stored in node objects (of nodeTypeATTRIBUTE_NODE ). To make things even more complex, each value of an attribute is stored in an extra node object (with nodeType TEXT_NODE ). This complexity is a large part of the reason why simpler APIs such as JDOMdom4j, and XOM have become popular.

As an example, the nodes and their names, types, and values are depicted in figure 12.1 for the first week element in the datastore.


The fact that node objects behave differently with respect to their nodeType leads to code that needs to work with this distinction. For example, when reading information from a node, we need a method such as this:
  1. String info(node) {  
  2.     switch (node.nodeType) {  
  3.         case Node.ELEMENT_NODE:  
  4.              return 'element: '+ node.nodeName  
  5.         case Node.ATTRIBUTE_NODE:  
  6.              return "attribute: ${node.nodeName}=${node.nodeValue}"  
  7.         case Node.TEXT_NODE:  
  8.              return 'text: '+ node.nodeValue  
  9.     }  
  10.     return 'some other type: '+ node.nodeType  
  11. }  
With this helper method, we have almost everything we need to read information from our datastore. Two pieces of information are not yet explained: the types of the childNodes and attributes properties. The childNodes property is of type org.w3c.dom.NodeList . Unfortunately, it doesn’t extend the java.util.List interface but provides its own methods, getLength and item(index) . This makes it inconvenient to work with. However, as you saw in section 9.1.3, Groovy makes its object iteration methods ( each , find , findAll , and so on) available on that type; The attributes property is of type org.w3c.dom.NamedNodeMap , which doesn’t extend java.util.Map either. We will use its getNamedItem(name) method.

Listing 12.2 puts all this together and reads our plan from the XML datastore, walking into the first task of the first week.
- Listing 12.2 Reading plan.xml with the classic DOM parser
  1. import javax.xml.parsers.DocumentBuilderFactory  
  2. import org.w3c.dom.Node  
  3.   
  4. String info(node) {  
  5.     switch (node.nodeType) {  
  6.         case Node.ELEMENT_NODE:  
  7.              return 'element: '+ node.nodeName  
  8.         case Node.ATTRIBUTE_NODE:  
  9.              return "attribute: ${node.nodeName}=${node.nodeValue}"  
  10.         case Node.TEXT_NODE:  
  11.              return 'text: '+ node.nodeValue  
  12.     }  
  13.     return 'some other type: '+ node.nodeType  
  14. }  
  15.   
  16. def fac     = DocumentBuilderFactory.newInstance()  
  17. def builder = fac.newDocumentBuilder()  
  18. def doc     = builder.parse(new FileInputStream('data/plan.xml'))  
  19. def plan    = doc.documentElement  
  20. assert 'element: plan' == info(plan)  
  21.   
  22. // 1) Object iteration method  
  23. def week =  plan.childNodes.find{'week' == it.nodeName}  
  24. assert 'element: week' == info(week)  
  25.   
  26. // 2) Indexed access  
  27. def task =  week.childNodes.item(1)  
  28. assert 'element: task' == info(task)  
  29.   
  30. def title = task.attributes.getNamedItem('title')  
  31. assert 'attribute: title=read XML chapter' == info(title)  
Note how we use the object iteration method find (1) to access the first week element under plan . We use indexed access to the first task child node at (2). But why is the index one and not zero? Because in our XML document, there is a line break between week and task . The DOM parser generates a text node containing this line break (and surrounding whitespaceand adds it as the first child node of week (at index zero). The task node floats to the second position with index one.

Making DOM groovier
Groovy wouldn’t be groovy without a convenience method for the lengthy parsing prework:
  1. def doc  = groovy.xml.DOMBuilder.parse(new FileReader('data/plan.xml'))  
  2. def plan = doc.documentElement   
NOTE.
The DOMBuilder is not only for convenient parsing. As the name suggests, it is a builder and can be used like any other builder (see chapter 8). It returns a tree of org.w3c.dom.Node objects just as if they’d been parsed from an XML document. You can add it to another tree, write it to XML, or query it using XPath (see section 12.2.3).

Dealing with child nodes and attributes as in listing 12.2 doesn’t feel groovy either. Therefore, Groovy provides a DOMCategory that you can use for simplified access. With this, you can index child nodes via the subscript operator or via their node name. You can refer to attributes by getting the @attributeNameproperty:
  1. use(groovy.xml.dom.DOMCategory){  
  2.     assert 'plan' == plan.nodeName  
  3.     assert 'week' == plan[1].nodeName  
  4.     assert 'week' == plan.week.nodeName  
  5.     assert '8'    == plan[1].'@capacity'  
  6. }  
Although not shown in the example, DOMCategory has recently been improved to provide additional syntax shortcuts such as name , text , children , iterator ,parent , and attributes . We explain these shortcuts later in this chapter, because they originated in Groovy’s purpose-built XML parsing classes. Consult the online Groovy documentation for more details.

Reading with a Groovy parser:
The Groovy way of reading the plan datastore is so simple, we’ll dive headfirst into the solution as presented in listing 12.3.
- Listing 12.3 Reading plan.xml with Groovy’s XmlParser
  1. def plan = new XmlParser().parse(new File('data/plan.xml'))  
  2.   
  3. assert 'plan' == plan.name()  
  4. assert 'week' == plan.week[0].name()  
  5. assert 'task' == plan.week[0].task[0].name()  
  6. assert 'read XML chapter' == plan.week[0].task[0].'@title'  
The parser can work directly on File objects and other input sources, as you will see in table 12.2. The parser returns a groovy.util.Node . You already came across this type in section 8.2. That means we can easily use GPath expressions to walk through the tree, as shown with the assert statements.

Up to this point, you have seen that Groovy’s XmlParser provides all the functionality you first saw with the DOM parser. But there is more to come. In addition to the XmlParser, Groovy comes with the XmlSlurper . Let’s explore the common-alities and differences between those two before considering more advanced usages of each.

Commonalities between XmlParser and XmlSlurper
Let’s start with the commonalities of XmlParser and XmlSlurper: They both reside in package groovy.util and provide the constructors listed in table 12.1.


Besides sharing constructors with the same parameter lists, the types share parsing methods with the same signatures. The only difference is that the parsing methods of XmlParser return objects of type groovy.util.Node whereas XmlSlurper returns GPathResult objects. Table 12.2 lists the uniform parse methods.


The result of the parse method is either a Node (for XmlParser) or a GPathResult (for XmlSlurper). Table 12.3 lists the common available methods for both result types. Note that because both types understand the iterator method, all object iteration methods are also instantly available.

GPathResult and groovy.util.Node provide additional shortcuts for method calls to the parent object and all descendent objects. Such shortcuts make reading a GPath expression more like other declarative path expressions such as XPath or Ant paths.


Objects of type Node and GPathResult can access both child elements and attributes as if they were properties of the current object. Table 12.4 shows the syntax and how the leading @ sign distinguishes attribute names from nested element names.


Listing 12.4 plays with various method calls and uses GPath expressions to work on objects of type Node and GPathResult alike. It uses XmlParser to returnNode objects and XmlSlurper to return a GPathResult. To make the similarities stand out, listing 12.4 shows doubled lines, one using Node, one usingGPathResult.
- Listing 12.4 Using common methods of groovy.util.Node and GPathResult
  1. def node = new XmlParser().parse(new File('data/plan.xml'))  
  2. def path = new XmlSlurper().parse(new File('data/plan.xml'))  
  3.   
  4. assert 'plan' == node.name()  
  5. assert 'plan' == path.name()  
  6.   
  7. assert 2 == node.children().size()  
  8. assert 2 == path.children().size()  
  9.   
  10. // 1) All tasks  
  11. assert 5 == node.week.task.size()  
  12. assert 5 == path.week.task.size()  
  13.   
  14. // 2) All hours done  
  15. assert 6 == node.week.task.'@done'*.toInteger().sum()  
  16.   
  17. // 3) Second week  
  18. assert path.week[1].task.every{ it.'@done' == '0' }  
Note that the GPath expression node.week.task (1) first collects all child elements named week , and then, for each of those, collects all their child elements named task (compare the second row in table 12.4). In the case of node.week.task , we have a list of task nodes that we can ask for its size . In the case ofpath.week.task , we have a GPathResult that we can ask for its size . The interesting thing here is that the GPathResult can determine the size without collecting intermediate results (such as week and task nodesin a temporary datastructure such as a list. Instead, it stores whatever iteration logic is needed to determine the result and then executes that logic and returns the result (the size in this example).

At (2), you see that in GPath, attribute access has the same effect as access to child elements; node.week.task.'@done' results in a list of all values of the doneattribute of all tasks of all weeks. We use the spread-dot operator (see section 7.5.1) to apply the toInteger method to all strings in that list, returning a list of integers. We finally use the GDK method sum on that list.

The line at (3) can be read as: “Assert that the done attribute in every task of week[1] is '0' .” What’s new here is using indexed access and the object iteration method every . Because indexing starts at zero, week[1] means the second week.

This example should serve as an appetizer for your own experiences with applying GPath expressions to XML documents. In addition to the convenient GPath notation, you might also wish to make use of traversal methods; for example, we could add the following lines to listing 12.4:
  1. assert 'plan->week->week->task->task->task->task->task' ==  
  2.         node.breadthFirst()*.name().join('->')  
  3. assert 'plan->week->task->task->task->week->task->task' ==  
  4.         node.depthFirst()*.name().join('->')  
So far, you have seen that XmlParser and XmlSlurper can be used in a similar fashion to produce similar results. But there would be no need for two separate classes if there wasn’t a difference. That’s what we cover next.

Differences between XmlParser and XmlSlurper
Despite the similarities between XmlParser and XmlSlurper when used for simple reading purposes, there are differences when it comes to more advanced reading tasks and when processing XML documents into other formats.

XmlParser uses the groovy.util.Node type and its GPath expressions result in lists of nodes. That makes working with XmlParser feel like there always is atangible object representation of elements—something that we can inspect via toString , print, or change in-place. Because GPath expressions return lists of such elements, we can apply all our knowledge of the list datatype (see section 4.2).

This convenience comes at the expense of additional up-front processing and extra memory consumption. The GPath expression node.week.task.'@done'generates three lists: a temporary list of weeks (two entries), a temporary list of tasks (five entries), and a list of done attribute values (five strings) that is finally returned. This is reasonable for our small example but hampers processing large or deeply nested XML documents.

XmlSlurper in contrast does not store intermediate results when processing information after a document has been parsed. It avoids the extra memory hit when processing. Internally, XmlSlurper uses iterators instead of extra collections to reflect every step in the GPath. With this construction, it is possible to defer processing until the last possible moment.
NOTE.
This does not mean that XmlSlurper would work without storing the parsed information in memory. It still does, and the memory consumption rises with the size of the XML document. However, for processing that stored information via GPath, XmlSlurper does not need extra memory.

Table 12.5 lists the methods unique to Node . When using XmlParser, you can use these methods in your processing.


Table 12.6 lists the methods that are unique to or are optimized in GPathResult. As an example, we could add the following line to listing 12.4 to use the optimized findAll in GPathResult:
  1. assert 2 == path.week.task.findAll{ it.'@title' =~ 'XML' }.size()  
Additionally, some classes may only work on one type or the other; for example, there is groovy.util.XmlNodePrinter with method print(Node) but no support for GPathResult. Like the name suggests, XmlNodePrinter pretty-prints a Node tree to a PrintStream in XML format.


You have seen that there are a lot of similarities and some slight differences when reading XML via XmlParser or XmlSlurper. The real, fundamental differences become apparent when processing the parsed information. Coming up in section 12.2, we will look at these differences in more detail by exploring two examples: processing with direct in-place data manipulation and processing in a streaming scenario. However, first we are going to look at event style parsing and how it can be used with Groovy. This will help us better position some of Groovy’s powerful XML features in our forthcoming more-detailed examples.

Reading with a SAX parser:
In addition to the original Java DOM parsing you saw earlier, Java also supports what is known as event-based parsing. The original and most common form of event-based parsing is called SAX. SAX is a push-style event-based parser because the parser pushes events to your code.

When using this style of processing, no memory structure is constructed to store the parsed information; instead, the parser notifies a handler about parsing events. We implement such a handler interface in our program to perform processing relevant to our application’s needs whenever the parser notifies us.

Let’s explore this for our simple plan example. Suppose we wish to display a quick summary of the tasks that are underway and those that are upcoming; we aren’t interested in completed activities for the moment. Listing 12.5 shows how to receive start element events using SAX and perform our business logic of printing out the tasks of interest.
- Listing 12.5 Using a SAX parser with Groovy
  1. import javax.xml.parsers.SAXParserFactory  
  2. import org.xml.sax.*  
  3. import org.xml.sax.helpers.DefaultHandler  
  4.   
  5. class PlanHandler extends DefaultHandler {  
  6.     def underway = []  
  7.     def upcoming = []  
  8.     // Interested in element start events  
  9.     @Override  
  10.     void startElement(String namespace, String localName, String qName, Attributes atts) {  
  11.         if (qName != 'task'return  
  12.         def title = atts.getValue('title')  
  13.         def total = atts.getValue('total')  
  14.         switch (atts.getValue('done')) {  
  15.             case '0'             : upcoming << title ; break  
  16.             case { it != total } : underway << title ; break  
  17.         }  
  18.     }  
  19. }  
  20.   
  21. def handler = new PlanHandler()  
  22. // Declare our SAX reader  
  23. def reader = SAXParserFactory.newInstance().newSAXParser().getXMLReader()     
  24. reader.contentHandler = handler  
  25. def inputStream = new FileInputStream('data/plan.xml')  
  26. reader.parse(new InputSource(inputStream))  
  27. inputStream.close()  
  28. assert handler.underway == [  
  29.     'use in current project'  
  30. ]  
  31. assert handler.upcoming == [  
  32.     're-read DB chapter',  
  33.     'use DB/XML combination'  
  34. ]  
Note that with this style of processing, we have more work to do. When our startElement method is called, we are provided with SAX event information including the name of the element (along with a namespace, if provided) and all the attributes. It’s up to us to work out whether we need this information and process or store it as required during this method call. The parser won’t do any further storage for us. This minimizes memory overhead of the parser, but the implication is that we won’t be able to do GPath-style processing and we aren’t in a position to manipulate a tree-like data structure. We’ll have more to say about SAX event information when we explore XmlSlurper in more detail in section 12.2.

Reading with a StAX parser:
In addition to the push-style SAX parsers supported by Java, a recent trend in processing XML with Java is to use pull-style event-based parsers. The most common of these are called StAX-based parsers. With such a parser, you are still interested in events, but you ask the parser for events (you pull events as needed) during processing, instead of waiting to be informed by methods being called.

Listing 12.6 shows how you can use StAX with Groovy. You will need a StAX parser in your classpath to run this example. If you have already set up Groovy-SOAP, which we explore further in section 12.3, you may already have everything you need.
- Listing 12.6 Using a StAX parser with Groovy
  1. import javax.xml.stream.*  
  2.   
  3. def input = 'file:data/plan.xml'.toURL()  
  4. def underway = []  
  5. def upcoming = []  
  6.   
  7. def eachStartElement(inputStream, Closure yield) {  
  8.     def token = XMLInputFactory.newInstance()  
  9.         .createXMLStreamReader(inputStream)  
  10.     try {  
  11.         while (token.hasNext()) {  
  12.             if (token.startElement) yield token  
  13.             token.next()  
  14.         }  
  15.     } finally {  
  16.         token?.close()  
  17.         inputStream?.close()  
  18.     }  
  19. }  
  20.   
  21. class XMLStreamCategory {                                     
  22.     static Object get(XMLStreamReader self, String key) {     
  23.         return self.getAttributeValue(null, key)              
  24.     }                                                         
  25. }  
  26.   
  27. use (XMLStreamCategory) {  
  28.     eachStartElement(input.openStream()) { element ->  
  29.         if (element.name.toString() != 'task'return  
  30.         switch (element.done) {  
  31.             case '0' :  
  32.                 upcoming << element.title  
  33.                 break  
  34.             case { it != element.total } :  
  35.                 underway << element.title  
  36.         }  
  37.     }  
  38. }  
  39. assert underway == [  
  40.     'use in current project'  
  41. ]  
  42. assert upcoming == [  
  43.     're-read DB chapter',  
  44.     'use DB/XML combination'  
  45. ]  
Note that this style of parsing is similar to SAX-style parsing except that we are running the main control loop ourselves rather than having the parser do it. This style has advantages for certain kinds of processing where the code becomes simpler to write and understand.

Suppose you have to respond to many parts of the document differently. With push models, your code has to maintain extra state to know where you are and how to react. With a pull model, you can decide what parts of the document to process at any point within your business logic. The flow through the document is easier to follow, and the code feels more natural.

Supplement:
Groovy Doc - Reading XML using Groovy's XmlParser
Java DOM Tutorial
Groovy Document - GPath
Lesson: Streaming API for XML
This lesson focuses on the Streaming API for XML (StAX), a streaming Java technology-based, event-driven, pull-parsing API for reading and writing XML documents. StAX enables you to create bidirectional XML parsers that are fast, relatively easy to program, and have a light memory footprint.


沒有留言:

張貼留言

網誌存檔