Preface:
When working with XML, we have to somehow read it to begin with. This section will lead you through the many options available in Groovy for parsing XML: the normal DOM route, enhanced by Groovy; Groovy’s own XmlParser and XmlSlurper classes; SAX event-based parsing; and the recently introduced StAX pull-parsers.
Let’s suppose we have a little datastore in XML format for planning our Groovy self-education activities. In this datastore, we capture how many hours per week we can invest in this training, what tasks need to be done, and how many hours each task will eat up in total. To keep track of our progress, we will also store how many hours are “done” for each task. Listing 12.1 shows our XML datastore as it resides in a file named data/plan.xml.
- Listing 12.1 The example datastore data/plan.xml
We plan for two weeks, with eight hours for education each week. Three tasks are scheduled for the current week: reading this chapter (two hours for a quick reader), playing with the newly acquired knowledge (three hours of real fun), and using it in the real world (one hour done and one still left). This will be our running example for most of the chapter.
When working with XML, we have to somehow read it to begin with. This section will lead you through the many options available in Groovy for parsing XML: the normal DOM route, enhanced by Groovy; Groovy’s own XmlParser and XmlSlurper classes; SAX event-based parsing; and the recently introduced StAX pull-parsers.
Let’s suppose we have a little datastore in XML format for planning our Groovy self-education activities. In this datastore, we capture how many hours per week we can invest in this training, what tasks need to be done, and how many hours each task will eat up in total. To keep track of our progress, we will also store how many hours are “done” for each task. Listing 12.1 shows our XML datastore as it resides in a file named data/plan.xml.
- Listing 12.1 The example datastore data/plan.xml
We plan for two weeks, with eight hours for education each week. Three tasks are scheduled for the current week: reading this chapter (two hours for a quick reader), playing with the newly acquired knowledge (three hours of real fun), and using it in the real world (one hour done and one still left). This will be our running example for most of the chapter.
For reading such a datastore, we will present several different approaches: first using technologies built into the JRE, and then using the Groovy parsers. We’ll start with the more familiar DOM parser.
Working with a DOM parser:
Why do we bother with Java’s classic DOM parsers? Shouldn’t we restrict ourselves to show only Groovy specifics here? Well, first of all, even in Groovy code, we sometimes need DOM objects for further processing, for example when applying XPath expressions to an object as we will explain in section 12.2.3. For that reason, we show the Groovy way of retrieving the DOM representation of our datastore with the help of Java’s DOM parsers. Second, there is basic Groovy support for dealing with DOM NodeLists, and Groovy also provides extra helper classes to simplify common tasks within DOM.
Finally, it’s much easier to appreciate how slick the Groovy parsers are after having seen the “old” way of reading XML. We start by loading a DOM tree into memory.
Getting the document
Not surprisingly, the Document Object Model is based around the central abstraction of a document, realized as the Java interface org.w3c.dom.Document . An object of this type will hold our datastore. The Java way of retrieving a document is through the parse method of a DocumentBuilder (= parser). This method takes an InputStream to read the XML from. So a first attempt of reading is:
- def doc = builder.parse(new FileInputStream('data/plan.xml'))
- import javax.xml.parsers.DocumentBuilderFactory as fac
- def builder = fac.newInstance().newDocumentBuilder()
- def doc = builder.parse(new FileInputStream('data/plan.xml'))
Walking the DOM
The document object is not yet the root of our datastore. In order to get the top-level element, which is plan in our case, we have to ask the document for itsdocumentElement property:
- def plan = doc.documentElement
It is not surprising that XML elements are stored in nodes of type ELEMENT_NODE , but it is surprising that attributes are also stored in node objects (of nodeTypeATTRIBUTE_NODE ). To make things even more complex, each value of an attribute is stored in an extra node object (with nodeType TEXT_NODE ). This complexity is a large part of the reason why simpler APIs such as JDOM, dom4j, and XOM have become popular.
As an example, the nodes and their names, types, and values are depicted in figure 12.1 for the first week element in the datastore.
The fact that node objects behave differently with respect to their nodeType leads to code that needs to work with this distinction. For example, when reading information from a node, we need a method such as this:
- String info(node) {
- switch (node.nodeType) {
- case Node.ELEMENT_NODE:
- return 'element: '+ node.nodeName
- case Node.ATTRIBUTE_NODE:
- return "attribute: ${node.nodeName}=${node.nodeValue}"
- case Node.TEXT_NODE:
- return 'text: '+ node.nodeValue
- }
- return 'some other type: '+ node.nodeType
- }
Listing 12.2 puts all this together and reads our plan from the XML datastore, walking into the first task of the first week.
- Listing 12.2 Reading plan.xml with the classic DOM parser
- import javax.xml.parsers.DocumentBuilderFactory
- import org.w3c.dom.Node
- String info(node) {
- switch (node.nodeType) {
- case Node.ELEMENT_NODE:
- return 'element: '+ node.nodeName
- case Node.ATTRIBUTE_NODE:
- return "attribute: ${node.nodeName}=${node.nodeValue}"
- case Node.TEXT_NODE:
- return 'text: '+ node.nodeValue
- }
- return 'some other type: '+ node.nodeType
- }
- def fac = DocumentBuilderFactory.newInstance()
- def builder = fac.newDocumentBuilder()
- def doc = builder.parse(new FileInputStream('data/plan.xml'))
- def plan = doc.documentElement
- assert 'element: plan' == info(plan)
- // 1) Object iteration method
- def week = plan.childNodes.find{'week' == it.nodeName}
- assert 'element: week' == info(week)
- // 2) Indexed access
- def task = week.childNodes.item(1)
- assert 'element: task' == info(task)
- def title = task.attributes.getNamedItem('title')
- assert 'attribute: title=read XML chapter' == info(title)
Making DOM groovier
Groovy wouldn’t be groovy without a convenience method for the lengthy parsing prework:
- def doc = groovy.xml.DOMBuilder.parse(new FileReader('data/plan.xml'))
- def plan = doc.documentElement
Dealing with child nodes and attributes as in listing 12.2 doesn’t feel groovy either. Therefore, Groovy provides a DOMCategory that you can use for simplified access. With this, you can index child nodes via the subscript operator or via their node name. You can refer to attributes by getting the @attributeNameproperty:
- use(groovy.xml.dom.DOMCategory){
- assert 'plan' == plan.nodeName
- assert 'week' == plan[1].nodeName
- assert 'week' == plan.week.nodeName
- assert '8' == plan[1].'@capacity'
- }
Reading with a Groovy parser:
The Groovy way of reading the plan datastore is so simple, we’ll dive headfirst into the solution as presented in listing 12.3.
- Listing 12.3 Reading plan.xml with Groovy’s XmlParser
- def plan = new XmlParser().parse(new File('data/plan.xml'))
- assert 'plan' == plan.name()
- assert 'week' == plan.week[0].name()
- assert 'task' == plan.week[0].task[0].name()
- assert 'read XML chapter' == plan.week[0].task[0].'@title'
Up to this point, you have seen that Groovy’s XmlParser provides all the functionality you first saw with the DOM parser. But there is more to come. In addition to the XmlParser, Groovy comes with the XmlSlurper . Let’s explore the common-alities and differences between those two before considering more advanced usages of each.
Commonalities between XmlParser and XmlSlurper
Let’s start with the commonalities of XmlParser and XmlSlurper: They both reside in package groovy.util and provide the constructors listed in table 12.1.
Besides sharing constructors with the same parameter lists, the types share parsing methods with the same signatures. The only difference is that the parsing methods of XmlParser return objects of type groovy.util.Node whereas XmlSlurper returns GPathResult objects. Table 12.2 lists the uniform parse methods.
The result of the parse method is either a Node (for XmlParser) or a GPathResult (for XmlSlurper). Table 12.3 lists the common available methods for both result types. Note that because both types understand the iterator method, all object iteration methods are also instantly available.
GPathResult and groovy.util.Node provide additional shortcuts for method calls to the parent object and all descendent objects. Such shortcuts make reading a GPath expression more like other declarative path expressions such as XPath or Ant paths.
Objects of type Node and GPathResult can access both child elements and attributes as if they were properties of the current object. Table 12.4 shows the syntax and how the leading @ sign distinguishes attribute names from nested element names.
Listing 12.4 plays with various method calls and uses GPath expressions to work on objects of type Node and GPathResult alike. It uses XmlParser to returnNode objects and XmlSlurper to return a GPathResult. To make the similarities stand out, listing 12.4 shows doubled lines, one using Node, one usingGPathResult.
- Listing 12.4 Using common methods of groovy.util.Node and GPathResult
- def node = new XmlParser().parse(new File('data/plan.xml'))
- def path = new XmlSlurper().parse(new File('data/plan.xml'))
- assert 'plan' == node.name()
- assert 'plan' == path.name()
- assert 2 == node.children().size()
- assert 2 == path.children().size()
- // 1) All tasks
- assert 5 == node.week.task.size()
- assert 5 == path.week.task.size()
- // 2) All hours done
- assert 6 == node.week.task.'@done'*.toInteger().sum()
- // 3) Second week
- assert path.week[1].task.every{ it.'@done' == '0' }
At (2), you see that in GPath, attribute access has the same effect as access to child elements; node.week.task.'@done' results in a list of all values of the doneattribute of all tasks of all weeks. We use the spread-dot operator (see section 7.5.1) to apply the toInteger method to all strings in that list, returning a list of integers. We finally use the GDK method sum on that list.
The line at (3) can be read as: “Assert that the done attribute in every task of week[1] is '0' .” What’s new here is using indexed access and the object iteration method every . Because indexing starts at zero, week[1] means the second week.
This example should serve as an appetizer for your own experiences with applying GPath expressions to XML documents. In addition to the convenient GPath notation, you might also wish to make use of traversal methods; for example, we could add the following lines to listing 12.4:
- assert 'plan->week->week->task->task->task->task->task' ==
- node.breadthFirst()*.name().join('->')
- assert 'plan->week->task->task->task->week->task->task' ==
- node.depthFirst()*.name().join('->')
Differences between XmlParser and XmlSlurper
Despite the similarities between XmlParser and XmlSlurper when used for simple reading purposes, there are differences when it comes to more advanced reading tasks and when processing XML documents into other formats.
XmlParser uses the groovy.util.Node type and its GPath expressions result in lists of nodes. That makes working with XmlParser feel like there always is atangible object representation of elements—something that we can inspect via toString , print, or change in-place. Because GPath expressions return lists of such elements, we can apply all our knowledge of the list datatype (see section 4.2).
This convenience comes at the expense of additional up-front processing and extra memory consumption. The GPath expression node.week.task.'@done'generates three lists: a temporary list of weeks (two entries), a temporary list of tasks (five entries), and a list of done attribute values (five strings) that is finally returned. This is reasonable for our small example but hampers processing large or deeply nested XML documents.
XmlSlurper in contrast does not store intermediate results when processing information after a document has been parsed. It avoids the extra memory hit when processing. Internally, XmlSlurper uses iterators instead of extra collections to reflect every step in the GPath. With this construction, it is possible to defer processing until the last possible moment.
NOTE.
Table 12.5 lists the methods unique to Node . When using XmlParser, you can use these methods in your processing.
Table 12.6 lists the methods that are unique to or are optimized in GPathResult. As an example, we could add the following line to listing 12.4 to use the optimized findAll in GPathResult:
- assert 2 == path.week.task.findAll{ it.'@title' =~ 'XML' }.size()
You have seen that there are a lot of similarities and some slight differences when reading XML via XmlParser or XmlSlurper. The real, fundamental differences become apparent when processing the parsed information. Coming up in section 12.2, we will look at these differences in more detail by exploring two examples: processing with direct in-place data manipulation and processing in a streaming scenario. However, first we are going to look at event style parsing and how it can be used with Groovy. This will help us better position some of Groovy’s powerful XML features in our forthcoming more-detailed examples.
Reading with a SAX parser:
In addition to the original Java DOM parsing you saw earlier, Java also supports what is known as event-based parsing. The original and most common form of event-based parsing is called SAX. SAX is a push-style event-based parser because the parser pushes events to your code.
When using this style of processing, no memory structure is constructed to store the parsed information; instead, the parser notifies a handler about parsing events. We implement such a handler interface in our program to perform processing relevant to our application’s needs whenever the parser notifies us.
Let’s explore this for our simple plan example. Suppose we wish to display a quick summary of the tasks that are underway and those that are upcoming; we aren’t interested in completed activities for the moment. Listing 12.5 shows how to receive start element events using SAX and perform our business logic of printing out the tasks of interest.
- Listing 12.5 Using a SAX parser with Groovy
- import javax.xml.parsers.SAXParserFactory
- import org.xml.sax.*
- import org.xml.sax.helpers.DefaultHandler
- class PlanHandler extends DefaultHandler {
- def underway = []
- def upcoming = []
- // Interested in element start events
- @Override
- void startElement(String namespace, String localName, String qName, Attributes atts) {
- if (qName != 'task') return
- def title = atts.getValue('title')
- def total = atts.getValue('total')
- switch (atts.getValue('done')) {
- case '0' : upcoming << title ; break
- case { it != total } : underway << title ; break
- }
- }
- }
- def handler = new PlanHandler()
- // Declare our SAX reader
- def reader = SAXParserFactory.newInstance().newSAXParser().getXMLReader()
- reader.contentHandler = handler
- def inputStream = new FileInputStream('data/plan.xml')
- reader.parse(new InputSource(inputStream))
- inputStream.close()
- assert handler.underway == [
- 'use in current project'
- ]
- assert handler.upcoming == [
- 're-read DB chapter',
- 'use DB/XML combination'
- ]
Reading with a StAX parser:
In addition to the push-style SAX parsers supported by Java, a recent trend in processing XML with Java is to use pull-style event-based parsers. The most common of these are called StAX-based parsers. With such a parser, you are still interested in events, but you ask the parser for events (you pull events as needed) during processing, instead of waiting to be informed by methods being called.
Listing 12.6 shows how you can use StAX with Groovy. You will need a StAX parser in your classpath to run this example. If you have already set up Groovy-SOAP, which we explore further in section 12.3, you may already have everything you need.
- Listing 12.6 Using a StAX parser with Groovy
- import javax.xml.stream.*
- def input = 'file:data/plan.xml'.toURL()
- def underway = []
- def upcoming = []
- def eachStartElement(inputStream, Closure yield) {
- def token = XMLInputFactory.newInstance()
- .createXMLStreamReader(inputStream)
- try {
- while (token.hasNext()) {
- if (token.startElement) yield token
- token.next()
- }
- } finally {
- token?.close()
- inputStream?.close()
- }
- }
- class XMLStreamCategory {
- static Object get(XMLStreamReader self, String key) {
- return self.getAttributeValue(null, key)
- }
- }
- use (XMLStreamCategory) {
- eachStartElement(input.openStream()) { element ->
- if (element.name.toString() != 'task') return
- switch (element.done) {
- case '0' :
- upcoming << element.title
- break
- case { it != element.total } :
- underway << element.title
- }
- }
- }
- assert underway == [
- 'use in current project'
- ]
- assert upcoming == [
- 're-read DB chapter',
- 'use DB/XML combination'
- ]
Suppose you have to respond to many parts of the document differently. With push models, your code has to maintain extra state to know where you are and how to react. With a pull model, you can decide what parts of the document to process at any point within your business logic. The flow through the document is easier to follow, and the code feels more natural.
Supplement:
* Groovy Doc - Reading XML using Groovy's XmlParser
* Java DOM Tutorial
* Groovy Document - GPath
* Lesson: Streaming API for XML
沒有留言:
張貼留言