Computing Thoughts
Simplifying XML Manipulation
by Bruce Eckel
June 7, 2006

Summary
Every time someone creates a new XML-based language, God (performs some unspeakable act).

My initial strategy for XML was to recognize it for what it is -- a platform independent way to move data from one place to another -- and then, as much as possible, to ignore it and assume that I would only have to actually look at XML when I was debugging something.

Occasionally, XML is actually treated this way. For example, in XML-RPC, you make an ordinary function call, the mechanism converts it into XML and uses HTTP to transport the call across the network, the return value comes back as XML and the mechanism converts the XML back into useful data. It's the wet-dream of remote procedure calls that we've been trying, and failing, to achieve all these years (someone starts with an idea, then committees descend and you end up with things like CORBA and SOAP).

Alas, I haven't seen XML-RPC used yet in web-servicey things. I hope that it is used somewhere, but if you read my previous entry you'll see that the first two services I tried to use -- Fedex and the Post Office (and UPS apparently also works this way) -- uses something that's almost XML-RPC, but it's not, and you're expected to assemble your XML by hand and then use HTTP to send it to the server, then unbundle the XML that comes back. It's the worst of both worlds: an ordinary RESTful call, but instead of just handing the server plain HTTP arguments, you have to build and dissect the XML. I'm sure there's some reason they decided not to use XML-RPC in these cases, but I can't imagine what it is.

The point being that here's a case where the XML should have been invisible, but you have to mess with it by hand.

Another example is Ant. The creator of this tool has since apologized for using XML, but it's what we're stuck with. And it was too late, anyway, since Maven also appears to use XML. So again, you're stuck creating XML by hand.

For Thinking in Java, I wanted to automatically create the Ant build files for all the chapters, so I created a tool in Python that I call AntBuilder (it's not a general-purpose tool; it's designed specifically for the code in Thinking in Java). In the process I used xml.dom.minidom from the standard Python library. I'm not exactly sure what the genesis of this library is, but the resulting code is terribly verbose and obtuse, and thus difficult to modify and maintain. Indeed, others have created libraries that are simpler in comparison to xml.dom.minidom; elementree is probably the most notable, but I still found it a bit more verbose than I want, and I prefer something more like py.xml (also look at the simplicity of py.test in the same package). It's worth noting, though, that elementree isn't all that bad and if it had been the original XML library in Python I probably wouldn't have gone to this trouble. And more importantly, a subset of elementree will be in Python 2.5.

At the time I was creating AntBuilder, the messiness of xml.dom.minidom made the creation of the build.xml files far too difficult to manage, so I began creating a simplified approach for building XML trees. And when I tried to make some modifications to the code from http://opensource.pseudocode.net/files/shipping.tgz which also used xml.dom.minidom, I ran into the same over-complexity problem. So yesterday I began rewriting the XML code from the AntBuilder project to create a general purpose XML manipulation library called xmlnode.

My goal was to make the resulting code as minimal and (to my standards) readable as possible. To me, XML is a hierarchy of nodes, and each node has a tag, potentially some attributes, and either a value or subnodes. As it turns out, the whole of XML can be encapsulated into a single class which I call Node. The operator += has been overloaded to insert subnodes into a node, and the operator + has been overloaded to string together a group of subnodes.

You create a new Node using its constructor, and you give it a tag name, an optional value, and optionally one or more attributes. Thus, building an XML tree is about as simple as thinking about it -- the syntax doesn't get in the way (IMO).

With XML web-servicey things, the results also come back in XML, so the Node class has a static method (available in Python 2.4) that will take a dom and produce a hierarchy of Node objects. You can select a Node that has a particular tag name by using the "[]" operator (examples below). This only returns the first node with that tag name; if there are more then you must write the code to iterate through the nodes and select the appropriate one(s). But that's not very hard since each node just contains a Python list of other nodes.

You can download the code here (right-click to download). As I get feedback I will make changes. If you import the file, you'll get the single class that uses xml.dom.minidom, and if you run it as a standalone program it will execute all the test/demonstration code which provides examples so you can see how to use it.

Update: (6/19/06) By using and testing this code, I've significantly redesigned it to make it clearer and easier to use. Also note that the design of the class allows it to be easily retargeted to a different underly XML library implementation, if desired.

#!/usr/bin/python2.4
"""
xmlnode.py -- Rapidly assemble XML using minimal coding.

By Bruce Eckel, (c)2006 MindView Inc. www.MindView.net
Permission is granted to use or modify without payment as 
long as this copyright notice is retained.

Everything is a Node, and each Node can either have a value 
or subnodes. Subnodes can be appended to Nodes using '+=', 
and a group of Nodes can be strung together using '+'.

Create a node containing a value by saying 
Node("tag", "value")
You can also give attributes to the node in the constructor:
Node("tag", "value", attr1 = "attr1", attr2 = "attr2")
or without a value:
Node("tag", attr1 = "attr1", attr2 = "attr2")

To produce xml from a finished Node n, say n.xml() (for 
nicely formatted output) or n.rawxml().

You can read and modify the attributes of an xml Node using 
getAttribute(), setAttribute(), or delAttribute().

You can find the value of the first subnode with tag == "tag"
by saying n["tag"]. If there are multiple instances of n["tag"],
this will only find the first one, so you should use node() or
nodeList() to narrow your search down to a Node that only has
one instance of n["tag"] first.

You can replace the value of the first subnode with tag == "tag"
by saying n["tag"] = newValue. The same issues exist as noted
in the above paragraph.

You can find the first node with tag == "tag" by saying 
node("tag"). If there are multiple nodes with the same tag 
at the same level, use nodeList("tag").

The Node class is also designed to create a kind of "domain 
specific language" by subclassing Node to create Node types 
specific to your problem domain.

This implementation uses xml.dom.minidom which is available
in the standard Python 2.4 library. However, it can be 
retargeted to use other XML libraries without much effort.
"""
from xml.dom.minidom import getDOMImplementation, parseString
import copy, re

class Node(object):
    """
    Everything is a Node. The XML is maintained as (very efficient)
    Python objects until an XML representation is needed.
    """
    def __init__(self, tag, value = None, **attributes):
        self.tag = tag.strip()
        self.attributes = attributes
        self.children = []
        self.value = value
        if self.value:
            self.value = self.value.strip()

    def getAttribute(self, name):
        """
        Read XML attribute of this node.
        """
        return self.attributes[name]

    def setAttribute(self, name, item):
        """
        Modify XML attribute of this node.
        """
        self.attributes[name] = item

    def delAttribute(self, name):
        """
        Remove XML attribute with this name.
        """
        del self.attributes[name]

    def node(self, tag):
        """ 
        Recursively find the first subnode with this tag. 
        """
        if self.tag == tag:
            return self
        for child in self.children:
            result = child.node(tag)
            if result:
                return result
        return False
        
    def nodeList(self, tag):
        """ 
        Produce a list of subnodes with the same tag. 
        Note:
        It only makes sense to do this for the immediate 
        children of a node. If you went another level down, 
        the results would be ambiguous, so the user must 
        choose the node to iterate over.
        """
        return [n for n in self.children if n.tag == tag]

    def __getitem__(self, tag):
        """ 
        Produce the value of a single subnode using operator[].
        Recursively find the first subnode with this tag. 
        If you want duplicate subnodes with this tag, use
        nodeList().
        """
        subnode = self.node(tag)
        if not subnode:
            raise KeyError
        return subnode.value

    def __setitem__(self, tag, newValue):
        """ 
        Replace the value of the first subnode containing "tag"
        with a new value, using operator[].
        """
        assert isinstance(newValue, str), "Value " + str(newValue) + " must be a string"
        subnode = self.node(tag)
        if not subnode:
            raise KeyError
        subnode.value = newValue

    def __iadd__(self, other):
        """
        Add child nodes using operator +=
        """
        assert isinstance(other, Node), "Tried to += " + str(other)
        self.children.append(other)
        return self

    def __add__(self, other):
        """
        Allow operator + to combine children
        """
        return self.__iadd__(other)

    def __str__(self):
        """
        Display this object (for debugging)
        """
        result = self.tag + "\n"
        for k, v in self.attributes.items():
            result += "    attribute: %s = %s\n" % (k, v)
        if self.value:
            result += "    value: [%s]" % self.value
        return result
        
    # The following are the only methods that rely on the underlying
    # Implementation, and thus the only methods that need to change
    # in order to retarget to a different underlying implementation.

    # A static dom implementation object, used to create elements:        
    doc = getDOMImplementation().createDocument(None, None, None)

    def dom(self):
        """
        Lazily create a minidom from the information stored
        in this Node object.
        """
        element = Node.doc.createElement(self.tag)
        for key, val in self.attributes.items():
            element.setAttribute(key, val)
        if self.value:
            assert not self.children, "cannot have value and children: " + str(self)
            element.appendChild(Node.doc.createTextNode(self.value))
        else:
            for child in self.children:
                element.appendChild(child.dom()) # Generate children as well
        return element

    def xml(self, separator = '  '):
        return self.dom().toprettyxml(separator)

    def rawxml(self):
        return self.dom().toxml()

    @staticmethod
    def create(dom):
        """
        Create a Node representation, given either
        a string representation of an XML doc, or a dom.
        """
        if isinstance(dom, str):
            # Strip all extraneous whitespace so that
            # text input is handled consistently:
            dom = re.sub("\s+", " ", dom)
            dom = dom.replace("> ", ">")
            dom = dom.replace(" <", "<")
            return Node.create(parseString(dom))
        if dom.nodeType == dom.DOCUMENT_NODE:
            return Node.create(dom.childNodes[0])
        if dom.nodeName == "#text":
            return
        node = Node(dom.nodeName)
        if dom.attributes:
            for name, val in dom.attributes.items():
                node.setAttribute(name, val)
        for n in dom.childNodes:
            if n.nodeType == n.TEXT_NODE and n.wholeText.strip():
                node.value = n.wholeText
            else:
                subnode = Node.create(n)
                if subnode:
                    node += subnode
        return node

Here is the test program for the class, which also demonstrates how to use the various features:

#!python
"""
Test the xmlnode library, and demonstrate how to use it.
"""
from xmlnode import Node
import copy

# The XML we want to create:
request = """\
<RequestHeader>
  <AccountNumber>
    12345
  </AccountNumber>
  <MeterNumber>
    6789
  </MeterNumber>
  <CarrierCode>
    FDXE
  </CarrierCode>
  <Service>
    STANDARDOVERNIGHT
  </Service>
  <Packaging>
    FEDEXENVELOPE
  </Packaging>
</RequestHeader>
"""

# Create a Node from a string:
requestNode = Node.create(request)
# Create a Node from a minidom:
requestNode = Node.create(requestNode.dom())

# Assemble a Node programmatically:
root = Node('RequestHeader')
account = Node('AccountNumber', '12345')
assert account.value == '12345'
root += account # Insert account node as child of root
# Add new child nodes:
root += Node('MeterNumber', '6789')
root += Node('CarrierCode', 'FDXE') 
root += Node('Service', 'STANDARDOVERNIGHT') 
root += Node('Packaging', 'FEDEXENVELOPE')

assert root.xml() == requestNode.xml()
assert root.xml() == request
# Dom objects are different, not equivalent:
assert root.dom() != requestNode.dom()

# A more succinct approach. The '+' adds child nodes
# to the RequestHeader node. 
root2 = Node('RequestHeader') + \
        Node('AccountNumber', '12345') + \
        Node('MeterNumber', '6789') + \
        Node('CarrierCode', 'FDXE') + \
        Node('Service', 'STANDARDOVERNIGHT') + \
        Node('Packaging', 'FEDEXENVELOPE')

assert root2.xml() == root.xml()

# Reading a value using operator[]. 
assert root2['AccountNumber'] == '12345'

# Begin creating a DSL for Fedex:    
class RequestHeader(Node):
    def __init__(self, acct, meter, ccode, serv, packg):
        Node.__init__(self, 'RequestHeader')
        self += Node('AccountNumber', acct)
        self += Node('MeterNumber', meter)
        self += Node('CarrierCode', ccode) 
        self += Node('Service', serv) 
        self += Node('Packaging', packg)

header = RequestHeader('12345', '6789', 'FDXE', 
          'STANDARDOVERNIGHT', 'FEDEXENVELOPE')
assert header.xml() == root2.xml()

root3 = Node("Top")
two = Node("two")
two += Node("three", "wompity", x="42")
two += Node("four", "woo")
root3 += two
root3 += Node("five", "rebar", stim="pinch", attr="ouch")

# Conversion to string:
assert str(root3.node("five")) == """\
five
    attribute: stim = pinch
    attribute: attr = ouch
    value: [rebar]"""

assert root3.xml() == """\
<Top>
  <two>
    <three x="42">
      wompity
    </three>
    <four>
      woo
    </four>
  </two>
  <five attr="ouch" stim="pinch">
    rebar
  </five>
</Top>
"""

# Reassign values:
root3["four"] = "wimpozzle"
# Can easily extract a subtree:
assert root3.node("four").xml() == """\
<four>
  wimpozzle
</four>
"""    

# Change attribute:
root3.node("five").setAttribute("attr", "argh!")
assert root3.node("five").xml() == """\
<five attr="argh!" stim="pinch">
  rebar
</five>
"""    

# Remove attribute:
root3.node("five").delAttribute("attr")
assert root3.node("five").xml() == """\
<five stim="pinch">
  rebar
</five>
"""    

# Create a Node from a minidom:
nn = Node.create(root3.dom())
assert nn.xml() == root3.xml()

# Cannot insert values into anything but leaf nodes:
test = copy.deepcopy(root3)
test["two"] = "Replacement"
try:
    print test.xml()
except AssertionError:
    pass # Expected

# Handling block text values (imperfect but consistent):
beware = Node.create("""\
<SomeText>
    Beware the machines that confuse
    And
    The men behind them
</SomeText>
""")
assert beware.xml() == """\
<SomeText>
  Beware the machines that confuse And The men behind them
</SomeText>
"""

Talk Back!

Have an opinion? Readers have already posted 29 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Bruce Eckel adds a new entry to his weblog, subscribe to his RSS feed.

Digg |

del.icio.us |

About the Blogger

Bruce Eckel (www.BruceEckel.com) provides development assistance in Python with user interfaces in Flex. He is the author of Thinking in Java (Prentice-Hall, 1998, 2nd Edition, 2000, 3rd Edition, 2003, 4th Edition, 2005), the Hands-On Java Seminar CD ROM (available on the Web site), Thinking in C++ (PH 1995; 2nd edition 2000, Volume 2 with Chuck Allison, 2003), C++ Inside & Out (Osborne/McGraw-Hill 1993), among others. He's given hundreds of presentations throughout the world, published over 150 articles in numerous magazines, was a founding member of the ANSI/ISO C++ committee and speaks regularly at conferences.


	Web Artima.com