The Artima Developer Community
Sponsored Link

Weblogs Forum
Simplifying XML Manipulation

29 replies on 2 pages. Most recent reply: Jun 23, 2006 6:30 PM by Andy Dent

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 29 replies on 2 pages [ 1 2 | » ]
Bruce Eckel

Posts: 875
Nickname: beckel
Registered: Jun, 2003

Simplifying XML Manipulation (View in Weblogs)
Posted: Jun 7, 2006 11:42 AM
Reply to this message Reply
Summary
Every time someone creates a new XML-based language, God (performs some unspeakable act).
Advertisement

My initial strategy for XML was to recognize it for what it is -- a platform independent way to move data from one place to another -- and then, as much as possible, to ignore it and assume that I would only have to actually look at XML when I was debugging something.

Occasionally, XML is actually treated this way. For example, in XML-RPC, you make an ordinary function call, the mechanism converts it into XML and uses HTTP to transport the call across the network, the return value comes back as XML and the mechanism converts the XML back into useful data. It's the wet-dream of remote procedure calls that we've been trying, and failing, to achieve all these years (someone starts with an idea, then committees descend and you end up with things like CORBA and SOAP).

Alas, I haven't seen XML-RPC used yet in web-servicey things. I hope that it is used somewhere, but if you read my previous entry you'll see that the first two services I tried to use -- Fedex and the Post Office (and UPS apparently also works this way) -- uses something that's almost XML-RPC, but it's not, and you're expected to assemble your XML by hand and then use HTTP to send it to the server, then unbundle the XML that comes back. It's the worst of both worlds: an ordinary RESTful call, but instead of just handing the server plain HTTP arguments, you have to build and dissect the XML. I'm sure there's some reason they decided not to use XML-RPC in these cases, but I can't imagine what it is.

The point being that here's a case where the XML should have been invisible, but you have to mess with it by hand.

Another example is Ant. The creator of this tool has since apologized for using XML, but it's what we're stuck with. And it was too late, anyway, since Maven also appears to use XML. So again, you're stuck creating XML by hand.

For Thinking in Java, I wanted to automatically create the Ant build files for all the chapters, so I created a tool in Python that I call AntBuilder (it's not a general-purpose tool; it's designed specifically for the code in Thinking in Java). In the process I used xml.dom.minidom from the standard Python library. I'm not exactly sure what the genesis of this library is, but the resulting code is terribly verbose and obtuse, and thus difficult to modify and maintain. Indeed, others have created libraries that are simpler in comparison to xml.dom.minidom; elementree is probably the most notable, but I still found it a bit more verbose than I want, and I prefer something more like py.xml (also look at the simplicity of py.test in the same package). It's worth noting, though, that elementree isn't all that bad and if it had been the original XML library in Python I probably wouldn't have gone to this trouble. And more importantly, a subset of elementree will be in Python 2.5.

At the time I was creating AntBuilder, the messiness of xml.dom.minidom made the creation of the build.xml files far too difficult to manage, so I began creating a simplified approach for building XML trees. And when I tried to make some modifications to the code from http://opensource.pseudocode.net/files/shipping.tgz which also used xml.dom.minidom, I ran into the same over-complexity problem. So yesterday I began rewriting the XML code from the AntBuilder project to create a general purpose XML manipulation library called xmlnode.

My goal was to make the resulting code as minimal and (to my standards) readable as possible. To me, XML is a hierarchy of nodes, and each node has a tag, potentially some attributes, and either a value or subnodes. As it turns out, the whole of XML can be encapsulated into a single class which I call Node. The operator += has been overloaded to insert subnodes into a node, and the operator + has been overloaded to string together a group of subnodes.

You create a new Node using its constructor, and you give it a tag name, an optional value, and optionally one or more attributes. Thus, building an XML tree is about as simple as thinking about it -- the syntax doesn't get in the way (IMO).

With XML web-servicey things, the results also come back in XML, so the Node class has a static method (available in Python 2.4) that will take a dom and produce a hierarchy of Node objects. You can select a Node that has a particular tag name by using the "[]" operator (examples below). This only returns the first node with that tag name; if there are more then you must write the code to iterate through the nodes and select the appropriate one(s). But that's not very hard since each node just contains a Python list of other nodes.

You can download the code here (right-click to download). As I get feedback I will make changes. If you import the file, you'll get the single class that uses xml.dom.minidom, and if you run it as a standalone program it will execute all the test/demonstration code which provides examples so you can see how to use it.

Update: (6/19/06) By using and testing this code, I've significantly redesigned it to make it clearer and easier to use. Also note that the design of the class allows it to be easily retargeted to a different underly XML library implementation, if desired.

#!/usr/bin/python2.4
"""
xmlnode.py -- Rapidly assemble XML using minimal coding.

By Bruce Eckel, (c)2006 MindView Inc. www.MindView.net
Permission is granted to use or modify without payment as 
long as this copyright notice is retained.

Everything is a Node, and each Node can either have a value 
or subnodes. Subnodes can be appended to Nodes using '+=', 
and a group of Nodes can be strung together using '+'.

Create a node containing a value by saying 
Node("tag", "value")
You can also give attributes to the node in the constructor:
Node("tag", "value", attr1 = "attr1", attr2 = "attr2")
or without a value:
Node("tag", attr1 = "attr1", attr2 = "attr2")

To produce xml from a finished Node n, say n.xml() (for 
nicely formatted output) or n.rawxml().

You can read and modify the attributes of an xml Node using 
getAttribute(), setAttribute(), or delAttribute().

You can find the value of the first subnode with tag == "tag"
by saying n["tag"]. If there are multiple instances of n["tag"],
this will only find the first one, so you should use node() or
nodeList() to narrow your search down to a Node that only has
one instance of n["tag"] first.

You can replace the value of the first subnode with tag == "tag"
by saying n["tag"] = newValue. The same issues exist as noted
in the above paragraph.

You can find the first node with tag == "tag" by saying 
node("tag"). If there are multiple nodes with the same tag 
at the same level, use nodeList("tag").

The Node class is also designed to create a kind of "domain 
specific language" by subclassing Node to create Node types 
specific to your problem domain.

This implementation uses xml.dom.minidom which is available
in the standard Python 2.4 library. However, it can be 
retargeted to use other XML libraries without much effort.
"""
from xml.dom.minidom import getDOMImplementation, parseString
import copy, re

class Node(object):
    """
    Everything is a Node. The XML is maintained as (very efficient)
    Python objects until an XML representation is needed.
    """
    def __init__(self, tag, value = None, **attributes):
        self.tag = tag.strip()
        self.attributes = attributes
        self.children = []
        self.value = value
        if self.value:
            self.value = self.value.strip()

    def getAttribute(self, name):
        """
        Read XML attribute of this node.
        """
        return self.attributes[name]

    def setAttribute(self, name, item):
        """
        Modify XML attribute of this node.
        """
        self.attributes[name] = item

    def delAttribute(self, name):
        """
        Remove XML attribute with this name.
        """
        del self.attributes[name]

    def node(self, tag):
        """ 
        Recursively find the first subnode with this tag. 
        """
        if self.tag == tag:
            return self
        for child in self.children:
            result = child.node(tag)
            if result:
                return result
        return False
        
    def nodeList(self, tag):
        """ 
        Produce a list of subnodes with the same tag. 
        Note:
        It only makes sense to do this for the immediate 
        children of a node. If you went another level down, 
        the results would be ambiguous, so the user must 
        choose the node to iterate over.
        """
        return [n for n in self.children if n.tag == tag]

    def __getitem__(self, tag):
        """ 
        Produce the value of a single subnode using operator[].
        Recursively find the first subnode with this tag. 
        If you want duplicate subnodes with this tag, use
        nodeList().
        """
        subnode = self.node(tag)
        if not subnode:
            raise KeyError
        return subnode.value

    def __setitem__(self, tag, newValue):
        """ 
        Replace the value of the first subnode containing "tag"
        with a new value, using operator[].
        """
        assert isinstance(newValue, str), "Value " + str(newValue) + " must be a string"
        subnode = self.node(tag)
        if not subnode:
            raise KeyError
        subnode.value = newValue

    def __iadd__(self, other):
        """
        Add child nodes using operator +=
        """
        assert isinstance(other, Node), "Tried to += " + str(other)
        self.children.append(other)
        return self

    def __add__(self, other):
        """
        Allow operator + to combine children
        """
        return self.__iadd__(other)

    def __str__(self):
        """
        Display this object (for debugging)
        """
        result = self.tag + "\n"
        for k, v in self.attributes.items():
            result += "    attribute: %s = %s\n" % (k, v)
        if self.value:
            result += "    value: [%s]" % self.value
        return result
        
    # The following are the only methods that rely on the underlying
    # Implementation, and thus the only methods that need to change
    # in order to retarget to a different underlying implementation.

    # A static dom implementation object, used to create elements:        
    doc = getDOMImplementation().createDocument(None, None, None)

    def dom(self):
        """
        Lazily create a minidom from the information stored
        in this Node object.
        """
        element = Node.doc.createElement(self.tag)
        for key, val in self.attributes.items():
            element.setAttribute(key, val)
        if self.value:
            assert not self.children, "cannot have value and children: " + str(self)
            element.appendChild(Node.doc.createTextNode(self.value))
        else:
            for child in self.children:
                element.appendChild(child.dom()) # Generate children as well
        return element

    def xml(self, separator = '  '):
        return self.dom().toprettyxml(separator)

    def rawxml(self):
        return self.dom().toxml()

    @staticmethod
    def create(dom):
        """
        Create a Node representation, given either
        a string representation of an XML doc, or a dom.
        """
        if isinstance(dom, str):
            # Strip all extraneous whitespace so that
            # text input is handled consistently:
            dom = re.sub("\s+", " ", dom)
            dom = dom.replace("> ", ">")
            dom = dom.replace(" <", "<")
            return Node.create(parseString(dom))
        if dom.nodeType == dom.DOCUMENT_NODE:
            return Node.create(dom.childNodes[0])
        if dom.nodeName == "#text":
            return
        node = Node(dom.nodeName)
        if dom.attributes:
            for name, val in dom.attributes.items():
                node.setAttribute(name, val)
        for n in dom.childNodes:
            if n.nodeType == n.TEXT_NODE and n.wholeText.strip():
                node.value = n.wholeText
            else:
                subnode = Node.create(n)
                if subnode:
                    node += subnode
        return node

Here is the test program for the class, which also demonstrates how to use the various features:

#!python
"""
Test the xmlnode library, and demonstrate how to use it.
"""
from xmlnode import Node
import copy

# The XML we want to create:
request = """\
<RequestHeader>
  <AccountNumber>
    12345
  </AccountNumber>
  <MeterNumber>
    6789
  </MeterNumber>
  <CarrierCode>
    FDXE
  </CarrierCode>
  <Service>
    STANDARDOVERNIGHT
  </Service>
  <Packaging>
    FEDEXENVELOPE
  </Packaging>
</RequestHeader>
"""

# Create a Node from a string:
requestNode = Node.create(request)
# Create a Node from a minidom:
requestNode = Node.create(requestNode.dom())

# Assemble a Node programmatically:
root = Node('RequestHeader')
account = Node('AccountNumber', '12345')
assert account.value == '12345'
root += account # Insert account node as child of root
# Add new child nodes:
root += Node('MeterNumber', '6789')
root += Node('CarrierCode', 'FDXE') 
root += Node('Service', 'STANDARDOVERNIGHT') 
root += Node('Packaging', 'FEDEXENVELOPE')

assert root.xml() == requestNode.xml()
assert root.xml() == request
# Dom objects are different, not equivalent:
assert root.dom() != requestNode.dom()

# A more succinct approach. The '+' adds child nodes
# to the RequestHeader node. 
root2 = Node('RequestHeader') + \
        Node('AccountNumber', '12345') + \
        Node('MeterNumber', '6789') + \
        Node('CarrierCode', 'FDXE') + \
        Node('Service', 'STANDARDOVERNIGHT') + \
        Node('Packaging', 'FEDEXENVELOPE')

assert root2.xml() == root.xml()

# Reading a value using operator[]. 
assert root2['AccountNumber'] == '12345'

# Begin creating a DSL for Fedex:    
class RequestHeader(Node):
    def __init__(self, acct, meter, ccode, serv, packg):
        Node.__init__(self, 'RequestHeader')
        self += Node('AccountNumber', acct)
        self += Node('MeterNumber', meter)
        self += Node('CarrierCode', ccode) 
        self += Node('Service', serv) 
        self += Node('Packaging', packg)

header = RequestHeader('12345', '6789', 'FDXE', 
          'STANDARDOVERNIGHT', 'FEDEXENVELOPE')
assert header.xml() == root2.xml()

root3 = Node("Top")
two = Node("two")
two += Node("three", "wompity", x="42")
two += Node("four", "woo")
root3 += two
root3 += Node("five", "rebar", stim="pinch", attr="ouch")

# Conversion to string:
assert str(root3.node("five")) == """\
five
    attribute: stim = pinch
    attribute: attr = ouch
    value: [rebar]"""

assert root3.xml() == """\
<Top>
  <two>
    <three x="42">
      wompity
    </three>
    <four>
      woo
    </four>
  </two>
  <five attr="ouch" stim="pinch">
    rebar
  </five>
</Top>
"""

# Reassign values:
root3["four"] = "wimpozzle"
# Can easily extract a subtree:
assert root3.node("four").xml() == """\
<four>
  wimpozzle
</four>
"""    

# Change attribute:
root3.node("five").setAttribute("attr", "argh!")
assert root3.node("five").xml() == """\
<five attr="argh!" stim="pinch">
  rebar
</five>
"""    

# Remove attribute:
root3.node("five").delAttribute("attr")
assert root3.node("five").xml() == """\
<five stim="pinch">
  rebar
</five>
"""    

# Create a Node from a minidom:
nn = Node.create(root3.dom())
assert nn.xml() == root3.xml()

# Cannot insert values into anything but leaf nodes:
test = copy.deepcopy(root3)
test["two"] = "Replacement"
try:
    print test.xml()
except AssertionError:
    pass # Expected

# Handling block text values (imperfect but consistent):
beware = Node.create("""\
<SomeText>
    Beware the machines that confuse
    And
    The men behind them
</SomeText>
""")
assert beware.xml() == """\
<SomeText>
  Beware the machines that confuse And The men behind them
</SomeText>
"""


Nathan Moore

Posts: 1
Nickname: q7nate
Registered: Jun, 2006

Re: Simplifying XML Manipulation Posted: Jun 7, 2006 12:23 PM
Reply to this message Reply
For parsing xml in python BeautifulSoup makes it tolerable, this looks like it might make producing less of a pain.

old vb3guy

Posts: 1
Nickname: oldvb3guy
Registered: Jun, 2006

Re: Simplifying XML Manipulation Posted: Jun 7, 2006 12:27 PM
Reply to this message Reply
One word: Namespaces. Wouldn't this approach break down when XML with multiple namespaces is processed? Multiple namespaces imply a composition of XML vocabularies which overlays the simple hierachy of the document. Anyway, just a thought. I do enjoy reading your blog, good brain food...

Ian Bicking

Posts: 900
Nickname: ianb
Registered: Apr, 2003

Re: Simplifying XML Manipulation Posted: Jun 7, 2006 12:54 PM
Reply to this message Reply
xml.dom.minidom is a sad, sad library. Python has suffered from not enough NIH sometimes, where an API like the DOM gets brought in from elsewhere even though it's not a particularly good example of anything. I thought for a while I'd use it to get some symmetry between Javascript and Python (before I knew the DOM very well) and it was very disappointing... and maybe it's also that the DOM code in Python just isn't of particularly good quality. I don't know. It's abandoned, even if it is in the standard library; a sad state.

py.xml is very pleasantly simple. Why didn't you use that? Alternately, if you want ElementTree but want it to be easier to build XML, you might find formencode.htmlgen to be convenient (http://svn.formencode.org/FormEncode/trunk/formencode/htmlgen.py) -- it's not really round-trip (it uses ElementTree subclasses, and hence parsed XML won't have the extra building features that you'd otherwise get). At this point, I might be more inclined to add ElementTree serialization to py.xml, and get py.xml properly and independently packaged, instead of going further with that; but I still use that module often anyway.

For RPC I'd probably be inclined to use JSON these days, because the object model is slightly better than XML-RPC (includes null), and simpler (e.g., no method name, doesn't force an RPC feel). JSON seems to be gaining some traction over XML-RPC these days, in part because it can piggyback on Ajax into places where XML-RPC is seen as too old and ineligant.

Noam Tamim

Posts: 26
Nickname: noamtm
Registered: Jun, 2005

Namespaces are a headache Posted: Jun 7, 2006 1:12 PM
Reply to this message Reply
I think you forgot about namespaces. From my experience with XML, they are THE biggest headache about it. Their semantics are far from intuitive. It's not like they're wrong; you just have to understand them really well in order to use them correctly.

You avoid that problem by not including namespace support. This may be good enough for your needs, but it can't be a general purpose XML manipulation library.

My entire experience with XML was using Java. However I think Java's built-in XML APIs are terrible. The best library I found is Elliotte Rusty Harold's XOM - see http://www.xom.nu/. It has an excellent API.

-Noam.

Bruce Eckel

Posts: 875
Nickname: beckel
Registered: Jun, 2003

Re: Namespaces are a headache Posted: Jun 7, 2006 2:22 PM
Reply to this message Reply
I like XOM too. There's a small introduction to it in Thinking in Java 4e.

And you're right, I'm ignoring XML namespaces because they haven't appeared yet in what I've had to use XML for. Apparently py.xml does handle namespaces.

Evan Jones

Posts: 1
Nickname: evanjones
Registered: Jun, 2006

Re: Simplifying XML Manipulation Posted: Jun 7, 2006 5:50 PM
Reply to this message Reply
I might as well toss my hat into the ring. I've created a Python framework to simplify parsing and generating XML files that match a template. It converts XML to/from Python objects. I haven't *quite* finished the generation features, but I probably should soon.

http://evanjones.ca/software/simplexmlparse.html

Daniel Rinehart

Posts: 1
Nickname: danielr
Registered: Jun, 2006

Re: Simplifying XML Manipulation Posted: Jun 8, 2006 5:19 AM
Reply to this message Reply
You would need to modify the Python language itself for full support, but in general the E4X model of XML manipulation is one of the best I've worked with.

Andy Dent

Posts: 165
Nickname: andydent
Registered: Nov, 2005

Re: Simplifying XML Manipulation Posted: Jun 8, 2006 7:23 AM
Reply to this message Reply
I think this is publicly accessible

https://www.seegrid.csiro.au/twiki/bin/view/Xmml/AdX3GraphicalRendering
shows an example of a graphical rendering generated by some DOT code (see http://www.graphviz.org/)

It was generated by the following Python which uses ElementTree and shows namespaces being used. It also illustrates combining Python strengths with XPath.

The XML is a small but complex example, to give you an idea here's the first element showing how many (standardised) namespaces are being used!

This is typical of the kind of complex XML used to represent complex data like assay results of samples taken from boreholes, complete with spatial coordinates and attribution to various parties.



<Report
xmlns="http://xml.arrc.csiro.au/adx"
xmlns:al="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"
xmlns:cil="urn:oasis:names:tc:ciq:xsdschema:xCIL:2.0"
xmlns:gml="http://www.opengis.net/gml"
xmlns:nal="urn:oasis:names:tc:ciq:xsdschema:xNAL:2.0"
xmlns:nl="urn:oasis:names:tc:ciq:xsdschema:xNL:2.0"
xmlns:xmml="http://www.opengis.net/xmml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://xml.arrc.csiro.au/adx
..\..\XMML\adx.xsd" id="XX-R923364-12" version="2.0.1">


"""
adx2dot
"""
import pydot
#import pydotNoParser as pydot # Andy's hack so you don't need pyparsing installed
# import pdis.xpath # not using at present, just using elementree xpath support
from elementtree import ElementTree as ET

# namespace shortcuts
# easy way to assemble '//{http://www.opengis.net/gml}definitionMember/{http://www.opengis.net/om}ProcedureSequence'
# using ''.join( ('//', gml, 'definitionMember/', om, 'ProcedureSequence') )
# use these shortcuts throughout logic so we can reassign them, in case xmlns is different!!!
adx = "{http://www.seegrid.csiro.au/xml/adx/3}"
gml = "{http://www.opengis.net/gml}"
om = "{http://www.opengis.net/om}"
xlink = "{http://www.w3.org/1999/xlink}"


def buildActivitiesDict(tree):
"""
builds dict of all
<gml:definitionMember><adx:SpecimenPreparationActivity>
This is the sort of thing you'd use an xsl:key for in XSLT.
"""
activities = tree.findall( ''.join(( '//', gml, 'definitionMember/', adx, 'SpecimenPreparationActivity' )) )
d = {}
for a in activities:
d[ a.attrib['{http://www.opengis.net/gml}id'] ] = a # index elements by their id
return d


def graphSeqDefsCrossLinking(tree, actsD):
"""
returns a graph you can dump with .to_string() or graph with write_png
actsD should be a dict created with buildActivitiesDict

The nodes are shown
"""
graph = pydot.Dot(size='5.5,7')
seqsPath = ''.join( ('//', gml, 'definitionMember/', om, 'ProcedureSequence') )
seqs = tree.findall( seqsPath)
for seq in seqs:
seqName = seq.find(gml+'name').text
graph.add_node( pydot.Node(seqName) )
edgeFrom = seqName
for step in seq.findall(om+'step'):
activityDesc = actsD[ step.attrib[xlink+'href'][1:] ] # assume # in href[0:0], eg: #rec1
edgeTo = activityDesc.find(gml+'name').text
graph.add_edge( pydot.Edge(edgeFrom, edgeTo) )
return graph


def graphSeqDefs(tree, actsD):
"""
returns a graph you can dump with .to_string() or graph with write_png
actsD should be a dict created with buildActivitiesDict
"""
graph = pydot.Dot(size='5.5,7')
seqsPath = ''.join( ('//', gml, 'definitionMember/', om, 'ProcedureSequence') )
seqs = tree.findall( seqsPath)
uniqueStepNumber = 0
for seq in seqs:
seqName = seq.find(gml+'name').text
seqNode = pydot.Node(seqName)
seqNode.shape = "box"
graph.add_node( seqNode )
edgeFrom = seqName
for step in seq.findall(om+'step'):
activityDesc = actsD[ step.attrib[xlink+'href'][1:] ] # assume # in href[0:0], eg: #rec1
edgeTo = 'step'+str(uniqueStepNumber)
uniqueStepNumber += 1
destNode = pydot.Node(edgeTo)
destNode.label = activityDesc.find(gml+'name').text # unique name, possibly similar label
graph.add_node( destNode )
graph.add_edge( pydot.Edge(edgeFrom, edgeTo) )
edgeFrom = edgeTo
return graph

Toby Ho

Posts: 4
Nickname: airportyh
Registered: Jun, 2006

Re: Simplifying XML Manipulation Posted: Jun 8, 2006 8:39 AM
Reply to this message Reply
Have you seen the XML builder libraries in groovy and ruby? With the power of closures and the ability to write a method (method_missing in ruby) that responds to any message, those languages really give you the ability to create a library that is very succinct to use.

Here's a groovy link for a groovy example: http://onestepback.org/articles/groovy/xmlbuilder.html.
And here's the ruby example that should the same output as your test.

x.RequestHeader {
x.AccountNumber '12345'
x.MeterNumber '6789'
x.CarrierCode 'FDXE'
x.Service 'STANDARDOVERNIGHT'
x.Packaging 'FEDEXENVELOPE'
}

I think you can even remove the inner x's so it looks like:

x.RequestHeader {
AccountNumber '12345'
MeterNumber '6789'
CarrierCode 'FDXE'
Service 'STANDARDOVERNIGHT'
Packaging 'FEDEXENVELOPE'
}

How clean is that?

Ian Bicking

Posts: 900
Nickname: ianb
Registered: Apr, 2003

Re: Simplifying XML Manipulation Posted: Jun 8, 2006 8:50 AM
Reply to this message Reply
With a builder like py.xml this would look like:


node = x.RequestHeader(
x.AccountNumber('12345'),
x.MeterNumber('6789'),
x.CarrierCode('FDXE'),
x.Service('STANDARDOVERNIGHT'),
x.Packaging('FEDEXENVELOPE'),
)

Steve R. Hastings

Posts: 5
Nickname: steveha
Registered: Jun, 2006

Re: Simplifying XML Manipulation Posted: Jun 8, 2006 10:11 AM
Reply to this message Reply
I wrote a Python library called "xe". xe is designed to make it very easy to work with structured XML.

Most XML libraries are designed to work with arbitrary XML, so it's a lot of work to do even the simplest thing. xe can handle arbitrary XML, but it's really not what it was designed for.

I've announced this a few times on comp.lang.python and no one seemed excited by it, but I think this is really cool stuff.

http://home.avvanta.com/~steveha/xe


Using xe, I have written several libararies for working with syndication feeds. Take a look at the source for my RSS or Atom libraries to see just how little code they contain and how tidy it is.

http://home.blarg.net/~steveha/pyfeed.html

Currently xe is pretty stable, but there are two changes I intend to make: first, I need to make it automatically convert character entities into characters ("&" should become "&") one layer deep, and extract CDATA sections; second, I want to make it handle XML name spaces. I have been very busy lately and haven't been working on this, but I should have more time to work on it very soon.

If you have any questions, just send email to one of the addresses from one of the above web pages. And of course I'll accept patches if anyone ever offers some.

Steve R. Hastings

Posts: 5
Nickname: steveha
Registered: Jun, 2006

Re: Simplifying XML Manipulation Posted: Jun 8, 2006 10:58 AM
Reply to this message Reply
I went ahead and coded up an example of xe, using the unit test from the blog post. Note that xe pretty-prints the tags somewhat differently from the way Bruce Eckel's code does it; xe does give fine-grained control over how the pretty-printing works (see the xe.TFC class for details).


import xe

xe.set_indent_str(" ")

root = xe.NestElement("RequestHeader")

# binding names inside NestElement builds the tree implicitly

root.acct_num = xe.IntElement("AccountNumber", 12345)
root.meter_num = xe.IntElement("MeterNumber", 6789)
root.carr_code = xe.TextElement("CarrierCode", "FDXE")
root.service = xe.TextElement("Service", "STANDARDOVERNIGHT")
root.packaging = xe.TextElement("Packaging", "FEDEXENVELOPE")


# acct_num is an IntElement, so .value is an integer
print root.acct_num.value
assert root.acct_num.value == 12345

# str(root) yields the XML tags representation
print str(root)

assert str(root) == """\
<RequestHeader>
<AccountNumber>12345</AccountNumber>
<MeterNumber>6789</MeterNumber>
<CarrierCode>FDXE</CarrierCode>
<Service>STANDARDOVERNIGHT</Service>
<Packaging>FEDEXENVELOPE</Packaging>
</RequestHeader>"""

# .attrs is a dictionary of attribute names; by default attrs are strings

# You can subclass the attrs to make custom attributes with checked
# values, or of another type. See opml.py for a class that has a
# timestamp in the attributes.

root3 = xe.NestElement("Top")
root3.two = xe.NestElement("two")
root3.two.three = xe.TextElement("three", "wompity")
root3.two.three.attrs["x"] = "42"
root3.two.four = xe.TextElement("four", "woo")
root3.five = xe.TextElement("five", "rebar")
root3.five.attrs["attr"] = "ouch"
root3.five.attrs["stim"] = "pinch"

print str(root3)

assert str(root3) == """\
<Top>
<two>
<three x="42">wompity</three>
<four>woo</four>
</two>
<five
attr="ouch"
stim="pinch">rebar</five>
</Top>"""

# both these lines work to set the text value
root3.two.four.text = "wimpozzle"
root3.two.four = "wimpozzle"


# Can easily extract a subtree:
assert str(root3.two.four) == """<four>wimpozzle</four>"""

root3.five.attrs["attr"] = "argh!"
assert str(root3.five) == """\
<five
attr="argh!"
stim="pinch">rebar</five>"""


del(root3.five.attrs["attr"])
assert str(root3.five) == """<five stim="pinch">rebar</five>"""

Steve R. Hastings

Posts: 5
Nickname: steveha
Registered: Jun, 2006

Re: Simplifying XML Manipulation Posted: Jun 8, 2006 11:05 AM
Reply to this message Reply
Sorry, my example of an XML entity gets turned to a single character. What I said was:

"&amp;" should become "&"

Bruce Eckel

Posts: 875
Nickname: beckel
Registered: Jun, 2003

Re: Simplifying XML Manipulation Posted: Jun 9, 2006 9:32 AM
Reply to this message Reply
> With a builder like py.xml this would look like:
>
>

> node = x.RequestHeader(
> x.AccountNumber('12345'),
> x.MeterNumber('6789'),
> x.CarrierCode('FDXE'),
> x.Service('STANDARDOVERNIGHT'),
> x.Packaging('FEDEXENVELOPE'),
> )
>


I didn't imagine it being quite that clean when I looked at the docs, but that's pretty compelling. I still like my system (it's my C++ roots; I really like elegant applications of operator overloading), but if I started running into more complex problems -- primarily XML namespaces, I imagine -- then I'll probably use py.xml. So far, all I've had to do with XML is things like this, and I haven't yet run into any namespaces issues.

Small rant off topic: I don't understand why the py.test approach isn't being followed in the unit testing stuff that's been incorporated into the standard Python library. It seems like people are just blindly following the JUnit model, which was poorly designed to begin with. In general I really like the simple thinking that the "py." guy has.

Flat View: This topic has 29 replies on 2 pages [ 1  2 | » ]
Topic: Simplifying XML Manipulation Previous Topic   Next Topic Topic: XML Processors can't Ignore Namespaces

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use