This post originated from an RSS feed registered with Python Buzz
by Phillip Pearson.
Original Post: ElementTree returns normal strings if given a 7-bit document
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
I'm parsing some XML with ElementTree and trying to handle character encodings properly, but was confused as ET was giving me plain strings (types.StringType) rather than unicode strings (types.UnicodeType), which is what I'm used to.
Finally figured out that ET returns plain strings if given 7-bit input, so it should be safe to pass anything from ET through unicode() if you want to make sure input to something else is in unicode string format.
Test script:
#!/usr/bin/python2.5 -u
import sys
import traceback
from xml.etree import ElementTree as ET
from xml.etree import cElementTree as cET
def main():
for eclair in (u'\xe9clair', u'plain text'):
utf = eclair.encode("utf-8")
iso = eclair.encode("iso-8859-1")
for xml in (
# expected to fail:
u"""<?xml version="1.0"?><test>%s</test>""" % eclair, # unicode source - will fail
"""<?xml version="1.0"?><test>%s</test>""" % iso, # iso input specified as utf-8, will fail
# expected to succeed:
"""<?xml version="1.0"?><test>%s</test>""" % utf, # correct utf-8 input, default encoding
"""<?xml version="1.0" encoding="utf-8"?><test>%s</test>""" % utf, # utf-8 specified as such
"""<?xml version="1.0" encoding="iso-8859-1"?><test>%s</test>""" % iso, # iso-8859-1 specified as such
):
print "------ parsing %s" % `xml`
try:
tree = ET.fromstring(xml)
except Exception, e:
print "FAIL:",e
continue
print "to tree:",tree
ctree = cET.fromstring(xml)
print " cET:",ctree
print "string:",`tree.text`
print " cET:",`ctree.text`
main()
And the output:
$ ./et_utf8.py
------ parsing u'<?xml version="1.0"?><test>\xe9clair</test>'
FAIL: 'ascii' codec can't encode character u'\xe9' in position 27: ordinal not in range(128)
------ parsing '<?xml version="1.0"?><test>\xe9clair</test>'
FAIL: not well-formed (invalid token): line 1, column 27
------ parsing '<?xml version="1.0"?><test>\xc3\xa9clair</test>'
to tree: <Element test at b7d99a8c>
cET: <Element 'test' at 0xb7d9c908>
string: u'\xe9clair'
cET: u'\xe9clair'
------ parsing '<?xml version="1.0" encoding="utf-8"?><test>\xc3\xa9clair</test>'
to tree: <Element test at b7d99a0c>
cET: <Element 'test' at 0xb7d9c950>
string: u'\xe9clair'
cET: u'\xe9clair'
------ parsing '<?xml version="1.0" encoding="iso-8859-1"?><test>\xe9clair</test>'
to tree: <Element test at b7d9972c>
cET: <Element 'test' at 0xb7d9c938>
string: u'\xe9clair'
cET: u'\xe9clair'
------ parsing u'<?xml version="1.0"?><test>plain text</test>'
to tree: <Element test at b7d9992c>
cET: <Element 'test' at 0xb7d9c908>
string: 'plain text'
cET: 'plain text'
------ parsing '<?xml version="1.0"?><test>plain text</test>'
to tree: <Element test at b7d996cc>
cET: <Element 'test' at 0xb7d9c950>
string: 'plain text'
cET: 'plain text'
------ parsing '<?xml version="1.0"?><test>plain text</test>'
to tree: <Element test at b7d999ec>
cET: <Element 'test' at 0xb7d9c938>
string: 'plain text'
cET: 'plain text'
------ parsing '<?xml version="1.0" encoding="utf-8"?><test>plain text</test>'
to tree: <Element test at b7d9990c>
cET: <Element 'test' at 0xb7d9c908>
string: 'plain text'
cET: 'plain text'
------ parsing '<?xml version="1.0" encoding="iso-8859-1"?><test>plain text</test>'
to tree: <Element test at b7d99a0c>
cET: <Element 'test' at 0xb7d9c950>
string: 'plain text'
cET: 'plain text'
Some notes from the above
Input to ET should be of type types.StringType, with a proper encoding specification in the XML header.
cElementTree and ElementTree return consistent results.