The Artima Developer Community
Sponsored Link

Python Buzz Forum
ElementTree returns normal strings if given a 7-bit document

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand
ElementTree returns normal strings if given a 7-bit document Posted: Aug 9, 2007 7:02 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: ElementTree returns normal strings if given a 7-bit document
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Latest Python Buzz Posts
Latest Python Buzz Posts by Phillip Pearson
Latest Posts From Second p0st

Advertisement

I'm parsing some XML with ElementTree and trying to handle character encodings properly, but was confused as ET was giving me plain strings (types.StringType) rather than unicode strings (types.UnicodeType), which is what I'm used to.

Finally figured out that ET returns plain strings if given 7-bit input, so it should be safe to pass anything from ET through unicode() if you want to make sure input to something else is in unicode string format.

Test script:

#!/usr/bin/python2.5 -u

import sys
import traceback
from xml.etree import ElementTree as ET
from xml.etree import cElementTree as cET

def main():
    for eclair in (u'\xe9clair', u'plain text'):
        utf = eclair.encode("utf-8")
        iso = eclair.encode("iso-8859-1")
        for xml in (
            # expected to fail:
            u"""<?xml version="1.0"?><test>%s</test>""" % eclair, # unicode source - will fail
            """<?xml version="1.0"?><test>%s</test>""" % iso, # iso input specified as utf-8, will fail

            # expected to succeed:
            """<?xml version="1.0"?><test>%s</test>""" % utf, # correct utf-8 input, default encoding
            """<?xml version="1.0" encoding="utf-8"?><test>%s</test>""" % utf, # utf-8 specified as such
            """<?xml version="1.0" encoding="iso-8859-1"?><test>%s</test>""" % iso, # iso-8859-1 specified as such
            ):
            print "------ parsing %s" % `xml`
            try:
                tree = ET.fromstring(xml)
            except Exception, e:
                print "FAIL:",e
                continue
            print "to tree:",tree
            ctree = cET.fromstring(xml)
            print "    cET:",ctree
            print "string:",`tree.text`
            print "   cET:",`ctree.text`

main()

And the output:

$ ./et_utf8.py
------ parsing u'<?xml version="1.0"?><test>\xe9clair</test>'
FAIL: 'ascii' codec can't encode character u'\xe9' in position 27: ordinal not in range(128)
------ parsing '<?xml version="1.0"?><test>\xe9clair</test>'
FAIL: not well-formed (invalid token): line 1, column 27
------ parsing '<?xml version="1.0"?><test>\xc3\xa9clair</test>'
to tree: <Element test at b7d99a8c>
    cET: <Element 'test' at 0xb7d9c908>
string: u'\xe9clair'
   cET: u'\xe9clair'
------ parsing '<?xml version="1.0" encoding="utf-8"?><test>\xc3\xa9clair</test>'
to tree: <Element test at b7d99a0c>
    cET: <Element 'test' at 0xb7d9c950>
string: u'\xe9clair'
   cET: u'\xe9clair'
------ parsing '<?xml version="1.0" encoding="iso-8859-1"?><test>\xe9clair</test>'
to tree: <Element test at b7d9972c>
    cET: <Element 'test' at 0xb7d9c938>
string: u'\xe9clair'
   cET: u'\xe9clair'
------ parsing u'<?xml version="1.0"?><test>plain text</test>'
to tree: <Element test at b7d9992c>
    cET: <Element 'test' at 0xb7d9c908>
string: 'plain text'
   cET: 'plain text'
------ parsing '<?xml version="1.0"?><test>plain text</test>'
to tree: <Element test at b7d996cc>
    cET: <Element 'test' at 0xb7d9c950>
string: 'plain text'
   cET: 'plain text'
------ parsing '<?xml version="1.0"?><test>plain text</test>'
to tree: <Element test at b7d999ec>
    cET: <Element 'test' at 0xb7d9c938>
string: 'plain text'
   cET: 'plain text'
------ parsing '<?xml version="1.0" encoding="utf-8"?><test>plain text</test>'
to tree: <Element test at b7d9990c>
    cET: <Element 'test' at 0xb7d9c908>
string: 'plain text'
   cET: 'plain text'
------ parsing '<?xml version="1.0" encoding="iso-8859-1"?><test>plain text</test>'
to tree: <Element test at b7d99a0c>
    cET: <Element 'test' at 0xb7d9c950>
string: 'plain text'
   cET: 'plain text'

Some notes from the above

  • Input to ET should be of type types.StringType, with a proper encoding specification in the XML header.
  • cElementTree and ElementTree return consistent results.

Comment

Read: ElementTree returns normal strings if given a 7-bit document

Topic: XO B4 Previous Topic   Next Topic Topic: Tempita

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use