This post originated from an RSS feed registered with Python Buzz
by Thomas Guest.
Original Post: Narrow Python
Feed Title: Word Aligned: Category Python
Feed URL: http://feeds.feedburner.com/WordAlignedCategoryPython
Feed Description: Dynamic languages in general. Python in particular. The adventures of a space sensitive programmer.
I needed to investigate character code points beyond the Unicode
basic multilingual plane. As usual Python was the tool I reached for first
– well, not quite first, since I’d already leafed through the
introductory sections of the Unicode book, in which I noticed the
following encouraging words from Python’s inventor:
“Modern programs must handle Unicode – Python has excellent support
for Unicode, and will keep getting better.” — Guido van Rossum
I’m not sure I can fully agree with the excellent support bit of
this quotation: in this case, I had to put in the batteries myself.
Legacy Systems
Incidentally, I agree with the BDFL and the many others who are
on record as saying that Unicode is both necessary and great. It’s
just a shame it didn’t happen sooner, because we now have any number
of legacy systems which make a poor fist of things — C++’s built in
wchar_t being a typically half-baked solution.
(Questions:
Is a wchar_t suitable for Unicode characters?
Can a std:wstring help us write international applications in a portable way?
What’s the best way to handle text data in a C++ program?
Answers:
Maybe.
Probably not.
Watch this space.)
Narrow builds
As I write this, the Python installed on my machine – and indeed on
all the machines I have access to – behaves as follows:
narrow python problems
>>> help(unichr)
Help on built-in function unichr in module __builtin__:
unichr(...)
unichr(i) -> Unicode character
Return a Unicode string of one character with
ordinal i; 0 <= i <= 0x10ffff.
>>> uc = unichr(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
The error message suggested I’d need to rebuild Python to make it
behave. I didn’t really want to do this, so I tried a bit of
googling
in case it found me a more favourable answer (it didn’t). I had a closer
look at what python -h had to tell me in case I could supply a -wide
option (I couldn’t). I even wondered if the pythonw which lives alongside
python might be just what I wanted (it wasn’t).
Wide builds
Finally, I decided I’d have to rebuild Python after all. For the
record, here’s what you do:
Building a wide version Python
$ tar xjf Python-2.5.tar.bz2
$ cd Python-2.5
$ ./configure --enable-unicode=ucs4 && make
No, it’s not that hard, once you’ve read PEP 261, which
explains the configure options. You’ll have to work out for yourself
if and where you want to install this new and flabby version of Python,
which – shock!, horror!! – doubles the memory used for most Unicode
strings.
Loss of power
For once, I’m disappointed in Python. The
default build provides weakened support for Unicode, which in some
ways is worse than no support for Unicode. Why? Because the language
appears to support Unicode, but is likely to let you down if you
ever venture past the safe region of the basic multilingual plane
– the kind of unwelcome surprise which experienced programmers
rightly fear. And because the behaviour you see on a wide build differs from
the behaviour you get on a narrow build. Worse again, Python is
perfectly able to support the full Unicode standard, if you’re
prepared to trade in a bit of memory for compliance. This is the kind
of trade-off Python users are usually more than happy to accept. If
and when they need to get closer to the silicon, they just use C.
Hope
Of course Guido van Rossum did say:
“… Python has excellent support
for Unicode, and will keep getting better.”
(Emphasis mine). This looks like one particular area where support could be
better. I’ve seen hints that C++0X (which may well end up becoming
C++1X) will place improved Unicode support into the standard language, but
I’d bet Python will stay ahead by a comfortable margin.