This post originated from an RSS feed registered with Python Buzz
by Phillip Pearson.
Original Post: Encoding of non-ascii characters in URLs
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Today I've been subjecting the PeopleAggregator API implementation to the 'Sam Ruby Iñtërnâtiônàlizætiøn test'. It went in and out just fine through XML-RPC, but the REST methods caused a bit more trouble. All sorted out now, but...
It turns out that Firefox, at least on my dev machine, encodes URLs as ISO-8859-1 (or perhaps Windows-1252), whereas Internet Explorer encodes them as UTF-8. I was trying to use PHP's mb_convert_encoding function to convert this, but it was just ignoring any non-ASCII chars.
The interesting thing about non-ascii chars in URLs and POSTDATA is that the browsers don't seem to send any indication of the charset used. Whether the content is UTF-8 or ISO-8859-1, all I get is "Content-Type: application/x-www-form-urlencoded". It would be nice to have "; charset=UTF-8" at the end, but it doesn't seem like I'm that lucky!
As a results of this, I've reduced the scope - PeopleAggregator will support UTF-8 and ISO-8859-1, with UTF-8 strongly preferred.
For Frontier's benefit, it will handle XML-RPC requests that pretend to be UTF-8 but are actually ISO-8859-1.