This post originated from an RSS feed registered with Ruby Buzz
by Eric Hodel.
Original Post: From Iconv#iconv to String#encode
Feed Title: Segment7
Feed URL: http://blog.segment7.net/articles.rss
Feed Description: Posts about and around Ruby, MetaRuby, ruby2c, ZenTest and work at The Robot Co-op.
In Ruby 1.8 if you wished to transcode between two character sets you would need to use Iconv. James has an excellent explanation of Encoding Conversion With iconv on his blog.
Ruby 1.9.3 will warn when iconv is required:
$ ruby19 -v -riconv -e 0
ruby 1.9.3dev (2010-12-17 trunk 30231) [x86_64-darwin10.5.0]
iconv will be deprecated in the future, use String#encode instead.
I don't wish to duplicate anything in James or runpaint's documentation, so I'll just run through a simple example of porting from Iconv#iconv to String#encode.
We'll start with a part of James' final example which converts from UTF-8 to Latin 1 (ISO-8859-1), transliterates characters and ignores unknown sequences:
require "iconv"
utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")
on_and_on = "On and on… and on…"
utf8_to_latin1.iconv(on_and_on) # => "On and on... and on..."
Ruby 1.9.2 doesn't support transliterate while Ruby 1.9.3 supports transliteration but doesn't have the pre-built tables of Iconv.
For Ruby 1.9.2 using String#encode instead of Iconv looks like this:
on_and_on = "On and on… and on…"
on_and_on.encode Encoding::ISO_8859_1
However, the ellipsis character doesn't have an analog in ISO-8859-1, so we'll get an encoding error:
U+2026 from UTF-8 to ISO-8859-1 (Encoding::UndefinedConversionError)
String#encode supports an options Hash for encoding which is described in runpaint's transcoding section. For 1.9.2 we tell the encoding to replace undefined characters with the replacement character:
on_and_on = "On and on… and on…"
result = on_and_on.encode Encoding::ISO_8859_1, :undef => :replace
p result # => "On and on? and on?"
This isn't as nice as Iconv's handling with transliterate, but it's the best 1.9.2 can do without using the more verbose Encoding::Converter to recover from undefined conversions, and it still lacks the tables of Iconv.
In 1.9.3 String#encode's options Hash will support a new value, :fallback which can be either a Hash or Proc that maps characters to replacement Strings. The replacement string that :fallback returns can be in any encoding and String#encode will attempt to transcode the result to the destination encoding. The same example with a current 1.9.3dev build:
on_and_on = "On and on… and on…"
result = on_and_on.encode Encoding::ISO_8859_1, :fallback => { … => '...' }
p result # => "On and on... and on..."
Finally, the above examples all assume the input string is valid for its encoding. You can use String#valid_encoding? to check for invalid byte sequences. If there may be invalid byte sequences in your input string you can use :invalid => :replace in the encode options string to map these sequences to the replacement character.