This post originated from an RSS feed registered with Ruby Buzz
by Eric Hodel.
Original Post: String Encoding Quick-Start
Feed Title: Segment7
Feed URL: http://blog.segment7.net/articles.rss
Feed Description: Posts about and around Ruby, MetaRuby, ruby2c, ZenTest and work at The Robot Co-op.
So you're using Ruby 1.9 and don't want to read JEG's Understanding M17N Series or runpaint's Encoding documents. I understand, they're very long and very detailed and you can go back and read them when you need to learn the details. How about just the important stuff?
Files have encodings and strings in those files have matching encodings. You set the encoding with a magic comment as the first line (second with shebang) # coding: UTF-8. If you don't set the encoding your strings are assumed to be in US-ASCII encoding instead.
Regular expressions are in US-ASCII encoding by default. You can use /u to make it UTF-8, /e to make it EUC-JP, /s to make it Windows-31J. You can also use #force_encoding. US-ASCII regular expressions will match strings that are ASCII-compatible. See ri Regexp for more details.
String#gsub may not preserve the input encoding. If the match happens at the beginning of the string the output encoding may not match the input encoding. You can work around this by forcing the encoding on the replacement string before replacement or using the block form of gsub. (This behavior may be a bug.)
IO objects have two encodings, the external_encoding which is how it is stored outside ruby (on disk for files and the stream or packet encoding for sockets) and the internal_encoding which will cause ruby to transcode the content if necessary. There's no provision for guessing the encoding of a document so you'll need to know ahead of time.
Strings can have an encoding but that encoding may not be valid, use String#valid_encoding? to verify. To transcode use String#encode. I have an example of using String#encode in my From Iconv#iconv to String#encode article and there are more in JEG and runpaint's articles linked above.