The other day, someone brought up a UTF-8 related issue with RiCal.
RFC2445 specifies that each line of a icalendar datastream must be no more than 75 bytes, and longer lines need to be folded by breaking them into sections with the second and following sections put into lines with an initial space character to mark them as continuation lines. As was pointed out to me, simply breaking a UTF-8 string in Ruby runs the risk of splitting up a multi-byte character.
Here's a spec to show what I needed:
describe "String#safe_utf8_split" do
context "For an all-ascii string" do
before(:each) do
@it = "abcdef"
end
it "should properly split an ascii string when n leaves 1 character" do
@it.utf8_safe_split(5).should == ["abcde", "f"]
end
it "should return a nil remainder if the string has less than n characters" do
@it.utf8_safe_split(7).should == ["abcdef", nil]
end
it "should return a nil remainder if the string has exactly n characters" do
@it.utf8_safe_split(6).should == ["abcdef", nil]
end
end
context "For a string containing a 2-byte UTF-8 character" do
before(:each) do
@it = "Café"
end
it "should split properly just before the 2-byte character" do
@it.utf8_safe_split(3).should == ["Caf", "é"]
end
it "should split before when n is at the start of the 2-byte character" do
@it.utf8_safe_split(4).should == ["Caf", "é"]
end
it "should split after when n is at the second byte of a 2-byte character" do
@it.utf8_safe_split(5).should == ["Café", nil]
end
end
context "For a string containing a 3-byte UTF-8 character" do
before(:each) do
@it = "Prix â¬200"
end
it "should split properly just before the 3-byte character" do
@it.utf8_safe_split(5).should == ["Prix ", "â¬200"]
end
it "should split before when n is at the start of the 3-byte character" do
@it.utf8_safe_split(6).should == ["Prix ", "â¬200"]
end
it "should split before when n is at the second byte of a 3-byte character" do
@it.utf8_safe_split(7).should == ["Prix ", "â¬200"]
end
it "should split after when n is at the third byte of a 3-byte character" do
@it.utf8_safe_split(8).should == ["Prix â¬", "200"]
end
end
end
So to fix this I came up with a pretty simple idea, split the string and check to see if the second part is valid UTF-8:
class String
def valid_utf8?
unpack("U") rescue nil
end
def utf8_safe_split(n)
if length <= n
[self, nil]
else
before = self[0, n]
after = self[n..-1]
until after.valid_utf8?
n = n - 1
before = self[0, n]
after = self[n..-1]
end
[before, after.empty? ? nil : after]
end
end
end
In RiCal, I actually implemented this using functional methods in another object, since I didn't want to 'pollute' Strings instance methods, but the code here illustrates the basic idea.