The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Safely dividing a UTF-8 String in Ruby

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Rick DeNatale

Posts: 269
Nickname: rdenatale
Registered: Sep, 2007

Rick DeNatale is a consultant with over three decades of experience in OO technology.
Safely dividing a UTF-8 String in Ruby Posted: May 28, 2009 10:18 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Rick DeNatale.
Original Post: Safely dividing a UTF-8 String in Ruby
Feed Title: Talk Like A Duck
Feed URL: http://talklikeaduck.denhaven2.com/articles.atom
Feed Description: Musings on Ruby, Rails, and other topics by an experienced object technologist.
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Rick DeNatale
Latest Posts From Talk Like A Duck

Advertisement

The other day, someone brought up a UTF-8 related issue with RiCal.

RFC2445 specifies that each line of a icalendar datastream must be no more than 75 bytes, and longer lines need to be folded by breaking them into sections with the second and following sections put into lines with an initial space character to mark them as continuation lines. As was pointed out to me, simply breaking a UTF-8 string in Ruby runs the risk of splitting up a multi-byte character.

Here's a spec to show what I needed:

describe "String#safe_utf8_split" do
  context "For an all-ascii string" do
    before(:each) do
      @it = "abcdef"
    end

    it "should properly split an ascii string when n leaves 1 character" do
      @it.utf8_safe_split(5).should == ["abcde", "f"]
    end

    it "should return a nil remainder if the string has less than n characters" do
      @it.utf8_safe_split(7).should == ["abcdef", nil]
    end

    it "should return a nil remainder if the string has exactly n characters" do
      @it.utf8_safe_split(6).should == ["abcdef", nil]
    end
  end

  context "For a string containing a 2-byte UTF-8 character" do
    before(:each) do
      @it = "Café"
    end


    it "should split properly just before the 2-byte character" do
      @it.utf8_safe_split(3).should == ["Caf", "é"]
    end

    it "should split before when n is at the start of the 2-byte character" do
      @it.utf8_safe_split(4).should == ["Caf", "é"]
    end

    it "should split after when n is at the second byte of a 2-byte character" do
      @it.utf8_safe_split(5).should == ["Café", nil]
    end
  end

  context "For a string containing a 3-byte UTF-8 character" do
    before(:each) do
      @it = "Prix €200"
    end


    it "should split properly just before the 3-byte character" do
      @it.utf8_safe_split(5).should == ["Prix ", "€200"]
    end

    it "should split before when n is at the start of the 3-byte character" do
      @it.utf8_safe_split(6).should == ["Prix ", "€200"]
    end

    it "should split before when n is at the second byte of a 3-byte character" do
      @it.utf8_safe_split(7).should == ["Prix ", "€200"]
    end

    it "should split after when n is at the third byte of a 3-byte character" do
      @it.utf8_safe_split(8).should == ["Prix €", "200"]
    end
  end

end

So to fix this I came up with a pretty simple idea, split the string and check to see if the second part is valid UTF-8:

class String
  def valid_utf8?
    unpack("U") rescue nil
  end

  def utf8_safe_split(n)
    if length <= n
      [self, nil]
    else
      before = self[0, n]
      after = self[n..-1]
      until after.valid_utf8?
        n = n - 1
        before = self[0, n]
        after = self[n..-1]
      end      
      [before, after.empty? ? nil : after]
    end
  end  
end

In RiCal, I actually implemented this using functional methods in another object, since I didn't want to 'pollute' Strings instance methods, but the code here illustrates the basic idea.

Read: Safely dividing a UTF-8 String in Ruby

Topic: Sometimes I feel like Agile Hitler Previous Topic   Next Topic Topic: Calling Clojure from Java

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use