The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
sort_by{rand}'s bias makes BigDecimal cry

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.
sort_by{rand}'s bias makes BigDecimal cry Posted: Dec 5, 2005 1:23 PM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: sort_by{rand}'s bias makes BigDecimal cry
Feed Title: Eigenclass
Feed URL: http://feeds.feedburner.com/eigenclass
Feed Description: Ruby stuff --- trying to stay away from triviality.
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Eigen Class
Latest Posts From Eigenclass

Advertisement

sort_by{ rand } is a common idiom to shuffle an array in Ruby. It's short, looks good and is fairly efficient (despite being O ( n roman log n )) since Array#sort_by is done at C speed.

Unfortunately, it is biased: if the same rand() value is returned for two different elements, their relative order will be preserved.

%w[a b c d].sort_by{ 0 }                           # => ["a", "b", "c", "d"]
i = 0
%w[a b c d].sort_by{ 10 - (i += 1) / 2 }           # => ["d", "b", "c", "a"]

This means that permutations preserving the relative order of one (or more) pair of elements of the original array are a bit more probable.

How often will that happen? The Mersenne Twister PRNG used by Ruby has been proven to have a period of 2 sup 19937 - 1, so it would seem at first sight that the answer is "not before the sun turns into a nova". But Kernel#rand does not return the full state of the PRNG: you only get 53 bits of pseudo-randomness out of a call to rand(). This is becoming more interesting.

The birthday paradox

The birthday paradox states that given 23 people in a room, the odds are over 50% that two of them will have the same birthday. It is not actually a paradox (there's no self-contradiction), it merely defies common intuition.

It turns out that the problem with sort_by{ rand } is another instance of the birthday paradox. In this case, we have a room with N people (N being the size of the array to shuffle) and there are 2 sup 53 "days in a year".

Let's try to reproduce the 23-50% result, and apply the method to the other one.

There are

roman A(365,m) = { 365! } over {left ( 365 - m right ) ! }
ways to assign a birthday to m people without any repetition. Therefore, the probability that at least two people share a birthday is
1 - {roman A(365,m)} over { 365 sup m }
In Ruby terms, we first have to define factorial:

class Integer
  def fact0
    case self
    when 0..1
      1
    else 
      self * (self - 1).fact0
    end
  end
end  

This is fairly slow and will typically fail for 6000! or so, depending on the stack size. Stirling's approximation proves helpful here: we'll use

n! approx sqrt{ 2 * pi } n sup {n + 1 over 2} e sup{-n}

class Integer  
  def fact
    Math.sqrt(2*Math::PI*self) * Math.exp(-self) * Math.exp(self * Math.log(self))
  end
end

It compares rather well to the recursive factorial:

%w[fact0 fact].each do |m|
  t = Time.new
  eval <<-EOF
    200.times{|i| (i+1).#{m}}
  EOF
  "#{m}: #{Time.new - t}"     # => "fact0: 0.125991", "fact: 0.001527"
end

Time to define A(n,m):

def A(n,m)
  n.fact / (n-m).fact
end

def C(n,m)
  A(n,m) / m.fact
end

Given that definition, the well-known 23-50% figures should be easy to reproduce:

(21..23).each do |i|
  [i, "%5.3f" % (1 - A(365,i)/(365**i))] # => [21, "  nan"], [22, "  nan"], [23, "  nan"]
end

Something's wrong, and the culprit is not hard to locate:

A(365, 20)                                         # => NaN

BigDecimal to the rescue

Ruby's standard library includes an extension for arbitrary precision floats, BigDecimal, whose very name suggests that we can think of it as Bignum's cousin. BigDecimal objects have to be initialized with a String, but that's not much of a problem

require 'bigdecimal'
require 'bigdecimal/math'

BigDecimal.limit(100)
class Integer
  BigMath = Class.new{ extend ::BigMath }
  def fact(prec = 10) # !> method redefined; discarding old fact
    BigDecimal.new((Math.sqrt(2*Math::PI*self) * Math.exp(-self)).to_s) *
    BigMath.exp(BigDecimal.new(self.to_s) * BigMath.log(BigDecimal.new(self.to_s), prec), prec)
  end
end

Let's go back to the birthday problem:

(21..23).each do |i|
  [i, "%5.3f" % (1 - A(365,i)/(365**i))]  # => [21, "0.444"], [22, "0.476"], [23, "0.507"]
end

Good.


Read more...

Read: sort_by{rand}'s bias makes BigDecimal cry

Topic: Portland a hub for Open Source? Previous Topic   Next Topic Topic: MySQL Ruby Forum

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use