This post originated from an RSS feed registered with Scala Buzz
by Eric Torreborre.
Original Post: Edit distance in Scala
Feed Title: A++ [Eric Torreborre's Blog]
Feed URL: http://etorreborre.blogspot.com/feeds/posts/default/-/scala
Feed Description: This blog is "Software on my mind" and I try to write one post a month. As Scala is a lot on my mind recently, you can expect more Scala posts quite soon!
I was just looking for a way to highlight differences between 2 strings for my Behavior-Driven Development library, specs. And reading the excellent Algorithm Design Manual. Intuitively, I was thinking that there was a better way to show string differences than the one used in jUnit:
Expected kit<t&gt;en but was kit<ch&gt;en Expected <skate&gt; but was <kite&gt;
The first message shows that <t&gt; has been replaced by <ch&gt;, while the second message says that everything is different! There must be a way to find the minimal set of operations necessary to transform one string to another, right?
The Levenshtein distance
There is indeed. The "Edit" distance, also called "Levenshtein" distance, computes exactly this, the minimal number of insertions, deletions and substitutions required to transform "World" into "Peace" (5, they're very far apart,...).
The computation is usually done by:
examining the first 2 or last 2 letters of each string
assuming the eventual transformation which may occur for those 2 letters:
they're equal: "h..." and "h...." --> distance = 0
the first one has been removed: "ah..." and "h..." --> distance = 1
the second one has been inserted: "h..." and "ah..." --> distance = 1
they've been substituted: "ha..." and "ga..." --> distance = 1
3. saying that the final distance is the minimum of the distance when choosing one the
4 possibilities + the distance resulting from the consequences of that choice
You can notice that many variations are possible: the distance could be different depending on the letters being added or removed, there could be more allowed operations such as the swapping of 2 letters (distance 1 instead of 2 substitutions of distance 2),...
Then the final result can either be constructed incrementally or recursively. Incrementally, for "skate" and "kite" you will compute costs from any small prefix to another prefix:
"s" to "k" --> distance 1
"s" to "ki" --> distance 2
"sk" to "ki" --> distance 2
"ska" to "kit" --> distance 3
Then you increase the prefix size and you reuse the distance numbers found for smaller prefixes. The different results can be stored in the following matrix:
s k a t e
k 1 1 2 3 4 i 2 2 2 3 4 t 3 3 3 2 3 e 4 4 4 4 2
Recursively, you do the inverse and you establish that the distance between 2 strings can be computed from knowing the distance between smaller prefixes and you travel the matrix to its upper left corner.
This is a very good example of Dynamic Programming, where the optimal quantity (the distance at [i, j] in the matrix) you're looking for only depends on:
the optimal quantity for a subset of the data (the distance at [i-1, j-1], [i-1, j], [i, j-1])
the resulting state (the substrings defined by a position [i-x, j-x] in the matrix)
not how you got to that resulting state
The EditDistance trait
For the record, I will write here the code I'm using for the implementation, which is constructing the solution incrementally, as opposed to the recursive solution presented on Tony's blog, here.
The few interesting things to notice about this code are:
the algorithm itself ("initializing the matrix"), which is very straightforward
the minimum function which is a bit strange because it is not comparing all alternatives. The cost for an insertion is only computed if the cost for a suppression is not better than the cost for a substitution. I would like to work out a formal proof on why this is correct by my intuition tells me that a substitution is generally a "good" operation. It has the same cost as an insertion or a suppression but processes 2 letters at the time. So if we find a better cost using a suppression, it is going to be the best
the matrix traversal to retrieve one of the possible paths showing the letters which needs to be transformed for each string: not so fun code with a lot of corner cases
the "separators" feature which allows to select the separators of your choice to display the string differences
trait EditDistance { /** * Class encapsulating the functions related to the edit distance of 2 strings */ case class EditMatrix(s1: String, s2: String) { /* matrix containing the edit distance for any prefix of s1 and s2: matrix(i)(j) = edit distance(s1[0..i], s[0..j])*/ val matrix = new Array[Array[int]](s1.length + 1, s2.length + 1)
/* initializing the matrix */ for (i <- 0 to s1.length; j <- 0 to s2.length) { if (i == 0) matrix(i)(j) = j // j insertions else if (j == 0) matrix(i)(j) = i // i suppressions else matrix(i)(j) = min(matrix(i - 1)(j) + 1, // suppression matrix(i - 1)(j - 1) + (if (s1(i - 1) == s2(j - 1)) 0 else 1), // substitution matrix(i)(j - 1) + 1) // insertion
} /** @return the edit distance between 2 strings */ def distance = matrix(s1.length)(s2.length)
/** prints the edit matrix of 2 strings */ def print = { for (i <- 0 to s1.length) { def row = for (j <- 0 to s2.length) yield matrix(i)(j) println(row.mkString("|")) } this }
/** @return a (String, String) displaying the differences between each input strings. The used separators are parenthesis: '(' and ')'*/ def showDistance: (String, String) = showDistance("()")
/** * @param sep separators used to hightlight differences. If sep is empty, * then no separator is used. If sep contains one character, it is * taken as the unique separator. If sep contains 2 or more characters, * the first 2 characters are taken as opening separator and closing * separator. * * @return a (String, String) displaying the differences between each * input strings. The used separators are specified by the caller. */ def showDistance(sep: String) = { val (firstSeparator, secondSeparator) = separators(sep) def modify(s: String, c: Char): String = modifyString(s, c.toString) def modifyString(s: String, mod: String): String = { (firstSeparator + mod + secondSeparator + s). removeAll(secondSeparator + firstSeparator) } def findOperations(dist: Int, i: Int, j:Int, s1mod: String, s2mod: String): (String, String) = { if (i == 0 && j == 0) { ("", "") } else if (i == 1 && j == 1) { if (dist == 0) (s1(0) + s1mod, s2(0) + s2mod) else (modify(s1mod, s1(0)), modify(s2mod, s2(0))) } else if (j < 1) (modifyString(s1mod, s1.slice(0, i)), s2mod) else if (i < 1) (s1mod, modifyString(s2mod, s2.slice(0, j))) else { val (suppr, subst, ins) = (matrix(i - 1)(j), matrix(i - 1)(j - 1), matrix(i)(j - 1)) if (suppr < subst) findOperations(suppr, i - 1, j, modify(s1mod, s1(i - 1)), s2mod) else if (ins < subst) findOperations(ins, i, j - 1, s1mod, modify(s2mod, s2(j - 1))) else if (subst < dist) findOperations(subst, i - 1, j - 1, modify(s1mod, s1(i - 1)), modify(s2mod, s2(j - 1))) else findOperations(subst, i - 1, j - 1, s1(i - 1) + s1mod, s2(j - 1) + s2mod) } } findOperations(distance, s1.length, s2.length, "", "") } def min(suppr: Int, subst: Int, ins: =>Int) = { if(suppr < subst) suppr else if (ins < subst) ins else subst } } def editDistance(s1: String, s2: String): Int = EditMatrix(s1, s2).distance def showMatrix(s1: String, s2: String) = EditMatrix(s1, s2).print def showDistance(s1: String, s2: String) = EditMatrix(s1, s2).showDistance def showDistance(s1: String, s2: String, sep: String) = EditMatrix(s1, s2).showDistance(sep)