This post originated from an RSS feed registered with Scala Buzz
by Eric Torreborre.
Original Post: My first ICFP contest, Part 2: parser combinators are so cool
Feed Title: A++ [Eric Torreborre's Blog]
Feed URL: http://etorreborre.blogspot.com/feeds/posts/default/-/scala
Feed Description: This blog is "Software on my mind" and I try to write one post a month. As Scala is a lot on my mind recently, you can expect more Scala posts quite soon!
I guess that we need to make some mistakes (at least) once oneself before getting it!
However, the biggest lesson for me in this contest, was the use of Parser Combinators.
Do you speak Martian?
The second programming task in the contest, after setting up the network connection to be able to receive messages and send back commands, was to parse the input messages giving the position of the Martian rover, and its immediate environment.
initialization messages -> what's the map like? what are you technical constants?
telemetry stream -> where are you, what's your speed, what's around?
adverse event -> crashed in a crater!
success -> yes, home!
end of run -> time and status
The format of those messages is not particularly complex but the telemetry message can contain an unlimited number of obstacles descriptions for example.
For many developers, the most obvious way to transforms meaningless strings to meaningful objects like Telemetry or Martian, would be to start traversing the string, check for characters like 'T' for Telemetry, or 'I' for Initialization then pass the rest of the string to another function and continue slicing the string until all data is analyzed.
Parser combinators
But Haskell developers know better,... They can just describe the grammar of the messages they're parsing by combining parsers, then applying the resulting parser to the string input.
Something like that: telemetry :: Parser Message telemetry = do ts <- int state <- accelStateParser turning <- turningStateParser spaces x <- double y <- double dir <- double speed <- double obstacles <- obstaclesParser messageEnd return $ Telemetry ts state turning x y dir speed obstacles
hardleft = stateParser 'L' HardLeft softleft = stateParser 'l' SoftLeft hardright = stateParser 'R' HardRight softright = stateParser 'r' SoftRight straight = stateParser '-' Straight obstaclesParser :: Parser [Object] obstaclesParser = do l <- many obstacleParser return l
obstacleParser :: Parser Object obstacleParser = do code <- oneOf "bchm" spaces case code of 'b' -> object Boulder 'c' -> object Crater 'h' -> object Home 'm' -> martian _ -> error "Can't get here in parser!"
object :: (Circle -> a) -> Parser a object cons = do x <- double y <- double r <- double spaces return $ cons (Circle (x:.y) r)
martian :: Parser Object martian = do x <- double y <- double dir <- double speed <- double spaces return $ Martian (Circle (x:.y) 0.4) dir speed
In the above code, you can read that the parser for telemetry messages is a sequence of different parsers: a timestamp parser, a acceleration state parser, a turning state parser, ..., a parser for lists of obstacles. And a parser for list of obstacles is just "many" parsers for one obstacle. This is indeed as easy as being able to describe a grammar for that mini-language.
"k" should better be 1
But I didn't say that describing a grammar was easy! For instance, at first, we made a mistake in the way we managed spaces. We had almost each parser trying to consume spaces before doing its job. This generated unnecessary ambiguity in the parsing.
For example, let's say that you want to be able to parse this kind of message: " O T O O T"
(honestly, I don't know why you would like to do something like that but you're the boss here)
You'd better combine those 3 parsers:
spaces -> will consume any number of spaces
oneParser = char 'O' >>= spaces -> will consume 'O' followed by any number of spaces
twoParser = char 'T' >>= spaces -> will consume 'T' followed by any number of spaces
If you define parsers differently, having "oneParser" and "twoParser" consuming spaces before they parse their character, which parser should be used on the first space of the message?
Actually, parser combinators in Haskell are also capable of backtracking and "unparsing" results if the chosen parser was not appropriate (trying things with the the "try" parser combinator). But why add more complexity when things can be described more directly?
Of course, there is a formal theory behind the example above: LL(k) grammars, where LL(1) is a type of grammar only needing to look at the next character to know which parser to use. But there's nothing like a simple "real-life" example to be able to grasp it.
Compare and contrast
Now, honestly, most other contest solutions were not using Parser combinators and the aren't soooo ugly (in Dylan for example). I think they're just less readable and more error-prone because of index manipulation.
For fun and comparison, I also developed a similar parser in Scala: def telemetryMessage = for { timestamp <- intNumber accState <- accelState turnState <- turningState vehicleX <- doubleNumber vehicleY <- doubleNumber vehicleDir <- doubleNumber vehicleSpeed <- doubleNumber obstacles <- objects } yield Telemetry(timestamp, accState(), turnState(), vehicleX, vehicleY, vehicleDir, vehicleSpeed, obstacles)
def accelState = ( "a" ^^^ Accelerating | "b" ^^^ Braking | "-" ^^^ Rolling )
def circleParser = for { x <- doubleNumber y <- doubleNumber r <- doubleNumber } yield new Circle(x, y, r)
def martianParser = for { circle <- circleParser dir <- doubleNumber speed <- doubleNumber } yield new Martian(circle, dir, speed)
The interesting things in the Scala implementation, although it could seem a bit ugly at first sight, are:
the use of a sequence operator "~>" getting rid of what has already been parsed. Indeed, when we have parsed "b " we know we need to create a Boulder object. The 'b' character and the space are then not usefull anymore
the use of functions to transform the parsed string into a meaningful object, like "l" ^^^ SoftLeft, creating a new SoftLeft object (it's a case class) when parsing "l"
implicit definitions. In the example above I don't need to declare my parser for 'l' as letter("l"). Combining "l" with another parser automatically creates a parser for the character 'l'
Why XML?
One of the reasons for the ubiquitous adoption (and hate!) of XML is, I think, the general availability of parsers and binding libraries. This is why this often becomes the easiest option for configuration files (is it changing with YAML and JSON?).
It doesn't have to be that way. I hope that parser combinators can now bring another legitimate option when considering the implementation of an external mini-language for configuration or protocol. They're simple to use and readily available on your Java platform through Scala. Try them at home!