Ragel rocks (part 1 of 2)

Zed Shaw is spot-on: “The Ragel State Machine Compiler is one kick-ass piece of software.

As part of Kinetic development, I need a way to parse several custom-designed languages, each with its custom grammar. The term popularized by Martin Fowler is DSLs, short for “domain-specific languages.”


http://www.flickr.com/photos/dilaudid/278649026/

In the past, when faced with this task of parsing DSLs, I took one of two approaches:

Approach #1: For parsing complex DSLs, the typical starting point, and the one taught in college compiler design courses, is to first define the language’s custom syntax via a comprehensive set of BNF grammar rules.

In junior high, I was one of those kids who actually enjoyed diagramming sentences. Well, defining BNF grammars for DSLs requires a similar masochistic mindset. Parser generators such as ANTLR allow these BNF grammars to be defined, and custom logic to be executed when the constituent words/tokens are “found” in the input text.

I’m over-simplifying the process of course, as any CS major who wasted away Memorial Day weekend staring at lex/yacc debugging messages can attest. And that is in fact the primary drawback of this approach of using parser generators: it’s usually too heavy-handed unless you really want to design a fairly rich custom DSL.

It’s no surprise that parsers and compilers are written by a relatively-tiny and arcane community of computer engineers. That business developer you’re paying $90/hour for? Chances are, he or she has never written a custom DSL parser.

Thankfully, I’ve only had to go down this road twice in my programming career (both for personal projects). I absolutely love the power of this approach–using full-fledged parser generator tools gives me the freedom to define complex DSLs–but the price in terms of software complexity and time is difficult to justify for most projects.

Approach #2: For that reason, I generally opt to keep my DSLs simple. Simpler DSLs means simpler parsing. This allows me to just cobble together straightforward text-parsing routines. These routines rely on basic multi-stage regular expression (regex) matchers and core string-manipulation functions. Nothing fancy, it’s anything any engineer can do.

Yes, it’s simplistic and dirty with the taint of hackiness. But–more importantly, it gets the job done every time.

The ugly catch? Debugging and maintaining these simplistic parsers is no fun. There’s a pearl of programming wisdom:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.”

Step away from several complicated regexes, and try patching that code a few years later. I’ve had to do that on several occasions, and each time, I growled at my former self. I would’ve had more luck trying to decipher the green letters in the Matrix. Add a pair of parentheses to an existing regex, and you risk breaking your back-references (Ruby 1.9′s support of named groups only helps very slightly here). The pain of maintaining regexes and convoluted string-manipulation logic generally leads to parsing code collecting more than its fair share of dust.

Which is why I was thrilled to discover Ragel.

(to be continued tomorrow…)

3 Comments »

  1. Zac Said,

    June 8, 2010 @ 5:09 pm

    I never took the compiler course at my university–though I purchased the books and had signed up one Spring semester before deciding to drop. That said, this whole post puts me in mind of FSAs and state machines and all sorts of stuff that I knew once upon a time, but would be hard pressed to do nowadays.

    These are actually the *interesting* problems to solve, as opposed to most of what we do as professional programmers (i.e. make that button be higher up, etc).

    Long story short, this post just made me realize how stupid I am. Ah well. C’est la vie.

  2. Josh Said,

    June 8, 2010 @ 8:31 pm

    I too never took that class (I don’t understand why it wasn’t required at my school… seems like it should have been).

    Because of that, I have found it very difficult any time I have attempted to use javacc and the like to create parsers.

    Regexes seem quite a bit more useful when that’s all you have :)

  3. Jim Tran Said,

    June 9, 2010 @ 5:33 pm

    I think the vernacular of the field (“finite state automata” and “BNF”) can be a bit off-putting to those who want to dabble in language parsing. There’s a tremendous amount of mathematical rigor behind language parsing theory; unfortunately, the downside of that is that it makes it appear less accessible than it really is. Which is a shame, because as Zac hints at, there are many *interesting* problems where a custom DSL can form the groundwork for a robust solution.

Leave a Comment

Powered by WP Hashcash