The more I play around with it, the more impressed I become with the combination of ANTLR v4 and Groovy. I am learning while I play around with the Grammar and syntax of the language I intend to embed within DEX, AKA, TMI. That’s a lot of acronyms!
There is something that I find extremely empowering about the thought process of designing a language versus simply programming within one. When programming in a target language you ask questions such as “How can I say…<insert task here>”. When designing one you are liberated in that you can also ask “How should I be able to express a solution for…<some task>”.
I am becoming more impressed with ANTLR v4’s ability to unravel left recursive grammars. Let’s explore some.
Exploring TMI
I started by using a subset of ANSI C and borrowed some from Scala’s grammar. The C grammar’s been stable for ages and Martin Ordersky is about a thousand times smarter than I. I am also mixing in a dash of R with some of the convenience of Perl.
I hope that eventually I will get to my core task of Table transformations, however, no language is complete without basic math expressions, conditionals and so on.
Programs and Statements
program : statement* EOF; statement : foreachLoop | '{' statement* '}' | ifStatement | forLoop | whileLoop | builtin ';' | expression (('if'|'unless') expression)? ';';
Programs consist of zero or more statements followed by an end of file.
We’re currently supporting the following statements:
- foreach loops
- block statements
- if statements
- while loops
- builtins such as print and println.
- expressions which are optionally followed by an if or unless clause. I loved this in Perl because some statements read much simpler this way.
foreach loops
I have chosen the unconventional route of using Java as my target language and Groovy as the language which I have written the parse tree visitor.
The visitor nodes return type Object. Groovy magically does the correct thing with them most of the time. I did have to create one special type for Identifiers called TMIIdentifier to indicate when a value requires a secondary lookup within an internal symbol table which is simply a Groovy map.
The gramatical implementation for foreach is simply:
foreachLoop : 'foreach' IDENTIFIER '(' range ')' statement; range : INTEGER_LITERAL ('..' INTEGER_LITERAL)? # intRange | CHARACTER_LITERAL ('..' CHARACTER_LITERAL)? # charRange | range (',' range)+ # rangeUnion;
This will accept phrases such as:
- foreach i (1..10) { } // Simple integer iterator….
- foreach c (‘a’..’z’) {} // Simple character iterator…
- foreach i (1..5,10..15) {} // Union of two iterations…
- foreach (10..1) {} // Downward iteration
- foreach hex (0..9,’A’..’F’) {} // Mixed type iteration
How cool is that?
The visitor code is very simple in Groovy:
public Object visitForeachLoop(ForeachLoopContext ctx) { if (ctx.getChildCount() == 6) { TMIIdentifier id = new TMIIdentifier(ctx.IDENTIFIER().getText()) def range = visit(ctx.getChild(3)) range.each { it -> sym[id.getId()] = it; visit(ctx.getChild(5)) } } return null; }
We check to see if there are 6 children in the context. This should always be the case given this grammar, however, in many contexts this is not the case.
If for some reason, I do not get the expected 6 children, I return null. This is questionable, but for now will suffice. Probably I should create some non-null representation of an empty return type.
If we do get the 6 expected children, we know their types as there are no choices within this alternative. It is of the form:
- ‘foreach’ token – ctx.getChild(0)
- IDENTIFIER – ctx.getChild(1)
- ‘(‘ token – ctx.getChild(2)
- range – ctx.getChild(3)
- ‘)’ token – ctx.getChild(4)
- statement – ctx.getChild(5)
We create a new identifier from IDENTIFIER then define a range by visiting the 4th child.
The visit to the range child returns a list of integers, characters or both. It’s visitors are represented here:
public Object visitIntRange(IntRangeContext ctx) { def range = [] if (ctx.getChildCount() == 1) { int value = new Integer(ctx.getChild(0).getText()) range << value } else if (ctx.getChildCount() == 3) { int lower = new Integer(ctx.getChild(0).getText()) int upper = new Integer(ctx.getChild(2).getText()) range = (lower..upper).collect() } return range } public Object visitCharRange(CharRangeContext ctx) { def range = [] if (ctx.getChildCount() == 1) { Character value = ctx.getChild(0).getText().charAt(1) range << value } else if (ctx.getChildCount() == 3) { Character lower = ctx.getChild(0).getText().charAt(1) Character upper = ctx.getChild(2).getText().charAt(1) range = (lower..upper).collect() } return range }
For each element returned from the range we will:
- Set the identifier referenced by IDENTIFIER in the symbol table: sym[id.getId()] = it;
- Visit the statement child. This is equivalent to executing the statement, which can be a block of statements. We do this via the code: visit(ctx.getChild(5))
In my opinion, that’s an amazing amount of functionality for a meager amount of code. Groovy gurus will likely point out that it can be accomplished in 1/2 the code. They are undoubtedly correct but this is readable to me and my level of expertise and having the misfortune of having to support clever code I once wrote years later, I learned a deep appreciation for readability.
while loops
While loops are even simpler. Gramatically:
whileLoop : 'while' '(' expression ')' statement;
While an expression is true, execute a statement.
The visitor:
public Object visitWhileLoop(WhileLoopContext ctx) { if (ctx.getChildCount() == 5) { def cond = getValue(visit(ctx.getChild(2))) while (cond) { visit(ctx.getChild(4)); cond = getValue(visit(ctx.getChild(2))) } return null; } else { return visitChildren(ctx); } }
If the context contains 5 children, define an internal condition called “cond” which is initialized to the result of evaluating the third child which contains the condition.
While the condition is true, execute the statement held in 5th child and reevaluate the condition. Repeat infinitely or until the condition is no longer true.
Else, we simply return the result of visiting the children. This should never happen given our grammar. This is the default action of the automatically generated nodes. It is also required if your rules cascade precedence downward to other rules. With the visitor pattern, you manually govern the evaluation of nodes.