The suggested alternative is already longer than the original (more lines of code, more tokens). If you try to extend the grammar with a ‘-’ operator, then you add one line to the original construction:
| Expr '-' Factor
But you add many more lines to the alternative construction, because the number of restrictions blows up. You’d need to add:
Actually, the generative expression probably should say Expr ('+' | '*') Expr, it’s enough for it to only enumerate shapes of expressions.
I am not aiming at the shortest expression here, I am aiming at the one which mostly clearly communicates the intended set. To me, the subtractive formulation feels much more natural, as it directly expresses the local property we want (between + and *, multiplication wins). In the traditional formulation, this local property appears as a consequence of the global shape of the grammar, which for me is not intuitive.
In the traditional style, we express the fact that that (AddOp is left associative, MulOp is left associative, PowOp is right associative) in exactly one place in the code. Precedence relations are strictly ordered, and each ordering relationship (AddOp < MulOp, MulOp < PowOp) is stated in exactly two places, on two adjacent lines, regardless of the number of precedence levels. This makes the code relatively easy to reason about using local reasoning.
With the new style, this same information is smeared out across many restrictions.
In my version of the grammar, each (left,right)-associative fact is stated once per precedence level.
However, ordering relationships between precedence levels are stated N-1 times for each precedence level,
assuming there are N precedence levels.
In your version of the grammar, it’s the same, even though your code is more compressed.
The problem with stating each precedence relation N-1 times is keeping all those instances in sync, and also reasoning about the code and finding bugs if one of those instances is misstated. It’s higher cognitive code to convince yourself the code is correct.
On the one hand I wrote that out without trouble and I’m convinced up to typos/editing errors it’s correct. On the other hand it feels like a grammar that ought to be generated by code, and would be annoying to review for typos because of the inherent repetitiveness.
Luka Miljak’s masters thesis (https://repository.tudelft.nl/islandora/object/uuid%3A2d831aa4-df6b-4ab6-983e-9776c710b450) verifies an approach similar to the one suggested at the end of your post, where you have rules to filter out patterns that violate precedence rules. The idea is that such rules should only filter out ambiguous cases, and never make a previously parseable string unparseable. Second, such rules should filter out all ambigous cases, and never leave multiple parse trees for a given string.
Another approach is to write down a parameterised LR grammar (basically, a program that generates a LR grammar). This is similar to the approach suggested in the beginning of your post, but with a sufficiently high level way of writing down the LR grammar.
Yes, cool project. Unfortunately, it doesn’t give guarantee that parser and pretty printer match, i. e. it doesn’t give round-trip guarantee. This is bug about this reported by me: https://github.com/BNFC/bnfc/issues/360 .
This doesn’t actually solve the problem. The problem is characterizing 3 - 2 + 1 - 4 / 2 * 2
The only correct answer to that equation, in code, is to throw an error and have the programmer explicitly state the intent. The cost of a few extra punctuation symbols is less than the expected cost of the errors prevented, ((3 - 2) + (1 - 4)) / (2 * 2) is the correct monster equation. No one gets the intent wrong.
Also, Expr = Expr '-' Factor means lines like a-bwould be acceptable, and they should never be. a - b, and a + -b` are reasonable, but the mistakes from alien identifiers (looking at you, CSS) that use hyphens is too high to justify saving a whitespace.
Unfortunately, these are small dollar items in the Maintenance Equation, so actual testing to determine the validity generally doesn’t pay for itself.
Thanks, I was waiting for this comment! One cool thing here is that it’s not some hack for precedence levels specifically, but rather seems to be a pretty general thing. So, you can decide on case by case basis whether you want precedence between two operators, or whether a combination should be banned from the grammar.
The math term Associativity matters. “A + ( B + C) == (A + B) + C”, meaning that “do pairwise additions on this list” allows the compiler or runtime to have free reign to farm out portions to multiple processors. Also, human brains do fine with “A + B + C” when the same operator and associative. Human brains do not do well with “2 ** 3 ** 4”. In practice, humans blow “A OR B AND C” as well.
The fast hack is always require parenthesis between binary operators unless they are the same, and are associative. You can still use precedence levels for areas you expect good human comprehension. For example, “A + B AND C + D” usually gets parsed correctly. So, you end up with only a few levels of precedence, like handling unary operations early.
I saw this earlier, and Doug’s issues with it were spot on, but having thought about it a bit longer, it kept reminding me of Pratt parsing. So @matklad you may find the topic of Pratt parsing quite fun if you enjoyed playing with the ideas in your blog.
The idea that occurred to me is Operator precedence parsing, which is simpler than Pratt parsing, and which efficiently parses an infix expression with many levels of operator precedence, with a very compact specification. You’ll need a different parser technology to handle the non-infix part of your language.
Expr =
'number'
| '(' Expr ')'
| Expr '+' Expr /* but right Expr cannot be a sum */
| Expr '*' Expr /* but right Expr cannot be a product, and both Exprs cannot be a sum */
Then we can parameterize Expr with parameters “Sum” and “Prod” and we get this (“Sum” means “may be a sum”, and “prod” means “may be a product”) (in EcmaScript spec notation):
Then we can trivially convert this form into usual CFG (using rules from link above). And then we can construct LR(1) table to be sure our grammar is unambiguous.
Ask me any questions about grammars. I think I’m expert. To prove that I’m expert, let me show you this my library: https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html . This is reversible parser with guaranteed reversibility. It is unique among all programming languages
Bison is able to construct LR(1) table for this grammar, and thus prove that it is unambiguous (unfortunately, I don’t fully understand what exactly these %left, etc commands mean).
Note, that order of these command matters, i. e. we specify both associativity and priority of our operators
I like notation for infix expressions in Isabelle. It is excellent!
Isabelle uses the following conventions (I give them in simplified form):
If you grammar has some nonterminal, for example, Expr, then Isabelle internally replaces it with 1001 nonterminals Expr0, Expr1, …, Expr1000
When you use nonterminal on the left side of production, Isabelle adds 1000 to it. If you use it on the right side, then Isabelle adds 0. For example, production Expr = '(' Expr ')' is converted to Expr1000 = '(' Expr0 ')'
Then we can construct LR(1) table for the grammar, and thus check that it is unambiguous. (Unfortunately, Isabelle doesn’t do this, Isabelle uses Earley parser instead.)
And this notation is fully transparent. I can fully understand it. As opposed to bison’s and lalrpop’s precedence notation.
I simplified Isabelle notation above. In fact, (*) snippet looks so in real Isabelle notation:
You begin with this example grammar:
You don’t like this style, and your suggested replacement is:
The suggested alternative is already longer than the original (more lines of code, more tokens). If you try to extend the grammar with a ‘-’ operator, then you add one line to the original construction:
But you add many more lines to the alternative construction, because the number of restrictions blows up. You’d need to add:
This quickly becomes unwieldy for a non-trivial grammar.
One can replace any
'+'
with('+' | '-')
.Actually, the generative expression probably should say
Expr ('+' | '*') Expr
, it’s enough for it to only enumerate shapes of expressions.I am not aiming at the shortest expression here, I am aiming at the one which mostly clearly communicates the intended set. To me, the subtractive formulation feels much more natural, as it directly expresses the local property we want (between + and *, multiplication wins). In the traditional formulation, this local property appears as a consequence of the global shape of the grammar, which for me is not intuitive.
I think there is still a problem.
Let’s define
and then see what happens when we add another level of operator precedence.
The traditional style:
Using this new style:
Using the traditional style, the size of the grammar grows linearly with the number of precedence levels. With this new style, it grows non-linearly:
The new style seems to be impractical for most commonly used programming languages.
I think its still linear:
well, technically in this form it is still quadratic, but you can define aliases like AddOrMul, etc, to make this truly linear.
In the traditional style, we express the fact that that (AddOp is left associative, MulOp is left associative, PowOp is right associative) in exactly one place in the code. Precedence relations are strictly ordered, and each ordering relationship (AddOp < MulOp, MulOp < PowOp) is stated in exactly two places, on two adjacent lines, regardless of the number of precedence levels. This makes the code relatively easy to reason about using local reasoning.
With the new style, this same information is smeared out across many restrictions.
The problem with stating each precedence relation N-1 times is keeping all those instances in sync, and also reasoning about the code and finding bugs if one of those instances is misstated. It’s higher cognitive code to convince yourself the code is correct.
So the general form for a simple precedence tree with equal-precedence ops evaluating left to right is:
On the one hand I wrote that out without trouble and I’m convinced up to typos/editing errors it’s correct. On the other hand it feels like a grammar that ought to be generated by code, and would be annoying to review for typos because of the inherent repetitiveness.
Luka Miljak’s masters thesis (https://repository.tudelft.nl/islandora/object/uuid%3A2d831aa4-df6b-4ab6-983e-9776c710b450) verifies an approach similar to the one suggested at the end of your post, where you have rules to filter out patterns that violate precedence rules. The idea is that such rules should only filter out ambiguous cases, and never make a previously parseable string unparseable. Second, such rules should filter out all ambigous cases, and never leave multiple parse trees for a given string.
Another approach is to write down a parameterised LR grammar (basically, a program that generates a LR grammar). This is similar to the approach suggested in the beginning of your post, but with a sufficiently high level way of writing down the LR grammar.
Thanks, getting a link like this was exactly the purpose of me writing this down!
Have you seen Labelled BNF grammar? Here’s your example in it:
There’s a tool called bnfc which can generate an AST, lexer, parser, pretty printer, and docs from the above in many programming languages.
Yes, cool project. Unfortunately, it doesn’t give guarantee that parser and pretty printer match, i. e. it doesn’t give round-trip guarantee. This is bug about this reported by me: https://github.com/BNFC/bnfc/issues/360 .
I wrote Haskell library, which does thing similar to bnfc (but restricted to Haskell), but which gives guarantee of round-trip! https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html
This doesn’t actually solve the problem. The problem is characterizing
3 - 2 + 1 - 4 / 2 * 2
The only correct answer to that equation, in code, is to throw an error and have the programmer explicitly state the intent. The cost of a few extra punctuation symbols is less than the expected cost of the errors prevented,
((3 - 2) + (1 - 4)) / (2 * 2)
is the correct monster equation. No one gets the intent wrong.Also,
Expr = Expr '-' Factor means lines like
a-bwould be acceptable, and they should never be.
a - b, and
a + -b` are reasonable, but the mistakes from alien identifiers (looking at you, CSS) that use hyphens is too high to justify saving a whitespace.Unfortunately, these are small dollar items in the Maintenance Equation, so actual testing to determine the validity generally doesn’t pay for itself.
Thanks, I was waiting for this comment! One cool thing here is that it’s not some hack for precedence levels specifically, but rather seems to be a pretty general thing. So, you can decide on case by case basis whether you want precedence between two operators, or whether a combination should be banned from the grammar.
Yes. It’s not exactly case by case.
The math term Associativity matters. “A + ( B + C) == (A + B) + C”, meaning that “do pairwise additions on this list” allows the compiler or runtime to have free reign to farm out portions to multiple processors. Also, human brains do fine with “A + B + C” when the same operator and associative. Human brains do not do well with “2 ** 3 ** 4”. In practice, humans blow “A OR B AND C” as well.
The fast hack is always require parenthesis between binary operators unless they are the same, and are associative. You can still use precedence levels for areas you expect good human comprehension. For example, “A + B AND C + D” usually gets parsed correctly. So, you end up with only a few levels of precedence, like handling unary operations early.
Confusing code should be incorrect code.
I saw this earlier, and Doug’s issues with it were spot on, but having thought about it a bit longer, it kept reminding me of Pratt parsing. So @matklad you may find the topic of Pratt parsing quite fun if you enjoyed playing with the ideas in your blog.
You mean like this? https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
Yes indeed! Damned time machines are making me look all stupid now.
The idea that occurred to me is Operator precedence parsing, which is simpler than Pratt parsing, and which efficiently parses an infix expression with many levels of operator precedence, with a very compact specification. You’ll need a different parser technology to handle the non-infix part of your language.
Your formalism looks very similar to formalism used in EcmaScript spec.
Okay, so this is example from the end of your post:
It can be rewritten so:
Then we can parameterize Expr with parameters “Sum” and “Prod” and we get this (“Sum” means “may be a sum”, and “prod” means “may be a product”) (in EcmaScript spec notation):
And we got exact notation from EcmaScript spec! It is described here: https://tc39.es/ecma262/#sec-grammatical-parameters
Then we can trivially convert this form into usual CFG (using rules from link above). And then we can construct LR(1) table to be sure our grammar is unambiguous.
Ask me any questions about grammars. I think I’m expert. To prove that I’m expert, let me show you this my library: https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html . This is reversible parser with guaranteed reversibility. It is unique among all programming languages
bison has very nice notation. Here is example from manual:
See full example in section 2.2 in https://www.gnu.org/software/bison/manual/bison.html
Bison is able to construct LR(1) table for this grammar, and thus prove that it is unambiguous (unfortunately, I don’t fully understand what exactly these
%left
, etc commands mean).Note, that order of these command matters, i. e. we specify both associativity and priority of our operators
lalrpop supports notation similar to bison’s notation. See it in the end of this section: https://lalrpop.github.io/lalrpop/tutorial/004_full_expressions.html
I like notation for infix expressions in Isabelle. It is excellent!
Isabelle uses the following conventions (I give them in simplified form):
Expr
, then Isabelle internally replaces it with 1001 nonterminalsExpr0
,Expr1
, …,Expr1000
Expr = '(' Expr ')'
is converted toExpr1000 = '(' Expr0 ')'
Expr0 = Expr1
,Expr1 = Expr2
, …,Expr999 = Expr1000
Expr infix '+' 10
is automatically converted toExpr10 = Expr11 '+' Expr11
(for non-associative operators)Expr infixl '+' 10
is automatically converted toExpr10 = Expr10 '+' Expr11
Expr infixr '+' 10
is automatically converted toExpr10 = Expr11 '+' Expr10
Using these conventions we can write this (I will call this snippet as (*)):
Isabelle will automatically translate this grammar to this:
Then we can construct LR(1) table for the grammar, and thus check that it is unambiguous. (Unfortunately, Isabelle doesn’t do this, Isabelle uses Earley parser instead.)
And this notation is fully transparent. I can fully understand it. As opposed to bison’s and lalrpop’s precedence notation.
I simplified Isabelle notation above. In fact, (*) snippet looks so in real Isabelle notation:
You can see this notation in the wild here: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html (search in this document “infix”, “syntax”, “notation”, “nonterminal”). Full explanation of rules is in https://isabelle.in.tum.de/dist/Isabelle2024/doc/isar-ref.pdf in 8.2 - 8.4.3. But I don’t think you need to read this manual or try to understand real Isabelle notation. I think information I gave in this comment is enough
lalrpop’s tier trick may be of interest. Search “tier” on this page: https://lalrpop.github.io/lalrpop/tutorial/006_macros.html