Regular, Recursive, Restricted

9

doug-moen 6 months ago | link

You begin with this example grammar:

Expr =
    Factor
  | Expr '+' Factor

Factor =
    Atom
  | Factor '*' Atom

Atom = 'number' | '(' Expr ')'

You don’t like this style, and your suggested replacement is:

Expr =
    'number'
  | '(' Expr ')'
  | Expr '+' Expr
  | Expr '*' Expr

Expr !=
             Expr '+' [Expr '+' Expr]
|            Expr '*' [Expr '*' Expr]
|            Expr '*' [Expr '+' Expr]
| [Expr '+' Expr] '*' Expr

The suggested alternative is already longer than the original (more lines of code, more tokens). If you try to extend the grammar with a ‘-’ operator, then you add one line to the original construction:

  | Expr '-' Factor

But you add many more lines to the alternative construction, because the number of restrictions blows up. You’d need to add:

|            Expr '-' [Expr '-' Expr]
|            Expr '-' [Expr '+' Expr]
|            Expr '+' [Expr '-' Expr]
|            Expr '*' [Expr '-' Expr]
|           [Expr '-' Expr] '*' Expr

This quickly becomes unwieldy for a non-trivial grammar.

5

matklad 6 months ago | link

One can replace any '+' with ('+' | '-').

Actually, the generative expression probably should say Expr ('+' | '*') Expr, it’s enough for it to only enumerate shapes of expressions.

I am not aiming at the shortest expression here, I am aiming at the one which mostly clearly communicates the intended set. To me, the subtractive formulation feels much more natural, as it directly expresses the local property we want (between + and *, multiplication wins). In the traditional formulation, this local property appears as a consequence of the global shape of the grammar, which for me is not intuitive.
1. 4
  doug-moen 6 months ago | link
  I think there is still a problem.
  
  Let’s define
  
  AddOp = '+' | '-' MulOp = '*' | '/' PowOp = '^'
  
  and then see what happens when we add another level of operator precedence.
  
  The traditional style:
  
  Expr = Factor | Expr AddOp Factor Factor = Power | Factor MulOp Power Power = Atom | Atom PowOp Power Atom = 'number' | '(' Expr ')'
  
  Using this new style:
  
  Expr = Expr (AddOp|MulOp|PowOp) Expr Expr != Expr AddOp [Expr AddOp Expr] | [Expr AddOp Expr] MulOp Expr | [Expr AddOp Expr] PowOp Expr | Expr MulOp [Expr AddOp Expr] | Expr MulOp [Expr MulOp Expr] | [Expr MulOp Expr] PowOp Expr | Expr PowOp [Expr AddOp Expr] | Expr PowOp [Expr MulOp Expr] | [Expr PowOp Expr] PowOp Expr
  
  Using the traditional style, the size of the grammar grows linearly with the number of precedence levels. With this new style, it grows non-linearly:
  
  With 2 levels of operator precedence, we need 2^2 = 4 restrictions.
  
  With 3 levels of operator precedence, we need 3^2 = 9 restrictions.
  
  With 20 levels of operator precedence, we need 20^2 = 400 restrictions.
  
  The new style seems to be impractical for most commonly used programming languages.
  1. 2
    matklad 6 months ago | link
    
    I think its still linear:
    
    Expr != Expr AddOp [Expr AddOp Expr] | [Expr AddOp Expr] MulOp Expr | Expr MulOp [Expr (AddOp|MulOp) Expr] | [Expr (AddOp|MulOp|PowOp) Expr] PowOp Expr | Expr PowOp [Expr (AddOp|MulOp) Expr] |
    
    well, technically in this form it is still quadratic, but you can define aliases like AddOrMul, etc, to make this truly linear.
    1. 5
      
      doug-moen 6 months ago | link
      
      In the traditional style, we express the fact that that (AddOp is left associative, MulOp is left associative, PowOp is right associative) in exactly one place in the code. Precedence relations are strictly ordered, and each ordering relationship (AddOp < MulOp, MulOp < PowOp) is stated in exactly two places, on two adjacent lines, regardless of the number of precedence levels. This makes the code relatively easy to reason about using local reasoning.
      
      With the new style, this same information is smeared out across many restrictions.
      
      In my version of the grammar, each (left,right)-associative fact is stated once per precedence level. However, ordering relationships between precedence levels are stated N-1 times for each precedence level, assuming there are N precedence levels.
      
      In your version of the grammar, it’s the same, even though your code is more compressed.
      
      The problem with stating each precedence relation N-1 times is keeping all those instances in sync, and also reasoning about the code and finding bugs if one of those instances is misstated. It’s higher cognitive code to convince yourself the code is correct.
    2. 3
      
      gpm edited 6 months ago | link
      
      So the general form for a simple precedence tree with equal-precedence ops evaluating left to right is:
      
      P1_OP = AddOp | ... P2_OP = MulOp | ... P3_OP = PowOp | ... ... PN_OP = ... AT_MOST_P1_OP = P1_OP ... AT_MOST_PN_OP = AT_MOST_P[N-1]_OP | PN_OP Expr = Expr PN_OP EXPR Expr != Expr P[i]_OP [Expr AT_MOST_P[i]_OP Expr] | [Expr AT_MOST_P[i-1]_OP Expr] P[i]_OP Expr
      
      On the one hand I wrote that out without trouble and I’m convinced up to typos/editing errors it’s correct. On the other hand it feels like a grammar that ought to be generated by code, and would be annoying to review for typos because of the inherent repetitiveness.

6

julesjacobs 6 months ago | link

Luka Miljak’s masters thesis (https://repository.tudelft.nl/islandora/object/uuid%3A2d831aa4-df6b-4ab6-983e-9776c710b450) verifies an approach similar to the one suggested at the end of your post, where you have rules to filter out patterns that violate precedence rules. The idea is that such rules should only filter out ambiguous cases, and never make a previously parseable string unparseable. Second, such rules should filter out all ambigous cases, and never leave multiple parse trees for a given string.

Another approach is to write down a parameterised LR grammar (basically, a program that generates a LR grammar). This is similar to the approach suggested in the beginning of your post, but with a sufficiently high level way of writing down the LR grammar.

3

matklad 6 months ago | link

Thanks, getting a link like this was exactly the purpose of me writing this down!

4

stevan 6 months ago | link

Have you seen Labelled BNF grammar? Here’s your example in it:

EAdd. 	Exp 	::= 	Exp 	"+" 	Exp1 	;
EMul. 	Exp1 	::= 	Exp1 	"*" 	Exp2 	;
EInt. 	Exp2 	::= 	Integer 	;
coercions 	Exp 	2 	;

There’s a tool called bnfc which can generate an AST, lexer, parser, pretty printer, and docs from the above in many programming languages.

1

safinaskar 5 months ago | link

Yes, cool project. Unfortunately, it doesn’t give guarantee that parser and pretty printer match, i. e. it doesn’t give round-trip guarantee. This is bug about this reported by me: https://github.com/BNFC/bnfc/issues/360 .

I wrote Haskell library, which does thing similar to bnfc (but restricted to Haskell), but which gives guarantee of round-trip! https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html

2

deepchasm 6 months ago | link

This doesn’t actually solve the problem. The problem is characterizing 3 - 2 + 1 - 4 / 2 * 2

The only correct answer to that equation, in code, is to throw an error and have the programmer explicitly state the intent. The cost of a few extra punctuation symbols is less than the expected cost of the errors prevented, ((3 - 2) + (1 - 4)) / (2 * 2) is the correct monster equation. No one gets the intent wrong.

Also, Expr = Expr '-' Factor means lines like a-bwould be acceptable, and they should never be. a - b, and a + -b` are reasonable, but the mistakes from alien identifiers (looking at you, CSS) that use hyphens is too high to justify saving a whitespace.

Unfortunately, these are small dollar items in the Maintenance Equation, so actual testing to determine the validity generally doesn’t pay for itself.

2

matklad 6 months ago | link

Thanks, I was waiting for this comment! One cool thing here is that it’s not some hack for precedence levels specifically, but rather seems to be a pretty general thing. So, you can decide on case by case basis whether you want precedence between two operators, or whether a combination should be banned from the grammar.
1. 1
  
  deepchasm 5 months ago | link
  
  Yes. It’s not exactly case by case.
  
  The math term Associativity matters. “A + ( B + C) == (A + B) + C”, meaning that “do pairwise additions on this list” allows the compiler or runtime to have free reign to farm out portions to multiple processors. Also, human brains do fine with “A + B + C” when the same operator and associative. Human brains do not do well with “2 ** 3 ** 4”. In practice, humans blow “A OR B AND C” as well.
  
  The fast hack is always require parenthesis between binary operators unless they are the same, and are associative. You can still use precedence levels for areas you expect good human comprehension. For example, “A + B AND C + D” usually gets parsed correctly. So, you end up with only a few levels of precedence, like handling unary operations early.
  
  Confusing code should be incorrect code.

2

cpurdy 6 months ago | link

I saw this earlier, and Doug’s issues with it were spot on, but having thought about it a bit longer, it kept reminding me of Pratt parsing. So @matklad you may find the topic of Pratt parsing quite fun if you enjoyed playing with the ideas in your blog.

3

doug-moen 6 months ago | link

You mean like this? https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
1. 2
  
  cpurdy 6 months ago | link
  
  Yes indeed! Damned time machines are making me look all stupid now.
2

doug-moen 6 months ago | link

The idea that occurred to me is Operator precedence parsing, which is simpler than Pratt parsing, and which efficiently parses an infix expression with many levels of operator precedence, with a very compact specification. You’ll need a different parser technology to handle the non-infix part of your language.

1

safinaskar 6 months ago | link

Your formalism looks very similar to formalism used in EcmaScript spec.

Okay, so this is example from the end of your post:

Expr =
    'number'
  | '(' Expr ')'
  | Expr '+' Expr
  | Expr '*' Expr

Expr !=
             Expr '+' [Expr '+' Expr]
|            Expr '*' [Expr '*' Expr]
|            Expr '*' [Expr '+' Expr]
| [Expr '+' Expr] '*' Expr

It can be rewritten so:

Expr =
    'number'
  | '(' Expr ')'
  | Expr '+' Expr /* but right Expr cannot be a sum */
  | Expr '*' Expr /* but right Expr cannot be a product, and both Exprs cannot be a sum */

Then we can parameterize Expr with parameters “Sum” and “Prod” and we get this (“Sum” means “may be a sum”, and “prod” means “may be a product”) (in EcmaScript spec notation):

Expr[Sum, Prod] =
    'number'
  | '(' Expr[+Sum, +Prod] ')'
  | [+Sum]     Expr[+Sum, +Prod] '+' Expr[~Sum, +Prod]
  | [+Prod]    Expr[~Sum, +Prod] '*' Expr[~Sum, ~Prod]

And we got exact notation from EcmaScript spec! It is described here: https://tc39.es/ecma262/#sec-grammatical-parameters

Then we can trivially convert this form into usual CFG (using rules from link above). And then we can construct LR(1) table to be sure our grammar is unambiguous.

Ask me any questions about grammars. I think I’m expert. To prove that I’m expert, let me show you this my library: https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html . This is reversible parser with guaranteed reversibility. It is unique among all programming languages

1

safinaskar 6 months ago | link

bison has very nice notation. Here is example from manual:

%token NUM
%left '-' '+'
%left '*' '/'
%precedence NEG   /* negation--unary minus */
%right '^'        /* exponentiation */

%% /* The grammar follows. */

exp:
  NUM
| exp '+' exp        { $$ = $1 + $3;      }
| exp '-' exp        { $$ = $1 - $3;      }
| exp '*' exp        { $$ = $1 * $3;      }
| exp '/' exp        { $$ = $1 / $3;      }
| '-' exp  %prec NEG { $$ = -$2;          }
| exp '^' exp        { $$ = pow ($1, $3); }
| '(' exp ')'        { $$ = $2;           }
;

See full example in section 2.2 in https://www.gnu.org/software/bison/manual/bison.html

Bison is able to construct LR(1) table for this grammar, and thus prove that it is unambiguous (unfortunately, I don’t fully understand what exactly these %left, etc commands mean).

Note, that order of these command matters, i. e. we specify both associativity and priority of our operators

1

safinaskar 6 months ago | link

lalrpop supports notation similar to bison’s notation. See it in the end of this section: https://lalrpop.github.io/lalrpop/tutorial/004_full_expressions.html

1

safinaskar 5 months ago | link

I like notation for infix expressions in Isabelle. It is excellent!

Isabelle uses the following conventions (I give them in simplified form):

If you grammar has some nonterminal, for example, Expr, then Isabelle internally replaces it with 1001 nonterminals Expr0, Expr1, …, Expr1000
When you use nonterminal on the left side of production, Isabelle adds 1000 to it. If you use it on the right side, then Isabelle adds 0. For example, production Expr = '(' Expr ')' is converted to Expr1000 = '(' Expr0 ')'
Isabelle automatically introduces these productions: Expr0 = Expr1, Expr1 = Expr2, …, Expr999 = Expr1000
Expr infix '+' 10 is automatically converted to Expr10 = Expr11 '+' Expr11 (for non-associative operators)
Expr infixl '+' 10 is automatically converted to Expr10 = Expr10 '+' Expr11
Expr infixr '+' 10 is automatically converted to Expr10 = Expr11 '+' Expr10

Using these conventions we can write this (I will call this snippet as (*)):

Expr = Num
Expr = '(' Expr ')'
Expr infixl '+' 65
Expr infixl '-' 65
Expr infixl '*' 70
Expr infixl '/' 70
Expr infixr '^' 80

Isabelle will automatically translate this grammar to this:

Expr1000 = Num
Expr1000 = '(' Expr0 ')'
Expr65 = Expr65 '+' Expr66
Expr65 = Expr65 '-' Expr66
Expr70 = Expr70 '*' Expr71
Expr70 = Expr70 '/' Expr71
Expr80 = Expr81 '^' Expr80
Expr0 = Expr1
Expr1 = Expr2
Expr2 = Expr3
...
Expr999 = Expr1000

Then we can construct LR(1) table for the grammar, and thus check that it is unambiguous. (Unfortunately, Isabelle doesn’t do this, Isabelle uses Earley parser instead.)

And this notation is fully transparent. I can fully understand it. As opposed to bison’s and lalrpop’s precedence notation.

I simplified Isabelle notation above. In fact, (*) snippet looks so in real Isabelle notation:

syntax
  "num_const" :: "num_token => expr" ("_")
  "braces" :: "expr => expr" ("(_)")
  "add" :: "[expr, expr] => expr" (infixl "+" 65)
  "sub" :: "[expr, expr] => expr" (infixl "-" 65)
  "mul" :: "[expr, expr] => expr" (infixl "*" 70)
  "div" :: "[expr, expr] => expr" (infixl "/" 70)
  "pow" :: "[expr, expr] => expr" (infixr "^" 80)

You can see this notation in the wild here: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html (search in this document “infix”, “syntax”, “notation”, “nonterminal”). Full explanation of rules is in https://isabelle.in.tum.de/dist/Isabelle2024/doc/isar-ref.pdf in 8.2 - 8.4.3. But I don’t think you need to read this manual or try to understand real Isabelle notation. I think information I gave in this comment is enough

1

safinaskar 5 months ago | link

lalrpop’s tier trick may be of interest. Search “tier” on this page: https://lalrpop.github.io/lalrpop/tutorial/006_macros.html