1. 23
    1. 9

      You begin with this example grammar:

      Expr =
          Factor
        | Expr '+' Factor
      
      Factor =
          Atom
        | Factor '*' Atom
      
      Atom = 'number' | '(' Expr ')'
      

      You don’t like this style, and your suggested replacement is:

      Expr =
          'number'
        | '(' Expr ')'
        | Expr '+' Expr
        | Expr '*' Expr
      
      Expr !=
                   Expr '+' [Expr '+' Expr]
      |            Expr '*' [Expr '*' Expr]
      |            Expr '*' [Expr '+' Expr]
      | [Expr '+' Expr] '*' Expr
      

      The suggested alternative is already longer than the original (more lines of code, more tokens). If you try to extend the grammar with a ‘-’ operator, then you add one line to the original construction:

        | Expr '-' Factor
      

      But you add many more lines to the alternative construction, because the number of restrictions blows up. You’d need to add:

      |            Expr '-' [Expr '-' Expr]
      |            Expr '-' [Expr '+' Expr]
      |            Expr '+' [Expr '-' Expr]
      |            Expr '*' [Expr '-' Expr]
      |           [Expr '-' Expr] '*' Expr
      

      This quickly becomes unwieldy for a non-trivial grammar.

      1. 5

        One can replace any '+' with ('+' | '-').

        Actually, the generative expression probably should say Expr ('+' | '*') Expr, it’s enough for it to only enumerate shapes of expressions.

        I am not aiming at the shortest expression here, I am aiming at the one which mostly clearly communicates the intended set. To me, the subtractive formulation feels much more natural, as it directly expresses the local property we want (between + and *, multiplication wins). In the traditional formulation, this local property appears as a consequence of the global shape of the grammar, which for me is not intuitive.

        1. 4

          I think there is still a problem.

          Let’s define

          AddOp = '+' | '-'
          MulOp = '*' | '/'
          PowOp = '^'
          

          and then see what happens when we add another level of operator precedence.

          The traditional style:

          Expr = Factor | Expr AddOp Factor
          Factor = Power | Factor MulOp Power
          Power = Atom | Atom PowOp Power
          Atom = 'number' | '(' Expr ')'
          

          Using this new style:

          Expr = Expr (AddOp|MulOp|PowOp) Expr
          Expr !=
              Expr AddOp [Expr AddOp Expr] |
              [Expr AddOp Expr] MulOp Expr |
              [Expr AddOp Expr] PowOp Expr |
              Expr MulOp [Expr AddOp Expr] |
              Expr MulOp [Expr MulOp Expr] |
              [Expr MulOp Expr] PowOp Expr |
              Expr PowOp [Expr AddOp Expr] |
              Expr PowOp [Expr MulOp Expr] |
              [Expr PowOp Expr] PowOp Expr
          

          Using the traditional style, the size of the grammar grows linearly with the number of precedence levels. With this new style, it grows non-linearly:

          • With 2 levels of operator precedence, we need 2^2 = 4 restrictions.
          • With 3 levels of operator precedence, we need 3^2 = 9 restrictions.
          • With 20 levels of operator precedence, we need 20^2 = 400 restrictions.

          The new style seems to be impractical for most commonly used programming languages.

          1. 2

            I think its still linear:

            Expr !=
                Expr                            AddOp [Expr AddOp Expr] |
            
                [Expr AddOp Expr]               MulOp Expr |
                Expr                            MulOp [Expr (AddOp|MulOp) Expr] |
            
                [Expr (AddOp|MulOp|PowOp) Expr] PowOp Expr |
                Expr                            PowOp [Expr (AddOp|MulOp) Expr] |
            

            well, technically in this form it is still quadratic, but you can define aliases like AddOrMul, etc, to make this truly linear.

            1. 5

              In the traditional style, we express the fact that that (AddOp is left associative, MulOp is left associative, PowOp is right associative) in exactly one place in the code. Precedence relations are strictly ordered, and each ordering relationship (AddOp < MulOp, MulOp < PowOp) is stated in exactly two places, on two adjacent lines, regardless of the number of precedence levels. This makes the code relatively easy to reason about using local reasoning.

              With the new style, this same information is smeared out across many restrictions.

              • In my version of the grammar, each (left,right)-associative fact is stated once per precedence level. However, ordering relationships between precedence levels are stated N-1 times for each precedence level, assuming there are N precedence levels.
              • In your version of the grammar, it’s the same, even though your code is more compressed.

              The problem with stating each precedence relation N-1 times is keeping all those instances in sync, and also reasoning about the code and finding bugs if one of those instances is misstated. It’s higher cognitive code to convince yourself the code is correct.

            2. 3

              So the general form for a simple precedence tree with equal-precedence ops evaluating left to right is:

              P1_OP = AddOp | ...
              P2_OP = MulOp | ...
              P3_OP = PowOp | ...
              ...
              PN_OP = ...
              
              AT_MOST_P1_OP = P1_OP
              ...
              AT_MOST_PN_OP = AT_MOST_P[N-1]_OP | PN_OP
              
              Expr = Expr PN_OP EXPR
              Expr != 
                  Expr P[i]_OP [Expr AT_MOST_P[i]_OP Expr]
                  | [Expr AT_MOST_P[i-1]_OP Expr] P[i]_OP Expr
              

              On the one hand I wrote that out without trouble and I’m convinced up to typos/editing errors it’s correct. On the other hand it feels like a grammar that ought to be generated by code, and would be annoying to review for typos because of the inherent repetitiveness.

    2. 6

      Luka Miljak’s masters thesis (https://repository.tudelft.nl/islandora/object/uuid%3A2d831aa4-df6b-4ab6-983e-9776c710b450) verifies an approach similar to the one suggested at the end of your post, where you have rules to filter out patterns that violate precedence rules. The idea is that such rules should only filter out ambiguous cases, and never make a previously parseable string unparseable. Second, such rules should filter out all ambigous cases, and never leave multiple parse trees for a given string.

      Another approach is to write down a parameterised LR grammar (basically, a program that generates a LR grammar). This is similar to the approach suggested in the beginning of your post, but with a sufficiently high level way of writing down the LR grammar.

      1. 3

        Thanks, getting a link like this was exactly the purpose of me writing this down!

    3. 4

      Have you seen Labelled BNF grammar? Here’s your example in it:

      EAdd. 	Exp 	::= 	Exp 	"+" 	Exp1 	;
      EMul. 	Exp1 	::= 	Exp1 	"*" 	Exp2 	;
      EInt. 	Exp2 	::= 	Integer 	;
      coercions 	Exp 	2 	;
      

      There’s a tool called bnfc which can generate an AST, lexer, parser, pretty printer, and docs from the above in many programming languages.

      1. 1

        Yes, cool project. Unfortunately, it doesn’t give guarantee that parser and pretty printer match, i. e. it doesn’t give round-trip guarantee. This is bug about this reported by me: https://github.com/BNFC/bnfc/issues/360 .

        I wrote Haskell library, which does thing similar to bnfc (but restricted to Haskell), but which gives guarantee of round-trip! https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html

    4. 2

      This doesn’t actually solve the problem. The problem is characterizing 3 - 2 + 1 - 4 / 2 * 2

      The only correct answer to that equation, in code, is to throw an error and have the programmer explicitly state the intent. The cost of a few extra punctuation symbols is less than the expected cost of the errors prevented, ((3 - 2) + (1 - 4)) / (2 * 2) is the correct monster equation. No one gets the intent wrong.

      Also, Expr = Expr '-' Factor means lines like a-bwould be acceptable, and they should never be. a - b, and a + -b` are reasonable, but the mistakes from alien identifiers (looking at you, CSS) that use hyphens is too high to justify saving a whitespace.

      Unfortunately, these are small dollar items in the Maintenance Equation, so actual testing to determine the validity generally doesn’t pay for itself.

      1. 2

        Thanks, I was waiting for this comment! One cool thing here is that it’s not some hack for precedence levels specifically, but rather seems to be a pretty general thing. So, you can decide on case by case basis whether you want precedence between two operators, or whether a combination should be banned from the grammar.

        1. 1

          Yes. It’s not exactly case by case.

          The math term Associativity matters. “A + ( B + C) == (A + B) + C”, meaning that “do pairwise additions on this list” allows the compiler or runtime to have free reign to farm out portions to multiple processors. Also, human brains do fine with “A + B + C” when the same operator and associative. Human brains do not do well with “2 ** 3 ** 4”. In practice, humans blow “A OR B AND C” as well.

          The fast hack is always require parenthesis between binary operators unless they are the same, and are associative. You can still use precedence levels for areas you expect good human comprehension. For example, “A + B AND C + D” usually gets parsed correctly. So, you end up with only a few levels of precedence, like handling unary operations early.

          Confusing code should be incorrect code.

    5. 2

      I saw this earlier, and Doug’s issues with it were spot on, but having thought about it a bit longer, it kept reminding me of Pratt parsing. So @matklad you may find the topic of Pratt parsing quite fun if you enjoyed playing with the ideas in your blog.

        1. 2

          Yes indeed! Damned time machines are making me look all stupid now.

      1. 2

        The idea that occurred to me is Operator precedence parsing, which is simpler than Pratt parsing, and which efficiently parses an infix expression with many levels of operator precedence, with a very compact specification. You’ll need a different parser technology to handle the non-infix part of your language.

    6. 1

      Your formalism looks very similar to formalism used in EcmaScript spec.

      Okay, so this is example from the end of your post:

      Expr =
          'number'
        | '(' Expr ')'
        | Expr '+' Expr
        | Expr '*' Expr
      
      Expr !=
                   Expr '+' [Expr '+' Expr]
      |            Expr '*' [Expr '*' Expr]
      |            Expr '*' [Expr '+' Expr]
      | [Expr '+' Expr] '*' Expr
      

      It can be rewritten so:

      Expr =
          'number'
        | '(' Expr ')'
        | Expr '+' Expr /* but right Expr cannot be a sum */
        | Expr '*' Expr /* but right Expr cannot be a product, and both Exprs cannot be a sum */
      

      Then we can parameterize Expr with parameters “Sum” and “Prod” and we get this (“Sum” means “may be a sum”, and “prod” means “may be a product”) (in EcmaScript spec notation):

      Expr[Sum, Prod] =
          'number'
        | '(' Expr[+Sum, +Prod] ')'
        | [+Sum]     Expr[+Sum, +Prod] '+' Expr[~Sum, +Prod]
        | [+Prod]    Expr[~Sum, +Prod] '*' Expr[~Sum, ~Prod]
      

      And we got exact notation from EcmaScript spec! It is described here: https://tc39.es/ecma262/#sec-grammatical-parameters

      Then we can trivially convert this form into usual CFG (using rules from link above). And then we can construct LR(1) table to be sure our grammar is unambiguous.


      Ask me any questions about grammars. I think I’m expert. To prove that I’m expert, let me show you this my library: https://mail.haskell.org/pipermail/haskell-cafe/2021-July/134217.html . This is reversible parser with guaranteed reversibility. It is unique among all programming languages

    7. 1

      bison has very nice notation. Here is example from manual:

      %token NUM
      %left '-' '+'
      %left '*' '/'
      %precedence NEG   /* negation--unary minus */
      %right '^'        /* exponentiation */
      
      %% /* The grammar follows. */
      
      exp:
        NUM
      | exp '+' exp        { $$ = $1 + $3;      }
      | exp '-' exp        { $$ = $1 - $3;      }
      | exp '*' exp        { $$ = $1 * $3;      }
      | exp '/' exp        { $$ = $1 / $3;      }
      | '-' exp  %prec NEG { $$ = -$2;          }
      | exp '^' exp        { $$ = pow ($1, $3); }
      | '(' exp ')'        { $$ = $2;           }
      ;
      

      See full example in section 2.2 in https://www.gnu.org/software/bison/manual/bison.html

      Bison is able to construct LR(1) table for this grammar, and thus prove that it is unambiguous (unfortunately, I don’t fully understand what exactly these %left, etc commands mean).

      Note, that order of these command matters, i. e. we specify both associativity and priority of our operators

    8. 1

      lalrpop supports notation similar to bison’s notation. See it in the end of this section: https://lalrpop.github.io/lalrpop/tutorial/004_full_expressions.html

    9. 1

      I like notation for infix expressions in Isabelle. It is excellent!

      Isabelle uses the following conventions (I give them in simplified form):

      • If you grammar has some nonterminal, for example, Expr, then Isabelle internally replaces it with 1001 nonterminals Expr0, Expr1, …, Expr1000
      • When you use nonterminal on the left side of production, Isabelle adds 1000 to it. If you use it on the right side, then Isabelle adds 0. For example, production Expr = '(' Expr ')' is converted to Expr1000 = '(' Expr0 ')'
      • Isabelle automatically introduces these productions: Expr0 = Expr1, Expr1 = Expr2, …, Expr999 = Expr1000
      • Expr infix '+' 10 is automatically converted to Expr10 = Expr11 '+' Expr11 (for non-associative operators)
      • Expr infixl '+' 10 is automatically converted to Expr10 = Expr10 '+' Expr11
      • Expr infixr '+' 10 is automatically converted to Expr10 = Expr11 '+' Expr10

      Using these conventions we can write this (I will call this snippet as (*)):

      Expr = Num
      Expr = '(' Expr ')'
      Expr infixl '+' 65
      Expr infixl '-' 65
      Expr infixl '*' 70
      Expr infixl '/' 70
      Expr infixr '^' 80
      

      Isabelle will automatically translate this grammar to this:

      Expr1000 = Num
      Expr1000 = '(' Expr0 ')'
      Expr65 = Expr65 '+' Expr66
      Expr65 = Expr65 '-' Expr66
      Expr70 = Expr70 '*' Expr71
      Expr70 = Expr70 '/' Expr71
      Expr80 = Expr81 '^' Expr80
      Expr0 = Expr1
      Expr1 = Expr2
      Expr2 = Expr3
      ...
      Expr999 = Expr1000
      

      Then we can construct LR(1) table for the grammar, and thus check that it is unambiguous. (Unfortunately, Isabelle doesn’t do this, Isabelle uses Earley parser instead.)

      And this notation is fully transparent. I can fully understand it. As opposed to bison’s and lalrpop’s precedence notation.

      I simplified Isabelle notation above. In fact, (*) snippet looks so in real Isabelle notation:

      syntax
        "num_const" :: "num_token => expr" ("_")
        "braces" :: "expr => expr" ("(_)")
        "add" :: "[expr, expr] => expr" (infixl "+" 65)
        "sub" :: "[expr, expr] => expr" (infixl "-" 65)
        "mul" :: "[expr, expr] => expr" (infixl "*" 70)
        "div" :: "[expr, expr] => expr" (infixl "/" 70)
        "pow" :: "[expr, expr] => expr" (infixr "^" 80)
      

      You can see this notation in the wild here: https://isabelle.in.tum.de/dist/library/HOL/HOL/HOL.html (search in this document “infix”, “syntax”, “notation”, “nonterminal”). Full explanation of rules is in https://isabelle.in.tum.de/dist/Isabelle2024/doc/isar-ref.pdf in 8.2 - 8.4.3. But I don’t think you need to read this manual or try to understand real Isabelle notation. I think information I gave in this comment is enough

    10. 1

      lalrpop’s tier trick may be of interest. Search “tier” on this page: https://lalrpop.github.io/lalrpop/tutorial/006_macros.html