tokenize-ebnf.scm
defines in it a tokenizer (aka lexer, or scanner) for EBNF. This is compliant to ISO but several other meta-structures such as RegExp (/.../
) are also consumed. Comments starting withg #
are tokenized, consumed and appended to the token stream. Do with them as you wish!
This was an excercise at Scheme --- and so much more, an excercise at hand-rollling lexers in an stateless language with functional paradigm built in.
You can use this to, for example, translate EBNF to LaTEx, HTML, PostScript, ROFF --- or maybe define a parser for it, and convert it to Yacc, PEG, ANTLR, Flex, Re2C, etc.
Extended Backus-Naur Form is a meta-language, a domain-specific language used to describe other languages. Meta-lanugages can describe the lexical grammar of a language, or its syntactic grammar. They could also describe the abstract syntax of the language (e.g. Zephyr ASDL, see my translator to C.
EBNF has been 'standardized' by ISO. But the standard is lacking. For example, it is much easier to use regular expressions to embed lexical grammar for terminals amongst non-terminals. But with ISO EBNF you'd have to define it in a much less readble way. I am not a fan of using Regular Expressions in tokenizers, as Rob Pike explains they just make things messy and the final DFA unreadable. Plus, just you try to compile an NFA into DFA in a language without any imperative control structure! But using Regex to explain lexical grammar of terminals, a la PEG, is fine (I am not sure what PEG generators do with Regex, do they make DFA? PEG is often translated to LL(1) Recursive-Descent Parser, so it's up to them what they do. They can scoop up as they go. ANTLR is LL(k) and it uses so many heuristics to parse it runs the Euphrates dry!).
Anyways, the rules for EBNF:
- We have 'rules', which are 'non-terminal names' separated by
::=
from 'non-terminal definitons';
non-terminal-name ::= ... ;
- We have 'terminals', inside single, double or backtick quotes;
- You can use infix
...
to denote range between left terminal operator, and right terminal operand; - We can use infix
|
to denote 'choice' between rule clusters (terminals, or composite); - It's best, for the sake of conversion to e.g. Yacc, to use single quotes for characters, and double quotes for strings;
lower ::= 'a' | 'b' | ... | 'y' | 'z' ;
keywords ::= "do" | "while" ;
- Delimiters:
[]
denotes 'optional',{}
denotes 'sequence',()
denotes group;
if-else-stmt ::= "if" expr [ "then" ] stmt { stmt } { "elsif" { stmt } } [ "else" stmt { stmt } ] ;
- 'Free capture' can be done with delimiting betweeen
?
and?
;
any-char ::= ? any ascii character ?
There are some other rules, but you get the gist!
EBNF grammars are well-capable of describing Context-Free (Type 2) and Regular (Type 3) grammars. There's a 1-1 relationship between EBNF rules, and regular expressions, in fact. Regardless of what I said about ISO EBNF not having rules for RE, it's just a matter of rolling up your sleeve to describe or define a Regular grammar, that is, a lexical grammar, in EBNF.
Let's use PEG, which is also capable of both T2 and T3, but has Regexp.
- PEG:
id <- [a-zA-Z_][a-zA-Z0-9_]* ;
id-list <- id ( ',' id )*
- Equivalent EBNF:
upper ::= 'A' ... 'Z' ;
lower ::= 'a' ... 'z' ;
digit ::= '0' ... '9' ;
letter ::= upper | lower ;
id-head ::= letter | '_' ;
id ::= id-head { letter | digit | '_' } ;
id-list ::= id { ',' id } ;
I think you should use a mix of PEG and EBNF. Like, EBNF with Regexp. Like Python describes and defines its gramamr. PEG is much more 'terse' than it should be for a meta-language. After all, PEG is used to actually implement parsers to be generated!
I have included an ebnf-bootstrap.ebnf
file in this Gist. You can wget
it and begin writing your grammar. If you are a [Neo]Vim user, you can use ebnf.vim
, also in this Gist.
I hope you learned something from this. Have fun.
Contact me at chubakbidpaa [at] riseup [dot] net
if you need help. Check out my work!.