Parsing Expression Grammars
Parsing expression grammars (PEGs) are a new-ish formal grammar intended to describe computer languages. PEGs differ from the more commonplace context-free grammars most notably in that PEGs offer an
Another natural consequence of ordered choice is that so-called repetition operators (
*
and +
) and the option operator (?
) are necessarily greedy. (Any time a grammar rule can recur, ordered choice ensures that it does recur, since the option of not recurring will implicitly not match.)A PEG rule, as described in the Wikipedia entry, may be either:
- a terminal (a symbol),
- a nonterminal (another rule), or
- the empty string.
- sequence:
A B
- ordered choice:
A / B
- zero or more:
A*
- one or more:
A+
- optional:
A?
- positive lookahead:
&A
- negative lookahead:
!A
Translating into Mercury
It’s well-known that CFGs can be implemented directly in logic languages such as Mercury as recursive-descent parsers. Prolog and Mercury even have a special syntax known as definite clause grammar (DCG) to facilitate doing so. It’s similarly easy to implement PEGs, and there’s a surprise at the end!. Let’s see how.
PEG rules can be represented using DCG clauses like so:
rule --> body
The type and mode of rules can be inferred by Mercury, but if you need to declare them, you can do so as:
:- pred rule(list(char)::in, list(char)::out) is semidet.
The elements of the body of a PEG are translated as follows:
- terminal C
[C]
- nonterminal A
A
- empty string
[]
- sequence
A B
A, B
- ordered choice
A / B
A -> []; B
Note that using-> [] ;
(if/else) instead of;
gives us exactly the greedy semantics we require.- optional
A?
A -> []; []
- positive lookahead
&A
not not A
- or equivalently,
=(S), A, :=(S)
- negative lookahead
!A
not A
Repetition constructions (
A*
and A+
) are recursive in nature and cannot be represented inline. However we can define higher order rules to represent them as::- mode *(pred(in, out) is semidet, in, out) is det. *(P) --> (P, *(P)) -> []; []. :- mode +(pred(in, out) is semidet, in, out) is semidet. +(P) --> (P, +(P)) -> []; P.Then we can translate repetition constructions as:
- zero or more:
A*
*(A)
or*(pred(in, out) is semidet --> A)
- one or more:
A+
+(A)
or+(pred(in, out) is semidet --> A)
Example
Let’s translate the example of basic arithmetic expressions from Wikipedia:
Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Sum ← Product (('+' / '-') Product)* Expr ← SumFirst let’s get some boilerplate out of the way. We’ll need the
list
module for DCG-syntax terminals::- module peg_example. :- interface. :- import_module io. :- pred main(io::di, io::uo) is det. :- implementation. :- import_module list.Now onto
Value
. Right off the bat Wikipedia throws us a curveball with the range expression [0-9]
. Fortunately that’s easily translated as a specially-moded function::- mode in..in = in is semidet. A..B = C :- A @=< C, C @=< B.Even though (in fact, because)
..
is a builtin operator, we can use it in definitions as well.Armed with our new range operator, we can translate the
Value
rule as:value --> +(pred(in, out) is semidet --> ['0'..'9']) -> []; (['('], expr, [')']).Note that the lambda-predicate (
pred(in, out) is semidet -->
) is needed since ['0'..'9']
is not a predicate on its own. The translation is otherwise straightforward.Product
, Sum
, and Expr
follow similarly:product --> value, *(pred(in, out) is semidet --> ((['*'] -> []; ['/']), value)). sum --> product, *(pred(in, out) is semidet --> ((['+'] -> []; ['-']), value)). expr --> sum.Our
main
function will read a line, try to parse it using expr
and make sure that only the newline at the end is left unparsed:main(!IO) :- read_line(Res, !IO), (if Res = ok(Line) then (if expr(Line, ['\n']) then print("Success!\n", !IO) else print("Failure!\n", !IO)), main(!IO) else true).Compile with
mmc --infer-all --make peg_example
, run it and try out some expressions, keeping in mind that our grammar doesn’t allow negatives, decimals, or whitespace:$ ./peg_example 123+456 Success! 5/(10-20) Success! 45-(40 Failure! *12 Failure! a+18 Failure!Of course, a parser that merely tests whether input is well-formed is only mildly useful. Next time, we’ll look into creating parse trees, and the real-life example of parsing C code.
Download peg_example.m
This comment has been removed by the author.
ReplyDeleteHi,
ReplyDeletegr8 post.
I think negation is one of the most underated and badly covered feature (in text books and papers) of DCG. I believe it has a lot of potential in parsing, especially when one takes the idea of "Parsing is Deduction" seriously.
Best Regards
Hi Jan, thanks!
ReplyDeleteI agree about negation; in particular I find myself wanting negation often when using regular expressions (fortunately some RE engines such as Tcl’s support negation as a “negative lookahead assertion”).
I don’t suppose many of websites give this kind of information.Grammarly
ReplyDelete