### Parsing Expression Grammars

Parsing expression grammars (PEGs) are a new-ish formal grammar intended to describe computer languages. PEGs differ from the more commonplace context-free grammars most notably in that PEGs offer an

*does not match*any expression which is matched by the first branch. Since any text can only match exactly one branch of an ordered choice, this has the practical implication that

*a PEG cannot be ambiguous*and rule precedence declarations are not needed.

Another natural consequence of ordered choice is that so-called repetition operators (

`*`

and `+`

) and the option operator (`?`

) are necessarily greedy. (Any time a grammar rule *can*recur, ordered choice ensures that it

*does*recur, since the option of not recurring will implicitly not match.)

A PEG rule, as described in the Wikipedia entry, may be either:

- a terminal (a symbol),
- a nonterminal (another rule), or
- the empty string.

- sequence:
`A``B` - ordered choice:
`A`/`B` - zero or more:
`A`* - one or more:
`A`+ - optional:
`A`? - positive lookahead:
`&`

`A` - negative lookahead:
`!`

`A`

### Translating into Mercury

It’s well-known that CFGs can be implemented directly in logic languages such as Mercury as recursive-descent parsers. Prolog and Mercury even have a special syntax known as definite clause grammar (DCG) to facilitate doing so. It’s similarly easy to implement PEGs, and

*there’s a surprise at the end!*. Let’s see how.

PEG rules can be represented using DCG clauses like so:

`rule` --> `body`

The type and mode of rules can be inferred by Mercury, but if you need to declare them, you can do so as:

`:- pred ``rule`(list(char)::in, list(char)::out) is semidet.

The elements of the body of a PEG are translated as follows:

- terminal
`C` `[`

`C`]- nonterminal
`A` `A`- empty string
`[]`

- sequence
`A``B` `A`,`B`- ordered choice
`A`/`B` `A`-> [];`B`

Note that using`-> [] ;`

(if/else) instead of`;`

gives us exactly the greedy semantics we require.- optional
`A`? `A`-> []; []- positive lookahead
`&`

`A` `not not`

`A`- or equivalently,
`=(S),`

`A`, :=(S) - negative lookahead
`!`

`A` `not`

`A`

Repetition constructions (

`A*`

and `A+`

) are recursive in nature and cannot be represented inline. However we can define higher order rules to represent them as::- mode *(pred(in, out) is semidet, in, out) is det. *(P) --> (P, *(P)) -> []; []. :- mode +(pred(in, out) is semidet, in, out) is semidet. +(P) --> (P, +(P)) -> []; P.Then we can translate repetition constructions as:

- zero or more:
`A`* `*(`

or`A`)`*(pred(in, out) is semidet -->`

`A`)- one or more:
`A`+ `+(`

or`A`)`+(pred(in, out) is semidet -->`

`A`)

### Example

Let’s translate the example of basic arithmetic expressions from Wikipedia:

Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Sum ← Product (('+' / '-') Product)* Expr ← SumFirst let’s get some boilerplate out of the way. We’ll need the

`list`

module for DCG-syntax terminals::- module peg_example. :- interface. :- import_module io. :- pred main(io::di, io::uo) is det. :- implementation. :- import_module list.Now onto

`Value`

. Right off the bat Wikipedia throws us a curveball with the range expression `[0-9]`

. Fortunately that’s easily translated as a specially-moded function::- mode in..in = in is semidet. A..B = C :- A @=< C, C @=< B.Even though (in fact, because)

`..`

is a builtin operator, we can use it in definitions as well.Armed with our new range operator, we can translate the

`Value`

rule as:value --> +(pred(in, out) is semidet --> ['0'..'9']) -> []; (['('], expr, [')']).Note that the lambda-predicate (

`pred(in, out) is semidet -->`

) is needed since `['0'..'9']`

is not a predicate on its own. The translation is otherwise straightforward.`Product`

, `Sum`

, and `Expr`

follow similarly:product --> value, *(pred(in, out) is semidet --> ((['*'] -> []; ['/']), value)). sum --> product, *(pred(in, out) is semidet --> ((['+'] -> []; ['-']), value)). expr --> sum.Our

`main`

function will read a line, try to parse it using `expr`

and make sure that only the newline at the end is left unparsed:main(!IO) :- read_line(Res, !IO), (if Res = ok(Line) then (if expr(Line, ['\n']) then print("Success!\n", !IO) else print("Failure!\n", !IO)), main(!IO) else true).Compile with

`mmc --infer-all --make peg_example`

, run it and try out some expressions, keeping in mind that our grammar doesn’t allow negatives, decimals, or whitespace:Of course, a parser that merely tests whether input is well-formed is only mildly useful. Next time, we’ll look into creating parse trees, and the real-life example of parsing C code.$./peg_example123+456Success!5/(10-20)Success!45-(40Failure!*12Failure!a+18Failure!

Download peg_example.m

This comment has been removed by the author.

ReplyDeleteHi,

ReplyDeletegr8 post.

I think negation is one of the most underated and badly covered feature (in text books and papers) of DCG. I believe it has a lot of potential in parsing, especially when one takes the idea of "Parsing is Deduction" seriously.

Best Regards

Hi Jan, thanks!

ReplyDeleteI agree about negation; in particular I find myself wanting negation often when using regular expressions (fortunately some RE engines such as Tcl’s support negation as a “negative lookahead assertion”).

I don’t suppose many of websites give this kind of information.Grammarly

ReplyDelete