Sunday, October 2, 2011

Parsing Expression Grammars part 1

First, I’d like to say welcome back! AiM is returning from its hiatus since April. The summer has been busy for me, but I hope to update AiM weekly henceforth. Let’s start!

Parsing Expression Grammars


Parsing expression grammars (PEGs) are a new-ish formal grammar intended to describe computer languages. PEGs differ from the more commonplace context-free grammars most notably in that PEGs offer an ordered choice operator. The second branch of an ordered choice implicitly does not match any expression which is matched by the first branch. Since any text can only match exactly one branch of an ordered choice, this has the practical implication that a PEG cannot be ambiguous and rule precedence declarations are not needed.

Another natural consequence of ordered choice is that so-called repetition operators (* and +) and the option operator (?) are necessarily greedy. (Any time a grammar rule can recur, ordered choice ensures that it does recur, since the option of not recurring will implicitly not match.)

A PEG rule, as described in the Wikipedia entry, may be either:
  • a terminal (a symbol),
  • a nonterminal (another rule), or
  • the empty string.
or may be constructed from other rules by:
  • sequence: A B
  • ordered choice: A / B
  • zero or more: A*
  • one or more: A+
  • optional: A?
  • positive lookahead: &A
  • negative lookahead: !A
Positive lookahead and negative lookahead test for a rule match (or lack thereof) but do not consume input.

Translating into Mercury


It’s well-known that CFGs can be implemented directly in logic languages such as Mercury as recursive-descent parsers. Prolog and Mercury even have a special syntax known as definite clause grammar (DCG) to facilitate doing so. It’s similarly easy to implement PEGs, and there’s a surprise at the end!. Let’s see how.

PEG rules can be represented using DCG clauses like so:

rule --> body

The type and mode of rules can be inferred by Mercury, but if you need to declare them, you can do so as:

:- pred rule(list(char)::in, list(char)::out) is semidet.

The elements of the body of a PEG are translated as follows:
terminal C
[C]
nonterminal A
A
empty string
[]
sequence A B
A, B
ordered choice A / B
A -> []; B
Note that using -> [] ; (if/else) instead of ; gives us exactly the greedy semantics we require.
optional A?
A -> []; []
positive lookahead &A
not not A
or equivalently, =(S), A, :=(S)
negative lookahead !A
not A

Repetition constructions (A* and A+) are recursive in nature and cannot be represented inline. However we can define higher order rules to represent them as:
:- mode *(pred(in, out) is semidet, in, out) is det.
*(P) --> (P, *(P)) -> []; [].
:- mode +(pred(in, out) is semidet, in, out) is semidet.
+(P) --> (P, +(P)) -> []; P.
Then we can translate repetition constructions as:
zero or more: A*
*(A) or *(pred(in, out) is semidet --> A)
one or more: A+
+(A) or +(pred(in, out) is semidet --> A)
The first, shorter form, can only be used for named rules with defined modes. The longer form is applicable to any rule construct.

Example


Let’s translate the example of basic arithmetic expressions from Wikipedia:
Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum     ← Product (('+' / '-') Product)*
Expr    ← Sum
First let’s get some boilerplate out of the way. We’ll need the list module for DCG-syntax terminals:
:- module peg_example.

:- interface.
:- import_module io.
:- pred main(io::di, io::uo) is det.

:- implementation.

:- import_module list.
Now onto Value. Right off the bat Wikipedia throws us a curveball with the range expression [0-9]. Fortunately that’s easily translated as a specially-moded function:
:- mode in..in = in is semidet.
A..B = C :- A @=< C, C @=< B.
Even though (in fact, because) .. is a builtin operator, we can use it in definitions as well.

Armed with our new range operator, we can translate the Value rule as:
value -->
    +(pred(in, out) is semidet --> ['0'..'9']) -> [];
    (['('], expr, [')']).
Note that the lambda-predicate (pred(in, out) is semidet -->) is needed since ['0'..'9'] is not a predicate on its own. The translation is otherwise straightforward.

Product, Sum, and Expr follow similarly:
product -->
    value, *(pred(in, out) is semidet -->
             ((['*'] -> []; ['/']), value)).

sum -->
    product, *(pred(in, out) is semidet -->
               ((['+'] -> []; ['-']), value)).

expr --> sum.
Our main function will read a line, try to parse it using expr and make sure that only the newline at the end is left unparsed:
main(!IO) :-
    read_line(Res, !IO),
    (if Res = ok(Line) then
        (if expr(Line, ['\n'])
         then print("Success!\n", !IO)
         else print("Failure!\n", !IO)),
        main(!IO)
     else true).
Compile with mmc --infer-all --make peg_example, run it and try out some expressions, keeping in mind that our grammar doesn’t allow negatives, decimals, or whitespace:
$ ./peg_example
123+456
Success!
5/(10-20)
Success!
45-(40
Failure!
*12
Failure!
a+18
Failure!
Of course, a parser that merely tests whether input is well-formed is only mildly useful. Next time, we’ll look into creating parse trees, and the real-life example of parsing C code.

Download peg_example.m

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi,

    gr8 post.

    I think negation is one of the most underated and badly covered feature (in text books and papers) of DCG. I believe it has a lot of potential in parsing, especially when one takes the idea of "Parsing is Deduction" seriously.

    Best Regards

    ReplyDelete
  3. Hi Jan, thanks!

    I agree about negation; in particular I find myself wanting negation often when using regular expressions (fortunately some RE engines such as Tcl’s support negation as a “negative lookahead assertion”).

    ReplyDelete
  4. I don’t suppose many of websites give this kind of information.Grammarly

    ReplyDelete