Adventures in Mercury: Parsing Expression Grammars part 2

Continuing last week’s post, I’d like to give an example of implementing a real-life PEG. What better than the C language, specifically, preprocessed C99. (Note: It may be helpful to the reader to open the linked PDF to Annex A, “Language syntax summary”.)

Recall that PEGs differ from more commonly used CFGs in that they are completely deterministic, greedy, and don’t allow left recursion. This can make translating some grammars difficult or impossible if they rely on those features. Fortunately, C99’s grammar is straightforward enough that, with a little thought, we can work around these problems. Let’s see how.

The Lexer

First off, let’s create a submodule for the lexer like so:

:- module lexer.
:- interface.

From the lexer we will export two predicates, one which lexes single tokens (token), and one which lexes a sequence of whitespace (whitespace):

:- pred token(token::out, list(char)::in, list(char)::out) is semidet.
:- pred whitespace(list(char)::in, list(char)::out) is det.

Note: it’s important that we define the mode of the token argument of token as out rather than in, even though they are logically equivalent. This will allow us later to memoize token on one argument instead of two, greatly reducing our memory use at negligible (if any) processing overhead. This trick is possible precisely because PEGs are deterministic: the determinism of token with token as out will still be semidet, rather than nondet (and hence unmemoizable) as it would be with a CFG.

The C99 standard defines 5 kinds of lexical tokens our lexer will need to generate: keyword, identifier, constant, string-literal, and punctuator. A glance through the C99 grammar reveals that merely to parse the grammar, we do not need to know the specific values of constants and string-literals, so let’s define the type of a lexing token like so:

:- type token --->
    keyword(string);
    identifier(string);
    constant;
    string_literal;
    punctuator(string).

Of course, if we were constructing a parse tree, we would want to hold onto the values of constants and string-literals, but for now since we are just matching the grammar it is OK to leave them out. Onto the implementation!

:- implementation.

Whitespace, or, using built-in character classes

Let’s get whitespace out of the way. The C99 definition of whitespace corresponds to Mercury’s definition of whitespace, so we can import char and define whitespace like so:

whitespace --> ([W], {is_whitespace(W)}) -> whitespace; [].

Tokens, or, parameterizing rules

C99 defines token as follows:

token:
        keyword
        identifier
        constant
        string-literal
        punctuator

A straightforward translation of this to a PEG in Mercury would read:

token -->
    keyword -> [];
    identifier -> [];
    constant -> [];
    string-literal -> [];
    punctuator.

However because this is an unordered choice in the spec, but an ordered choice in Mercury, we need to apply a moment’s thought that choosing this ordering will not change the language. Fortunately this is easy to confirm: the latter four each must start with a different character (letter, digit, quote, and non-quote symbol, respectively), and the first (keyword) should be matched in favor of the second (identifier) anyway.

The C99 spec assumes that whitespace is removed in some phase in-between lexing and parsing, and that string literals are concatenated. Because we will be attaching our lexer directly to our parser (such is the magic of PEGs), we will need to perform these two steps in our tokenizer. Finally, we will want to actually produce a lexing token as well (rather than just match it). The final implementation of token is thus:

token(Token) -->
    whitespace,
    (keyword(KW) -> {Token = keyword(KW)};
     identifier(Ident) -> {Token = identifier(Ident)};
     constant -> {Token = constant};
     (string_literal, (token(string_literal) -> []; [])) ->
        {Token = string_literal};
     punctuator(Punct), {Token = punctuator(Punct)}).

Parameterizing rules like this can be very powerful, as we’ll see later in this post. However let’s tackle each kind of token in turn.

Keywords, or, fixing choice using negative lookahead assertions

keyword looks deceptively simple; it is merely a list of valid C99 keywords. We could be tempted to implement it as such:

keyword(Keyword) --> % WRONG
    ['a', 'u', 't', 'o'] -> {Keyword = "auto"};
    ['b', 'r', 'e', 'a', 'k'] -> {Keyword = "break"};
    etc.

But the devil is in the details. We must ensure that no keyword is matched against the prefix of another keyword or identifier (e.g. we should not match the “do” in the keyword “double”, or the “for” in the identifier “format”). The latter can be fixed by using a negative lookahead assertion:

keyword(Keyword) -->
    (['a', 'u', 't', 'o'] -> {Keyword = "auto"};
     ['b', 'r', 'e', 'a', 'k'] -> {Keyword = "break"};
     etc.),
    not (identifier_nondigit(_) -> []; digit(_)).

In other words, a keyword must not be followed immediately by one of the characters which is valid in a keyword or identifier (defined later). Note that this won’t fix partial keyword matching, since if the assertion fails the parse as a keyword will be rejected entirely. Either we should move the negative lookahead assertion into the bodies of the branch, or simply rearrange the offending keywords from longest to shortest. Since “do” and “double” are the only offenders, it’s easy enough to swap them. (We’ll encounter a better example later.)

(Note: In the completed lexer linked below, I add a little extra machinery to allow listing keywords as strings in a list. There is nothing PEG-specific about this technique, and it saves me only keystrokes.)

Identifiers, or, transforming left-recursive productions

Identifiers take a recursive form commonly seen in the C99 spec:

identifier:
        identifier-nondigit
        identifier identifier-nondigit
        identifier digit

That is, an identifier is a certain initial production followed by a sequence of either of two other productions. A direct implementation is no good:

identifier --> % WRONG
    identifier_nondigit --> [];
    (identifier, identifier_nondigit) --> [];
    (identifier, digit).

This will either match a single identifier_nondigit, or, failing to do so, will get caught in a loop calling itself again having consumed no input. Not what we wanted. We need to split the initial call to identifier_nondigit out, and write a right-recursive rule for the rest of the identifier:

identifier --> identifier_nondigit, identifier_rec.

identifier_rec -->
    (identifier_nondigit, identifier_rec) -> [];
    (digit, identifier_rec) -> [];
    [].

Because identifier-nondigit and digit are disjoint, order is not important. This formulation is a proper PEG and will behave as expected.

Of course, we determined earlier that it was important to us to know what identifier was parsed. We can easily add expressions to our rules to do just this:

identifier(from_char_list(Ident): string) -->:
    identifier_nondigit(C), identifier_rec(Ident0),
    {Ident = [C | Ident0]}.

identifier_rec(Ident) -->
    (identifier_nondigit(C), identifier_rec(Ident0)) ->
        {Ident = [C | Ident0]};
    (digit(D), identifier_rec(Ident0)) ->
        {Ident = [D | Ident0]};
    {Ident = []}.

identifier_nondigit(D) -->
    nondigit(D0) -> {D = D0};
    universal_character_name(D0), {D = D0}.

(The typecast to string in the head of identifier is necessary because from_char_list is overloaded and Mercury gets confused otherwise.)

identifier-nondigit, nondigit, and digit are straightforward and use the techniques described thus far. I’m going to skip over them and move onto constants.

Constants, or, fixing choice using order

C99 defines constants as:

constant:
       integer-constant
       floating-constant
       enumeration-constant
       character-constant

Now, the astute reader will notice that something will go wrong here: an integer literal is also a valid floating-point literal in C99, so if we wrote this production directly as a PEG, it would always match the integer part of a floating-point literal as an integer. Again, not what we want.

We could use a negative lookahead assertion as above to fix this, but easier still is to change the order, prioritizing floats about integers:

constant -->
    floating_constant -> [];
    integer_constant -> []; 
    enumeration_constant -> [];
    character_constant.

Easy enough! Since enumeration constants must always start with a letter (which floats and integers do not), and character constants must always start with a single quote, we are all set here.

The rest of the lexer for constants and string literals uses one of these techniques so let’s move on to the last lexing token, punctuators.

Punctuators, or, fixing choice by ordering by length

punctuator, like keyword, is merely a list of strings of symbols which are meaningful in C99. This list presents the same problem that the list of keywords presents. Namely, that a shorter punctuator may match the prefix of what is in fact a longer punctuator. (For example, < may match the symbol <<= in source text.)

For this production, we can use the solution of ordering the punctuators from longest to shortest. This works without much further thought. (Again, I use some extra code to allow me to define this production as a list of strings in my example code.) And that about does it for the lexer.

:- end module lexer.

The Parser

The parser is, largely, trivial, using the same four techniques described above for the lexer (rewriting left-recursion and the three ordering techniques). Since the grammar makes lots of reference to individual kinds of tokens, I defined a few shortcuts:

kw(KW) --> token(keyword(KW)).
id(Ident) --> token(identifier(Ident)).
const --> token(constant).
str_lit --> token(string_literal).
p(Punct) --> token(punctuator(Punct)).

The grammar also references the enumeration-constant production, which, although not a token directly, is merely an identifier and we can treat it as such:

enumeration_const --> id(_).

Ensuring that all calls to the lexer are funneled through token will make our lives easier when we optimize later.

Now before I sign off, there are two gotchas in the grammar, for both of which C is infamous.

The “dangling else”

C’s if/then/[else] has the problem that in a CFG, it is ambiguous. Does if (1) if (0) puts("hi"); else puts("bye"); print bye or nothing? The C99 standard indicates that an else should match the nearest if it legally can (so the above code would print bye).

Fortunately with PEGs, this ambiguity is trivial to solve. We need simply write the if/then/else case first in the grammar. Any else will be matched by the innermost (and thus nearest) if possible:

selection_statement -->
    (kw("if"), p("("), expression, p(")"), statement,
     kw("else"), statement) -> [];
    (kw("if"), p("("), expression, p(")"), statement) -> [];
    etc.

Typedefs

For the unfamiliar, a typedef in C is a means of defining an alias for an existing type specification. There’s nothing odd about that, except that types in C are keywords and the set of types determines the parse. In other words, C is not context-free.

Even though PEGs can directly represent some non-context-free grammars, C is not one of them. Fortunately the problem is easily solved by parameterizing the grammar and passing around a set of “typedefed” identifiers:

:- import bool, set.
:- type typedefs == set(string).

Typedefs are created in declarations, if one of the declaration-specifiers is a storage-class-specifier which is the keyword typedef:

declaration(!Typedefs) -->
    declaration_specifiers(!.Typedefs, HasTypedef),
    (init_declarator_list(Decls0) ->
        {Decls = Decls0}; {Decls = init}),
    p(";"), {if HasTypedef = yes then union(Decls, !Typedefs) else true}.

declaration_specifiers(Typedefs, HasTypedef) -->
    (storage_class_specifier(IsTypedef),
     declaration_specifiers_opt(Typedefs, HasTypedef0)) ->
        {HasTypedef: bool = IsTypedef `or` HasTypedef0};
    (type_specifier,
     declaration_specifiers_opt(Typedefs, HasTypedef0)) ->
        {HasTypedef = HasTypedef0};
    etc.

declaration_specifiers_opt(Typedefs, HasTypedef) -->
    declaration_specifiers(Typedefs, HasTypedef0) ->
        {HasTypedef = HasTypedef0};
    {HasTypedef = no}.

storage_class_specifier(IsTypedef) -->
    kw("typedef") -> {IsTypedef = yes};
    kw("extern")  -> {IsTypedef = no}; 
    kw("static")  -> {IsTypedef = no};
    kw("auto")    -> {IsTypedef = no};
    kw("register"),  {IsTypedef = no}.

(Note the use of state variable syntax, and the factored-out declaration_specifiers_opt, which makes the code passably legible.) Typedef declarations may legally exist in block-items, scoped within compound-statements:

compound_statement(Typedefs) -->
    p("{"), (block_item_list(Typedefs, _) -> []; []), p("}").

block_item_list(!Typedefs) -->
    (block_item(!Typedefs), block_item_list(!Typedefs)) -> [];
    block_item(!Typedefs).

block_item(!Typedefs) --> declaration(!Typedefs) -> []; statement.

and as external-declarations:

translation_unit(!Typedefs) -->
    (external_declaration(!Typedefs), translation_unit(!Typedefs)) -> [];
    external_declaration(!Typedefs).

external_declaration(!Typedefs) -->
    function_definition -> [];
    declaration(!Typedefs).

Finally, typedef-name itself references the set of typedefs:

typedef_name(Typedefs) --> id(Ident), {set.member(Ident, Typedefs)}.

The rest of the implementation of typedefs involves parameterizing most productions to pass the set of typedefs from external_declaration down to typedef_name, and parameterizing the few illegal uses of typedefs (for example, in function signatures) to simply drop the modified set of typedefs. Nothing brain-bending; just a lot of typing.

Summary & Test

In today’s post we saw how to translate a real language grammar, that of C99, into a Mercury program. We learned how to transform left-recursive rules into right-recursive ones, how to ensure proper production choice by using negative lookahead assertions, ordering by precedence, and ordering by length, and how to implement the non-context-free elements of C99.

Let’s add a main function similar to yesterday’s, compile, and try out our spiffy new parser:

$ ./c_parser_direct
int main(void) {}

…and wait, and wait, and wait, and… what’s going on? Why are we getting no response? Do we have an infinite loop?

Nope. Add this line in the lexer module:

:- pragma memo(token/3, [fast_loose]).

$ ./c_parser_direct
int main(void) {}

(wait for it…)

Success!

Success indeed, but it’s still slow. More on memoization and how to speed this up even more in part 3.

Download c_parser_direct.m

Adventures in Mercury

Sunday, October 9, 2011

Parsing Expression Grammars part 2