Recall that PEGs differ from more commonly used CFGs in that they are completely deterministic, greedy, and don’t allow left recursion. This can make translating some grammars difficult or impossible if they rely on those features. Fortunately, C99’s grammar is straightforward enough that, with a little thought, we can work around these problems. Let’s see how.
The Lexer
First off, let’s create a submodule for the lexer like so:
:- module lexer.
:- interface.
From the lexer we will export two predicates, one which lexes single tokens (token
), and one which lexes a sequence of whitespace (whitespace
)::- pred token(token::out, list(char)::in, list(char)::out) is semidet.
:- pred whitespace(list(char)::in, list(char)::out) is det.
Note: it’s important that we define the mode of the token
argument of token
as out
rather than in
, even though they are logically equivalent. This will allow us later to memoize token
on one argument instead of two, greatly reducing our memory use at negligible (if any) processing overhead. This trick is possible precisely because PEGs are deterministic: the determinism of token
with token
as out
will still be semidet
, rather than nondet
(and hence unmemoizable) as it would be with a CFG.The C99 standard defines 5 kinds of lexical tokens our lexer will need to generate: keyword, identifier, constant, string-literal, and punctuator. A glance through the C99 grammar reveals that merely to parse the grammar, we do not need to know the specific values of constants and string-literals, so let’s define the type of a lexing token like so:
:- type token --->
keyword(string);
identifier(string);
constant;
string_literal;
punctuator(string).
Of course, if we were constructing a parse tree, we would want to hold onto the values of constants and string-literals, but for now since we are just matching the grammar it is OK to leave them out. Onto the implementation!:- implementation.
Whitespace, or, using built-in character classes
Let’s get whitespace out of the way. The C99 definition of whitespace corresponds to Mercury’s definition of whitespace, so we can import
char
and define whitespace
like so:whitespace --> ([W], {is_whitespace(W)}) -> whitespace; [].
Tokens, or, parameterizing rules
C99 defines token as follows:
token: keyword identifier constant string-literal punctuatorA straightforward translation of this to a PEG in Mercury would read:
token -->
keyword -> [];
identifier -> [];
constant -> [];
string-literal -> [];
punctuator.
However because this is an unordered choice in the spec, but an ordered choice in Mercury, we need to apply a moment’s thought that choosing this ordering will not change the language. Fortunately this is easy to confirm: the latter four each must start with a different character (letter, digit, quote, and non-quote symbol, respectively), and the first (keyword) should be matched in favor of the second (identifier) anyway.The C99 spec assumes that whitespace is removed in some phase in-between lexing and parsing, and that string literals are concatenated. Because we will be attaching our lexer directly to our parser (such is the magic of PEGs), we will need to perform these two steps in our tokenizer. Finally, we will want to actually produce a lexing token as well (rather than just match it). The final implementation of
token
is thus:token(Token) -->
whitespace,
(keyword(KW) -> {Token = keyword(KW)};
identifier(Ident) -> {Token = identifier(Ident)};
constant -> {Token = constant};
(string_literal, (token(string_literal) -> []; [])) ->
{Token = string_literal};
punctuator(Punct), {Token = punctuator(Punct)}).
Parameterizing rules like this can be very powerful, as we’ll see later in this post. However let’s tackle each kind of token in turn.Keywords, or, fixing choice using negative lookahead assertions
keyword looks deceptively simple; it is merely a list of valid C99 keywords. We could be tempted to implement it as such:
keyword(Keyword) --> % WRONG
['a', 'u', 't', 'o'] -> {Keyword = "auto"};
['b', 'r', 'e', 'a', 'k'] -> {Keyword = "break"};
etc.
But the devil is in the details. We must ensure that no keyword is matched against the prefix of another keyword or identifier (e.g. we should not match the “do” in the keyword “double”, or the “for” in the identifier “format”). The latter can be fixed by using a negative lookahead assertion:In other words, a keyword must not be followed immediately by one of the characters which is valid in a keyword or identifier (defined later). Note that this won’t fix partial keyword matching, since if the assertion fails the parse as a keyword will be rejected entirely. Either we should move the negative lookahead assertion into the bodies of the branch, or simply rearrange the offending keywords from longest to shortest. Since “do” and “double” are the only offenders, it’s easy enough to swap them. (We’ll encounter a better example later.)keyword(Keyword) --> (['a', 'u', 't', 'o'] -> {Keyword = "auto"}; ['b', 'r', 'e', 'a', 'k'] -> {Keyword = "break"};
etc.), not (identifier_nondigit(_) -> []; digit(_)).
(Note: In the completed lexer linked below, I add a little extra machinery to allow listing keywords as strings in a list. There is nothing PEG-specific about this technique, and it saves me only keystrokes.)
Identifiers, or, transforming left-recursive productions
Identifiers take a recursive form commonly seen in the C99 spec:
identifier: identifier-nondigit identifier identifier-nondigit identifier digitThat is, an identifier is a certain initial production followed by a sequence of either of two other productions. A direct implementation is no good:
identifier --> % WRONG
identifier_nondigit --> [];
(identifier, identifier_nondigit) --> [];
(identifier, digit).
This will either match a single identifier_nondigit
, or, failing to do so, will get caught in a loop calling itself again having consumed no input. Not what we wanted. We need to split the initial call to identifier_nondigit
out, and write a right-recursive rule for the rest of the identifier:identifier --> identifier_nondigit, identifier_rec.
identifier_rec -->
(identifier_nondigit, identifier_rec) -> [];
(digit, identifier_rec) -> [];
[].
Because identifier-nondigit and digit are disjoint, order is not important. This formulation is a proper PEG and will behave as expected.Of course, we determined earlier that it was important to us to know what identifier was parsed. We can easily add expressions to our rules to do just this:
identifier(from_char_list(Ident): string) -->:
identifier_nondigit(C), identifier_rec(Ident0),
{Ident = [C | Ident0]}.
identifier_rec(Ident) -->
(identifier_nondigit(C), identifier_rec(Ident0)) ->
{Ident = [C | Ident0]};
(digit(D), identifier_rec(Ident0)) ->
{Ident = [D | Ident0]};
{Ident = []}.
identifier_nondigit(D) -->
nondigit(D0) -> {D = D0};
universal_character_name(D0), {D = D0}.
(The typecast to string
in the head of identifier
is necessary because from_char_list
is overloaded and Mercury gets confused otherwise.)identifier-nondigit, nondigit, and digit are straightforward and use the techniques described thus far. I’m going to skip over them and move onto constants.
Constants, or, fixing choice using order
C99 defines constants as:
constant: integer-constant floating-constant enumeration-constant character-constantNow, the astute reader will notice that something will go wrong here: an integer literal is also a valid floating-point literal in C99, so if we wrote this production directly as a PEG, it would always match the integer part of a floating-point literal as an integer. Again, not what we want.
We could use a negative lookahead assertion as above to fix this, but easier still is to change the order, prioritizing floats about integers:
constant -->
floating_constant -> [];
integer_constant -> [];
enumeration_constant -> [];
character_constant.
Easy enough! Since enumeration constants must always start with a letter (which floats and integers do not), and character constants must always start with a single quote, we are all set here.The rest of the lexer for constants and string literals uses one of these techniques so let’s move on to the last lexing token, punctuators.
Punctuators, or, fixing choice by ordering by length
punctuator, like keyword, is merely a list of strings of symbols which are meaningful in C99. This list presents the same problem that the list of keywords presents. Namely, that a shorter punctuator may match the prefix of what is in fact a longer punctuator. (For example, < may match the symbol <<= in source text.)
For this production, we can use the solution of ordering the punctuators from longest to shortest. This works without much further thought. (Again, I use some extra code to allow me to define this production as a list of strings in my example code.) And that about does it for the lexer.
:- end module lexer.
The Parser
The parser is, largely, trivial, using the same four techniques described above for the lexer (rewriting left-recursion and the three ordering techniques). Since the grammar makes lots of reference to individual kinds of tokens, I defined a few shortcuts:
kw(KW) --> token(keyword(KW)). id(Ident) --> token(identifier(Ident)). const --> token(constant). str_lit --> token(string_literal). p(Punct) --> token(punctuator(Punct)).The grammar also references the enumeration-constant production, which, although not a token directly, is merely an identifier and we can treat it as such:
enumeration_const --> id(_).
Ensuring that all calls to the lexer are funneled through token
will make our lives easier when we optimize later.Now before I sign off, there are two gotchas in the grammar, for both of which C is infamous.
The “dangling else”
C’s if/then/[else] has the problem that in a CFG, it is ambiguous. Does
if (1) if (0) puts("hi"); else puts("bye");
print bye or nothing? The C99 standard indicates that an else
should match the nearest if
it legally can (so the above code would print bye).Fortunately with PEGs, this ambiguity is trivial to solve. We need simply write the if/then/else case first in the grammar. Any
else
will be matched by the innermost (and thus nearest) if
possible:selection_statement -->
(kw("if"), p("("), expression, p(")"), statement,
kw("else"), statement) -> [];
(kw("if"), p("("), expression, p(")"), statement) -> [];
etc.
Typedefs
For the unfamiliar, a
Even though PEGs can directly represent some non-context-free grammars, C is not one of them. Fortunately the problem is easily solved by parameterizing the grammar and passing around a set of “typedefed” identifiers:
:- import bool, set.
:- type typedefs == set(string).
Typedefs are created in declarations, if one of the declaration-specifiers is a storage-class-specifier which is the keyword typedef
:(Note the use of state variable syntax, and the factored-outdeclaration(!Typedefs) --> declaration_specifiers(!.Typedefs, HasTypedef), (init_declarator_list(Decls0) -> {Decls = Decls0}; {Decls = init}), p(";"), {if HasTypedef = yes then union(Decls, !Typedefs) else true}. declaration_specifiers(Typedefs, HasTypedef) --> (storage_class_specifier(IsTypedef), declaration_specifiers_opt(Typedefs, HasTypedef0)) -> {HasTypedef: bool = IsTypedef `or` HasTypedef0}; (type_specifier, declaration_specifiers_opt(Typedefs, HasTypedef0)) -> {HasTypedef = HasTypedef0};
etc.declaration_specifiers_opt(Typedefs, HasTypedef) --> declaration_specifiers(Typedefs, HasTypedef0) -> {HasTypedef = HasTypedef0}; {HasTypedef = no}. storage_class_specifier(IsTypedef) --> kw("typedef") -> {IsTypedef = yes}; kw("extern") -> {IsTypedef = no}; kw("static") -> {IsTypedef = no}; kw("auto") -> {IsTypedef = no}; kw("register"), {IsTypedef = no}.
declaration_specifiers_opt
, which makes the code passably legible.) Typedef declarations may legally exist in block-items, scoped within compound-statements:compound_statement(Typedefs) -->
p("{"), (block_item_list(Typedefs, _) -> []; []), p("}").
block_item_list(!Typedefs) -->
(block_item(!Typedefs), block_item_list(!Typedefs)) -> [];
block_item(!Typedefs).
block_item(!Typedefs) --> declaration(!Typedefs) -> []; statement.
and as external-declarations:translation_unit(!Typedefs) -->
(external_declaration(!Typedefs), translation_unit(!Typedefs)) -> [];
external_declaration(!Typedefs).
external_declaration(!Typedefs) -->
function_definition -> [];
declaration(!Typedefs).
Finally, typedef-name itself references the set of typedefs:typedef_name(Typedefs) --> id(Ident), {set.member(Ident, Typedefs)}.
The rest of the implementation of typedefs involves parameterizing most productions to pass the set of typedefs from external_declaration
down to typedef_name
, and parameterizing the few illegal uses of typedefs (for example, in function signatures) to simply drop the modified set of typedefs. Nothing brain-bending; just a lot of typing.Summary & Test
In today’s post we saw how to translate a real language grammar, that of C99, into a Mercury program. We learned how to transform left-recursive rules into right-recursive ones, how to ensure proper production choice by using negative lookahead assertions, ordering by precedence, and ordering by length, and how to implement the non-context-free elements of C99.
Let’s add a main function similar to yesterday’s, compile, and try out our spiffy new parser:
$ ./c_parser_direct int main(void) {}…and wait, and wait, and wait, and… what’s going on? Why are we getting no response? Do we have an infinite loop?
Nope. Add this line in the
lexer
module::- pragma memo(token/3, [fast_loose]).
$ ./c_parser_direct int main(void) {}(wait for it…)
Success!Success indeed, but it’s still slow. More on memoization and how to speed this up even more in part 3.
Download c_parser_direct.m
No comments:
Post a Comment