Monday, October 17, 2011

Parsing Expression Grammars part 3 (final)

So, last week, we built a C99 parser. A slow C99 parser. At the end we added one line of code that sped things up a bunch. Let’s look at that.

Why recursive descent parsing is slow

Right now our PEG parser is what’s known as a recursive descent parser – it recursively descends down the parse tree looking for matches. Unfortunately, recursive descent is very slow for two reasons:
  1. One grammar rule may try to parse a given span of input the same way multiple times.
  2. Multiple grammar rules may try to parse a given span of input the same way multiple times.
An example of the former is in the selection_statement production (Typedefs omitted for clarity):
selection_statement -->
    (kw("if"), p("("), expression, p(")"), statement, kw("else"), statement) -> [];
    (kw("if"), p("("), expression, p(")"), statement) -> [];
    (kw("switch"), p("("), expression, p(")"), statement).
The string “if (x > 0) x--; return;” will be parsed once by the first branch, which fails when it finds “return” instead of “else”, and will then be reparsed by the second branch, which will succeed. This sort of prefix-repetition can be eliminated by factoring out the repeated portion:
selection_statement -->
    (kw("if"), p("("), expression, p(")"), statement,
     ((kw("else"), statement) -> []; [])) -> [];
    (kw("switch"), p("("), expression, p(")"), statement).
Of course, this means that our code no longer is structurally identical to the grammar we are parsing, but this may be a welcome tradeoff.

An example of the latter, which is more pernicious, is given by, well, almost every production. Why? Consider statement:
statement -->
    labeled_statement -> [];
    compound_statement -> [];
    expression_statement -> [];
    selection_statement -> []; 
    iteration_statement -> []; 
    jump_statement.
Pretty innocuous, right? Take a look at each of the productions it invokes. Almost all of them start with a call to kw, which in turn calls token. In fact, all of them start with a call, directly or indirectly, to token. In fact, this is true of every production in the parser. token is not the only culprit but it is the most pervasive.

This problem can’t be solved through a simple restructuring of the grammar. For this we need a more powerful tool: memoization.

Memoization

Memoization is one of my favorite features. Simply put, memozation of a procedure means maintaining a table of its past input and output values, and looking up output values in this table when the procedure is called in the future with the same input values. Of course, this only works if the procedure has no desirable side effects; fortunately this is true of most Mercury code. Mercury provides built-in support for memoization via the memo pragma. Basic usage is so:
:- pragma memo(Name/Arity).
By default, Mercury memoizes inputs by value. This will catch all repeated calls to a procedure, but can be slow for large structures. We can specify that we would like Mercury to memoize inputs by address instead:
:- pragma memo(Name/Arity, [fast_loose]).
Memoizing by address is much faster but may miss calls with identical values at different addresses. However, because our primary data structure is a list of characters which is very large (meaning memoize-by-value will be slow) and which we are only destructing (meaning identical sublists of this list must be physically the same), we will memoize by address.

Profiling

Now, although we could memoize every procedure in our lexer and parser, memoization does have overhead, and so we should not use it on procedures which are not called often or with repeated inputs. We should limit our scope to grammar rules which are invoked by many different rules, or multiple times by one rule, since these are likely to be invoking the rule repeatedly on the same input.

To find such candidates for memoization, we can use runtime code profiling. To generate profiling data, we can simply compile with the -p switch and then run our program with our test from last week (minus the memoization of token we added at the end):
$ mmc -p --infer-all --make c_parser_direct
$ echo 'int main(void) {}' | ./c_parser_direct
Let that churn for a few seconds, and then kill it with Ctrl+C. Now run mprof to find out what’s been taking so long:
   %  cumulative    self              self    total                                                            
 time    seconds  seconds    calls  ms/call  ms/call name                                                      
 68.5       8.70     8.70 17602295     0.00     0.00 <function `string.to_char_list/1' mode 0> [1]
  7.1       9.60     0.90 17602483     0.00     0.00 <predicate `builtin.unify/2' mode 0> [2]     
  6.3      10.40     0.80   202300     0.00     0.00 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [3]
  5.9      11.15     0.75   202388     0.00     0.00 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [4]   
  4.7      11.75     0.60 10114349     0.00     0.00 <predicate `c_parser_aim.lexer.punctuator_aux/3' mode 0> [5]
  4.3      12.30     0.55  7487945     0.00     0.00 <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [6]
Over 95% of that run was spent just checking for keywords and punctuators! In particular, the biggest culprit, string.to_char_list, is highly suspect. To find out who’s calling it, we can run mprof -c [note: sometimes I have had to run mprof -cd; I can’t reproduce this now so I’m not sure why] and search for the callers of string.to_char_list:
                3.70        3.70 7487945/17602295    <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]
                5.00        5.00 10114350/17602295    <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [22]
[21]    68.5    8.70        8.70 17602295         <function `string.to_char_list/1' mode 0> [21]
The [21] in the left-most column indicates that this section is dedicated to procedure #21, string.to_char_list, which was called 17,602,295 times. The two lines above this line indicate that both lexer.keyword_rec and lexer.punctuator_rec call string.to_char_list, 7,487,945 and 10,114,350 times respectively. Of course; we use string.to_char_list to convert our keywords and punctuators to character lists for these two procedures to match against the input data. But clearly we don't have 7,487,945 keywords or 10,114,350 punctuators, so string.to_char_list must be getting called with the same arguments repeatedly. Let’s fix that:
:- func to_char_list_memo(string) = list(char).
:- pragma memo(to_char_list_memo/1, [fast_loose]).
to_char_list_memo(S) = to_char_list(S).
Unfortunately we can’t instruct Mercury to memoize procedures in other modules, so we must define to_char_list_memo as above. We can replace the two calls to it in lexer.keyword_rec and lexer.punctuator_rec with the memoized version. Compiling and reprofiling tells us:
   %  cumulative    self              self    total                                                             
 time    seconds  seconds    calls  ms/call  ms/call name                                                       
 45.2       1.65     1.65 10400248     0.00     0.00 <function `c_parser_aim.to_char_list_memo/1' mode 0> [22]
 12.3       2.10     0.45 10400437     0.00     0.00 <predicate `builtin.unify/2' mode 0> [29]                
 12.3       2.55     0.45  5975244     0.00     0.00 <predicate `c_parser_aim.lexer.punctuator_aux/3' mode 0> [27]
 12.3       3.00     0.45   119606     0.00     0.01 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]   
  8.2       3.30     0.30  4425004     0.00     0.00 <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [28]
  4.1       3.45     0.15   119517     0.00     0.02 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [21]

Tracking down token

We’ve reduced the percentage of total runtime consumed by to_char_list, but it’s still getting called an awful lot. We know keyword_rec and punctuator_rec don’t call it any more often than they need to, so we need to find out who’s calling them. Let’s follow keyword_rec:
                0.45        1.64  119606/119606      <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]
[24]    45.0    0.45        1.64  119606         <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]
                0.70        0.70 4425004/10400248    <function `c_parser_aim.to_char_list_memo/1' mode 0> [22]
                0.30        0.49 4425004/4425004     <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [28]
keyword is the only caller, no surprise there. Let’s go up one more level:
                0.00        1.64  119606/119606      <predicate `c_parser_aim.lexer.token/3' mode 0> [1]       
[23]    45.0    0.00        1.64  119606         <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]
                0.00        0.00      57/239246      <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [67]
                0.45        1.64  119606/119606      <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]        
                0.00        0.00      57/836707      <predicate `char.is_digit/1' mode 0> [59]
Again, no surprise – keyword is called only by token.
                0.00        0.21    7009/119606      <predicate `c_parser_aim.cast_expression/3' mode 0> [8]
                0.00        0.00       2/119606      <predicate `c_parser_aim.compound_statement/3' mode 0>  <cycle 3> [55]
                0.00        0.00     117/119606      <predicate `c_parser_aim.declaration_specifiers/4' mode 0>  <cycle 4> [35]
                0.00        0.00       4/119606      <predicate `c_parser_aim.direct_abstract_declarator/3' mode 0> [52]             
                0.00        0.00       9/119606      <predicate `c_parser_aim.direct_declarator/4' mode 0>  <cycle 2> [50]
                0.00        0.00       4/119606      <predicate `c_parser_aim.direct_declarator_rec/3' mode 0>  <cycle 2> [47]
                0.00        0.00      36/119606      <predicate `c_parser_aim.enum_specifier/3' mode 0> [42]                        
                0.01        0.43   14042/119606      <predicate `c_parser_aim.kw/3' mode 0> [30]            
                0.00        0.00       3/119606      <predicate `c_parser_aim.labeled_statement/3' mode 0> [53]
                0.00        0.00       8/119606      <predicate `c_parser_aim.p/3' mode 0> [51]                
                0.00        0.00       2/119606      <predicate `c_parser_aim.parameter_list/3' mode 0>  <cycle 2> [54]
                0.00        0.00       1/119606      <predicate `c_parser_aim.parameter_type_list/3' mode 0>  <cycle 2> [56]
                0.00        0.00      26/119606      <predicate `c_parser_aim.pointer/2' mode 0> [45]                             
                0.01        0.43   14019/119606      <predicate `c_parser_aim.postfix_expression/3' mode 0> [25]
                0.01        0.86   28036/119606      <predicate `c_parser_aim.primary_expression/3' mode 0> [26]
                0.00        0.00      48/119606      <predicate `c_parser_aim.struct_or_union/2' mode 0> [41]   
                0.00        0.00      36/119606      <predicate `c_parser_aim.type_qualifier/2' mode 0> [43] 
                0.00        0.00     120/119606      <predicate `c_parser_aim.type_specifier/3' mode 0> [37]
                0.00        0.00      12/119606      <predicate `c_parser_aim.typedef_name/3' mode 0> [48]  
                0.02        1.71   56072/119606      <predicate `c_parser_aim.unary_expression/3' mode 0> [20]
[1]    100.0    0.05        3.65  119606         <predicate `c_parser_aim.lexer.token/3' mode 0> [1]
                0.00        0.00  119517/119517      <function `c_parser_aim.lexer.punctuator_list/0' mode 0> [66]
                0.00        0.00      31/31          <function `string.from_char_list/1' mode 0> [60]             
                0.00        0.10  119517/119517      <predicate `c_parser_aim.lexer.constant/2' mode 0> [32]
                0.00        0.00  119548/239246      <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [67]
                0.00        0.00      31/31          <predicate `c_parser_aim.lexer.identifier_rec/3' mode 0> [71]     
                0.00        1.64  119606/119606      <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]       
                0.15        1.81  119517/119517      <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [21]
                0.00        0.05  119606/119606      <predicate `c_parser_aim.lexer.whitespace/2' mode 0> [34]
And there’s the culprit. token is called by many other procedures, and as discussed above, almost certainly with repeated inputs. Let’s memoize token.
:- pragma memo(token/3, [fast_loose]).
Now (as we found out last week), parsing our simple test completes in a reasonable amount of time. Let’s give our parser something bigger to chew on.

Factoring out multiple calls from multiplicative_expression

For stress testing, I used a 6kb program from Glen McCluskey’s C99 test suite, tdes_047.c. (It is available in the set of free samples if you wish to try it yourself.) We can use this (or any C99 code) as follows:
$ cpp -P c9_samp/tdes_047.c | ./c_parser_direct
And of course, we are met with no end to the parsing. mprof still tells us that token is responsible for the most CPU usage, but since we’ve already memoized it, let’s work our way up the chain some more. Who is responsible for most of the calls to token?
                0.23        0.23 1358257/95678933    <predicate `c_parser_aim.additive_expression/3' mode 0>  <cycle 2> [19]
                0.00        0.00    7546/95678933    <predicate `c_parser_aim.and_expression/3' mode 0>  <cycle 2> [24]     
                0.00        0.00    2370/95678933    <predicate `c_parser_aim.assignment_operator/2' mode 0> [29]            
                1.36        1.36 8218754/95678933    <predicate `c_parser_aim.cast_expression/3' mode 0>  <cycle 2> [12]
                0.00        0.00       7/95678933    <predicate `c_parser_aim.compound_statement/3' mode 0> [14]              
                0.00        0.00     236/95678933    <predicate `c_parser_aim.conditional_expression/3' mode 0>  <cycle 2> [50]
                0.00        0.00      14/95678933    <predicate `c_parser_aim.declaration/4' mode 0> [5]                             
                0.00        0.00     457/95678933    <predicate `c_parser_aim.declaration_specifiers/4' mode 0>  <cycle 2> [43]
                0.00        0.00     392/95678933    <predicate `c_parser_aim.designator/3' mode 0> [48]                             
                0.00        0.00      95/95678933    <predicate `c_parser_aim.direct_abstract_declarator/3' mode 0> [53]
                0.00        0.00     368/95678933    <predicate `c_parser_aim.direct_declarator/4' mode 0>  <cycle 2> [49]
                0.00        0.00     890/95678933    <predicate `c_parser_aim.direct_declarator_rec/3' mode 0>  <cycle 2> [27]
                0.00        0.00     999/95678933    <predicate `c_parser_aim.enum_specifier/3' mode 0> [38]                        
                0.01        0.01   30183/95678933    <predicate `c_parser_aim.equality_expression/3' mode 0>  <cycle 2> [23]
                0.00        0.00    3773/95678933    <predicate `c_parser_aim.exclusive_or_expression/3' mode 0>  <cycle 2> [26]
                0.00        0.00       2/95678933    <predicate `c_parser_aim.identifier_list/2' mode 0> [57]                         
                0.00        0.00    1886/95678933    <predicate `c_parser_aim.inclusive_or_expression/3' mode 0>  <cycle 2> [31]
                0.00        0.00      17/95678933    <predicate `c_parser_aim.init_declarator/4' mode 0> [8]                          
                0.00        0.00       8/95678933    <predicate `c_parser_aim.init_declarator_list/4' mode 0> [7]
                0.00        0.00       2/95678933    <predicate `c_parser_aim.initializer/3' mode 0>  <cycle 2> [56]
                0.00        0.00      48/95678933    <predicate `c_parser_aim.initializer_list/3' mode 0>  <cycle 2> [45]
                0.02        0.02  139104/95678933    <predicate `c_parser_aim.kw/3' mode 0> [22]                               
                0.00        0.00     943/95678933    <predicate `c_parser_aim.logical_and_expression/3' mode 0>  <cycle 2> [39]
                0.00        0.00     472/95678933    <predicate `c_parser_aim.logical_or_expression/3' mode 0>  <cycle 2> [44] 
                1.01        1.01 6112154/95678933    <predicate `c_parser_aim.multiplicative_expression/3' mode 0>  <cycle 2> [17]
                0.00        0.00    1474/95678933    <predicate `c_parser_aim.p/3' mode 0> [34]                                         
                0.00        0.00    1172/95678933    <predicate `c_parser_aim.pointer/2' mode 0> [32]
                2.72        2.72 16437987/95678933    <predicate `c_parser_aim.postfix_expression/3' mode 0>  <cycle 2> [4]
                8.10        8.10 48898650/95678933    <predicate `c_parser_aim.postfix_expression_rec/3' mode 0> [9]             
                2.20        2.20 13264944/95678933    <predicate `c_parser_aim.primary_expression/3' mode 0> [13]   
                0.03        0.03  181101/95678933    <predicate `c_parser_aim.relational_expression/3' mode 0>  <cycle 2> [21]
                0.08        0.08  452752/95678933    <predicate `c_parser_aim.shift_expression/3' mode 0>  <cycle 2> [20]     
                0.00        0.00      68/95678933    <predicate `c_parser_aim.struct_declaration/3' mode 0>  <cycle 2> [54]
                0.00        0.00     136/95678933    <predicate `c_parser_aim.struct_declarator/3' mode 0>  <cycle 2> [52] 
                0.00        0.00      68/95678933    <predicate `c_parser_aim.struct_declarator_list/3' mode 0>  <cycle 2> [55]
                0.00        0.00    1404/95678933    <predicate `c_parser_aim.struct_or_union/2' mode 0> [36]                        
                0.00        0.00     102/95678933    <predicate `c_parser_aim.struct_or_union_specifier/3' mode 0>  <cycle 2> [33]
                0.00        0.00    2334/95678933    <predicate `c_parser_aim.type_qualifier/2' mode 0> [30]                            
                0.00        0.00    4079/95678933    <predicate `c_parser_aim.type_specifier/3' mode 0>  <cycle 2> [25]
                0.00        0.00     333/95678933    <predicate `c_parser_aim.typedef_name/3' mode 0> [51]                   
                0.09        0.09  553352/95678933    <predicate `c_parser_aim.unary_expression/3' mode 0>  <cycle 2> [18]
[6]     52.1   15.85       15.85 95678933         <predicate `c_parser_aim.lexer.token/3' mode 0> [6]                          
                0.00        0.00      96/96          <function `c_parser_aim.lexer.punctuator_list/0' mode 0> [81]
                0.00        0.00      58/58          <function `string.from_char_list/1' mode 0> [64]             
                0.00        0.00     108/108         <predicate `c_parser_aim.lexer.constant/2' mode 0> [88]
                0.00        0.00     166/412         <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [86]
                0.00        0.00      58/58          <predicate `c_parser_aim.lexer.identifier_rec/3' mode 0> [97]     
                0.00        0.00     225/225         <predicate `c_parser_aim.lexer.keyword/3' mode 0> [83]       
                0.00        0.00      96/96          <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [95]
                0.00        0.00     225/225         <predicate `c_parser_aim.lexer.whitespace/2' mode 0> [82]
It seems like postfix_expression_rec is the winner; it is responsible for over half the calls to token. Tracing it back similarly to how we traced keyword_rec, we find that it is called solely by postfix_expression 8,149,775 times, which is called solely by unary_expression a similar number of times, which is called almost exclusively by cast_expression a similar number of times, which is called almost exclusively by multiplicative_expression, which is called… only 2,054,666 times:
                                 2054666             <predicate `c_parser_aim.additive_expression/3' mode 0>  <cycle 2> [19]
[17]     7.2    0.70        2.18 2054666         <predicate `c_parser_aim.multiplicative_expression/3' mode 0>  <cycle 2> [17]
                1.01        1.01 6112154/95678933    <predicate `c_parser_aim.lexer.token/3' mode 0> [6]                            
                0.47        0.47 6112154/87459316    <unification predicate for type `c_parser_aim.lexer.token/0' mode 0> [11]
                                 8218660             <predicate `c_parser_aim.cast_expression/3' mode 0>  <cycle 2> [12]
In other words, multiplicative_expression is responsible for multiplying the number of calls passing through it to token fourfold. Let’s look at multiplicative_expression (again, with Typedefs elided):
multiplicative_expression -->
    (cast_expression, p("*"), multiplicative_expression) -> [];
    (cast_expression, p("/"), multiplicative_expression) -> [];
    (cast_expression, p("%"), multiplicative_expression) -> [];
    cast_expression.
A-ha, a repeated prefix! We know how to fix this:
multiplicative_expression -->
    cast_expression,
    ((p("*"), multiplicative_expression) -> [];
     (p("/"), multiplicative_expression) -> [];
     (p("%"), multiplicative_expression) -> [];
     []).
Or, if we’d prefer to leave the structure of the grammar intact at the cost of memoization overhead, we can memoize cast_expression:
:- pred cast_expression(list(char)::in, list(char)::out) is semidet.
:- pragma memo(cast_expression/2, [fast_loose]).
(Note the :- pred declaration is necessary as you cannot memoize undeclared modes.)

Notice that most of the other _expression productions follow the same pattern; all of them will in fact benefit from prefix-factoring or memoization. Let’s do that and move on.

Memoizing multiple calls to unary_expression

Despite all our optimization so far, token (and postfix_expression_rec in turn) still reigns supreme. However, tracing the chain of calls back this time now reveals this:
                                  170474             <predicate `c_parser_aim.assignment_expression/3' mode 0>  <cycle 2> [15]
                                  198268             <predicate `c_parser_aim.cast_expression/3' mode 0>  <cycle 2> [32]      
[18]     7.2    0.10        0.32  368742         <predicate `c_parser_aim.unary_expression/3' mode 0>  <cycle 2> [18]
                0.03        0.06  171588/338952      <predicate `c_parser_aim.kw/3' mode 0> [30]                           
                0.13        0.13  686352/12057646    <predicate `c_parser_aim.lexer.token/3' mode 0> [7]
                0.03        0.03  686352/11652569    <unification predicate for type `c_parser_aim.lexer.token/0' mode 0> [17]
                                  368742             <predicate `c_parser_aim.postfix_expression/3' mode 0>  <cycle 2> [9]
unary_expression is called approximately an equal number of times by both assignment_expression and cast_expression. This makes it a perfect candidate for memoization:
:- pred cast_expression(typedefs::in, list(char)::in, list(char)::out) is semidet.
:- pragma memo(cast_expression/3, [fast_loose]).

Ad nauseum

And that’s about all there is to it: following this method I found I needed to memoize two more procedures and factor one more, and the parser parsed the test file faster than time could reliably distinguish. Of course, it would behoove us to test on different inputs in order to exercise different grammar productions, but the method is the same.

Summary

To recap:
  1. Use mprof -c to determine which procedure is the largest time hog.
  2. If the procedure is called by primarily one other procedure on a 1:1 basis, or if you have already optimized it, look instead at its most prominent caller (etc.).
  3. If the procedure is called by primarily one other procedure on a greater than 1:1 basis and it is a common prefix of this caller, factor it out.
  4. Otherwise, memoize it.
  5. Recompile and repeat.
This simple method will asymptotically increase the speed of almost any naïve PEG parser, in my experience, from unusable to quite fast, and will avoid the overhead associated with unnecessarily memoizing every grammar production in a parser.

Download c_parser_memo.m

Further reading

The direct encoding of PEGs in Mercury, and the performance implications thereof, were first explored in-depth by Ralph Becket and Zoltan Somogyi in the paper DCGs + Memoing = Packrat Parsing; But is it worth it?; I highly recommend it as further reading for those interested in the subject.

Sunday, October 9, 2011

Parsing Expression Grammars part 2

Continuing last week’s post, I’d like to give an example of implementing a real-life PEG. What better than the C language, specifically, preprocessed C99. (Note: It may be helpful to the reader to open the linked PDF to Annex A, “Language syntax summary”.)

Recall that PEGs differ from more commonly used CFGs in that they are completely deterministic, greedy, and don’t allow left recursion. This can make translating some grammars difficult or impossible if they rely on those features. Fortunately, C99’s grammar is straightforward enough that, with a little thought, we can work around these problems. Let’s see how.

The Lexer


First off, let’s create a submodule for the lexer like so:
:- module lexer.
:- interface.
From the lexer we will export two predicates, one which lexes single tokens (token), and one which lexes a sequence of whitespace (whitespace):
:- pred token(token::out, list(char)::in, list(char)::out) is semidet.
:- pred whitespace(list(char)::in, list(char)::out) is det.
Note: it’s important that we define the mode of the token argument of token as out rather than in, even though they are logically equivalent. This will allow us later to memoize token on one argument instead of two, greatly reducing our memory use at negligible (if any) processing overhead. This trick is possible precisely because PEGs are deterministic: the determinism of token with token as out will still be semidet, rather than nondet (and hence unmemoizable) as it would be with a CFG.

The C99 standard defines 5 kinds of lexical tokens our lexer will need to generate: keyword, identifier, constant, string-literal, and punctuator. A glance through the C99 grammar reveals that merely to parse the grammar, we do not need to know the specific values of constants and string-literals, so let’s define the type of a lexing token like so:
:- type token --->
    keyword(string);
    identifier(string);
    constant;
    string_literal;
    punctuator(string).
Of course, if we were constructing a parse tree, we would want to hold onto the values of constants and string-literals, but for now since we are just matching the grammar it is OK to leave them out. Onto the implementation!
:- implementation.

Whitespace, or, using built-in character classes


Let’s get whitespace out of the way. The C99 definition of whitespace corresponds to Mercury’s definition of whitespace, so we can import char and define whitespace like so:
whitespace --> ([W], {is_whitespace(W)}) -> whitespace; [].

Tokens, or, parameterizing rules


C99 defines token as follows:
token:
        keyword
        identifier
        constant
        string-literal
        punctuator
A straightforward translation of this to a PEG in Mercury would read:
token -->
    keyword -> [];
    identifier -> [];
    constant -> [];
    string-literal -> [];
    punctuator.
However because this is an unordered choice in the spec, but an ordered choice in Mercury, we need to apply a moment’s thought that choosing this ordering will not change the language. Fortunately this is easy to confirm: the latter four each must start with a different character (letter, digit, quote, and non-quote symbol, respectively), and the first (keyword) should be matched in favor of the second (identifier) anyway.

The C99 spec assumes that whitespace is removed in some phase in-between lexing and parsing, and that string literals are concatenated. Because we will be attaching our lexer directly to our parser (such is the magic of PEGs), we will need to perform these two steps in our tokenizer. Finally, we will want to actually produce a lexing token as well (rather than just match it). The final implementation of token is thus:
token(Token) -->
    whitespace,
    (keyword(KW) -> {Token = keyword(KW)};
     identifier(Ident) -> {Token = identifier(Ident)};
     constant -> {Token = constant};
     (string_literal, (token(string_literal) -> []; [])) ->
        {Token = string_literal};
     punctuator(Punct), {Token = punctuator(Punct)}).
Parameterizing rules like this can be very powerful, as we’ll see later in this post. However let’s tackle each kind of token in turn.

Keywords, or, fixing choice using negative lookahead assertions


keyword looks deceptively simple; it is merely a list of valid C99 keywords. We could be tempted to implement it as such:
keyword(Keyword) --> % WRONG
    ['a', 'u', 't', 'o'] -> {Keyword = "auto"};
    ['b', 'r', 'e', 'a', 'k'] -> {Keyword = "break"};
    etc.
But the devil is in the details. We must ensure that no keyword is matched against the prefix of another keyword or identifier (e.g. we should not match the “do” in the keyword “double”, or the “for” in the identifier “format”). The latter can be fixed by using a negative lookahead assertion:
keyword(Keyword) -->
    (['a', 'u', 't', 'o'] -> {Keyword = "auto"};
     ['b', 'r', 'e', 'a', 'k'] -> {Keyword = "break"};
     etc.),
    not (identifier_nondigit(_) -> []; digit(_)).
In other words, a keyword must not be followed immediately by one of the characters which is valid in a keyword or identifier (defined later). Note that this won’t fix partial keyword matching, since if the assertion fails the parse as a keyword will be rejected entirely. Either we should move the negative lookahead assertion into the bodies of the branch, or simply rearrange the offending keywords from longest to shortest. Since “do” and “double” are the only offenders, it’s easy enough to swap them. (We’ll encounter a better example later.)

(Note: In the completed lexer linked below, I add a little extra machinery to allow listing keywords as strings in a list. There is nothing PEG-specific about this technique, and it saves me only keystrokes.)

Identifiers, or, transforming left-recursive productions


Identifiers take a recursive form commonly seen in the C99 spec:
identifier:
        identifier-nondigit
        identifier identifier-nondigit
        identifier digit
That is, an identifier is a certain initial production followed by a sequence of either of two other productions. A direct implementation is no good:
identifier --> % WRONG
    identifier_nondigit --> [];
    (identifier, identifier_nondigit) --> [];
    (identifier, digit).
This will either match a single identifier_nondigit, or, failing to do so, will get caught in a loop calling itself again having consumed no input. Not what we wanted. We need to split the initial call to identifier_nondigit out, and write a right-recursive rule for the rest of the identifier:
identifier --> identifier_nondigit, identifier_rec.

identifier_rec -->
    (identifier_nondigit, identifier_rec) -> [];
    (digit, identifier_rec) -> [];
    [].
Because identifier-nondigit and digit are disjoint, order is not important. This formulation is a proper PEG and will behave as expected.

Of course, we determined earlier that it was important to us to know what identifier was parsed. We can easily add expressions to our rules to do just this:
identifier(from_char_list(Ident): string) -->:
    identifier_nondigit(C), identifier_rec(Ident0),
    {Ident = [C | Ident0]}.

identifier_rec(Ident) -->
    (identifier_nondigit(C), identifier_rec(Ident0)) ->
        {Ident = [C | Ident0]};
    (digit(D), identifier_rec(Ident0)) ->
        {Ident = [D | Ident0]};
    {Ident = []}.

identifier_nondigit(D) -->
    nondigit(D0) -> {D = D0};
    universal_character_name(D0), {D = D0}.
(The typecast to string in the head of identifier is necessary because from_char_list is overloaded and Mercury gets confused otherwise.)

identifier-nondigit, nondigit, and digit are straightforward and use the techniques described thus far. I’m going to skip over them and move onto constants.

Constants, or, fixing choice using order


C99 defines constants as:
constant:
       integer-constant
       floating-constant
       enumeration-constant
       character-constant
Now, the astute reader will notice that something will go wrong here: an integer literal is also a valid floating-point literal in C99, so if we wrote this production directly as a PEG, it would always match the integer part of a floating-point literal as an integer. Again, not what we want.

We could use a negative lookahead assertion as above to fix this, but easier still is to change the order, prioritizing floats about integers:
constant -->
    floating_constant -> [];
    integer_constant -> []; 
    enumeration_constant -> [];
    character_constant.
Easy enough! Since enumeration constants must always start with a letter (which floats and integers do not), and character constants must always start with a single quote, we are all set here.

The rest of the lexer for constants and string literals uses one of these techniques so let’s move on to the last lexing token, punctuators.

Punctuators, or, fixing choice by ordering by length


punctuator, like keyword, is merely a list of strings of symbols which are meaningful in C99. This list presents the same problem that the list of keywords presents. Namely, that a shorter punctuator may match the prefix of what is in fact a longer punctuator. (For example, < may match the symbol <<= in source text.)

For this production, we can use the solution of ordering the punctuators from longest to shortest. This works without much further thought. (Again, I use some extra code to allow me to define this production as a list of strings in my example code.) And that about does it for the lexer.
:- end module lexer.

The Parser


The parser is, largely, trivial, using the same four techniques described above for the lexer (rewriting left-recursion and the three ordering techniques). Since the grammar makes lots of reference to individual kinds of tokens, I defined a few shortcuts:
kw(KW) --> token(keyword(KW)).
id(Ident) --> token(identifier(Ident)).
const --> token(constant).
str_lit --> token(string_literal).
p(Punct) --> token(punctuator(Punct)).
The grammar also references the enumeration-constant production, which, although not a token directly, is merely an identifier and we can treat it as such:
enumeration_const --> id(_).
Ensuring that all calls to the lexer are funneled through token will make our lives easier when we optimize later.

Now before I sign off, there are two gotchas in the grammar, for both of which C is infamous.

The “dangling else”


C’s if/then/[else] has the problem that in a CFG, it is ambiguous. Does if (1) if (0) puts("hi"); else puts("bye"); print bye or nothing? The C99 standard indicates that an else should match the nearest if it legally can (so the above code would print bye).

Fortunately with PEGs, this ambiguity is trivial to solve. We need simply write the if/then/else case first in the grammar. Any else will be matched by the innermost (and thus nearest) if possible:
selection_statement -->
    (kw("if"), p("("), expression, p(")"), statement,
     kw("else"), statement) -> [];
    (kw("if"), p("("), expression, p(")"), statement) -> [];
    etc.

Typedefs


For the unfamiliar, a typedef in C is a means of defining an alias for an existing type specification. There’s nothing odd about that, except that types in C are keywords and the set of types determines the parse. In other words, C is not context-free.

Even though PEGs can directly represent some non-context-free grammars, C is not one of them. Fortunately the problem is easily solved by parameterizing the grammar and passing around a set of “typedefed” identifiers:
:- import bool, set.
:- type typedefs == set(string).
Typedefs are created in declarations, if one of the declaration-specifiers is a storage-class-specifier which is the keyword typedef:
declaration(!Typedefs) -->
    declaration_specifiers(!.Typedefs, HasTypedef),
    (init_declarator_list(Decls0) ->
        {Decls = Decls0}; {Decls = init}),
    p(";"), {if HasTypedef = yes then union(Decls, !Typedefs) else true}.

declaration_specifiers(Typedefs, HasTypedef) -->
    (storage_class_specifier(IsTypedef),
     declaration_specifiers_opt(Typedefs, HasTypedef0)) ->
        {HasTypedef: bool = IsTypedef `or` HasTypedef0};
    (type_specifier,
     declaration_specifiers_opt(Typedefs, HasTypedef0)) ->
        {HasTypedef = HasTypedef0};
    etc.

declaration_specifiers_opt(Typedefs, HasTypedef) -->
    declaration_specifiers(Typedefs, HasTypedef0) ->
        {HasTypedef = HasTypedef0};
    {HasTypedef = no}.

storage_class_specifier(IsTypedef) -->
    kw("typedef") -> {IsTypedef = yes};
    kw("extern")  -> {IsTypedef = no}; 
    kw("static")  -> {IsTypedef = no};
    kw("auto")    -> {IsTypedef = no};
    kw("register"),  {IsTypedef = no}.
(Note the use of state variable syntax, and the factored-out declaration_specifiers_opt, which makes the code passably legible.) Typedef declarations may legally exist in block-items, scoped within compound-statements:
compound_statement(Typedefs) -->
    p("{"), (block_item_list(Typedefs, _) -> []; []), p("}").

block_item_list(!Typedefs) -->
    (block_item(!Typedefs), block_item_list(!Typedefs)) -> [];
    block_item(!Typedefs).

block_item(!Typedefs) --> declaration(!Typedefs) -> []; statement.
and as external-declarations:
translation_unit(!Typedefs) -->
    (external_declaration(!Typedefs), translation_unit(!Typedefs)) -> [];
    external_declaration(!Typedefs).

external_declaration(!Typedefs) -->
    function_definition -> [];
    declaration(!Typedefs).
Finally, typedef-name itself references the set of typedefs:
typedef_name(Typedefs) --> id(Ident), {set.member(Ident, Typedefs)}.
The rest of the implementation of typedefs involves parameterizing most productions to pass the set of typedefs from external_declaration down to typedef_name, and parameterizing the few illegal uses of typedefs (for example, in function signatures) to simply drop the modified set of typedefs. Nothing brain-bending; just a lot of typing.

Summary & Test


In today’s post we saw how to translate a real language grammar, that of C99, into a Mercury program. We learned how to transform left-recursive rules into right-recursive ones, how to ensure proper production choice by using negative lookahead assertions, ordering by precedence, and ordering by length, and how to implement the non-context-free elements of C99.

Let’s add a main function similar to yesterday’s, compile, and try out our spiffy new parser:
$ ./c_parser_direct
int main(void) {}
…and wait, and wait, and wait, and… what’s going on? Why are we getting no response? Do we have an infinite loop?

Nope. Add this line in the lexer module:
:- pragma memo(token/3, [fast_loose]).
$ ./c_parser_direct
int main(void) {}
(wait for it…)
Success!
Success indeed, but it’s still slow. More on memoization and how to speed this up even more in part 3.

Download c_parser_direct.m

Sunday, October 2, 2011

Parsing Expression Grammars part 1

First, I’d like to say welcome back! AiM is returning from its hiatus since April. The summer has been busy for me, but I hope to update AiM weekly henceforth. Let’s start!

Parsing Expression Grammars


Parsing expression grammars (PEGs) are a new-ish formal grammar intended to describe computer languages. PEGs differ from the more commonplace context-free grammars most notably in that PEGs offer an ordered choice operator. The second branch of an ordered choice implicitly does not match any expression which is matched by the first branch. Since any text can only match exactly one branch of an ordered choice, this has the practical implication that a PEG cannot be ambiguous and rule precedence declarations are not needed.

Another natural consequence of ordered choice is that so-called repetition operators (* and +) and the option operator (?) are necessarily greedy. (Any time a grammar rule can recur, ordered choice ensures that it does recur, since the option of not recurring will implicitly not match.)

A PEG rule, as described in the Wikipedia entry, may be either:
  • a terminal (a symbol),
  • a nonterminal (another rule), or
  • the empty string.
or may be constructed from other rules by:
  • sequence: A B
  • ordered choice: A / B
  • zero or more: A*
  • one or more: A+
  • optional: A?
  • positive lookahead: &A
  • negative lookahead: !A
Positive lookahead and negative lookahead test for a rule match (or lack thereof) but do not consume input.

Translating into Mercury


It’s well-known that CFGs can be implemented directly in logic languages such as Mercury as recursive-descent parsers. Prolog and Mercury even have a special syntax known as definite clause grammar (DCG) to facilitate doing so. It’s similarly easy to implement PEGs, and there’s a surprise at the end!. Let’s see how.

PEG rules can be represented using DCG clauses like so:

rule --> body

The type and mode of rules can be inferred by Mercury, but if you need to declare them, you can do so as:

:- pred rule(list(char)::in, list(char)::out) is semidet.

The elements of the body of a PEG are translated as follows:
terminal C
[C]
nonterminal A
A
empty string
[]
sequence A B
A, B
ordered choice A / B
A -> []; B
Note that using -> [] ; (if/else) instead of ; gives us exactly the greedy semantics we require.
optional A?
A -> []; []
positive lookahead &A
not not A
or equivalently, =(S), A, :=(S)
negative lookahead !A
not A

Repetition constructions (A* and A+) are recursive in nature and cannot be represented inline. However we can define higher order rules to represent them as:
:- mode *(pred(in, out) is semidet, in, out) is det.
*(P) --> (P, *(P)) -> []; [].
:- mode +(pred(in, out) is semidet, in, out) is semidet.
+(P) --> (P, +(P)) -> []; P.
Then we can translate repetition constructions as:
zero or more: A*
*(A) or *(pred(in, out) is semidet --> A)
one or more: A+
+(A) or +(pred(in, out) is semidet --> A)
The first, shorter form, can only be used for named rules with defined modes. The longer form is applicable to any rule construct.

Example


Let’s translate the example of basic arithmetic expressions from Wikipedia:
Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum     ← Product (('+' / '-') Product)*
Expr    ← Sum
First let’s get some boilerplate out of the way. We’ll need the list module for DCG-syntax terminals:
:- module peg_example.

:- interface.
:- import_module io.
:- pred main(io::di, io::uo) is det.

:- implementation.

:- import_module list.
Now onto Value. Right off the bat Wikipedia throws us a curveball with the range expression [0-9]. Fortunately that’s easily translated as a specially-moded function:
:- mode in..in = in is semidet.
A..B = C :- A @=< C, C @=< B.
Even though (in fact, because) .. is a builtin operator, we can use it in definitions as well.

Armed with our new range operator, we can translate the Value rule as:
value -->
    +(pred(in, out) is semidet --> ['0'..'9']) -> [];
    (['('], expr, [')']).
Note that the lambda-predicate (pred(in, out) is semidet -->) is needed since ['0'..'9'] is not a predicate on its own. The translation is otherwise straightforward.

Product, Sum, and Expr follow similarly:
product -->
    value, *(pred(in, out) is semidet -->
             ((['*'] -> []; ['/']), value)).

sum -->
    product, *(pred(in, out) is semidet -->
               ((['+'] -> []; ['-']), value)).

expr --> sum.
Our main function will read a line, try to parse it using expr and make sure that only the newline at the end is left unparsed:
main(!IO) :-
    read_line(Res, !IO),
    (if Res = ok(Line) then
        (if expr(Line, ['\n'])
         then print("Success!\n", !IO)
         else print("Failure!\n", !IO)),
        main(!IO)
     else true).
Compile with mmc --infer-all --make peg_example, run it and try out some expressions, keeping in mind that our grammar doesn’t allow negatives, decimals, or whitespace:
$ ./peg_example
123+456
Success!
5/(10-20)
Success!
45-(40
Failure!
*12
Failure!
a+18
Failure!
Of course, a parser that merely tests whether input is well-formed is only mildly useful. Next time, we’ll look into creating parse trees, and the real-life example of parsing C code.

Download peg_example.m