Monday, October 17, 2011

Parsing Expression Grammars part 3 (final)

So, last week, we built a C99 parser. A slow C99 parser. At the end we added one line of code that sped things up a bunch. Let’s look at that.

Why recursive descent parsing is slow

Right now our PEG parser is what’s known as a recursive descent parser – it recursively descends down the parse tree looking for matches. Unfortunately, recursive descent is very slow for two reasons:
  1. One grammar rule may try to parse a given span of input the same way multiple times.
  2. Multiple grammar rules may try to parse a given span of input the same way multiple times.
An example of the former is in the selection_statement production (Typedefs omitted for clarity):
selection_statement -->
    (kw("if"), p("("), expression, p(")"), statement, kw("else"), statement) -> [];
    (kw("if"), p("("), expression, p(")"), statement) -> [];
    (kw("switch"), p("("), expression, p(")"), statement).
The string “if (x > 0) x--; return;” will be parsed once by the first branch, which fails when it finds “return” instead of “else”, and will then be reparsed by the second branch, which will succeed. This sort of prefix-repetition can be eliminated by factoring out the repeated portion:
selection_statement -->
    (kw("if"), p("("), expression, p(")"), statement,
     ((kw("else"), statement) -> []; [])) -> [];
    (kw("switch"), p("("), expression, p(")"), statement).
Of course, this means that our code no longer is structurally identical to the grammar we are parsing, but this may be a welcome tradeoff.

An example of the latter, which is more pernicious, is given by, well, almost every production. Why? Consider statement:
statement -->
    labeled_statement -> [];
    compound_statement -> [];
    expression_statement -> [];
    selection_statement -> []; 
    iteration_statement -> []; 
    jump_statement.
Pretty innocuous, right? Take a look at each of the productions it invokes. Almost all of them start with a call to kw, which in turn calls token. In fact, all of them start with a call, directly or indirectly, to token. In fact, this is true of every production in the parser. token is not the only culprit but it is the most pervasive.

This problem can’t be solved through a simple restructuring of the grammar. For this we need a more powerful tool: memoization.

Memoization

Memoization is one of my favorite features. Simply put, memozation of a procedure means maintaining a table of its past input and output values, and looking up output values in this table when the procedure is called in the future with the same input values. Of course, this only works if the procedure has no desirable side effects; fortunately this is true of most Mercury code. Mercury provides built-in support for memoization via the memo pragma. Basic usage is so:
:- pragma memo(Name/Arity).
By default, Mercury memoizes inputs by value. This will catch all repeated calls to a procedure, but can be slow for large structures. We can specify that we would like Mercury to memoize inputs by address instead:
:- pragma memo(Name/Arity, [fast_loose]).
Memoizing by address is much faster but may miss calls with identical values at different addresses. However, because our primary data structure is a list of characters which is very large (meaning memoize-by-value will be slow) and which we are only destructing (meaning identical sublists of this list must be physically the same), we will memoize by address.

Profiling

Now, although we could memoize every procedure in our lexer and parser, memoization does have overhead, and so we should not use it on procedures which are not called often or with repeated inputs. We should limit our scope to grammar rules which are invoked by many different rules, or multiple times by one rule, since these are likely to be invoking the rule repeatedly on the same input.

To find such candidates for memoization, we can use runtime code profiling. To generate profiling data, we can simply compile with the -p switch and then run our program with our test from last week (minus the memoization of token we added at the end):
$ mmc -p --infer-all --make c_parser_direct
$ echo 'int main(void) {}' | ./c_parser_direct
Let that churn for a few seconds, and then kill it with Ctrl+C. Now run mprof to find out what’s been taking so long:
   %  cumulative    self              self    total                                                            
 time    seconds  seconds    calls  ms/call  ms/call name                                                      
 68.5       8.70     8.70 17602295     0.00     0.00 <function `string.to_char_list/1' mode 0> [1]
  7.1       9.60     0.90 17602483     0.00     0.00 <predicate `builtin.unify/2' mode 0> [2]     
  6.3      10.40     0.80   202300     0.00     0.00 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [3]
  5.9      11.15     0.75   202388     0.00     0.00 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [4]   
  4.7      11.75     0.60 10114349     0.00     0.00 <predicate `c_parser_aim.lexer.punctuator_aux/3' mode 0> [5]
  4.3      12.30     0.55  7487945     0.00     0.00 <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [6]
Over 95% of that run was spent just checking for keywords and punctuators! In particular, the biggest culprit, string.to_char_list, is highly suspect. To find out who’s calling it, we can run mprof -c [note: sometimes I have had to run mprof -cd; I can’t reproduce this now so I’m not sure why] and search for the callers of string.to_char_list:
                3.70        3.70 7487945/17602295    <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]
                5.00        5.00 10114350/17602295    <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [22]
[21]    68.5    8.70        8.70 17602295         <function `string.to_char_list/1' mode 0> [21]
The [21] in the left-most column indicates that this section is dedicated to procedure #21, string.to_char_list, which was called 17,602,295 times. The two lines above this line indicate that both lexer.keyword_rec and lexer.punctuator_rec call string.to_char_list, 7,487,945 and 10,114,350 times respectively. Of course; we use string.to_char_list to convert our keywords and punctuators to character lists for these two procedures to match against the input data. But clearly we don't have 7,487,945 keywords or 10,114,350 punctuators, so string.to_char_list must be getting called with the same arguments repeatedly. Let’s fix that:
:- func to_char_list_memo(string) = list(char).
:- pragma memo(to_char_list_memo/1, [fast_loose]).
to_char_list_memo(S) = to_char_list(S).
Unfortunately we can’t instruct Mercury to memoize procedures in other modules, so we must define to_char_list_memo as above. We can replace the two calls to it in lexer.keyword_rec and lexer.punctuator_rec with the memoized version. Compiling and reprofiling tells us:
   %  cumulative    self              self    total                                                             
 time    seconds  seconds    calls  ms/call  ms/call name                                                       
 45.2       1.65     1.65 10400248     0.00     0.00 <function `c_parser_aim.to_char_list_memo/1' mode 0> [22]
 12.3       2.10     0.45 10400437     0.00     0.00 <predicate `builtin.unify/2' mode 0> [29]                
 12.3       2.55     0.45  5975244     0.00     0.00 <predicate `c_parser_aim.lexer.punctuator_aux/3' mode 0> [27]
 12.3       3.00     0.45   119606     0.00     0.01 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]   
  8.2       3.30     0.30  4425004     0.00     0.00 <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [28]
  4.1       3.45     0.15   119517     0.00     0.02 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [21]

Tracking down token

We’ve reduced the percentage of total runtime consumed by to_char_list, but it’s still getting called an awful lot. We know keyword_rec and punctuator_rec don’t call it any more often than they need to, so we need to find out who’s calling them. Let’s follow keyword_rec:
                0.45        1.64  119606/119606      <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]
[24]    45.0    0.45        1.64  119606         <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]
                0.70        0.70 4425004/10400248    <function `c_parser_aim.to_char_list_memo/1' mode 0> [22]
                0.30        0.49 4425004/4425004     <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [28]
keyword is the only caller, no surprise there. Let’s go up one more level:
                0.00        1.64  119606/119606      <predicate `c_parser_aim.lexer.token/3' mode 0> [1]       
[23]    45.0    0.00        1.64  119606         <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]
                0.00        0.00      57/239246      <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [67]
                0.45        1.64  119606/119606      <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]        
                0.00        0.00      57/836707      <predicate `char.is_digit/1' mode 0> [59]
Again, no surprise – keyword is called only by token.
                0.00        0.21    7009/119606      <predicate `c_parser_aim.cast_expression/3' mode 0> [8]
                0.00        0.00       2/119606      <predicate `c_parser_aim.compound_statement/3' mode 0>  <cycle 3> [55]
                0.00        0.00     117/119606      <predicate `c_parser_aim.declaration_specifiers/4' mode 0>  <cycle 4> [35]
                0.00        0.00       4/119606      <predicate `c_parser_aim.direct_abstract_declarator/3' mode 0> [52]             
                0.00        0.00       9/119606      <predicate `c_parser_aim.direct_declarator/4' mode 0>  <cycle 2> [50]
                0.00        0.00       4/119606      <predicate `c_parser_aim.direct_declarator_rec/3' mode 0>  <cycle 2> [47]
                0.00        0.00      36/119606      <predicate `c_parser_aim.enum_specifier/3' mode 0> [42]                        
                0.01        0.43   14042/119606      <predicate `c_parser_aim.kw/3' mode 0> [30]            
                0.00        0.00       3/119606      <predicate `c_parser_aim.labeled_statement/3' mode 0> [53]
                0.00        0.00       8/119606      <predicate `c_parser_aim.p/3' mode 0> [51]                
                0.00        0.00       2/119606      <predicate `c_parser_aim.parameter_list/3' mode 0>  <cycle 2> [54]
                0.00        0.00       1/119606      <predicate `c_parser_aim.parameter_type_list/3' mode 0>  <cycle 2> [56]
                0.00        0.00      26/119606      <predicate `c_parser_aim.pointer/2' mode 0> [45]                             
                0.01        0.43   14019/119606      <predicate `c_parser_aim.postfix_expression/3' mode 0> [25]
                0.01        0.86   28036/119606      <predicate `c_parser_aim.primary_expression/3' mode 0> [26]
                0.00        0.00      48/119606      <predicate `c_parser_aim.struct_or_union/2' mode 0> [41]   
                0.00        0.00      36/119606      <predicate `c_parser_aim.type_qualifier/2' mode 0> [43] 
                0.00        0.00     120/119606      <predicate `c_parser_aim.type_specifier/3' mode 0> [37]
                0.00        0.00      12/119606      <predicate `c_parser_aim.typedef_name/3' mode 0> [48]  
                0.02        1.71   56072/119606      <predicate `c_parser_aim.unary_expression/3' mode 0> [20]
[1]    100.0    0.05        3.65  119606         <predicate `c_parser_aim.lexer.token/3' mode 0> [1]
                0.00        0.00  119517/119517      <function `c_parser_aim.lexer.punctuator_list/0' mode 0> [66]
                0.00        0.00      31/31          <function `string.from_char_list/1' mode 0> [60]             
                0.00        0.10  119517/119517      <predicate `c_parser_aim.lexer.constant/2' mode 0> [32]
                0.00        0.00  119548/239246      <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [67]
                0.00        0.00      31/31          <predicate `c_parser_aim.lexer.identifier_rec/3' mode 0> [71]     
                0.00        1.64  119606/119606      <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]       
                0.15        1.81  119517/119517      <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [21]
                0.00        0.05  119606/119606      <predicate `c_parser_aim.lexer.whitespace/2' mode 0> [34]
And there’s the culprit. token is called by many other procedures, and as discussed above, almost certainly with repeated inputs. Let’s memoize token.
:- pragma memo(token/3, [fast_loose]).
Now (as we found out last week), parsing our simple test completes in a reasonable amount of time. Let’s give our parser something bigger to chew on.

Factoring out multiple calls from multiplicative_expression

For stress testing, I used a 6kb program from Glen McCluskey’s C99 test suite, tdes_047.c. (It is available in the set of free samples if you wish to try it yourself.) We can use this (or any C99 code) as follows:
$ cpp -P c9_samp/tdes_047.c | ./c_parser_direct
And of course, we are met with no end to the parsing. mprof still tells us that token is responsible for the most CPU usage, but since we’ve already memoized it, let’s work our way up the chain some more. Who is responsible for most of the calls to token?
                0.23        0.23 1358257/95678933    <predicate `c_parser_aim.additive_expression/3' mode 0>  <cycle 2> [19]
                0.00        0.00    7546/95678933    <predicate `c_parser_aim.and_expression/3' mode 0>  <cycle 2> [24]     
                0.00        0.00    2370/95678933    <predicate `c_parser_aim.assignment_operator/2' mode 0> [29]            
                1.36        1.36 8218754/95678933    <predicate `c_parser_aim.cast_expression/3' mode 0>  <cycle 2> [12]
                0.00        0.00       7/95678933    <predicate `c_parser_aim.compound_statement/3' mode 0> [14]              
                0.00        0.00     236/95678933    <predicate `c_parser_aim.conditional_expression/3' mode 0>  <cycle 2> [50]
                0.00        0.00      14/95678933    <predicate `c_parser_aim.declaration/4' mode 0> [5]                             
                0.00        0.00     457/95678933    <predicate `c_parser_aim.declaration_specifiers/4' mode 0>  <cycle 2> [43]
                0.00        0.00     392/95678933    <predicate `c_parser_aim.designator/3' mode 0> [48]                             
                0.00        0.00      95/95678933    <predicate `c_parser_aim.direct_abstract_declarator/3' mode 0> [53]
                0.00        0.00     368/95678933    <predicate `c_parser_aim.direct_declarator/4' mode 0>  <cycle 2> [49]
                0.00        0.00     890/95678933    <predicate `c_parser_aim.direct_declarator_rec/3' mode 0>  <cycle 2> [27]
                0.00        0.00     999/95678933    <predicate `c_parser_aim.enum_specifier/3' mode 0> [38]                        
                0.01        0.01   30183/95678933    <predicate `c_parser_aim.equality_expression/3' mode 0>  <cycle 2> [23]
                0.00        0.00    3773/95678933    <predicate `c_parser_aim.exclusive_or_expression/3' mode 0>  <cycle 2> [26]
                0.00        0.00       2/95678933    <predicate `c_parser_aim.identifier_list/2' mode 0> [57]                         
                0.00        0.00    1886/95678933    <predicate `c_parser_aim.inclusive_or_expression/3' mode 0>  <cycle 2> [31]
                0.00        0.00      17/95678933    <predicate `c_parser_aim.init_declarator/4' mode 0> [8]                          
                0.00        0.00       8/95678933    <predicate `c_parser_aim.init_declarator_list/4' mode 0> [7]
                0.00        0.00       2/95678933    <predicate `c_parser_aim.initializer/3' mode 0>  <cycle 2> [56]
                0.00        0.00      48/95678933    <predicate `c_parser_aim.initializer_list/3' mode 0>  <cycle 2> [45]
                0.02        0.02  139104/95678933    <predicate `c_parser_aim.kw/3' mode 0> [22]                               
                0.00        0.00     943/95678933    <predicate `c_parser_aim.logical_and_expression/3' mode 0>  <cycle 2> [39]
                0.00        0.00     472/95678933    <predicate `c_parser_aim.logical_or_expression/3' mode 0>  <cycle 2> [44] 
                1.01        1.01 6112154/95678933    <predicate `c_parser_aim.multiplicative_expression/3' mode 0>  <cycle 2> [17]
                0.00        0.00    1474/95678933    <predicate `c_parser_aim.p/3' mode 0> [34]                                         
                0.00        0.00    1172/95678933    <predicate `c_parser_aim.pointer/2' mode 0> [32]
                2.72        2.72 16437987/95678933    <predicate `c_parser_aim.postfix_expression/3' mode 0>  <cycle 2> [4]
                8.10        8.10 48898650/95678933    <predicate `c_parser_aim.postfix_expression_rec/3' mode 0> [9]             
                2.20        2.20 13264944/95678933    <predicate `c_parser_aim.primary_expression/3' mode 0> [13]   
                0.03        0.03  181101/95678933    <predicate `c_parser_aim.relational_expression/3' mode 0>  <cycle 2> [21]
                0.08        0.08  452752/95678933    <predicate `c_parser_aim.shift_expression/3' mode 0>  <cycle 2> [20]     
                0.00        0.00      68/95678933    <predicate `c_parser_aim.struct_declaration/3' mode 0>  <cycle 2> [54]
                0.00        0.00     136/95678933    <predicate `c_parser_aim.struct_declarator/3' mode 0>  <cycle 2> [52] 
                0.00        0.00      68/95678933    <predicate `c_parser_aim.struct_declarator_list/3' mode 0>  <cycle 2> [55]
                0.00        0.00    1404/95678933    <predicate `c_parser_aim.struct_or_union/2' mode 0> [36]                        
                0.00        0.00     102/95678933    <predicate `c_parser_aim.struct_or_union_specifier/3' mode 0>  <cycle 2> [33]
                0.00        0.00    2334/95678933    <predicate `c_parser_aim.type_qualifier/2' mode 0> [30]                            
                0.00        0.00    4079/95678933    <predicate `c_parser_aim.type_specifier/3' mode 0>  <cycle 2> [25]
                0.00        0.00     333/95678933    <predicate `c_parser_aim.typedef_name/3' mode 0> [51]                   
                0.09        0.09  553352/95678933    <predicate `c_parser_aim.unary_expression/3' mode 0>  <cycle 2> [18]
[6]     52.1   15.85       15.85 95678933         <predicate `c_parser_aim.lexer.token/3' mode 0> [6]                          
                0.00        0.00      96/96          <function `c_parser_aim.lexer.punctuator_list/0' mode 0> [81]
                0.00        0.00      58/58          <function `string.from_char_list/1' mode 0> [64]             
                0.00        0.00     108/108         <predicate `c_parser_aim.lexer.constant/2' mode 0> [88]
                0.00        0.00     166/412         <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [86]
                0.00        0.00      58/58          <predicate `c_parser_aim.lexer.identifier_rec/3' mode 0> [97]     
                0.00        0.00     225/225         <predicate `c_parser_aim.lexer.keyword/3' mode 0> [83]       
                0.00        0.00      96/96          <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [95]
                0.00        0.00     225/225         <predicate `c_parser_aim.lexer.whitespace/2' mode 0> [82]
It seems like postfix_expression_rec is the winner; it is responsible for over half the calls to token. Tracing it back similarly to how we traced keyword_rec, we find that it is called solely by postfix_expression 8,149,775 times, which is called solely by unary_expression a similar number of times, which is called almost exclusively by cast_expression a similar number of times, which is called almost exclusively by multiplicative_expression, which is called… only 2,054,666 times:
                                 2054666             <predicate `c_parser_aim.additive_expression/3' mode 0>  <cycle 2> [19]
[17]     7.2    0.70        2.18 2054666         <predicate `c_parser_aim.multiplicative_expression/3' mode 0>  <cycle 2> [17]
                1.01        1.01 6112154/95678933    <predicate `c_parser_aim.lexer.token/3' mode 0> [6]                            
                0.47        0.47 6112154/87459316    <unification predicate for type `c_parser_aim.lexer.token/0' mode 0> [11]
                                 8218660             <predicate `c_parser_aim.cast_expression/3' mode 0>  <cycle 2> [12]
In other words, multiplicative_expression is responsible for multiplying the number of calls passing through it to token fourfold. Let’s look at multiplicative_expression (again, with Typedefs elided):
multiplicative_expression -->
    (cast_expression, p("*"), multiplicative_expression) -> [];
    (cast_expression, p("/"), multiplicative_expression) -> [];
    (cast_expression, p("%"), multiplicative_expression) -> [];
    cast_expression.
A-ha, a repeated prefix! We know how to fix this:
multiplicative_expression -->
    cast_expression,
    ((p("*"), multiplicative_expression) -> [];
     (p("/"), multiplicative_expression) -> [];
     (p("%"), multiplicative_expression) -> [];
     []).
Or, if we’d prefer to leave the structure of the grammar intact at the cost of memoization overhead, we can memoize cast_expression:
:- pred cast_expression(list(char)::in, list(char)::out) is semidet.
:- pragma memo(cast_expression/2, [fast_loose]).
(Note the :- pred declaration is necessary as you cannot memoize undeclared modes.)

Notice that most of the other _expression productions follow the same pattern; all of them will in fact benefit from prefix-factoring or memoization. Let’s do that and move on.

Memoizing multiple calls to unary_expression

Despite all our optimization so far, token (and postfix_expression_rec in turn) still reigns supreme. However, tracing the chain of calls back this time now reveals this:
                                  170474             <predicate `c_parser_aim.assignment_expression/3' mode 0>  <cycle 2> [15]
                                  198268             <predicate `c_parser_aim.cast_expression/3' mode 0>  <cycle 2> [32]      
[18]     7.2    0.10        0.32  368742         <predicate `c_parser_aim.unary_expression/3' mode 0>  <cycle 2> [18]
                0.03        0.06  171588/338952      <predicate `c_parser_aim.kw/3' mode 0> [30]                           
                0.13        0.13  686352/12057646    <predicate `c_parser_aim.lexer.token/3' mode 0> [7]
                0.03        0.03  686352/11652569    <unification predicate for type `c_parser_aim.lexer.token/0' mode 0> [17]
                                  368742             <predicate `c_parser_aim.postfix_expression/3' mode 0>  <cycle 2> [9]
unary_expression is called approximately an equal number of times by both assignment_expression and cast_expression. This makes it a perfect candidate for memoization:
:- pred cast_expression(typedefs::in, list(char)::in, list(char)::out) is semidet.
:- pragma memo(cast_expression/3, [fast_loose]).

Ad nauseum

And that’s about all there is to it: following this method I found I needed to memoize two more procedures and factor one more, and the parser parsed the test file faster than time could reliably distinguish. Of course, it would behoove us to test on different inputs in order to exercise different grammar productions, but the method is the same.

Summary

To recap:
  1. Use mprof -c to determine which procedure is the largest time hog.
  2. If the procedure is called by primarily one other procedure on a 1:1 basis, or if you have already optimized it, look instead at its most prominent caller (etc.).
  3. If the procedure is called by primarily one other procedure on a greater than 1:1 basis and it is a common prefix of this caller, factor it out.
  4. Otherwise, memoize it.
  5. Recompile and repeat.
This simple method will asymptotically increase the speed of almost any naïve PEG parser, in my experience, from unusable to quite fast, and will avoid the overhead associated with unnecessarily memoizing every grammar production in a parser.

Download c_parser_memo.m

Further reading

The direct encoding of PEGs in Mercury, and the performance implications thereof, were first explored in-depth by Ralph Becket and Zoltan Somogyi in the paper DCGs + Memoing = Packrat Parsing; But is it worth it?; I highly recommend it as further reading for those interested in the subject.

2 comments:

  1. Hi, a while back, when I was part of the Mercury group, I did some research on this, culminating in a paper with Zoltan Somogyi, "DCGs + Memoization = Packrat Parsing -- But Is It Worth It?".

    http://www.mercury.csse.unimelb.edu.au/information/papers/packrat.pdf

    In that work, we used a sensible tokeniser: that is, we examined characters directly from the source string, rather than converting them to list(char) which is vastly more expensive. Speedy tokenisation makes sense anyway since you're going to be doing a lot of it in parsing.

    We found that direct recursive descent (at least against the standard Java grammar) was almost certainly as fast as you could go, taking this approach. The modest amount of backtracking typically was way cheaper than the overheads of memoisation. Note that we made no attempt to optimise the grammar past the following, fairly standard, points: speedy tokenisation; and, at a choice point, committing to the first matching branch (I think this is what PEGs do anyway).

    We absolutely trounced the Rats! parser, which is based on a Packrat parsing toolset with extensive optimisations.

    Unfortunately, our results seem to have been ignored by the community :-)

    ReplyDelete
  2. Hi Rafe, yes, your paper was what first introduced me to the similarities between DCGs and PEGs. Now that I think of it I was remiss to give it a shout out in this series; I'll amend that.

    I don't doubt the {string, int} parser state approach to be faster; I left it out for pedagogical purposes. I should rework my parser to use that approach and see if there are significant speedups; I'll post my results when I do.

    ReplyDelete