So,
last week, we built a C99 parser. A
slow C99 parser. At the end we added one line of code that sped things up a bunch. Let’s look at that.
Why recursive descent parsing is slow
Right now our PEG parser is what’s known as a
recursive descent parser – it recursively descends down the parse tree looking for matches. Unfortunately, recursive descent is very slow for two reasons:
- One grammar rule may try to parse a given span of input the same way multiple times.
- Multiple grammar rules may try to parse a given span of input the same way multiple times.
An example of the former is in the
selection_statement
production (
Typedefs
omitted for clarity):
selection_statement -->
(kw("if"), p("("), expression, p(")"), statement, kw("else"), statement) -> [];
(kw("if"), p("("), expression, p(")"), statement) -> [];
(kw("switch"), p("("), expression, p(")"), statement).
The string “
if (x > 0) x--; return;
” will be parsed once by the first branch, which fails when it finds “
return
” instead of “
else
”, and will then be
reparsed by the second branch, which will succeed. This sort of prefix-repetition can be eliminated by factoring out the repeated portion:
selection_statement -->
(kw("if"), p("("), expression, p(")"), statement,
((kw("else"), statement) -> []; [])) -> [];
(kw("switch"), p("("), expression, p(")"), statement).
Of course, this means that our code no longer is structurally identical to the grammar we are parsing, but this may be a welcome tradeoff.
An example of the latter, which is more pernicious, is given by, well,
almost every production. Why? Consider
statement
:
statement -->
labeled_statement -> [];
compound_statement -> [];
expression_statement -> [];
selection_statement -> [];
iteration_statement -> [];
jump_statement.
Pretty innocuous, right? Take a look at each of the productions it invokes. Almost all of them start with a call to
kw
, which in turn calls
token
. In fact,
all of them start with a call, directly or indirectly, to
token
. In fact, this is true of
every production in the parser.
token
is not the only culprit but it is the most pervasive.
This problem can’t be solved through a simple restructuring of the grammar. For this we need a more powerful tool: memoization.
Memoization
Memoization is one of my favorite features. Simply put,
memozation of a procedure means maintaining a table of its past input and output values, and looking up output values in this table when the procedure is called in the future with the same input values. Of course, this only works if the procedure has no desirable side effects; fortunately this is true of most Mercury code. Mercury provides built-in
support for memoization via the
memo
pragma. Basic usage is so:
:- pragma memo(Name/Arity).
By default, Mercury memoizes inputs by value. This will catch all repeated calls to a procedure, but can be slow for large structures. We can specify that we would like Mercury to memoize inputs by address instead:
:- pragma memo(Name/Arity, [fast_loose]).
Memoizing by address is much faster but may miss calls with identical values at different addresses. However, because our primary data structure is a list of characters which is very large (meaning memoize-by-value will be slow) and which we are only destructing (meaning identical sublists of this list must be physically the same), we will memoize by address.
Profiling
Now, although we could memoize every procedure in our lexer and parser, memoization
does have overhead, and so we should not use it on procedures which are not called often or with repeated inputs. We should limit our scope to grammar rules which are invoked by many different rules, or multiple times by one rule, since these are likely to be invoking the rule repeatedly on the same input.
To find such candidates for memoization, we can use runtime code profiling. To generate profiling data, we can simply compile with the
-p
switch and then run our program with our test from last week (minus the memoization of
token
we added at the end):
$ mmc -p --infer-all --make c_parser_direct
$ echo 'int main(void) {}' | ./c_parser_direct
Let that churn for a few seconds, and then kill it with
Ctrl+C. Now run
mprof
to find out what’s been taking so long:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
68.5 8.70 8.70 17602295 0.00 0.00 <function `string.to_char_list/1' mode 0> [1]
7.1 9.60 0.90 17602483 0.00 0.00 <predicate `builtin.unify/2' mode 0> [2]
6.3 10.40 0.80 202300 0.00 0.00 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [3]
5.9 11.15 0.75 202388 0.00 0.00 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [4]
4.7 11.75 0.60 10114349 0.00 0.00 <predicate `c_parser_aim.lexer.punctuator_aux/3' mode 0> [5]
4.3 12.30 0.55 7487945 0.00 0.00 <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [6]
Over 95% of that run was spent just checking for keywords and punctuators! In particular, the biggest culprit,
string.to_char_list
, is highly suspect. To find out who’s calling it, we can run
mprof -c
[note: sometimes I have had to run
mprof -cd
; I can’t reproduce this now so I’m not sure why] and search for the callers of
string.to_char_list
:
3.70 3.70 7487945/17602295 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]
5.00 5.00 10114350/17602295 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [22]
[21] 68.5 8.70 8.70 17602295 <function `string.to_char_list/1' mode 0> [21]
The
[21] in the left-most column indicates that this section is dedicated to procedure #21,
string.to_char_list
, which was called 17,602,295 times. The two lines above this line indicate that both
lexer.keyword_rec
and
lexer.punctuator_rec
call
string.to_char_list
, 7,487,945 and 10,114,350 times respectively. Of course; we use
string.to_char_list
to convert our keywords and punctuators to character lists for these two procedures to match against the input data. But clearly we don't have 7,487,945 keywords or 10,114,350 punctuators, so
string.to_char_list
must be getting called with the same arguments repeatedly. Let’s fix that:
:- func to_char_list_memo(string) = list(char).
:- pragma memo(to_char_list_memo/1, [fast_loose]).
to_char_list_memo(S) = to_char_list(S).
Unfortunately we can’t instruct Mercury to memoize procedures in other modules, so we must define
to_char_list_memo
as above. We can replace the two calls to it in
lexer.keyword_rec
and
lexer.punctuator_rec
with the memoized version. Compiling and reprofiling tells us:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.2 1.65 1.65 10400248 0.00 0.00 <function `c_parser_aim.to_char_list_memo/1' mode 0> [22]
12.3 2.10 0.45 10400437 0.00 0.00 <predicate `builtin.unify/2' mode 0> [29]
12.3 2.55 0.45 5975244 0.00 0.00 <predicate `c_parser_aim.lexer.punctuator_aux/3' mode 0> [27]
12.3 3.00 0.45 119606 0.00 0.01 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]
8.2 3.30 0.30 4425004 0.00 0.00 <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [28]
4.1 3.45 0.15 119517 0.00 0.02 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [21]
Tracking down token
We’ve reduced the percentage of total runtime consumed by
to_char_list
, but it’s still getting called an awful lot. We know
keyword_rec
and
punctuator_rec
don’t call it any more often than they need to, so we need to find out who’s calling
them. Let’s follow
keyword_rec
:
0.45 1.64 119606/119606 <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]
[24] 45.0 0.45 1.64 119606 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]
0.70 0.70 4425004/10400248 <function `c_parser_aim.to_char_list_memo/1' mode 0> [22]
0.30 0.49 4425004/4425004 <predicate `c_parser_aim.lexer.keyword_aux/3' mode 0> [28]
keyword
is the only caller, no surprise there. Let’s go up one more level:
0.00 1.64 119606/119606 <predicate `c_parser_aim.lexer.token/3' mode 0> [1]
[23] 45.0 0.00 1.64 119606 <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]
0.00 0.00 57/239246 <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [67]
0.45 1.64 119606/119606 <predicate `c_parser_aim.lexer.keyword_rec/4' mode 0> [24]
0.00 0.00 57/836707 <predicate `char.is_digit/1' mode 0> [59]
Again, no surprise –
keyword
is called only by
token
.
0.00 0.21 7009/119606 <predicate `c_parser_aim.cast_expression/3' mode 0> [8]
0.00 0.00 2/119606 <predicate `c_parser_aim.compound_statement/3' mode 0> <cycle 3> [55]
0.00 0.00 117/119606 <predicate `c_parser_aim.declaration_specifiers/4' mode 0> <cycle 4> [35]
0.00 0.00 4/119606 <predicate `c_parser_aim.direct_abstract_declarator/3' mode 0> [52]
0.00 0.00 9/119606 <predicate `c_parser_aim.direct_declarator/4' mode 0> <cycle 2> [50]
0.00 0.00 4/119606 <predicate `c_parser_aim.direct_declarator_rec/3' mode 0> <cycle 2> [47]
0.00 0.00 36/119606 <predicate `c_parser_aim.enum_specifier/3' mode 0> [42]
0.01 0.43 14042/119606 <predicate `c_parser_aim.kw/3' mode 0> [30]
0.00 0.00 3/119606 <predicate `c_parser_aim.labeled_statement/3' mode 0> [53]
0.00 0.00 8/119606 <predicate `c_parser_aim.p/3' mode 0> [51]
0.00 0.00 2/119606 <predicate `c_parser_aim.parameter_list/3' mode 0> <cycle 2> [54]
0.00 0.00 1/119606 <predicate `c_parser_aim.parameter_type_list/3' mode 0> <cycle 2> [56]
0.00 0.00 26/119606 <predicate `c_parser_aim.pointer/2' mode 0> [45]
0.01 0.43 14019/119606 <predicate `c_parser_aim.postfix_expression/3' mode 0> [25]
0.01 0.86 28036/119606 <predicate `c_parser_aim.primary_expression/3' mode 0> [26]
0.00 0.00 48/119606 <predicate `c_parser_aim.struct_or_union/2' mode 0> [41]
0.00 0.00 36/119606 <predicate `c_parser_aim.type_qualifier/2' mode 0> [43]
0.00 0.00 120/119606 <predicate `c_parser_aim.type_specifier/3' mode 0> [37]
0.00 0.00 12/119606 <predicate `c_parser_aim.typedef_name/3' mode 0> [48]
0.02 1.71 56072/119606 <predicate `c_parser_aim.unary_expression/3' mode 0> [20]
[1] 100.0 0.05 3.65 119606 <predicate `c_parser_aim.lexer.token/3' mode 0> [1]
0.00 0.00 119517/119517 <function `c_parser_aim.lexer.punctuator_list/0' mode 0> [66]
0.00 0.00 31/31 <function `string.from_char_list/1' mode 0> [60]
0.00 0.10 119517/119517 <predicate `c_parser_aim.lexer.constant/2' mode 0> [32]
0.00 0.00 119548/239246 <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [67]
0.00 0.00 31/31 <predicate `c_parser_aim.lexer.identifier_rec/3' mode 0> [71]
0.00 1.64 119606/119606 <predicate `c_parser_aim.lexer.keyword/3' mode 0> [23]
0.15 1.81 119517/119517 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [21]
0.00 0.05 119606/119606 <predicate `c_parser_aim.lexer.whitespace/2' mode 0> [34]
And there’s the culprit.
token
is called by
many other procedures, and as discussed above, almost certainly with repeated inputs. Let’s memoize
token
.
:- pragma memo(token/3, [fast_loose]).
Now (as we found out last week), parsing our simple test completes in a reasonable amount of time. Let’s give our parser something bigger to chew on.
Factoring out multiple calls from multiplicative_expression
For stress testing, I used a 6kb program from Glen McCluskey’s
C99 test suite,
tdes_047.c
. (It is available in the set of free samples if you wish to try it yourself.) We can use this (or any C99 code) as follows:
$ cpp -P c9_samp/tdes_047.c | ./c_parser_direct
And of course, we are met with no end to the parsing.
mprof
still tells us that
token
is responsible for the most CPU usage, but since we’ve already memoized it, let’s work our way up the chain some more. Who is responsible for most of the calls to
token
?
0.23 0.23 1358257/95678933 <predicate `c_parser_aim.additive_expression/3' mode 0> <cycle 2> [19]
0.00 0.00 7546/95678933 <predicate `c_parser_aim.and_expression/3' mode 0> <cycle 2> [24]
0.00 0.00 2370/95678933 <predicate `c_parser_aim.assignment_operator/2' mode 0> [29]
1.36 1.36 8218754/95678933 <predicate `c_parser_aim.cast_expression/3' mode 0> <cycle 2> [12]
0.00 0.00 7/95678933 <predicate `c_parser_aim.compound_statement/3' mode 0> [14]
0.00 0.00 236/95678933 <predicate `c_parser_aim.conditional_expression/3' mode 0> <cycle 2> [50]
0.00 0.00 14/95678933 <predicate `c_parser_aim.declaration/4' mode 0> [5]
0.00 0.00 457/95678933 <predicate `c_parser_aim.declaration_specifiers/4' mode 0> <cycle 2> [43]
0.00 0.00 392/95678933 <predicate `c_parser_aim.designator/3' mode 0> [48]
0.00 0.00 95/95678933 <predicate `c_parser_aim.direct_abstract_declarator/3' mode 0> [53]
0.00 0.00 368/95678933 <predicate `c_parser_aim.direct_declarator/4' mode 0> <cycle 2> [49]
0.00 0.00 890/95678933 <predicate `c_parser_aim.direct_declarator_rec/3' mode 0> <cycle 2> [27]
0.00 0.00 999/95678933 <predicate `c_parser_aim.enum_specifier/3' mode 0> [38]
0.01 0.01 30183/95678933 <predicate `c_parser_aim.equality_expression/3' mode 0> <cycle 2> [23]
0.00 0.00 3773/95678933 <predicate `c_parser_aim.exclusive_or_expression/3' mode 0> <cycle 2> [26]
0.00 0.00 2/95678933 <predicate `c_parser_aim.identifier_list/2' mode 0> [57]
0.00 0.00 1886/95678933 <predicate `c_parser_aim.inclusive_or_expression/3' mode 0> <cycle 2> [31]
0.00 0.00 17/95678933 <predicate `c_parser_aim.init_declarator/4' mode 0> [8]
0.00 0.00 8/95678933 <predicate `c_parser_aim.init_declarator_list/4' mode 0> [7]
0.00 0.00 2/95678933 <predicate `c_parser_aim.initializer/3' mode 0> <cycle 2> [56]
0.00 0.00 48/95678933 <predicate `c_parser_aim.initializer_list/3' mode 0> <cycle 2> [45]
0.02 0.02 139104/95678933 <predicate `c_parser_aim.kw/3' mode 0> [22]
0.00 0.00 943/95678933 <predicate `c_parser_aim.logical_and_expression/3' mode 0> <cycle 2> [39]
0.00 0.00 472/95678933 <predicate `c_parser_aim.logical_or_expression/3' mode 0> <cycle 2> [44]
1.01 1.01 6112154/95678933 <predicate `c_parser_aim.multiplicative_expression/3' mode 0> <cycle 2> [17]
0.00 0.00 1474/95678933 <predicate `c_parser_aim.p/3' mode 0> [34]
0.00 0.00 1172/95678933 <predicate `c_parser_aim.pointer/2' mode 0> [32]
2.72 2.72 16437987/95678933 <predicate `c_parser_aim.postfix_expression/3' mode 0> <cycle 2> [4]
8.10 8.10 48898650/95678933 <predicate `c_parser_aim.postfix_expression_rec/3' mode 0> [9]
2.20 2.20 13264944/95678933 <predicate `c_parser_aim.primary_expression/3' mode 0> [13]
0.03 0.03 181101/95678933 <predicate `c_parser_aim.relational_expression/3' mode 0> <cycle 2> [21]
0.08 0.08 452752/95678933 <predicate `c_parser_aim.shift_expression/3' mode 0> <cycle 2> [20]
0.00 0.00 68/95678933 <predicate `c_parser_aim.struct_declaration/3' mode 0> <cycle 2> [54]
0.00 0.00 136/95678933 <predicate `c_parser_aim.struct_declarator/3' mode 0> <cycle 2> [52]
0.00 0.00 68/95678933 <predicate `c_parser_aim.struct_declarator_list/3' mode 0> <cycle 2> [55]
0.00 0.00 1404/95678933 <predicate `c_parser_aim.struct_or_union/2' mode 0> [36]
0.00 0.00 102/95678933 <predicate `c_parser_aim.struct_or_union_specifier/3' mode 0> <cycle 2> [33]
0.00 0.00 2334/95678933 <predicate `c_parser_aim.type_qualifier/2' mode 0> [30]
0.00 0.00 4079/95678933 <predicate `c_parser_aim.type_specifier/3' mode 0> <cycle 2> [25]
0.00 0.00 333/95678933 <predicate `c_parser_aim.typedef_name/3' mode 0> [51]
0.09 0.09 553352/95678933 <predicate `c_parser_aim.unary_expression/3' mode 0> <cycle 2> [18]
[6] 52.1 15.85 15.85 95678933 <predicate `c_parser_aim.lexer.token/3' mode 0> [6]
0.00 0.00 96/96 <function `c_parser_aim.lexer.punctuator_list/0' mode 0> [81]
0.00 0.00 58/58 <function `string.from_char_list/1' mode 0> [64]
0.00 0.00 108/108 <predicate `c_parser_aim.lexer.constant/2' mode 0> [88]
0.00 0.00 166/412 <predicate `c_parser_aim.lexer.identifier_nondigit/3' mode 0> [86]
0.00 0.00 58/58 <predicate `c_parser_aim.lexer.identifier_rec/3' mode 0> [97]
0.00 0.00 225/225 <predicate `c_parser_aim.lexer.keyword/3' mode 0> [83]
0.00 0.00 96/96 <predicate `c_parser_aim.lexer.punctuator_rec/4' mode 0> [95]
0.00 0.00 225/225 <predicate `c_parser_aim.lexer.whitespace/2' mode 0> [82]
It seems like
postfix_expression_rec
is the winner; it is responsible for over half the calls to
token
. Tracing it back similarly to how we traced
keyword_rec
, we find that it is called solely by
postfix_expression
8,149,775 times, which is called solely by
unary_expression
a similar number of times, which is called almost exclusively by
cast_expression
a similar number of times, which is called almost exclusively by
multiplicative_expression
, which is called… only 2,054,666 times:
2054666 <predicate `c_parser_aim.additive_expression/3' mode 0> <cycle 2> [19]
[17] 7.2 0.70 2.18 2054666 <predicate `c_parser_aim.multiplicative_expression/3' mode 0> <cycle 2> [17]
1.01 1.01 6112154/95678933 <predicate `c_parser_aim.lexer.token/3' mode 0> [6]
0.47 0.47 6112154/87459316 <unification predicate for type `c_parser_aim.lexer.token/0' mode 0> [11]
8218660 <predicate `c_parser_aim.cast_expression/3' mode 0> <cycle 2> [12]
In other words,
multiplicative_expression
is responsible for multiplying the number of calls passing through it to
token
fourfold. Let’s look at
multiplicative_expression
(again, with
Typedefs
elided):
multiplicative_expression -->
(cast_expression, p("*"), multiplicative_expression) -> [];
(cast_expression, p("/"), multiplicative_expression) -> [];
(cast_expression, p("%"), multiplicative_expression) -> [];
cast_expression.
A-ha, a repeated prefix! We know how to fix this:
multiplicative_expression -->
cast_expression,
((p("*"), multiplicative_expression) -> [];
(p("/"), multiplicative_expression) -> [];
(p("%"), multiplicative_expression) -> [];
[]).
Or, if we’d prefer to leave the structure of the grammar intact at the cost of memoization overhead, we can memoize
cast_expression
:
:- pred cast_expression(list(char)::in, list(char)::out) is semidet.
:- pragma memo(cast_expression/2, [fast_loose]).
(Note the
:- pred
declaration is necessary as you cannot memoize undeclared modes.)
Notice that most of the other
_expression
productions follow the same pattern; all of them will in fact benefit from prefix-factoring or memoization. Let’s do that and move on.
Memoizing multiple calls to unary_expression
Despite all our optimization so far,
token
(and
postfix_expression_rec
in turn) still reigns supreme. However, tracing the chain of calls back this time now reveals this:
170474 <predicate `c_parser_aim.assignment_expression/3' mode 0> <cycle 2> [15]
198268 <predicate `c_parser_aim.cast_expression/3' mode 0> <cycle 2> [32]
[18] 7.2 0.10 0.32 368742 <predicate `c_parser_aim.unary_expression/3' mode 0> <cycle 2> [18]
0.03 0.06 171588/338952 <predicate `c_parser_aim.kw/3' mode 0> [30]
0.13 0.13 686352/12057646 <predicate `c_parser_aim.lexer.token/3' mode 0> [7]
0.03 0.03 686352/11652569 <unification predicate for type `c_parser_aim.lexer.token/0' mode 0> [17]
368742 <predicate `c_parser_aim.postfix_expression/3' mode 0> <cycle 2> [9]
unary_expression
is called approximately an equal number of times by both
assignment_expression
and
cast_expression
.
This makes it a perfect candidate for memoization:
:- pred cast_expression(typedefs::in, list(char)::in, list(char)::out) is semidet.
:- pragma memo(cast_expression/3, [fast_loose]).
Ad nauseum
And that’s about all there is to it: following this method I found I needed to memoize two more procedures and factor one more, and the parser parsed the test file faster than
time
could reliably distinguish. Of course, it would behoove us to test on different inputs in order to exercise different grammar productions, but the method is the same.
Summary
To recap:
- Use
mprof -c
to determine which procedure is the largest time hog.
- If the procedure is called by primarily one other procedure on a 1:1 basis, or if you have already optimized it, look instead at its most prominent caller (etc.).
- If the procedure is called by primarily one other procedure on a greater than 1:1 basis and it is a common prefix of this caller, factor it out.
- Otherwise, memoize it.
- Recompile and repeat.
This simple method will asymptotically increase the speed of almost any naïve PEG parser, in my experience, from unusable to quite fast, and will avoid the overhead associated with unnecessarily memoizing every grammar production in a parser.
Download
c_parser_memo.m
Further reading
The direct encoding of PEGs in Mercury, and the performance implications thereof, were first explored in-depth by Ralph Becket and Zoltan Somogyi in the paper
DCGs + Memoing = Packrat Parsing; But is it worth it?; I highly recommend it as further reading for those interested in the subject.
Hi, a while back, when I was part of the Mercury group, I did some research on this, culminating in a paper with Zoltan Somogyi, "DCGs + Memoization = Packrat Parsing -- But Is It Worth It?".
ReplyDeletehttp://www.mercury.csse.unimelb.edu.au/information/papers/packrat.pdf
In that work, we used a sensible tokeniser: that is, we examined characters directly from the source string, rather than converting them to list(char) which is vastly more expensive. Speedy tokenisation makes sense anyway since you're going to be doing a lot of it in parsing.
We found that direct recursive descent (at least against the standard Java grammar) was almost certainly as fast as you could go, taking this approach. The modest amount of backtracking typically was way cheaper than the overheads of memoisation. Note that we made no attempt to optimise the grammar past the following, fairly standard, points: speedy tokenisation; and, at a choice point, committing to the first matching branch (I think this is what PEGs do anyway).
We absolutely trounced the Rats! parser, which is based on a Packrat parsing toolset with extensive optimisations.
Unfortunately, our results seem to have been ignored by the community :-)
Hi Rafe, yes, your paper was what first introduced me to the similarities between DCGs and PEGs. Now that I think of it I was remiss to give it a shout out in this series; I'll amend that.
ReplyDeleteI don't doubt the {string, int} parser state approach to be faster; I left it out for pedagogical purposes. I should rework my parser to use that approach and see if there are significant speedups; I'll post my results when I do.