Why recursive descent parsing is slow
Right now our PEG parser is what’s known as a recursive descent parser – it recursively descends down the parse tree looking for matches. Unfortunately, recursive descent is very slow for two reasons:- One grammar rule may try to parse a given span of input the same way multiple times.
- Multiple grammar rules may try to parse a given span of input the same way multiple times.
selection_statement
production (Typedefs
omitted for clarity):selection_statement -->
(kw("if"), p("("), expression, p(")"), statement, kw("else"), statement) -> [];
(kw("if"), p("("), expression, p(")"), statement) -> [];
(kw("switch"), p("("), expression, p(")"), statement).
The string “if (x > 0) x--; return;
” will be parsed once by the first branch, which fails when it finds “return
” instead of “else
”, and will then be reparsed by the second branch, which will succeed. This sort of prefix-repetition can be eliminated by factoring out the repeated portion:selection_statement -->
(kw("if"), p("("), expression, p(")"), statement,
((kw("else"), statement) -> []; [])) -> [];
(kw("switch"), p("("), expression, p(")"), statement).
Of course, this means that our code no longer is structurally identical to the grammar we are parsing, but this may be a welcome tradeoff.An example of the latter, which is more pernicious, is given by, well, almost every production. Why? Consider
statement
:statement -->
labeled_statement -> [];
compound_statement -> [];
expression_statement -> [];
selection_statement -> [];
iteration_statement -> [];
jump_statement.
Pretty innocuous, right? Take a look at each of the productions it invokes. Almost all of them start with a call to kw
, which in turn calls token
. In fact, all of them start with a call, directly or indirectly, to token
. In fact, this is true of every production in the parser. token
is not the only culprit but it is the most pervasive.This problem can’t be solved through a simple restructuring of the grammar. For this we need a more powerful tool: memoization.
Memoization
Memoization is one of my favorite features. Simply put,memo
pragma. Basic usage is so::- pragma memo(Name/Arity).
By default, Mercury memoizes inputs by value. This will catch all repeated calls to a procedure, but can be slow for large structures. We can specify that we would like Mercury to memoize inputs by address instead::- pragma memo(Name/Arity, [fast_loose]).
Memoizing by address is much faster but may miss calls with identical values at different addresses. However, because our primary data structure is a list of characters which is very large (meaning memoize-by-value will be slow) and which we are only destructing (meaning identical sublists of this list must be physically the same), we will memoize by address.Profiling
Now, although we could memoize every procedure in our lexer and parser, memoization does have overhead, and so we should not use it on procedures which are not called often or with repeated inputs. We should limit our scope to grammar rules which are invoked by many different rules, or multiple times by one rule, since these are likely to be invoking the rule repeatedly on the same input.To find such candidates for memoization, we can use runtime code profiling. To generate profiling data, we can simply compile with the
-p
switch and then run our program with our test from last week (minus the memoization of token
we added at the end):$ mmc -p --infer-all --make c_parser_direct $ echo 'int main(void) {}' | ./c_parser_directLet that churn for a few seconds, and then kill it with Ctrl+C. Now run
mprof
to find out what’s been taking so long:Over 95% of that run was spent just checking for keywords and punctuators! In particular, the biggest culprit,
string.to_char_list
, is highly suspect. To find out who’s calling it, we can run mprof -c
[note: sometimes I have had to run mprof -cd
; I can’t reproduce this now so I’m not sure why] and search for the callers of string.to_char_list
:The [21] in the left-most column indicates that this section is dedicated to procedure #21,
string.to_char_list
, which was called 17,602,295 times. The two lines above this line indicate that both lexer.keyword_rec
and lexer.punctuator_rec
call string.to_char_list
, 7,487,945 and 10,114,350 times respectively. Of course; we use string.to_char_list
to convert our keywords and punctuators to character lists for these two procedures to match against the input data. But clearly we don't have 7,487,945 keywords or 10,114,350 punctuators, so string.to_char_list
must be getting called with the same arguments repeatedly. Let’s fix that::- func to_char_list_memo(string) = list(char).
:- pragma memo(to_char_list_memo/1, [fast_loose]).
to_char_list_memo(S) = to_char_list(S).
Unfortunately we can’t instruct Mercury to memoize procedures in other modules, so we must define to_char_list_memo
as above. We can replace the two calls to it in lexer.keyword_rec
and lexer.punctuator_rec
with the memoized version. Compiling and reprofiling tells us:Tracking down token
We’ve reduced the percentage of total runtime consumed by to_char_list
, but it’s still getting called an awful lot. We know keyword_rec
and punctuator_rec
don’t call it any more often than they need to, so we need to find out who’s calling them. Let’s follow keyword_rec
:keyword
is the only caller, no surprise there. Let’s go up one more level:Again, no surprise –
keyword
is called only by token
.And there’s the culprit.
token
is called by many other procedures, and as discussed above, almost certainly with repeated inputs. Let’s memoize token
.:- pragma memo(token/3, [fast_loose]).
Now (as we found out last week), parsing our simple test completes in a reasonable amount of time. Let’s give our parser something bigger to chew on.Factoring out multiple calls from multiplicative_expression
For stress testing, I used a 6kb program from Glen McCluskey’s C99 test suite, tdes_047.c
. (It is available in the set of free samples if you wish to try it yourself.) We can use this (or any C99 code) as follows:$ cpp -P c9_samp/tdes_047.c | ./c_parser_directAnd of course, we are met with no end to the parsing.
mprof
still tells us that token
is responsible for the most CPU usage, but since we’ve already memoized it, let’s work our way up the chain some more. Who is responsible for most of the calls to token
?It seems like
postfix_expression_rec
is the winner; it is responsible for over half the calls to token
. Tracing it back similarly to how we traced keyword_rec
, we find that it is called solely by postfix_expression
8,149,775 times, which is called solely by unary_expression
a similar number of times, which is called almost exclusively by cast_expression
a similar number of times, which is called almost exclusively by multiplicative_expression
, which is called… only 2,054,666 times:In other words,
multiplicative_expression
is responsible for multiplying the number of calls passing through it to token
fourfold. Let’s look at multiplicative_expression
(again, with Typedefs
elided):multiplicative_expression -->
(cast_expression, p("*"), multiplicative_expression) -> [];
(cast_expression, p("/"), multiplicative_expression) -> [];
(cast_expression, p("%"), multiplicative_expression) -> [];
cast_expression.
A-ha, a repeated prefix! We know how to fix this:multiplicative_expression -->
cast_expression,
((p("*"), multiplicative_expression) -> [];
(p("/"), multiplicative_expression) -> [];
(p("%"), multiplicative_expression) -> [];
[]).
Or, if we’d prefer to leave the structure of the grammar intact at the cost of memoization overhead, we can memoize cast_expression
::- pred cast_expression(list(char)::in, list(char)::out) is semidet.
:- pragma memo(cast_expression/2, [fast_loose]).
(Note the :- pred
declaration is necessary as you cannot memoize undeclared modes.)Notice that most of the other
_expression
productions follow the same pattern; all of them will in fact benefit from prefix-factoring or memoization. Let’s do that and move on.Memoizing multiple calls to unary_expression
Despite all our optimization so far, token
(and postfix_expression_rec
in turn) still reigns supreme. However, tracing the chain of calls back this time now reveals this:unary_expression
is called approximately an equal number of times by both assignment_expression
and cast_expression
. This makes it a perfect candidate for memoization::- pred cast_expression(typedefs::in, list(char)::in, list(char)::out) is semidet.
:- pragma memo(cast_expression/3, [fast_loose]).
Ad nauseum
And that’s about all there is to it: following this method I found I needed to memoize two more procedures and factor one more, and the parser parsed the test file faster thantime
could reliably distinguish. Of course, it would behoove us to test on different inputs in order to exercise different grammar productions, but the method is the same.Summary
To recap:- Use
mprof -c
to determine which procedure is the largest time hog. - If the procedure is called by primarily one other procedure on a 1:1 basis, or if you have already optimized it, look instead at its most prominent caller (etc.).
- If the procedure is called by primarily one other procedure on a greater than 1:1 basis and it is a common prefix of this caller, factor it out.
- Otherwise, memoize it.
- Recompile and repeat.
Download c_parser_memo.m