Saturday, April 2, 2011

Benchmarks Game: spectral-norm

The Computer Language Benchmarks Game measures the performance of programming language implementations over a set of algorithms. Mercury is known for being fast, so let’s put it to the test. We will compare it to four high-performing language implementations, C (gcc), Java (Sun), Haskell (GHC), and OCaml, on a 64-bit four-way processor (Intel Atom D510).

Problem description

Today we will develop a solution to the spectral-norm test, whose description reads:

Calculate an eigenvalue using the power method.

Eigenvectors are vectors which, when multiplied by a matrix, produce a multiple of themselves. The eigenvalue of this eigenvector is this multiple. The power method involves repeatedly multiplying a random vector by the matrix in question until the vector converges. This is then an eigenvector, and its associated eigenvalue is found easily.

To compute the spectral norm of a matrix, one simply computes the square root of the maximum eigenvalue (which coincidentally is the one found by the power method).

Implementation

Let’s take the design of the single-threaded Java entry as a template. This way, we can ensure that we are using the same algorithm as the other implementations, which is a requirement of the Benchmarks Game.

The interface and main predicate are unremarkable:

:- module spectralnorm.                   

:- interface.
:- import_module io.
:- pred main(io::di, io::uo) is det.

:- implementation.

:- import_module float, int, list, math, std_util, string.  

main(!IO) :-
command_line_arguments(Args, !IO),
(if to_int(head(Args), N0) then N = N0 else N = 100),
format("%.9f\n", [f(approximate(N))], !IO).

Note that we have imported list, for use in representing the vectors. One may think that using arrays would perform much better but surprisingly the gain is negligible in this program (mostly because there are no random accesses), and would come at the cost of more complicated code.

The approximate function works by iterating multiplication 20 times, and using that result and the second-to-last to compute the spectral norm. This function is easily written using higher-order functions; in particular, list.foldl and list.map_corresponding are used to compute the dot product of two vectors:

approximate(N) = sqrt(VBv / Vv) :-
V = pow(multiplyAtAv(N), 19, unit(N)),
U = multiplyAtAv(N, V),
VBv = foldl((+), map_corresponding((*), U, V), 0.0),
Vv = foldl((+), map_corresponding((*), V, V), 0.0).

The unit vector, used as the initial vector, is easily defined recursively:

unit(N) = (if N = 0 then [] else [1.0 | unit(N - 1)]).

Next we define a function a to represent the matrix we are using:

a(I, J) = 1.0 / float((I + J) * (I + J + 1) / 2 + I + 1).

Next we define multiplication of the matrix by a vector by using two recursive functions. multiplyAvI multiplies each row I < N of the matrix by the vector V by calling multiplyAvIJ, which multiplies and sums each element of a row. multiplyAvI simply starts the process.

multiplyAv(N, V) = multiplyAvI(N, 0, V).

multiplyAvI(N, I, V) = YYs :-
if I < N then
Y = multiplyAvIJ(I, 0, V),  
Ys = multiplyAvI(N, I + 1, V),  
YYs = [Y|Ys]
else YYs = [].  

multiplyAvIJ(_, _, []) = 0.0.
multiplyAvIJ(I, J, [X|V]) = a(I, J) * X + multiplyAvIJ(I, J + 1, V).

To multiply by the transpose of A, we similarly define multiplyAtv:

multiplyAtv(N, V) = multiplyAtvI(N, 0, V).

multiplyAtvI(N, I, V) = YYs :-
if I < N then
Y = multiplyAtvIJ(I, 0, V),
Ys = multiplyAtvI(N, I + 1, V),
YYs = [Y|Ys]
else YYs = [].

multiplyAtvIJ(_, _, []) = 0.0.
multiplyAtvIJ(I, J, [X|V]) = a(J, I) * X + multiplyAtvIJ(I, J + 1, V).
Finally, multiplyAtAv ties these together. Note we must declare its type, since we partially apply it in approximate.

:- func multiplyAtAv(int, list(float)) = list(float). 
multiplyAtAv(N, V) = multiplyAtv(N, multiplyAv(N, V)).
Compiling with mmc --infer-all -O6 --intermod-opt --rebuild spectralnorm, we obtain a runtime of about 38.1 seconds for N = 3000. Not too bad, considering that the highly optimized implementation has a single-threaded runtime of about 26.5 seconds. But we can do better.

Optimization

Compiling with -p and running mprof after running our program tells us that the majority of our time (99%) is spent in multiplyAtIJ and multiplyAtvIJ, as expected. So let's focus our efforts there, and in a, which apparently has been inlined into each of those functions, since it does not otherwise appear in the profile.

Unchecked arithmetic

The most obvious speedup is to replace the floating-point division in a with `unchecked_quotient`, which doesn't check if its right argument is zero. (We make sure to add parentheses as appropriate, since infix functions bind very tightly.) Recompiling, this reduces our runtime to 29.8 seconds! Similarly, replacing the integer division by two with `unchecked_right_shift` 1 reduces the runtime to 28.4 seconds.

Tail-call optimization / accumulator introduction

Next, both multiplyAvIJ and multiplyAtvIJ are written in a fashion which precludes tail-call optimization: they perform an operation on the result of the recursive call, which prevents Mercury from turning them into simple loops. Fortunately, we can direct mmc to transform these functions into accumulator-passing functions (which can be optimized) by using the --introduce-accumulators, and telling Mercury that floating-point addition is associative:

:- promise all [A, B, C, ABC]
(A + B) + C = ABC: float <=> A + (B + C) = ABC: float.
Unfortunately, due to bug 193, mmc doesn't recognize this declaration due to the typecast (which is necessary to distinguish between integer- and floating-point addition), so we must instead introduce a dummy function fplus to disambiguate:

:- func fplus(float, float) = float.
A `fplus` B = A + B.

:- promise all [A, B, C, ABC]
(A `fplus` B) `fplus` C = ABC <=> A `fplus` (B `fplus` C) = ABC.
After replacing our use of + with `fplus` in multiplyAvIJ and multiplyAtvIJ (and again, parenthesizing appropriately), we cut our runtime down to 25.2 seconds -- better than the C implementation!

An optimization along similar lines is to enable --optimize-constructor-last-call, which can enable tail-call optimization in functions like unit, multiplyAvI, and multiplyAtvI, which construct a list after their recursive call. While this is a win for smaller values of N, it actually degrades performance for larger values so we will not enable it.

Parallelization

Comparing our 25.2 seconds to the C implementation’s 26.5 seconds is not really fair, since the C implementation is parallelized, running in 6.9 seconds across four CPUs.

Fortunately parallelization in Mercury is simple. The & operator solves two goals in parallel, expecting them both to succeed (like ,). Our job is to decide where is the best place to place this operator.

At the top level, operations in approximate and multiplyAtAv are necessarily sequential, since the result of each successive multiplication depends on the result of the previous.

Moving down a level, the two operations in multiplyAtI and multiplyAtvI seem parallelizable, as they do not depend on one another. Let’s introduce a parallel goal:

multiplyAvI(N, I, V) = YYs :-
if I < N then
(Y = multiplyAvIJ(I, 0, V) &
Ys = multiplyAvI(N, I + 1, V)),
YYs = [Y|Ys]
else YYs = [].
multiplyAtvI is modified similarly. Note the placement of the parentheses: & binds more loosely than ,. If we did not have parentheses, the list construction would be moved into the second branch, making it dependent on the first branch, and causing (for reasons unknown to the author) massive amounts of memory to be used.

At the lowest level, we could introduce parallel goals in multiplyAvIJ and multiplyAtvIJ, but we should have sufficient parallelization already, and the operations at this level are so short that they would be dwarfed by parallelization overhead.

Compiling with mmc --parallel --infer-all -O6 --intermod-opt --introduce-accumulators --rebuild spectralnorm, we obtain a runtime of 7.2 seconds – less than 5% slower than the C counterpart. The difference between Mercury and C is even less pronounced at N = 5500, coming in at 23.8 and 23.7 seconds respectively.

Results

Results obtained using the Benchmark Game’s bencher.py program for N = 5500:

× Language CPU (s) Elapsed (s) Memory (kB) Code (B) CPU load
1.0 java 87.24 23.36 15617 1227 385%
1.0 ocaml 88.94 23.73 4037 1065 391%
1.0 gcc 89.98 23.74 763 1992 399%
1.0 mmc --parallel 90.85 23.83 54974 722 394%
1.3 ghc 95.45 29.27 2496 1346 346%
3.1 mmc -H 70.06 70.09 7118 722 116%
3.7 mmc 84.96 84.98 40751 722 112%
3.9 mmc --java 88.86 88.21 33007 722 113%

Curiously, the compiling with the -H flag (to enable high-level C “hlc” grade) generates much faster single-threaded code. Checking the generated C code in Mercury/cs/spectralnorm.c indicates that tail calls are implemented using loops in this version, compared to calls to a Mercury primitive in the other versions. Unfortunately the high-level C grade does not support the and-parallelism we are using.

Download spectralnorm.m

8 comments:

  1. > (also known as the “Shootout”)

    By people who haven't looked at the website for years and years - please just use the correct name.

    "The Virginia Tech shooting in April 2007 once again pushed gun violence into the media headlines. There was no wish to be associated with or to trivialise the slaughter behind the phrase shootout so the project was renamed back on 20th April 2007"

    http://c2.com/cgi/wiki?GreatComputerLanguageShootout

    ReplyDelete
  2. > Results obtained using the Shootout’s bencher.py program

    Thank you for doing the work to make those measurements yourself instead of demanding that Mercury be included in the benchmarks game!

    You might find some Mercury implementations for those problems in CVS

    http://alioth.debian.org/scm/viewvc.php/shootout/bench/spectralnorm/spectralnorm.mercury?view=markup&revision=1.2&root=shootout

    http://alioth.debian.org/scm/viewvc.php/shootout/bench/?root=shootout

    ReplyDelete
  3. Understood about the name; I’ve updated my post accordingly. I assumed it was an official alias because it is still used in the domain name.

    > Thank you for doing the work to make those measurements yourself instead of demanding that Mercury be included in the benchmarks game!

    Thank you for providing the scripts you use for others!

    > You might find some Mercury implementations for those problems in CVS

    Thanks, I was wondering where those had gotten off to. For this particular benchmark that implementation is much (3×) slower, but I will evaluate others as I'm doing my work.

    ReplyDelete
  4. > For this particular benchmark that implementation is much (3×) slower

    I'd like to show Mercury on the benchmarks game website - but the Mercury programs that have been contributed don't seem to show Mercury at it's best, and I don't know how to make them better or how to get the best out of the compiler.

    ReplyDelete
  5. I plan do implementations of all the benchmarks; I will let you know when they’re complete.

    Mercury is no longer distributed in Debian (I’m not sure if it ever has been)… is this an impediment to inclusion in the Game?

    ReplyDelete
  6. Cool!

    > is this an impediment to inclusion in the Game?

    No, even though I do have a vague recollection that building the compiler from source took an incredibly long time.

    ReplyDelete
  7. Hi Chris,

    Well done, I love this blog - the community needs more people like you and I'm afraid that we (the Mercury developers) don't do enough to assist you.

    I'm the maintainer of most of the parallel 'things' in Mercury, I can explain why when your conjunction was dependent it consumed a lot of memory. Basically in previous versions of the runtime system (there's a work-around now)
    a parallel conjunction in the form of.

    p :-
    q & p.

    will always fork a new thread (consuming memory) for each p called recursively. There are two workarounds for this, the first is to, wherever possible, re-order the conjuncts of a parallel conjunction so that it looks like:

    p :-
    p & q

    In this case a new thread may be created for q (all but the first conjunct are forked off), however q finishes before p finishes and therefore any thread it was using can be freed earlier. Secondly in this case since q is initially a spark, a thread is not always created (some of this is complicated).

    The problem is that when you included the list construction in the second argument of the parallel conjunction the compiler could not re-order the arguments of the parallel conjunction because doing so would break the mode constraint that producers must occur to the left of consumers. Making this construction an accumulator (by having the recursive call to p perform the construction would solve the problem, but not as well as just using the parentheses carefully as you have done.

    One more thing, When using parallelism you might find that a stack segmented grade makes things a bit faster, and is a bit more forgiving when dependent parallelism causes this memory problem since each thread uses a segmented stack and therefore less memory. Some programs will, however, run more slowly in such a grade especially when they don't use parallel conjunction often and don't use tail-recursion. Please, let me know if this improves the performance.

    Keep up the good work.

    ReplyDelete
  8. Hi Paul, glad you like the blog! I apologize I haven't updated it in a while; I've been essentially AFK all summer. I am just now finding the time to get back into Mercury. I plan upcoming posts on release 11.07 and PEGs in Mercury.

    Thanks for the explanation about parallel goals, that makes a lot of sense.

    ReplyDelete