Problem descriptionToday we will develop a solution to the spectral-norm test, whose description reads:
Calculate an eigenvalue using the power method.
Eigenvectors are vectors which, when multiplied by a matrix, produce a multiple of themselves. The eigenvalue of this eigenvector is this multiple. The power method involves repeatedly multiplying a random vector by the matrix in question until the vector converges. This is then an eigenvector, and its associated eigenvalue is found easily.
To compute the spectral norm of a matrix, one simply computes the square root of the maximum eigenvalue (which coincidentally is the one found by the power method).
ImplementationLet’s take the design of the single-threaded Java entry as a template. This way, we can ensure that we are using the same algorithm as the other implementations, which is a requirement of the Benchmarks Game.
The interface and
mainpredicate are unremarkable:
:- module spectralnorm. :- interface. :- import_module io. :- pred main(io::di, io::uo) is det. :- implementation. :- import_module float, int, list, math, std_util, string. main(!IO) :- command_line_arguments(Args, !IO), (if to_int(head(Args), N0) then N = N0 else N = 100), format("%.9f\n", [f(approximate(N))], !IO).
Note that we have imported
list, for use in representing the vectors. One may think that using arrays would perform much better but surprisingly the gain is negligible in this program (mostly because there are no random accesses), and would come at the cost of more complicated code.
approximatefunction works by iterating multiplication 20 times, and using that result and the second-to-last to compute the spectral norm. This function is easily written using higher-order functions; in particular,
list.map_correspondingare used to compute the dot product of two vectors:
approximate(N) = sqrt(VBv / Vv) :- V = pow(multiplyAtAv(N), 19, unit(N)), U = multiplyAtAv(N, V), VBv = foldl((+), map_corresponding((*), U, V), 0.0), Vv = foldl((+), map_corresponding((*), V, V), 0.0).
The unit vector, used as the initial vector, is easily defined recursively:
unit(N) = (if N = 0 then  else [1.0 | unit(N - 1)]).
Next we define a function
ato represent the matrix we are using:
a(I, J) = 1.0 / float((I + J) * (I + J + 1) / 2 + I + 1).
Next we define multiplication of the matrix by a vector by using two recursive functions.
multiplyAvImultiplies each row
Nof the matrix by the vector
multiplyAvIJ, which multiplies and sums each element of a row.
multiplyAvIsimply starts the process.
multiplyAv(N, V) = multiplyAvI(N, 0, V). multiplyAvI(N, I, V) = YYs :- if I < N then Y = multiplyAvIJ(I, 0, V), Ys = multiplyAvI(N, I + 1, V), YYs = [Y|Ys] else YYs = . multiplyAvIJ(_, _, ) = 0.0. multiplyAvIJ(I, J, [X|V]) = a(I, J) * X + multiplyAvIJ(I, J + 1, V).
To multiply by the transpose of A, we similarly define
multiplyAtv(N, V) = multiplyAtvI(N, 0, V). multiplyAtvI(N, I, V) = YYs :- if I < N then Y = multiplyAtvIJ(I, 0, V), Ys = multiplyAtvI(N, I + 1, V), YYs = [Y|Ys] else YYs = . multiplyAtvIJ(_, _, ) = 0.0. multiplyAtvIJ(I, J, [X|V]) = a(J, I) * X + multiplyAtvIJ(I, J + 1, V).Finally,
multiplyAtAvties these together. Note we must declare its type, since we partially apply it in
:- func multiplyAtAv(int, list(float)) = list(float). multiplyAtAv(N, V) = multiplyAtv(N, multiplyAv(N, V)).Compiling with
mmc --infer-all -O6 --intermod-opt --rebuild spectralnorm, we obtain a runtime of about 38.1 seconds for N = 3000. Not too bad, considering that the highly optimized implementation has a single-threaded runtime of about 26.5 seconds. But we can do better.
mprofafter running our program tells us that the majority of our time (99%) is spent in
multiplyAtvIJ, as expected. So let's focus our efforts there, and in
a, which apparently has been inlined into each of those functions, since it does not otherwise appear in the profile.
Unchecked arithmeticThe most obvious speedup is to replace the floating-point division in
`unchecked_quotient`, which doesn't check if its right argument is zero. (We make sure to add parentheses as appropriate, since infix functions bind very tightly.) Recompiling, this reduces our runtime to 29.8 seconds! Similarly, replacing the integer division by two with
`unchecked_right_shift` 1reduces the runtime to 28.4 seconds.
Tail-call optimization / accumulator introductionNext, both
multiplyAtvIJare written in a fashion which precludes tail-call optimization: they perform an operation on the result of the recursive call, which prevents Mercury from turning them into simple loops. Fortunately, we can direct
mmcto transform these functions into accumulator-passing functions (which can be optimized) by using the
--introduce-accumulators, and telling Mercury that floating-point addition is associative:
:- promise all [A, B, C, ABC] (A + B) + C = ABC: float <=> A + (B + C) = ABC: float.Unfortunately, due to bug 193,
mmcdoesn't recognize this declaration due to the typecast (which is necessary to distinguish between integer- and floating-point addition), so we must instead introduce a dummy function
:- func fplus(float, float) = float. A `fplus` B = A + B. :- promise all [A, B, C, ABC] (A `fplus` B) `fplus` C = ABC <=> A `fplus` (B `fplus` C) = ABC.After replacing our use of
multiplyAtvIJ(and again, parenthesizing appropriately), we cut our runtime down to 25.2 seconds -- better than the C implementation!
An optimization along similar lines is to enable
--optimize-constructor-last-call, which can enable tail-call optimization in functions like
multiplyAtvI, which construct a list after their recursive call. While this is a win for smaller values of N, it actually degrades performance for larger values so we will not enable it.
ParallelizationComparing our 25.2 seconds to the C implementation’s 26.5 seconds is not really fair, since the C implementation is parallelized, running in 6.9 seconds across four CPUs.
Fortunately parallelization in Mercury is simple. The
&operator solves two goals in parallel, expecting them both to succeed (like
,). Our job is to decide where is the best place to place this operator.
At the top level, operations in
multiplyAtAvare necessarily sequential, since the result of each successive multiplication depends on the result of the previous.
Moving down a level, the two operations in
multiplyAtvIseem parallelizable, as they do not depend on one another. Let’s introduce a parallel goal:
multiplyAvI(N, I, V) = YYs :- if I < N then (Y = multiplyAvIJ(I, 0, V) & Ys = multiplyAvI(N, I + 1, V)), YYs = [Y|Ys] else YYs = .
multiplyAtvIis modified similarly. Note the placement of the parentheses:
&binds more loosely than
,. If we did not have parentheses, the list construction would be moved into the second branch, making it dependent on the first branch, and causing (for reasons unknown to the author) massive amounts of memory to be used.
At the lowest level, we could introduce parallel goals in
multiplyAtvIJ, but we should have sufficient parallelization already, and the operations at this level are so short that they would be dwarfed by parallelization overhead.
mmc --parallel --infer-all -O6 --intermod-opt --introduce-accumulators --rebuild spectralnorm, we obtain a runtime of 7.2 seconds – less than 5% slower than the C counterpart. The difference between Mercury and C is even less pronounced at N = 5500, coming in at 23.8 and 23.7 seconds respectively.
ResultsResults obtained using the Benchmark Game’s
bencher.pyprogram for N = 5500:
|×||Language||CPU (s)||Elapsed (s)||Memory (kB)||Code (B)||CPU load|
Curiously, the compiling with the
-Hflag (to enable high-level C “hlc” grade) generates much faster single-threaded code. Checking the generated C code in
Mercury/cs/spectralnorm.cindicates that tail calls are implemented using loops in this version, compared to calls to a Mercury primitive in the other versions. Unfortunately the high-level C grade does not support the and-parallelism we are using.