Mar-17-2015

Automatic Parallelization with Mercury

Mercury is a fully declarative functional / logic programming language. Unlike many other functional languages, it supports non-deterministic programming and automatic backtracking (or other search strategies); unlike many other logic languages, it supports static typing and efficient compilation to C, Java, C# and Erlang. Its syntax is inspired by Prolog, yet it is fully declarative, therefore any predicate can theoretically be computed fully in parallel. This is used in an experimental feature to automatically parallelize programs! This feature is the topic of this article.

Very often it is claimed that purely functional languages are easy to parallelize and therefore are the future, as CPUs get ever more cores. The problem of course is, that it does not make much sense to parallelize everything, as this would create high overhead cost, eliminating any benefit from the parallelization.

Parallel Conjunctions in Mercury

This is a small excerpt from a Mercury program. More precisely, it shows a part of the native biginteger multiplication implementation in the standard library. This implementation uses a divide and conquer method devised by Anatoli Karatsuba, i.e., the first sub-quadratic method for multiplication.

% Karatsuba / Toom-2 multiplication in O(n^1.585)
:- func pos_mul_karatsuba(integer, integer) = integer.

pos_mul_karatsuba(i(L1, Ds1), i(L2, Ds2)) = Res :-
    ( L1 < karatsuba_threshold ->
        Res = pos_mul_list(Ds1, integer.zero, i(L2, Ds2))
    ;
        ( L2 < L1 ->
            unexpected($module, $pred, "second factor smaller")
        ;
            Middle = L2 div 2,
            HiDigits = L2 - Middle,
            HiDigitsSmall = max(0, L1 - Middle),
            % Split Ds1 in [L1 - Middle];[Middle] digits if L1 > Middle
            % or leave as [L1] digits.
            list.split_upto(HiDigitsSmall, Ds1, Ds1Upper, Ds1Lower),
            % Split Ds2 in [L2 - Middle; Middle] digits.
            list.split_upto(HiDigits, Ds2, Ds2Upper, Ds2Lower),
            LoDs1 = i(min(L1, Middle), Ds1Lower),
            LoDs2 = i(Middle, Ds2Lower),
            HiDs1 = i(HiDigitsSmall, Ds1Upper),
            HiDs2 = i(HiDigits, Ds2Upper),

    % if Middle position is larger than karatsuba_parallel_threshold,
            ( Middle > karatsuba_parallel_threshold ->

    % parallel-AND conjunction using '&'
                Res0 = pos_mul(LoDs1, LoDs2)
            &   Res1 = pos_mul(LoDs1 + HiDs1, LoDs2 + HiDs2)
            &   Res2 = pos_mul(HiDs1, HiDs2)

            ;
    % normal conjunction suing ','
                Res0 = pos_mul(LoDs1, LoDs2)
            ,   Res1 = pos_mul(LoDs1 + HiDs1, LoDs2 + HiDs2)
            ,   Res2 = pos_mul(HiDs1, HiDs2)
            )
        ),
        Res = big_left_shift(Res2, 2*Middle*log2base) +
              big_left_shift(Res1 - (Res2 + Res0), Middle*log2base) + Res0
    ).

In Mercury, bigintegers are represented as lists of 14 bit digits. Karatsuba's algorithm splits such a list at the Middle position and calls itself recursively. The speedup comes from the fact that instead of 4 multiplications of the half-sized numbers, only 3 multiplications and 2 shifts are needed. As shifts are relatively cheap, the algorithm is much faster than quadratic multiplication.

This implementation shows a classic approach to divide and conquer algorithms. For small numbers, the overhead of splitting etc. gets too high, therefore bigintegers with less than karatsuba_threshold digits are multiplied using quadratic multiplication.

Analogously, the recursive calls are only parallelized (using parallel-AND & instead of the normal conjunction ,) if the number of digits is superior to karatsuba_parallel_threshold. In the current implementation, these numbers are 35 and 350 respectively, but their value might change with newer processor generations.

Automatic Parallelization

Based on a lot of research work, in particular Paul Bone's PhD thesis, Mercury also implements an approach for automatic parallelization of conjunctions. This is based on profiling a program with a typical work load, and analyzing the profiling data to find places to replace normal conjunctions with parallel-AND.

The above code is a classic example, where manual parallelization is easy, but doing it efficiently requires adding thresholds with fallback to non-parallel functions. This heavily modifies the code and is currently out of the reach for automatic parallelization.

Therefore, to illustrate the automatic approach, we'll use the following code as example. It implements a solution to the Project Euler problems 81, 82 and 83. All three can be solved by applying Dijkstra's SSSP algorithm to different graphs. The non-parallel code of the main function looks like this:

main(!IO) :-
    Problem = big_problem,
    Matrix = Problem ^ problem_matrix,
    MaxX = Problem ^ maxX,
    MaxY = Problem ^ maxY,
    Map = convert_lists(0, Matrix, map.init),
    euler_vertex_list(Map, VList),
    W = (func(U, V) = euler_problem_weight(Map, U, V)),

    euler_edge_list(VList, MaxX, MaxY, euler_edge_81, EList_81),
    Graph_81 = create_graph(EList_81),
    dijkstra(Graph_81, W, s, PreDistMap_81),

    euler_edge_list(VList, MaxX, MaxY, euler_edge_82, EList_82),
    Graph_82 = create_graph(EList_82),
    dijkstra(Graph_82, W, s, PreDistMap_82),

    euler_edge_list(VList, MaxX, MaxY, euler_edge_83, EList_83),
    Graph_83 = create_graph(EList_83),
    dijkstra(Graph_83, W, s, PreDistMap_83),

There are three distinct calculations; for each, first a list of edges is calculated, then a graph is created with these edges and finally Dijkstra's algorithm is called on the resulting graph.

We can compile the program to create profiling information and to use them to create implicit parallel-AND conjunctions where reasonable:

% mmc -r --make --profile-for-feedback --deep-profiling program
% ./program
% mdprof_create_feedback --implicit-parallelism Deep.data feedback.data
% mmc -r --make --parallel --stack-segments --feedback-file feedback.data program

This first compiles the program with profiling options, then runs the program to create the profiling data, then calls a program to convert the profiling data to feedback for parallelization and finally re-compiles the program using this feedback.

The result is the following (excerpt directly from the output in the generated file feedback_data):

Parallel conjunction:
  (
    create_graph(EList_81, Graph_81)
  &
    dijkstra(Graph_81, VList, Map, PreMap_81, DistMap_81)
  &
    euler_edge_list(VList, MaxX, MaxY, V_33, EList_82)
  &
    create_graph(EList_82, Graph_82)
  &
    dijkstra(Graph_82, VList, Map, PreMap_82, DistMap_82)
  &
    euler_edge_list(VList, MaxX, MaxY, V_34, EList_83)
  &
    create_graph(EList_83, Graph_83)
  &
    dijkstra(Graph_83, VList, Map, PreMap_83, DistMap_83)
  )

Using the 'time' tool to measure the CPU usage, we get the following report on a laptop with dual core i5:

5.85user 0.07system 0:03.15elapsed 188%CPU (0avgtext+0avgdata 97480maxresident)

This means, that we get almost full dual core use, without any additional work on code!

While this feature is still experimental, and the authors will be the first to tell you that there are still some problems with it, the fully declarative nature of Mercury and the vast amount of research already make automatic parallelization available today for many programs!

There are limitations of course: it applies only to deterministic predicates, the quality of the result depends on a well chosen, representative input and the program used to profile uses more RAM. Also, the parallelization potential in the example given is quite obvious while s.th. like the Karatsuba multiplication, where recursive calls can be parallelized, is not automatically parallelizable now, as this would require changing the code.

Nevertheless, the authors solved many of the practical problems with automatic parallelization. Using this feature and reporting any problems to the Mercury team will certainly help to turn this into a robust feature which may also be included in more mainstream languages one day.