loop_control.tex

% vim: ts=4 sw=4 et ft=tex

\chapter{Loop Control}
\label{chap:lc}
\label{chap:loop_control}

\status{Peter has checked the first 3 sections and I will retrieve them from
him later.
The next three sections are now ready for someone to check.}

In the previous chapter we described our
system that uses profiling data
to automatically parallelise Mercury programs by
finding conjunctions with expensive conjuncts
that can run in parallel with minimal synchronisation delays.
This worked very well in some programs but not as well as we had hoped for
others,
including the raytracer.
This is because the way Mercury must execute dependent conjunctions
and the way programmers typically write logic programs are at odds.
We introduced this as
``the right recursion problem''
in Section~\ref{sec:rts_original_scheduling_performance}.

In this chapter we present a novel program transformation that eliminates
this problem in all situations.
The transformation has several benefits:
First, it reduces peak memory consumption
by putting a limit on how many stacks
a conjunction will need to have in memory at the same time.
Second,
it reduces the number of synchronisation barriers needed
from one per loop iteration to one per loop.
Third, it allows recursive calls inside parallel conjunctions to take
advantage of tail recursion optimisation.
Finally, it obsoletes the conjunct reordering transformation.
Our benchmark results show that our new transformation
greatly increases the speedups we can get from parallelising Mercury
programs;
in one case, it changes no speedup into almost perfect speedup on four cores.

We have written about the problem elsewhere in the dissertation,
however we have found that this problem is sometimes difficult to
understand.
Therefore
the introduction section
(Section~\ref{sec:lc_intro})
briefly describes the problem,
providing only the details necessary to understand and evaluate the rest of
this chapter.
\paul{It also motivates the reader correctly, but I do not say this.}
For more details about the problem see
Sections~\ref{sec:rts_original_scheduling}
and~\ref{sec:rts_original_scheduling_performance},
see also \citet{bone:2012:loop_control}, the paper on which this chapter is
based.
The rest of the chapter is organised as follows.
Section~\ref{sec:lc_transformation} describes
the program transformation we have developed
to control memory consumption by loops.
Section~\ref{sec:lc_perf} evaluates
how our system works in practice on some benchmarks.
Section~\ref{sec:lc_further_work}
describes potential further work,
and
Section~\ref{sec:lc_conc} concludes with discussion of related work.

\section{Introduction}
\label{sec:lc_intro}

\status{This section is ready for Peter Schachte to check.}

The implementation of a parallel conjunction
has to execute the first conjunct after spawning off the later conjuncts.
For dependent conjunctions, it cannot be done the other way around,
because only the first conjunct is guaranteed to be immediately executable:
later conjuncts may need to wait for data to be generated by earlier
conjuncts.
This poses a problem when the last conjunct contains a recursive call:
\begin{itemize}
\item
the state of the computation up to this iteration of the loop
is stored in the stack used by the original computation's context,
whereas
\item
the state of the computation after this iteration of the loop
is stored in the stack used by the spawned off context.
\end{itemize}
We can continue the computation after the parallel conjunction
only if we have both the original stack
and the results computed on the spawned-off stack.
This means the original stack must be kept in memory
until the recursive call is done.
However, there is no way to distinguish
the first iteration of a loop from the second, third etc,
so we must preserve the original stack on \emph{every} iteration.
This problem is very common,
since in logic programming languages,
tail recursive code has long been the preferred way to write a loop.
In independent code,
we can workaround the issue by reordering the conjuncts
in a conjunction (see Section~\ref{sec:rts_reorder}),
but in dependent code this is not possible.

We call this the ``right recursion problem'' because its effects are at
their worst when the recursive call is on the right hand side of a
parallel conjunction operator.
Unfortunately,
this is a natural way to (manually or automatically) parallelise programs
that were originally written with tail recursion in mind.
Thus, parallelisation often
transforms tail recursive sequential computations,
which run in constant stack space,
into parallel computations
that allocate a complete stack for each recursive call
and do not free them until the recursive call returns.
This means that each iteration effectively requires memory
to store an entire stack, not just a stack frame.

\picfigurelabel{linear_context_usage}{fig:linear_context_usage2}{Linear context usage in right recursion}

Figure~\ref{fig:linear_context_usage2} shows a visualisation of this stack
usage.
At the top left,
four contexts are created to execute four iterations of a loop,
as indicated by boxes.
Once each of these iterations finishes,
its context stays in memory but is suspended,
indicated by the long vertical lines.
Another four iterations of the loop create another four contexts, and so on.
Later, when all iterations of the loop have been executed,
each of the blocked contexts resumes execution and immediately exits,
indicated by the horizontal line at the bottom of each of the vertical
lines.

If we allow the number of contexts, and therefore their stacks, to grow
unbounded, then the program will very quickly run out of memory,
often bringing the operating system to its knees.
This is why we introduced the context limit work around
(described in Section~\ref{sec:rts_original_scheduling_performance}),
which can prevent a program from crashing,
but which limits the amount of parallel execution.

% The natural way to execute a parallel version of such a loop
% is to keep spawning off the task of performing each iteration
% until all the available cores are busy
%

Our transformation explicitly limits
the number of stacks allocated to recursive calls
to a small multiple of the number of available processors in the system.
This transformation can also be asked
to remove the dependency of a parallel loop iteration
on the parent stack frame from which it was spawned,
allowing the parent frame to be reclaimed
before the completion of the recursive call.
This allows parallel tail recursive computations
to run in constant stack space.
The transformation is applied
after the automatic parallelisation transformation,
so it benefits both manually and automatically parallelised Mercury code.

% Our benchmark results are very encouraging.
% Limiting the number of stacks
% not only permits deep tail recursions to take advantage of multiple cores,
% but it also significantly improves performance.
% For most of our benchmarks, we get near-optimal speedups.
% % The second
% % transformation does not produce a speed improvement; in some cases it
% % causes a slight slow-down.  However, this small price is well worth
% % paying to allow parallel tail recursive computations to run in
% % constant stack space.

% % This is the old explaination, which is now covered in Ch3.
% \section{The main problem}
% \label{sec:problem}
%
% As Mercury is a declarative programming language,
% Mercury programs make heavy use of recursion.
% Like the compilers for most declarative languages,
% the Mercury compiler optimises tail recursive procedures
% into code that can run in constant stack space.
% Since this generally makes tail recursive computations
% more efficient than code using other forms of recursion,
% typical Mercury code makes heavy use of tail recursion in particular.
%
% Unfortunately, tail recursive computations are not naturally handled well
% by Mercury's implementation of parallel conjunctions.
% Consider the \mapfoldl{} predicate in Figure~\ref{fig:mapfoldl}.
% This code applies the map predicate \code{M} to each element of an input list,
% and then uses the fold predicate \code{F}
% to accumulate (in a left-to-right order) all the results produced by \code{M}.
% The best parallelisation of \mapfoldl{} executes
% \code{M} and \code{F} in parallel with the recursive call.
% \tr{
% \code{M(H, MappedH), F(MappedH, Acc0, Acc1)} in parallel with
% Although all executions of \code{M} are independent
% and need not wait for anything to begin their computation,
% each call to \code{F} must wait until
% the call to \code{M} generates the value of \code{MappedH}.
% Thus there would be no point in executing \code{M} and \code{F} in parallel:
% \code{F} would immediately suspend until \code{M} had produced its result.
% However, the recursive call to \mapfoldl \emph{can} begin in parallel,
% allowing the next call to \code{M} to run in parallel with this iteration.
% }
% The programmer (or an automatic tool) can make this happen
% in the original sequential version of \mapfoldl
% by replacing the comma before the recursive call
% with the parallel conjunction operator \verb'&'.

% % Example.
% The problem is that the execution of a call to \mapfoldlpar{}
% has bad memory behaviour.
% When a context begins execution of a call to \mapfoldl{},
% it begins by creating a spark for the second conjunct
% (which contains the recursive call),
% and executes the first conjunct (which starts with the call to \code{M}).
% If another Mercury engine is available at that time,
% it will pick up and execute the spark for the recursive call,
% itself creating a spark for another recursive call
% and executing the \emph{next} call to \code{M}.
% This will continue until all Mercury engines are in use
% and the newest spark for a recursive call % to \mapfoldlpar{}
% must wait for an % available
% engine.
% When an engine completes execution of \code{M} and \code{F},
% it posts the value of \code{Acc1} into \code{FutureAcc1}.
% Any computations waiting for \code{Acc1} will then be woken up;
% these will be the calls that wait for \code{Acc0} in the next iteration.
% In this case, the woken code will resume execution
% immediately before the call to \code{F}
% in the recursive invocation of \mapfoldlpar{}.
%
% One might hope that after a spark for the recursive call has been created,
% and once \code{M} and \code{F} had completed execution
% and \code{Acc1} has been signalled,
% the context used to execute the first conjunct could be released.
% Unfortunately, it cannot because this context is the one that was running
% when execution entered the parallel conjunction,
% and therefore this is the context whose stacks
% contain the state of the computation outside the parallel conjunction.
% If we allowed this context to be reused,
% then all this state would be lost.
%
% This means that until the base case of the recursion is reached,
% \emph{every} recursive call must have its own complete execution context.
% Since each context contains two stacks,
% it can occupy a rather large amount of memory,
% so it is not practical to simultaneously preserve an execution context
% for each and every recursive call to a tail-recursive predicate.
% Originally, programs which bumped into this problem
% often ran themselves and the operating system out of memory rather quickly,
% because the default size of every det stack was several megabytes.
% To reduce the scope of the problem,
% we made stacks dynamically expandable,
% which allowed us to reduce their initial size,
% but programs with the problem can still run out of memory,
% it just takes more iterations to do so.
% \label{sec:context_limit}
% Our runtime system prevents such crashes
% by imposing a global limit on the number of contexts
% that can be running or suspended at any point:
% if a context is needed to execute a spark
% and allocating the context would breach this limit,
% then the spark will not be executed.
% Eventually, the context that created the spark will execute it on its
% own stack, but this limits the remainder of the recursive computation
% to use only that context, so parallelism is curtailed at that point.
%
% A much better solution is to swap the order of the conjuncts
% in the parallel conjunction
% so that the conjunct containing the recursive call is executed first.
% This means that we will spawn off the non-recursive conjuncts,
% whose contexts \emph{can} be freed when their execution is complete.
% However, since the Mercury mode system requires that
% the producer of a variable precede all its consumers,
% this is possible only if the conjuncts are independent.
% The approach we have taken in this paper
% is to spawn off the non-recursive conjuncts,
% and continue execution of the recursive call
% without swapping the order of the conjuncts.
% We also directly limit the number of contexts that are used in a loop
% to a small multiple of the number of available CPUs.
% Finally, we can arrange for
% the inputs and outputs of the non-recursive conjuncts
% to be stored outside the stack frame of a tail recursive procedure,
% which allows such procedures to run in fixed stack space
% even when executed in parallel.
% In the next section, we explain all of these improvements.
% % If they are dependent, this solution is not applicable,
% % so for such conjunctions we need a completely different solution.
% % We present one in the next section.

% \peter{Not sure how to cover this:}
% \paul{This is not as important as other issues,
% I am not sure it is worth confusing the reader.}
% zs: compiling with and without loop control both have their own unique
% overheads. There is not a clear advantage either way, so the issue is not
% important enough to mention.
%
% There is at most, only one spark on the spark queue,
% which means that parallel work is not abundant.
% There are likely to be more context switches.
% These problems are secondary to the memory consumption problems.
% However, loop control also fixes them ensuring that loop control has lower
% overhead than parallel conjunctions.

\section{The loop control transformation}
\label{sec:lc_transformation}

\status{This section is ready for someone to check.}

\begin{figure}[tb]
\begin{verbatim}
map_foldl_par(M, F, L, FutureAcc0, Acc) :-
    lc_create_loop_control(LC),
    map_foldl_par_lc(LC, M, F, L, FutureAcc0, Acc).

map_foldl_par_lc(LC, M, F, L, FutureAcc0, Acc) :-
    (
        L = [],
        % The base case.
        wait_future(FutureAcc0, Acc0),
        Acc = Acc0,
        lc_finish(LC)
    ;
        L = [H | T],
        new_future(FutureAcc1),
        lc_wait_free_slot(LC, LCslot),
        lc_spawn_off(LC, LCslot, (
            M(H, MappedH),
            wait_future(FutureAcc0, Acc0),
            F(MappedH, Acc0, Acc1),
            signal_future(FutureAcc1, Acc1),
            lc_join_and_terminate(LCslot, LC)
        )),
        map_foldl_par_lc(LC, M, F, T,
            FutureAcc1, Acc)
    ).
\end{verbatim}
%\vspace{2mm}
\caption{\mapfoldlpar after the loop control transformation}
\label{fig:map_foldl_transformed}
%\vspace{-1\baselineskip}
\end{figure}

The main aim of loop control is to set an upper bound
on the number of contexts that a loop may use,
regardless of how many iterations of the loop may be executed,
without limiting the amount of parallelism available.
The loops we are concerned about
are procedures that we call right recursive:
procedures in which the recursive execution path
ends in a parallel conjunction
whose last conjunct contains the recursive call.
A right recursive procedure may be tail recursive, or it may not be:
the recursive call could be followed by other code
either within the last conjunct, or after the whole parallel conjunction.
Programmers have long tended to write loops whose last call is recursive
in order to benefit from tail recursion.
\paul{My recursion type analysis does not yet separate right and left
recursion,
so I cannot actually say how common it is.}
Therefore,
right recursion is very common;
most parallel conjunctions in recursive procedures are right recursive.

To guarantee the imposition of an upper bound
on the number of contexts created during one of these loops,
we associate with each loop a data structure
that has a fixed number of slots,
and require each iteration of the loop that would spawn off a goal
to reserve a slot for the context of each spawned-off computation.
This slot is marked as in-use until that spawned-off computation finishes,
at which time it becomes available for use by another iteration.

This scheme requires us to use two separate predicates:
the first sets up the data structure
(which we call the \emph{loop control} structure)
and the second actually performs the loop.
The rest of the program knows only about the first predicate;
the second predicate is only ever called from the first predicate
and from itself.
Figure~\ref{fig:map_foldl_transformed} shows what these predicates look like.
In Section~\ref{sec:lc_structs},
we describe the loop control structure and the operations on it;
in Section~\ref{sec:lc_trans},
we give the algorithm that does the transformation;
in Section~\ref{sec:lc_tailrec},
we discuss its interaction with tail recursion optimisation.

\subsection{Loop control structures}
\label{sec:lc_structs}

\begin{figure}
\begin{verbatim}
typedef struct MR_LoopControl_Struct        MR_LoopControl;
typedef struct MR_LoopControlSlot_Struct    MR_LoopControlSlot;

struct MR_LoopControlSlot_Struct
{
    MR_Context                *MR_lcs_context;
    MR_bool                   MR_lcs_is_free;
};

struct MR_LoopControl_Struct
{
    volatile MR_Integer       MR_lc_outstanding_workers;

    MR_Context* volatile      MR_lc_master_context;
    volatile MR_Lock          MR_lc_master_context_lock;

    volatile MR_bool          MR_lc_finished;

    /*
    ** MR_lc_slots MUST be the last field, since in practice, we treat
    ** the array as having as many slots as we need, adding the size of
    ** all the elements except the first to sizeof(MR_LoopControl) when
    ** we allocate memory for the structure.
    */
    unsigned                  MR_lc_num_slots;
    MR_LoopControlSlot        MR_lc_slots[1];
};
\end{verbatim}
\caption{Loop control structure}
\label{fig:loop_control_structure}
\end{figure}

The loop control structure,
shown in Figure~\ref{fig:loop_control_structure},
contains the following fields:
\begin{description}

\item[\code{MR\_lc\_slots}]
is an array of slots, each of which contains a boolean and a pointer.
The boolean says whether the slot is free,
and if it is not,
the pointer points to the context that is currently occupying it.
When the occupying context finishes,
the slot is marked as free again,
but the pointer remains in the slot
to make it easier (and faster) for the next computation that uses that slot
to find a free context to reuse.
Therefore we cannot encode the boolean in the pointer being \NULL or non-\NULL.
Although this is the most significant field in the structure it is last so that
the array can be stored with the data structure, and an extra memory
dereference can be avoided.
This is the last field in the structure even though it is the most significant field.
% (The description of the \code{lc\_wait\_free\_slot(LC)} operation below
% will show why cannot we encode the boolean
% in the pointer being null/non-null.)

\item[\code{MR\_lc\_num\_slots}]
stores the number of slots in the array.

\item[\code{MR\_lc\_outstanding\_workers}]
is the count of the number of slots that are currently in use.

\item[\code{MR\_lc\_master\_context}]
is a possibly null pointer to the \emph{master} context,
the context that created this structure,
and the context that will spawn off all of the iterations.
This slot will point to the master context whenever it is sleeping,
and will be \NULL at all other times.

\item[\code{MR\_lc\_master\_context\_lock}]
a mutex that is used to protect access to MR\_lc\_master\_context.
The other fields are protected using atomic instructions;
we will describe them when we show the code for the loop control procedures.
% \zoltan{Should we explain how accesses to some fields
% can dispense with the mutex?}
% \paul{We decided not to explain this}

\item[\code{MR\_lc\_finished}]
A boolean flag that says whether the loop has finished or not.
It is initialised to false, and is set to true
as the first step of the \lcfinish operation.

%\item[\code{MR\_lc\_free\_slot\_hint}]
%contains an index into the array of free slots.
%It indicates which slot may be free;
%we use it to speed up the search for a free slot in the average case.
%\zoltan{we need an argument for WHY this is a speedup,
%but that argument does not belong here,
%since it depends on the behaviour of operations we have not described yet.
%Alternatively, we can completely avoid mentioning the hint field.}
%\paul{I think it is easier to not mention it, it is optimal in some cases and
%neither optimal nor pessimal in other cases.}

\end{description}

\noindent
The finished flag
is not strictly needed for the correctness of the following operations,
but it can help the loop control code cleanup at the end of a loop more
quickly.
In the following description of the primitive operations
on the loop control structure,
\LC is a reference to the whole of a loop control structure,
while \LCS is an index into the array of slots stored within \LC.

\begin{description}
\item[\code{LC = lc\_create\_loop\_control()}]
This operation creates a new loop control structure,
and initialises its fields.
The number of slots in the array in the structure
will be a small multiple of the number of cores in the system.
The multiplier is configurable
by setting an environment variable when the program is run.

%\begin{algorithm}[tbp]
%\begin{algorithmic}
%\Procedure{MR\_lc\_wait\_free\_slot}{$lc$, $retry\_label$}
%    %unsigned    hint, offset, i;
%
%    \If{$lc.MR\_lc\_outstanding\_workers = lc.MR\_lc\_num\_slots$}
%        \State MR\_aquire\_lock($lc.MR\_lc\_master\_context\_lock$)
%        \If{$lc.MR\_lc\_outstanding\_workers = lc.MR\_lc\_num\_slots$}
%            %MR\_Context *ctxt;
%            \State /* Only commit to sleeping while holding the lock. */
%            %so retest the outstanding worker count.
%            \State $ctxt \gets$ MR\_ENGINE($MR\_eng\_this\_context$)
%            \State MR\_save\_context($ctxt$)
%            \State $ctxt.MR\_ctxt\_resume \gets retry\_label$
%            %\State $ctxt.MR\_ctxt\_resume\_owner\_engine \gets$ MR\_ENGINE($MR_eng_id$)
%            \State $lc.MR\_lc\_master\_context \gets ctxt$
%            %MR_CPU_SFENCE;
%            \State MR\_release\_lock($lc.MR\_lc\_master\_context\_lock$)
%            \State MR\_ENGINE($MR\_eng\_this\_context$) $\gets$ \NULL
%            \State MR\_idle()
%        \EndIf 
%        \State MR\_release\_lock($lc.MR\_lc\_master\_context\_lock$)
%    \EndIf
%
%    \State $hint \gets lc.MR\_lc\_free\_slot\_hint$
%
%    \For{$offset \gets 0$ to $lc.MR\_lc\_num\_slots$}
%        \State $i \gets (hint + offset) \bmod lc.MR\_lc\_num\_slots$
%        \If{$lc.MR\_lc\_slots[i].MR\_lcs\_is\_free$}
%            \State $lc.MR\_lc\_slots[i].MR\_lcs\_is\_free \gets false$
%            \State $lc.MR\_lc\_free\_slot\_hint \gets
%                (i + 1) \bmod lc.MR\_lc\_num\_slots$
%            \State MR\_atomic\_inc\_int($lc.MR\_lc\_outstanding\_workers$)
%            \State $lcs\_idx \gets i$
%            \State \Break
%        \EndIf 
%    \EndFor 
%
%    \If{$lc.MR\_lc\_slots[i].MR\_lcs\_context = $\NULL}
%        %\Comment Allocate a new context.
%        \State $lc.MR\_lc\_slots[i].MR\_lcs\_context \gets$
%            MR\_create\_context()
%        %\State
%        %    $lc.MR\_lc\_slots[i].MR\_lcs\_context.MR\_ctxt\_thread\_local\_mutables
%        %    \gets MR\_THREAD\_LOCAL\_MUTABLES$
%    \EndIf
%
%    \State reset\_context\_stack\_ptr($lc.MR\_lc\_slots[i].MR\_lcs\_context$)
%
%    \State \Return $lcs\_idx$
%\EndProcedure
%\end{algorithmic}
%\caption{\lcwaitfreeslot}
%\label{alg:lc_free_slot}
%\end{algorithm}

\item[\code{LCslot = lc\_wait\_free\_slot(LC)}]
% This operation tests \linebreak[3] whether \LC{} has any free slots.
% This hyphenation improves this paragraph, I have swapped one evil for
% another.
% This operation tests whe\-ther \LC{} has any free slots.
This operation tests whether \LC{} has any free slots.
If it does not, the operation suspends until a slot becomes available.
When some slots are available, either immediately or after a wait,
the operation chooses one of the free slots, marks it in use,
fills in its context pointer and returns its index.
It can get the context to point to
from the last previous user of the slot,
from a global list of free contexts,
(in both cases it gets contexts which have been used previously
by computations that have terminated earlier),
or by allocating a new context
(which typically happens only soon after startup).

% I have edited this so that it does not refer to sparks, since they are not used
% with loop control.
\item[\code{lc\_spawn\_off(LC, LCslot, CodeLabel)}]
This operation sets up the context in the loop control slot,
and then puts it on the global runqueue,
where any engine looking for work can find it.
Setup of the context consists of initialising the context's parent stack
pointer to point to its master's stack frame,
and the context's resume instruction pointer to the value of \code{CodeLabel}.
% \paul{Possibly:
% Setup consists of creating a stack frame on the context's stack and copying
% the relevant values from the master's stack frame onto the worker-context's.
% Also, the context's resume instruction pointer will be initialised to
% the value of \code{CodeLabel}.
% }

\item[\code{lc\_join\_and\_terminate(LC, LCslot)}]
This operation marks the slot named by \LCS{} in \LC{} as available again.
It then terminates the context executing it,
allowing the engine that was running it to look for other work.

\item[\code{lc\_finish(LC)}]
This operation is executed by the master context when we know
that this loop will not spawn off any more work packages.
It suspends its executing context
until all the slots in \LC{} become free.
This will happen only when all the goals spawned off by the loop
have terminated.
This is necessary to ensure that
all variables produced by the recursive call
that are \emph{not} signalled via futures
have in fact had values generated for them.
A variable generated by a parallel conjunct
that is consumed by a later parallel conjunct will be signalled via a future,
but if the variable is consumed only by code after the parallel conjunction,
then it is made available by writing its value directly in its stack
slot.
Therefore such variables can exist
only if the original predicate had code after the parallel conjunction;
for example, map over lists
must perform a construction after the recursive call.
This barrier is the only barrier in the loop and it is executed just once;
in comparison, the normal parallel conjunction execution mechanism
executes one barrier in each iteration of the loop.
\end{description}

\zoltan{Should we give the pseudo-code of the operations?}
\paul{XXX: I have given the pseudo-code for one of these, if I have time I'll add
pseudo-code for the others.}

\noindent
See Figure~\ref{fig:map_foldl_transformed}
for an example of how we use these operations.
Note in particular that in this transformed version of \mapfoldl{},
the spawned-off computation contains the calls to \var{M} and \var{F},
with the main thread of execution making the recursive call.
This is the first step in preserving tail recursion optimisation.

% Some of these operations also perform some scheduling;
% we will discuss that later.

\picfigure{lc_context_usage}{Loop control context usage}

Figure~\ref{fig:lc_context_usage} shows a visual representation of context
usage when using loop control;
it should be compared with Figure~\ref{fig:linear_context_usage2}.
As before, this is how contexts are likely to be used on a four processor
system
when using a multiplier of two so that eight slots are used;
minor differences in the execution times of each task and similar variables
will mean that no execution will look as regular as in the figures.
In the Figure~\ref{fig:lc_context_usage},
we can see that a total of eight contexts are created and four are in use at
a time.
When the loop begins,
the master thread performs a few iterations, executing \lcwaitfreeslot and
\lcspawnoff.
This creates all the contexts and  adds them to the runqueue,
but only the first four contexts can be executed as there are only four
processors.
Once those contexts finish their work,
they execute \lcjoinandterminate.
Each call to \lcjoinandterminate marks the relevant slot in the loop control
structure as free,
allowing the master context to spawn off more work using the free slot.
Meanwhile, the other four contexts are now able to execute their work.
This continues until the loop is finished, at which point \lcfinish releases
all the contexts.

\subsection{The loop control transformation}
\label{sec:lc_trans}

Our algorithm for transforming procedures to use loop control
is shown in
Algorithms~\ref{alg:transform_alg},~\ref{alg:reccases_alg} and~\ref{alg:basecases_alg}.

\begin{algorithm}[tbp]
\begin{algorithmic}
\Procedure{loop\_control\_transform}{$OrigProc$}
  \State $OrigGoal \gets$ body($OrigProc$)
  \State $RecParConjs \gets$ set of parallel conjunctions
    in $OrigGoal$ that contain recursive calls
  \BigIf
    \algcondition{1}{$OrigProc$ is directly but not mutually recursive
        (HO calls are assumed not to create recursion), and}
    \algcondition{2}{$OrigGoal$ has at most one recursive call
      on all possible execution paths, and}
    \algcondition{3}{$OrigGoal$ has determinism \ddet, and}
    \algcondition{4}{no recursive call is within a disjunction,
      a scope that changes the determinism of a goal,
      a negation, or the condition of a if-then-else, and}
    \algcondition{5}{no member of $RecParConjs$ is nested within
      another parallel conjunction, and}
    \algcondition{6}{every recursive call is inside
      the last conjunct of a member of $RecParConjs$, and}
    \algcondition{7}{every execution path through
      one of these last conjuncts
      makes exactly one recursive call}
  \BigIfThen
    \State $LC \gets$ create\_new\_variable()
    \State $LCGoal \gets$ the call
      `lc\_create\_loop\_control($LC$)'
    \State $LoopProcName \gets$ a new unique predicate name
    \State $OrigArgs \gets$ arg\_list($OrigProc$)
    \State $LoopArgs \gets$ [$LC$] ++ $OrigArgs$
    \State $CallLoopGoal \gets$ the call
      `LoopProcName($LoopArgs$)'
    \State $NewProcBody \gets$ the conjunction `$LCGoal,~CallLoopGoal$'
    \State $NewProc \gets OrigProc$ with its body replaced
      by $NewProcBody$
    \State \parbox{0.98\textwidth}{
        \begin{tabbing}
            $LoopGoal \gets$ create\_loop\_goal(\=$OrigGoal$,
                $OrigProcName$, $LoopProcName$, \\
            \>$RecParConjs$, $LC$)
        \end{tabbing}}
    \State $LoopProc \gets \text{new\_procedure}\left(
        \begin{tabular}{l}
            $LoopProcName$\code{(}$LoopArgs$\code{) :-} \\
            \code{~~~~}$LoopGoal$\code{.}
        \end{tabular} \right)$
    \State $NewProcs \gets$ [$NewProc,~LoopProc$]
  \BigIfElse
    \State $NewProcs \gets$ [$OrigProc$]
  \EndBigIfElse
  \State \Return $NewProcs$
\EndProcedure
\end{algorithmic}
%\vspace{2mm}
\caption{The top level of the transformation algorithm}
\label{alg:transform_alg}
%\vspace{-1\baselineskip}
\end{algorithm}

\begin{algorithm}[tbp]
\begin{algorithmic}
\Procedure{create\_loop\_goal}{$OrigGoal$, $OrigProcName$, $LoopProcName$,
    $RecParConjs$, $LC$}
  \State $LoopGoal \gets OrigGoal$
  \For{$RecParConj \in RecParConjs$}
    \State $RecParConj$ has the form $`Conjunct_1~\&~\ldots~\&~Conjunct_n$'
        for some $n$
    \For{$i \gets 1$ to $n-1$}
      \Comment This does not visit the last goal in $RecParConj$
      \State $LCSlot_i \gets$ create\_new\_variable()
      \State $WaitGoal_i \gets$ the call
        `lc\_wait\_free\_slot($LC$, $LCSlot_i$)'
      \State $JoinGoal_i \gets$ the call
        `lc\_join\_and\_terminate($LC$, $LCSlot_i$)'
      \State \parbox{0.7\textwidth}{
        \begin{tabbing}
            $SpawnGoal_i \gets$ \=a goal that spawns off the sequential
                conjunction \\
                \>`$Conjunct_i, JoinGoal_i$' as a work package
        \end{tabbing}
        }
      \State $Conjunct_i \gets$ the sequential conjunction
        `$WaitGoal_i, SpawnGoal_i$'
    \EndFor
    \State $Conjunct_n' \gets Conjunct_n$
    \For{each recursive call $RecCall$ in $Conjunct_n'$}:
      \State $RecCall$ has the form `$OrigProcName(Args)$'
      \State $RecCall' \gets$ the call
        `$LoopProcName$([$LC$] ++ $Args$)'
      \State \textbf{replace} $RecCall$ with $RecCall'$ in $Conjunct_n'$
    \EndFor
    \State \parbox{0.7\textwidth}{
        \begin{tabbing}
            $Replacement \gets$ \=the flattened form
                of the sequential conjunction \\
            \>`$Conjunct_1',~\ldots,~Conjunct_n'$'
        \end{tabbing}}
    \State \textbf{replace} $RecParConj$ in $LoopGoal$ with $Replacement$
  \EndFor
  \State $LoopGoal' \gets$ put\_barriers\_in\_base\_cases($LoopGoal$,
    $RecParConjs$, $LoopProcName$, $LC$)
  \State \Return $LoopGoal'$
\EndProcedure
\end{algorithmic}
%\vspace{2mm}
\caption{Algorithm for transforming the recursive cases}
\label{alg:reccases_alg}
%\vspace{-1\baselineskip}
\end{algorithm}

Algorithm~\ref{alg:transform_alg} shows the top level of the algorithm,
which is mainly concerned with testing
whether the loop control transformation is applicable to a given procedure,
and creating the interface procedure if it is.

We impose conditions (1) and (2) because we need to ensure
that every loop we start for \var{OrigProc} is finished exactly once,
by the call to \lcfinish we insert into its base cases.
If \var{OrigProc} is mutually recursive with some other procedure,
then the recursion may terminate in a base case of the other procedure,
which our algorithm does not transform.
Additionally, if \var{OrigProc} has some execution path on which it calls
itself twice,
then the second call may continue executing loop iterations
after a base case reached through the first call has finished the loop.

We impose conditions (3) and (4) because the Mercury implementation
does not support the parallel execution of code that is not deterministic.
We do not want a recursive call to be called twice because
some code between the entry point of \var{OrigProc} and the recursive call
succeeded twice,
and we do not want a recursive call to be backtracked into because
some code between the recursive call and the exit point of \var{OrigProc}
has failed.
These conditions prevent both of those situations.

We impose condition (5) because we do not want another instance of loop control,
or an instance of the normal parallel conjunction execution mechanism,
to interfere with this instance of loop control.

We impose condition (6) for two reasons.
First, the structure of our transformation requires right recursive code:
we could not terminate the loop in base case code
if the call that lead to that code
was followed by any part of an earlier loop iteration.
Second, allowing recursion to sometimes occur
outside the parallel conjunctions we are trying to optimise
would unnecessarily complicate the algorithm.
(We do believe that it should be possible to extend our algorithm
to handle recursive calls made outside of parallel conjunctions.)

% We impose condition (7) to simplify the construction of a correctness argument
% in favor of the proposition that the two algorithms
% that transform the recursive calls and the bases cases respectively
% (which are shown in Figures \ref{fig:reccases_alg}
% and \ref{fig:basecases_alg})
% do not interfere in each other's operation.

We impose condition (7) to ensure that
our algorithm for transforming base cases
(Algorithm~\ref{alg:basecases_alg})
does not have to process goals that have already been processed
by our algorithm for transforming recursive calls
(Algorithm~\ref{alg:reccases_alg}).

If the transformation is applicable, we apply it.
The transformed original procedure has only one purpose:
to initialise the loop control structure.
Once that is done, it passes a reference to that structure to \var{LoopProc},
the procedure that does the actual work.

The argument list of \var{LoopProc}
is the argument list of \var{OrigProc}
plus the \LC variable that holds the reference to the loop control structure.
The code of \var{LoopProc} is derived from the code of \var{OrigProc}.
Some execution paths in this code include a recursive call; some do not.
The execution paths that contain a recursive call
are transformed by Algorithm~\ref{alg:reccases_alg};
the execution paths that do not
are transformed by Algorithm~\ref{alg:basecases_alg}.

We start with Algorithm~\ref{alg:reccases_alg}.
Due to condition (6),
every recursive call in \var{OrigGoal}
will be inside the last conjunct a parallel conjunction,
and the main task of \createloopgoal
is to iterate over and transform these parallel conjunctions.
(It is possible that some parallel conjunctions do not contain recursive calls;
\createloopgoal will leave these untouched.)

% The twin aims of our transformation are to
% (a) spawn off each parallel conjunct before the final recursive conjunct,
% in order to generate work for other cores to do, and
% (b) limit the number of work packages spawned off by the loop at any one time,
% in order to limit memory consumption.

The main aim of the loop control transformation is
to limit the number of work packages spawned off by the loop at any one time,
in order to limit memory consumption.
The goals we want to spawn off
%as work packages that other cores can pick up and execute
are all the conjuncts before the final recursive conjunct.
(Without loop control, we would spawn off all the \emph{later} conjuncts.)
The first half of the main loop in \createloopgoal
therefore generates code that creates and makes available each work package
only after it obtains a slot for it in the loop control structure,
waiting for a slot to become available if necessary.
We make the spawned-off computation free that slot when it finishes.

To implement the spawning off process,
we extended the internal representation of Mercury goals
with a new kind of scope.
The only one shown in the abstract syntax in
Figure~\ref{fig:abstractsyntax}
%Section~\ref{sec:backgnd_mercury}
was the existential quantification scope
($some~[X_1,\ldots,X_n]~G$),
but the Mercury implementation had several other kinds of scopes already,
though none of those are relevant for this dissertation.
We call the new kind of scope the spawn-off scope,
and we make $SpawnGoal_i$ be a scope goal of this kind.
When the code generator processes such scopes,
it
\begin{itemize}
\item
generates code for the goal inside the scope
(which will end with a call to \lcjoinandterminate),
\item
allocates a new label,
\item
puts the new label in front of that code,
\item
puts this labelled code aside so that
later it can be added to the end of the current procedure's code, and
\item
inserts into the instruction stream a call to \lcspawnoff
that specifies that the spawned-off computation should start execution
at the label of the set-aside code.
The other arguments of \lcspawnoff come from the scope kind.
\end{itemize}

\noindent
Since we allocate a loop slot \var{LCSlot}
just before we spawn off this computation,
waiting for a slot to become available if needed,
and free the slot once this computation has finished executing,
the number of computations that have been spawned-off by this loop
and which have not yet been terminated
cannot exceed the number of slots in the loop control structure.

\begin{algorithm}[tbp]
\begin{algorithmic}
\Procedure{put\_barriers\_in\_base\_cases}{$LoopGoal$,
    $RecParConjs$, $LoopProcName$, $LC$}
  \If{$LoopGoal$ is a parallel conjunction in $RecParConjs$}
    \Comment{case 1}
    \State $LoopGoal' \gets LoopGoal$
  \ElsIf{there no call to $LoopProcName$ in $LoopGoal$}
    \Comment{case 2}
    \State $FinishGoal \gets$ the call `lc\_finish($LC$)'
    \State $LoopGoal' \gets$ the sequential conjunction
      `$LoopGoal,~FinishGoal$'
  \Else
    \Comment{case 3}
    \Switch{goal\_type($LoopGoal$)}
      \Case{`ite($C$, $T$, $E$)'}
        \State $T' \gets$ put\_barriers\_in\_base\_cases($T$,
          $RecParConjs$, $LoopProcName$, $LC$)
        \State $E' \gets$ put\_barriers\_in\_base\_cases($E$,
          $RecParConjs$, $LoopProcName$, $LC$)
        \State $LoopGoal' \gets$ `ite($C$, $T'$, $E'$)'
      \EndCase
      \Case{`switch($V$, [$Case_1$, \ldots, $Case_N$])'}
        \For{$i \gets 1 to N$}
          \State $Case_i \gets$ `case($FunctionSymbol_i$, $Goal_i$)'
          \State \parbox{0.7\textwidth}{
            \begin{tabbing}
              $Goal_i' \gets$ \=put\_barriers\_in\_base\_cases($Goal_i$,
                $RecParConjs$, $LoopProcName$,\\
              \>$LC$)
            \end{tabbing}}
          \State $Case_i' \gets$ `case($FunctionSymbol_i$, $Goal_i'$)'
        \EndFor
        \State $LoopGoal' \gets$ `switch($V$, [$Case_1'$, \ldots, $Case_N'$])'
      \EndCase
      \Case{`$Conj_1$, \ldots $Conj_N$'}
        \Comment Sequential conjunction
        \State $i \gets 1$
        \While{$Conj_i$ does not contain a call to $LoopProcName$}
          \State $i \gets i + 1$
        \EndWhile
        \State $Conj_i' \gets$ put\_barriers\_in\_base\_cases($Conj_i$,
          $RecParConjs$, $LoopProcName$, $LC$)
        \State $LoopGoal' \gets LoopGoal$ with
          $Conj_i$ replaced with $Conj_i'$
      \EndCase
      \Case{`some($Vars$, $SubGoal$)'}
        \Comment Existential quantification
        \State \parbox{0.7\textwidth}{
          \begin{tabbing}
            $SubGoal' \gets$ \=put\_barriers\_in\_base\_cases($SubGoal$,
                $RecParConjs$,\\
            \>$LoopProcName$, $LC$)
          \end{tabbing}}
        \State $LoopGoal' \gets$ `some($Vars$, $SubGoal'$)'
      \EndCase
      \Case{a call `$ProcName$($Args$)'}
        \If{$ProcName = OrigProcName$}
          \State $LoopGoal' \gets$ the call `$LoopProcName$([$LC$] ++ $Args$)'
        \Else
          \State $LoopGoal' \gets LoopGoal$
        \EndIf
      \EndCase
    \EndSwitch
  \EndIf
  \State \Return $LoopGoal'$
\EndProcedure
\end{algorithmic}
%\vspace{2mm}
\caption{Algorithm for transforming the base cases}
\label{alg:basecases_alg}
%\vspace{-1\baselineskip}
\end{algorithm}

The second half of the main loop in \createloopgoal
transforms the last conjunct in the parallel conjunction
by locating all the recursive calls inside it
and modifying them in two ways.
The first change is to make the call actually call the loop procedure,
not the original procedure, which after the transformation is non-recursive;
the second is to make the list of actual parameters match
the loop procedure's formal parameters
by adding the variable referring to the loop control structure
to the argument list.
Due to condition (6),
there can be no recursive call in \var{OrigGoal} that is left untransformed
when the main loop of \createloopgoal finishes.

In some cases, the last conjunct may simply \emph{be} a recursive call.
In some other cases, the last conjunct may be a sequential conjunction
consisting of some unifications and/or some non-recursive calls
as well as a recursive call,
with the unifications and non-recursive calls
usually constructing and computing some of the arguments of the recursive call.
And in yet other cases,
the last conjunct may be an if-then-else or a switch,
possibly with other if-then-elses and/or switches nested inside them.
In all these cases, due to condition (7),
the last parallel conjunct will execute exactly one recursive call
on all its possible execution paths.

The last task of \createloopgoal is to invoke the
\putbarriers function
that is shown in Algorithm~\ref{alg:basecases_alg}
to transform the base cases of the goal
that will later become the body of \var{LoopProc}.
This function recurses on the structure of \var{LoopGoal},
as updated by the main loop in Algorithm~\ref{alg:reccases_alg}.

When \putbarriers is called,
its caller knows that \var{LoopGoal} may contain
the already processed parallel conjunctions (those containing recursive calls),
it may contain base cases,
or it may contain both.
The main if-then-else in \putbarriers
handles each of these situations in turn.

If \var{LoopGoal} is a parallel conjunction that is in \var{RecParConjs},
then the main loop of \createloopgoal has already processed it,
and due to condition (7), this function does not need to touch it.
Our objective in imposing condition (7) was to make this possible.

If, on the other hand, \var{LoopGoal} contains no call to \var{LoopProc},
then it did not have any recursive calls in the first place,
since (due to condition (6))
they would all have been turned into calls to \var{LoopProc}
by the main loop of \createloopgoal.
Therefore this goal either \emph{is} a base case of \var{LoopProc},
or it is part of a base case.
In either case, we add a call to \code{lc\_finish($LC$)} after it.
In the middle of the correctness argument below,
we will discuss why this is the right thing to do.

If both those conditions fail,
then \var{LoopGoal}
definitely contains some execution paths that execute a recursive call,
and may also contain some execution paths that do not.
What we do in that case (case 3)
depends on what kind of goal \var{LoopGoal} is.

If \var{LoopGoal} is an if-then-else,
then we know from condition (4)
that any recursive calls in it
must be in the then part or the else part,
and by definition the last part of any base case code
in the if-then-else must be in one of those two places as well.
We therefore recursively process both the then part and the else part.
Likewise, if \var{LoopGoal} is a switch,
some arms of the switch may execute a recursive call and some may not,
and we therefore recursively process all the arms.
For both if-then-elses and switches,
if the possible execution paths inside them do not involve conjunctions,
then the recursive invocations of \putbarriers
will add a call to \lcfinish at
the end of each execution path that does not make recursive calls.

What if those execution paths do involve conjunctions?
If \var{LoopGoal} is a conjunction,
then we recursively transform the first conjunct that makes recursive calls,
and leave the conjuncts both before and after it (if any) untouched.
There is guaranteed to be at least one conjunct that makes a recursive call,
because if there were not, the second condition would have succeeded,
and we would never get to the switch on the goal type.
We also know at most one conjunct makes a recursive call.
If more than one did, then
there would be an execution path through those conjuncts
that would make more than one recursive call,
then condition (2) would have failed,
and the loop control transformation would not be applicable.

\emph{Correctness argument.}
One can view the procedure body, or indeed any goal,
as a set of execution paths that
diverge from each other
in if-then-elses and switches
(on entry to the then or else parts and the switch arms respectively)
and then converge again
(when execution continues after the if-then-else or switch).
Our algorithm inserts calls to \lcfinish into the procedure body
at all the places needed to ensure
that every non-recursive execution path executes such a call exactly once,
and does so after the last goal in the non-recursive execution path
that is not shared with a recursive execution path.
These places are
the ends of non-recursive then parts
whose corresponding else parts are recursive,
the ends of non-recursive else parts
whose corresponding then parts are recursive,
and the ends of non-recursive switch arms
where at least one other switch arm is recursive.
Condition (4) tests for recursive calls in the conditions of if-then-elses
(which are rare in any case)
specifically to make this correctness argument possible.

\zoltan{if we have room, add an example that contains
a branched goal with both recursive and non-recursive branches,
followed by a common suffix of code.}
\paul{I think this is easy enough to understand without such an example,
but perhaps because I have seem similar patterns all through my honours and
phd studies.
In particular, It did not even occur to me that I might need to
provide a correctness argument, it just seemed right from a structural
induction point-of-view.}
\paul{Addendum: This now means `if I have time'}

Note that for most kinds of goals, execution cannot reach case~3
in Algorithm~\ref{alg:basecases_alg}.
Unifications are not parallel conjunctions and cannot contain calls,
so if \var{LoopGoal} is a unification, we will execute case~2.
If \var{LoopGoal} is a first order call, we will also execute case~2,
due to condition (6) in Algorithm~\ref{alg:transform_alg},
all recursive calls are inside parallel conjunctions;
since case 1 does not recurse,
we never get to those recursive calls.
\var{LoopGoal} might be a higher order call.
Although a higher order call might create mutual recursion,
any call to \var{OrigProc} will create a new loop control object and execute
its loop independently.
This may not be optimal, but it will not cause the transformation to create
invalid code or create code that performs worse than Mercury's normal
parallel execution mechanics.
Therefore we treat higher order calls the same way that we treat plain calls
to procedures other than \var{OrigProc}.
If \var{LoopGoal} is a parallel conjunction,
then it is either in \var{RecParConj},
in which case we execute case~1,
or (due to condition (5))
it does not contain any recursive calls,
in which case we execute case~2.
Condition (4) also guarantees that we will execute case~2
if \var{LoopGoal} is a disjunction, negation,
or a quantification that changes the determinism of a goal
by cutting away (indistinguishable) solutions.
The only other goal type for which execution may get to case~3
are quantification scopes that have no effect on the subgoal they wrap,
whose handling is trivial.

% \paul{I was a little lost when you started explaining this,
% because I did not realize why it should be explained, but that becomes clear
% in the following two paragraphs.
% Is it possible to phrase this as "... because ..." rather than "... therefore
% ..."?}
We can view the execution of a procedure body that satisfies condition (2)
and therefore has at most one recursive call on every execution path
as a descent from a top level invocation from another procedure
to a base case, followed by ascent back to the top.
During the descent,
each invocation of the procedure executes
the part of a recursive execution path
up to the recursive call;
during the ascent,
after each return we execute
the part of the chosen recursive execution path after the recursive call.
At the bottom, we execute exactly one of the non-recursive execution paths.

In our case, conditions (5) and (6) guarantee
that all the goals we spawn off
will be spawned off during the descent phase.
When we get to the bottom and
commit to a non-recursive execution path through the procedure body,
we know that we will not spawn off any more goals,
which is why we can invoke \lcfinish at that point.
We can call \lcfinish at any point in \var{LoopGoal}
that is after the point
where we have committed to a non-recursive execution path,
and before the point where
that non-recursive execution path
joins back up with some recursive execution paths.

The code at case 2 puts the call to \lcfinish
at the last allowed point, not the first, or a point somewhere in the middle.
We chose to do this because after the code executing \var{LoopProc}
has spawned off one or more goals one level above the base case,
we expect that other processors will be busy executing those spawned off goals
for rather longer than it takes this processor to execute the base case.
% \paul{The next sentence does not explain to me how we learnt this through
% automatic parallelism, do you mean that it is the kind of decision that the
% automatic parallelism system would make?}
% We expect this because our automatic parallelisation system
% tries very hard not to parallelise (and thus to spawn off) cheap goals,
% while most base cases \emph{are} cheap.
By making this core do as much useful work as possible
before it must suspend to wait for the spawned-off goals to finish,
we expect to reduce the amount of work remaining to be done
after the call to \lcfinish by a small but possibly useful amount.
\lcfinish returns \emph{after} all the spawned-off goals have finished,
so any code placed after it
(such as if \lcfinish were placed at the first valid point)
would be executed sequentially after the loop;
where it would definitely add to the overall runtime.
Therefore, we prefer to place \lcfinish as late as possible,
so that this code occurs before \lcfinish
and is executed in parallel with the rest of the loop,
where it may have no effect on the overall runtime of the program;
it will just put what would otherwise be dead time to good use.

We must of course be sure that every loop,
and therefore every execution of any base case of \var{LoopGoal},
will call \lcfinish exactly once: no more, no less.
(It should be clear that our transformation never puts that call
on an execution path that includes a recursive call.)
Now any non-recursive execution path through \var{LoopGoal}
will share a (possibly empty) initial part
and a (possibly empty) final part with some recursive execution paths.
On any non-recursive execution path,
\putbarriers will put the call to \lcfinish
just before the first point where that path
rejoins a recursive execution path.
Since \var{LoopProc} is \ddet (condition (3)),
all recursive execution paths must consist
entirely of \ddet goals and the conditions of if-then-elses,
and (due to condition (4)) cannot go through disjunctions.
The difference between a non-recursive execution path
and the recursive path it rejoins
must be either that
one takes the then part of an if-then-else and the other takes the else part,
or that they take different arms of a switch.
Such an if-then-else or switch must be \ddet:
if it were \dsemidet, \var{LoopProc} would be too,
and if it were \dnondet or \dmulti,
then its extra solutions could be thrown away
only by an existential quantification that quantifies away
all the output variables of the goal inside it.
However, by condition (4),
the part of the recursive execution path
that distinguishes it from a non-recursive path,
the recursive call itself
cannot appear inside such scopes.
This guarantees that the middle part of the non-recursive execution path,
which is not part of either
a prefix or a suffix shared with some recursive paths,
must also be \ddet overall,
though it may have nondeterminism inside it.
Any code put after the second of these three parts of the execution path
(shared prefix, middle, shared suffix),
all three of which are \ddet,
is guaranteed to be executed exactly once.

% ite with nondet condition,
% recursive call in else part,
% no recursive call in then part:
% we may execute lc\_finish several times,
% once for each success of the condition.
% No: cannot happen: argument is above

\subsection{Loop control and tail recursion}
\label{sec:lc_tailrec}

When a parallel conjunction spawns off a conjunct
as a work package that other cores can pick up,
the code that executes that conjunct has to know
where it should pick up its inputs,
where it should put its outputs,
and where it should store its local variables.
All the inputs come from the stack frame of the procedure
that executes the parallel conjunction,
and all the outputs go there as well,
so the simplest solution, and the one used by the Mercury system,
is for the spawned-off conjunct to do all its work
in the exact same stack frame.
Normally, Mercury code accesses stack slots
via offsets from the standard stack pointer.
Spawned-off code accesses stack slots using
a special Mercury abstract machine register
called the parent stack pointer,
which the code that spawns off goals
sets up to point to the stack frame of the procedure doing the spawning.
That same spawning-off code sets up the normal stack pointer
to point to the start of the stack in the context executing the work package,
so any calls made by the spawned-off goal
will allocate their stack frames in that stack,
but the spawned-off conjunct will use
the original frame in the stack of the parent context.

This approach works, and is simple to implement:
the code generator generates code for spawned-off conjuncts normally,
and then just substitutes the base pointer in all references to stack slots.
However, it does have an obvious drawback:
until the spawned-off computation finishes execution,
it may make references to the stack frame of the parallel conjunction,
whose space therefore cannot be reused until then.
This means that even if
a recursive call in the last conjunct of the parallel conjunction
happens to be a tail call,
it cannot have the usual tail call optimisation applied to it.

Before this work, this did not matter, because
the barrier synchronisation needed at the end of the parallel conjunction
(\joinandcontinue),
which had to be executed at every level of recursion except the base case,
prevented tail recursion optimisation anyway.
However, the loop control transformation eliminates that barrier,
replacing it with the single call to \lcfinish in the base case.
So now this limitation \emph{does} matter in cases
where all of the recursive calls in the last conjunct of a parallel conjunction
are tail recursive.
% \zoltan{Does the implementation get the quantification right?}
% \paul{No,
% I thought that:
% if some are tail calls and some are not then the spawn\_off scope is
% marked with 'create\_new\_frame\_on\_worker\_stack'.
% Which is the safe option.
% This only extends to spawn off scopes in the same parallel conjunction.
% (since other spawn offs and their recursive calls will be on different code
% paths.
% This made sense to me because:
% If there's a chance that the parent stack frame could be clobbered due to (a
% single) tail recursion, then the implementation will allocate a frame on the
% worker's stack.
% However:
% what if there is a tail which means we \emph{must not} use the
% master's stack,
% AND there's a non tail call with a read of a variable that the spawned off
% code was supposed to write which means we \emph{must} use the master's stack.
% To solve this we must forbid the tail call and use the master's stack.
% The implementation is correct WRT the benchmarks, but this will need to be
% fixed.
% Note that this is only the wrong eay around if a scope has an output
% variable.}

If at least one call is not tail recursive,
then it prevents the reuse of the original stack frame,
so our system will still follow the scheme described above.
However, if they all are,
then our system can now be asked to follow a different approach.
The code that spawns off a conjunct
will allocate a frame at the start of the stack in the child context,
and will copy the input variables of the spawned-off conjunct into it.
The local variables of the spawned-off goal
will also be stored in this stack frame.
The question of where its output variables are stored is moot:
there cannot \emph{be} any output variables
whose stack slots would need to be assigned to.

The reason this is true has to do with the way
the Mercury compiler handles synchronisation between parallel conjuncts.
Any variable whose value is generated by one parallel conjunct
and consumed by one or more other conjuncts in that conjunction
will have a future created for it (Section~\ref{sec:backgnd_deppar}).
The generating conjunct, once it has computed the value of the variable,
will execute \signal on the variable's future
to wake up any consumers that may be waiting for the value of this variable.
Those consumers will get the value of the original variable from the future,
and will store that value in a variable that is local to each consumer.
Since futures are always stored on the heap,
the communication of bindings from one parallel conjunct to another
does \emph{not} go through the stack frame.

A variable whose value is generated by a parallel conjunct
and is consumed by code after the parallel conjunction
does need to have its value put into its stack slot,
so that the code after the parallel conjunction can find it.
However, if all the recursive calls in the last conjunct
are in fact tail calls, then by definition
there can be no code after the parallel conjunction.
Since neither code later \emph{in} the parallel conjunction,
nor code \emph{after} the parallel conjunction,
requires the values of variables generated by a conjunct
to be stored in the original stack frame,
storing it in the spawned-off goal's child stack frame is good enough.

In our current system,
the stack frame used by the spawned-off goal
has exactly the same layout as its parent, the spawning-off goal.
This means that in general,
both the parent and child stack frames will have some unused slots,
slots used only in the \emph{other} stack frame.
This is trivial to implement,
and we have not found the wasted space to be a problem.
This may be because we have mostly been working with
automatically parallelised programs,
and our automatic parallelisation tools
put much effort into granularity control (Chapter~\ref{chap:overlap}):
the rarer spawning-off a goal is,
the less the effect of any wasted space.
However, if we ever find this to be an issue,
squeezing the unused stack slots out of each stack frame
would not be difficult.

% \zoltan{This scheme does require us
% to reserve a context for a spawned-off goal when we create each work package,
% not when the work package starts executing.
% Discuss that tradeoff, and the possibility of staging the input vars
% in the work package itself, AFTER we have relevant performance numbers.}

% \section{Runtime support and optimisations}
% \label{sec:runtime}

\section{Performance evaluation}
\label{sec:lc_perf}

\status{This section is ready for review by someone.}

% We ran all our benchmarks on
% taura
% a Dell Optiplex 755 desktop PC with a 2.4~GHz Intel Core 2 Quad Q6600 CPU
% (four cores, no hyperthreading)
% running Linux 2.6.31.
% no hyperthreading
% apollo, and carlton
% a Dell Optiplex 980 desktop PC with a 2.8~GHz Intel i7 860 CPU
% (four cores, each with two hyperthreads)
% running Linux 2.6.35 in 64-bit mode.
% \zoltan{
% % goliath
% a SunFire X2250 server with two 3.0~GHz Intel Xeon X5472 CPUs
% (eight cores total, no hyperthreading)
% running Linux 2.6.26.
% }
% cabsav
% an AsRock Z68-Pro3 based PC with a 3.40GHz Intel i7-2600K CPU
% with frequency scaling (Speedstep and TurboBoost) disabled
% (four cores, each with two hyperthreads)
% running Linux 2.6.32.

We have benchmarked our system with four different programs,
three of which were used in earlier chapters.
All these programs use explicit parallelism.
\begin{description}
\item[mandelbrot] uses dependent parallelism using \mapfoldl from
Figure~\ref{fig:mapfoldl}.

\item[matrixmult]
multiplies two large matrices.
It computes the rows of the result in parallel.
%We have included both dependent and independent versions of matrixmult;
%neither is tail recursive without

\item[raytracer] uses dependent parallelism and is tail recursive.
Like mandelbrot, it renders the rows of the generated image in parallel,
but it does not use \mapfoldl{}.

\item[spectralnorm]
was donated by Chris King\footnote{Chris' version was published at
\url{http://adventuresinmercury.blogspot.com/search/label/parallelization}.}.
It computes the eigenvalue of a large matrix using the power method.
It has two parallel loops, both of which are executed multiple times.
Therefore, the parallelism available in spectralnorm is very fine-grained.
Chris' original version uses dependent parallelism.
%we have created a new independently parallel version from his original code.
\end{description}

\noindent
All these benchmarks have dependent AND-parallelism,
but for two of the benchmarks, matrixmult and spectralnorm,
we have created versions that use independent AND-parallelism as well.
The difference between the dependent and independent versions
is just the location of a unification that constructs a cell
from the results of two parallel conjuncts:
the unification is outside the parallel conjunction in the independent versions,
while it is in the last parallel conjunct in the dependent versions.
The sequential versions of mandelbrot and raytracer can both use tail call
optimisation to run in constant stack space.
This is not possible in the other programs without
the last call modulo constructor (LCMC)
optimisation \citep{ross:mercury-lcmc}.
LCMC is not supported with tail recursion so we have not used it.

We found that the independent spectralnorm benchmark performed poorly due
to its fine granularity and the overhead of notification.
Therefore, we compiled the runtime system in such a way so that
notifications were not used to communicate the availability of sparks
(Section~\ref{sec:rts_work_stealing2}).
Instead Mercury engines were configured to poll each other to find sparks
to execute.
This change only affects the non-loop control tests.
We ran each test of each program twenty times.

% Mandelbrot, raytracer and matrixmult
% all compute the rows of their result in parallel,
% mandelbrot uses the \mapfoldl abstraction, see Figure \ref{fig:map_foldl}
% Spectralnorm contains two parallel loops which are executed multiple times.
% Mandelbrot and raytracer both use dependant parallelism.
% Dependent and independent versions of matrixmult and spectralnorm are
% used.
% Only the dependent parallel tests use loop control.
% \paul{We might want to show independent code using loop control}

\begin{sidewaystable}[tbp]
\begin{center}
\input{mem_table}
\caption[Peak number of contexts used,
and peak memory usage for stacks]{Peak number of contexts used,
and peak memory usage for stacks, measured in megabytes}
\label{tab:lc_mem}
\end{center}
%\vspace{-1\baselineskip}
\end{sidewaystable}

\begin{sidewaystable}[tbp]
\begin{center}
\input{times_table}
%\vspace{3mm}
\caption{Execution times measured in seconds, and speedups}
\label{tab:lc_times}
\end{center}
%\vspace{-1\baselineskip}
\end{sidewaystable}

Tables~\ref{tab:lc_mem} and~\ref{tab:lc_times}
presents our memory consumption and timing results respectively.
In both tables,
the columns list the benchmark programs,
while the rows show the different ways
the programs can be compiled and executed.
% Due to space limits,
% each table shows only a subset of the rows,
% those with the most interesting results.
% These subsets are different for the two tables.

In Table~\ref{tab:lc_mem}, each box has two numbers.
The first reports the maximum number of contexts alive at the same time,
while the second reports the maximum number of megabytes
ever used to store the stacks of these contexts.
In Table~\ref{tab:lc_times}, each box has three numbers.
The first is the execution time of that benchmark in seconds
when it is compiled and executed in the manner prescribed by the row.
The second and third numbers (the ones in parentheses)
show respectively the speedup this time represents
over the sequential version of the benchmark (the first row),
and over the base parallel version (the second row).
Some of the numbers are affected by rounding.

In both tables, the first row
compiles the program without using any parallelism at all,
asking the compiler to automatically convert
all parallel conjunctions into sequential conjunctions.
Obviously, the resulting program will execute on one core.

The second row
compiles the program in a way that prepares it for parallel execution,
but it still asks the compiler to automatically convert
all parallel conjunctions into sequential conjunctions.
The resulting executables will differ from the versions in the first row
in two main ways.
First, they will incur some overheads
that the versions in the first row do not,
overheads that are needed to support the possibility of parallel execution.
The most important of these overheads is that,
as described in Chapters~\ref{chap:rts} and~\ref{chap:overlap},
the runtime reserves a hardware register to point to thread specific data and
% potentially-parallel code needs a way to access thread-specific data,
% and therefore when a program is compiled for parallel execution,
% the Mercury compiler has to reserve one machine register
% to hold a pointer to this data,
% making that machine register unavailable
% to the rest of the Mercury abstract machine.
% Given the dearth of callee-save machine registers on the x86\_64
% (we are not set up to use caller-save registers),
% this can lead to very significant slowdowns:
% for our benchmarks, as much as 30\%.
% The second difference is that,
compiling
the garbage collector and the rest of the runtime system for thread safety.
% must be thread safe,
These overheads lead to slowdowns in most cases,
even when using a single core for user code.
(The garbage collector uses one core in all of our tests.)
However, the mandelbrot program speeds up when thread safety is enabled;
it does not do very much memory allocation
and is therefore affected less by
the overheads of thread safety in the garbage collector.
Its slight speedup may be due to
its different code and data layouts
interacting with the cache system differently.

All the later rows
compile the program for parallel execution,
and leave the parallel conjunctions in the program intact.
They execute the program on 1 to 4 cores (`1c' to `4c').
The versions that execute on the same number of cores
differ from each other mainly in how they handle loops.
The rows marked `nolc' are the controls.
They do not use the loop control mechanism described in this chapter;
instead, they rely on the context limit.
The actual limit is the number of engines
multiplied by a specified parameter,
which we have set to 128 in `c128' rows and to 512 in `c512' rows.
%A minor bug (where \code{<=} was used instead of \code{<}) allows this limit
%to be exceeded by a single context in some cases.
Although our code uses thread-safe mechanisms for counting the number of
contexts in use,
when it checks this value against the limit it does not use a thread-safe
mechanism.
Therefore, whilst the number of contexts in use can never be corrupted,
it can be exceed the limit by about one or two contexts.
Since different contexts can have different sized stacks,
the limit is only an approximate control over memory consumption anyway,
so this is an acceptable price to pay for reduced synchronisation overhead.
%\zoltan{Paul, have you checked this?}
%\paul{The cause is a case where I should have used < instead of <=,
%However the limit is also not-thread-safe, which can also cause
%similar problems.
%So leading the reader to believe that this is the cause is okay by me.}

The rows marked `lc$N$' do use our loop control mechanism,
with the value of $N$ indicating
the value of another parameter we specify when the program is run.
When the \lccreateloopcontrol instruction
creates a loop control structure,
it computes the number of slots to create in it,
by multiplying the configured number of Mercury engines
(each of which can execute on its own core)
with this parameter.
We show memory consumption and timing results for $N=1$, 2 and 4.
The timing results for $N=1$ and $N=4$ are almost identical to those for
$N=2$.
It seems that as long as we put a reasonably small limit
on the number of contexts a loop control structure can use,
speed is not much affected by the precise value of the limit.

The rows in Table~\ref{tab:lc_times}
marked `lc$N$, tr' are like the corresponding `lc$N$' rows,
but they also switch on tail recursion preservation
in the two benchmarks (mandelbrot and raytracer)
whose parallel loops are naturally tail recursive.
The implementation of parallelism without loop control
destroys this tail recursion,
and so does loop control unless we ask it to preserve tail recursion.
That means that mandelbrot and raytracer use tail recursion
in all the test setups except for
the parallel, non-loop control ones,
and loop control ones without tail recursion.
Since the other benchmarks are not naturally tail recursive,
they will not be tail recursive however they are compiled.
There are no such rows in Table~\ref{tab:lc_mem}
since the results in each `lc$N$, tr' row
would be identical to the corresponding `lc$N$' row.

There are several things to note in Table~\ref{tab:lc_mem}.
The most important is that when the programs are run on more than one core,
switching on loop control yields a dramatic reduction
in the maximum number of contexts used at any one time,
and therefore also in the maximum amount of memory used by stacks.
% \footnote{
(The total amount of memory used by these benchmarks
is approximately the maximum of this number
and the configured initial size of the heap.)
This shows that we have achieved our main objective.
Without loop control, the execution of
three of our four dependent benchmarks (mandelbrot, matrixmult and raytracer)
require the simultaneous existence of a context
for every parallel task that the program can spawn off.
For example, mandelbrot generates an image with 600 rows,
so the original context can never spawn off more than 600 other contexts.

On one core, the `nolc' versions spawn off sparks,
but since there is no other engine to pick them up,
the one engine eventually picks them up itself,
and executes them in the original context.
By contrast, the `lc$N$' versions directly spawn off new contexts, not sparks.
This avoids the overhead of converting a spark to a context,
but we can do this only because we know we will not create too many contexts.

When executing on two or more cores,
mandelbrot and raytracer use one more context that one would expect.
Before the compiler applies the loop control transformation,
it adds the synchronisation operations
needed by dependent parallel conjunctions.
As shown by Figure~\ref{fig:map_foldl_sync},
this duplicates the original procedure.
Only the inner procedure is recursive,
so the compiler performs the loop control transformation only on it.
The extra context is
the conjunct spawned off by the parallel conjunction in the outer procedure.

There are several things to note in Table~\ref{tab:lc_times} as well.
The first is that in the absence of loop control,
increasing the per-engine context limit from 128 to 512
yields significant speedups for three out of four the dependent benchmarks.
Nevertheless, the versions with loop control
significantly outperform the versions without, even `c512',
for all these benchmarks except for mandelbrot.
On mandelbrot, `c512' already gets a near-perfect speedup,
yet loop control still gets a small improvement.
Thus on all our dependent benchmarks,
switching on loop control yields a speedup
while greatly reducing memory consumption.

Overall, the versions with loop control
get excellent speedups on three of the benchmarks:
speedups of 3.95, 3.92 and 3.90 on four CPUs
for mandelbrot, matrixmult and spectralnorm, respectively.
The one apparent exception, raytracer,
is very memory-allocation-intensive.
In Section~\ref{sec:rts_gc} we showed that this can significantly reduce the
benefit of parallel execution of Mercury code,
especially when the collector does not use parallelism itself.
In we have observed nearly 45\% of a the raytracer's
runtime being used in garbage collection (Table~\ref{tab:gc_amdahl}):
This means that parallel execution can only speed up the remaining 55\% of
the program.
Therefore the best speedup we can hope for is
$(4 \times 0.55 + 0.45)/(0.55 + 0.45) \approx 2.65$,
which our result of 2.65 reaches.
It is unusual to see results this close to their theoretical maximums;
in this case the parallelisation of the runtime may affect the garbage
collector's performance,
which can affect the figures that we have used with Amdahl's law.

Note second that
% for two of the four benchmarks,
loop control is crucial for getting this kind of speedup,
without wasting lots of memory.
On four cores, loop control raises the speedup compared to c128
from 3.36 to 3.95 for mandelbrot,
from 1.44 to 3.92 for matrixmult,
from 0.87 to 2.65 for raytracer,
and from 1.01 to 3.90 for spectralnorm.

Third, for the benchmarks that have versions using independent parallelism,
the independent versions are faster than
the dependent versions without loop control,
while there is no significant difference between
the speeds of the independent versions and the dependent loop control versions.
For matrix multiplication, the loop control dependent version is faster,
but the difference is small.
While for spectral-norm, the independent version result falls within the
    range of loop control results.
This shows that on these benchmarks, loop control completely avoids
the problems described at the beginning of this chapter.

Fourth, preserving tail recursion has a mixed effect on speed:
of the six relevant cases (mandelbrot and raytracer on 2, 3 and 4 cores),
one case gets a slight speedup, while the others get slight slowdowns.
Due to the extra copying required,
this tilt towards slowdowns is to be expected.
However, the effect is very small:
usually within 2\%.
%and usually in the noise.
%(For example, spectral-indep on three cores
%does everything exactly the same with c512 as with c128,
%so the difference between 6.44s and 6.41s is just noise.)
The possibility of such slight slowdowns is an acceptable price to pay
for allowing parallel code to recurse arbitrarily deeply
while using constant stack space.

\section{Further work}
\label{sec:lc_further_work}

\status{This section is ready for someone to check.}

There are a number of small improvements that could be made to improve loop
control.
As we mentioned in Section~\ref{sec:lc_tailrec},
the allocation of stack slots in tail recursive code is rather naive.
We could perform better stack slot allocation for both the parent stack
frame and the first stack frame on the stack slot's context.
We could also change how stack slots are allocated in non-recursive code.
By allocating the slots that are shared between the parent and child
computations into separate parts of the same stack frame,
we may be able to reduce \emph{false sharing}.
False sharing normally occurs when two parallel tasks contend for access to
a single cache-line sized area of memory \emph{without} actually using the
data in that memory to communicate;
this creates extra cache misses that can be avoided by allocating memory
differently.

Loop control allows the same small number of contexts to be used to execute
a parallel loop.
However, futures are still allocated dynamically and are not re-used.
We could achieve further performance improvements by controlling the
allocation and use of futures.
This could reduce the number of allocations of futures from
$num\_futures\_per\_level \times num\_red\_levels$ to just a single
allocation,
depending on how futures where used in the loop.

These changes are minor optimisations, and might only increase
performance by a small fraction;
there are other areas of further work that could provide more benefit.
For example,
applying a similar transformation to other parallel code patterns,
such as right recursion or divide and conquer,
could improve performance of code using those patterns.
In some patterns this may require novel transformations such as we have
shown here,
in others it may simply be the application of loop control transformations
such as those of \citet{shen_98_granularity-control}


\section{Conclusion}
\label{sec:lc_conc}

\status{This section is ready for review by someone.}

Ever since the first parallel implementations
of declarative languages in the 1980s,
researchers have known that getting more parallelism out of a program
than the hardware could use can be a major problem,
because the excess parallelism brings no benefits, only overhead,
and these overheads could swamp
the speedups the system would otherwise have gotten.
Accordingly, they have devised systems to throttle parallelism,
keeping it at a reasonable level.

However, most throttling mechanisms we know of
have been general in nature,
such as granularity control systems
\citep{lopez96:granularity,
king:lower_bound_time_complexity,
shen_98_granularity-control}.
These have similar objectives,
but use totally different methods:
restricting the set of places in a program
\emph{where} they choose to exploit parallelism,
not changing \emph{how} they choose to exploit it.

We know of one system that tries to preserve tail recursion
even when the tail comes from a parallel conjunction.
The ACE system \citep{gupta01:optimization_for_parallel_nodet_code}
normally generates one parcall frame for each parallel conjunction,
but it will flatten two or more nested parcall frames into one
if runtime determinacy tests indicate it is safe to do so.
While these tests usually succeed for loops,
they can also succeed for other code,
and (unlike our system) the ACE compiler does not identify in advance
the places where the optimisation may apply.
The other main difference from our system is the motivation:
the main motivation of this mechanism in the ACE system is
neither throttling
nor the ability to handling unbounded input in constant stack space,
but reducing the overheads of backtracking.
This is totally irrelevant for us,
since our restrictions prevent any interaction
between AND-parallel code and code that can backtrack.

The only work on specially loop-oriented parallelism in logic languages
that we are aware of is Reform Prolog \citep{bevemyr:reform}.
This system was not designed for throttling either,
but it is more general than ours in one sense
(it can handle recursion in the middle of a clause)
and less general in other senses
(it cannot handle parallelism in any form other than loops,
and it cannot execute one parallel loop inside another).
It also has significantly higher overheads than our system:
it traverses the whole spine of the data structure being iterated over
(typically a list) \emph{before} starting parallel execution;
in some cases it synchronises computations by busy waiting;
and it requires variables stored on the heap to have a timestamp.
To avoid even higher overheads,
it imposes the same restriction we do:
it parallelises only deterministic code
(though the definition of ``deterministic'' it uses is a bit different).
Reform Prolog does not handle any other forms of parallelism,
whereas Mercury handles parallelism in various situations,
and the loop control transformation simply optimises one form of
parallelism.
% not throttled,
% no granularity control,

The only work on loop-oriented parallelism in functional languages
we know of is Sisal \citep{feo:1990:sisal-report}.
It shares two of Reform Prolog's limits:
no parallelism anywhere except loops, and
no nesting of parallel computations inside one another.
Since it was designed for number crunching on supercomputers,
it had to have lower overheads than Reform Prolog,
but it achieved those low overheads
primarily by limiting the use of parallelism
to loops whose iterations are \emph{independent} of each other,
which makes the problem much easier.
Similarly, while ACE Prolog supports both AND- and OR-parallelism,
the only form of AND-parallelism it supports is independent.

Our system is designed to throttle loops with dependent iterations,
and it seems to be quite effective.
By placing a hard bound on the number of contexts
that may be needed to handle a single loop,
our transformation allows parallel Mercury programs
to do their work in a reasonable amount of memory,
and since it does so without adding significant overhead,
it permits them to live up to their full potential.
For one of our benchmarks,
loop control makes a huge difference:
on four cores, it turns a speedup of 1.01 into a speedup of 3.90.
It significantly improves speedups on two other benchmarks,
and it even helps the fourth and last benchmark,
even though that was already close to the maximum possible speedup.

The other main advantage of our system
is that it allows procedures to keep exploiting tail recursion optimisation
(TRO).
If TRO is applicable to the sequential version of a procedure,
then it will stay applicable to its parallel version.
Many programs cannot handle large inputs without TRO,
so they cannot be parallelised at all without this capability.
The previous advantage may be specific
to systems that resemble the Mercury implementation,
but this should apply to the implementation
of every eager declarative language.

This chapter is an extended and revised version of
\citet{bone:2012:loop_control}.


%\section{Notes/To use}
%
%\status{This section will te removed.}
%
%Currently, the Mercury runtime system
%often continues execution, on completion of a parallel conjunction,
%on a CPU different from the one being used before that parallel conjunction.
%When our system finds a smattering of parallel conjunctions
%through a mostly sequential program,
%these switches from a CPU with a warm cache to a CPU with a cold cache
%severely degrade the program's performance.
%Right now, for most programs,
%this effect yields a slowdown significantly bigger
%than the speedups yielded by automatic parallelisation.
%Once this defect is fixed, we hope to report significantly better results
%for more and bigger programs.
%
%

%We would like to thank
%% for their support,
%Chris King for allowing us to use his spectralnorm benchmark.


% LocalWords:  parallelise parallelisation parallelised quantifications nondet
% LocalWords:  det parallelising syncterm mutex signalled nonecursive labelled
% LocalWords:  parallelises Prolog's