OpenMP/Tasks

If you followed along with the previous chapter and did the exercises, you'll have discovered that the sum function we developed was quite unstable. (To be fair, we set up the data to show the instability.) In fact, the naive way of summing has $$O(n)$$ round-off error: the error is proportional to the number of elements we sum. We can fix this by changing the summation algorithm. There are three candidates for a stable sum algorithm:


 * Sort the number from small to large, then sum them as before. Problem: this changes the complexity of the problem from linear to $$O(n \log n)$$.
 * Use Kahan's algorithm. While this algorithm takes linear time and is very stable, it is quite slow in practice and harder to parallelize then our third alternative, which is
 * divide-and-conquer recursion. The round-off error for such an algorithm is $$O(\log n)$$, i.e. proportional to the logarithm of the number of elements.

The basic divide and conquer summation is very easy to express in C:

If you use this definition of  in the program we developed for the previous chapter, you'll see that it produces exactly the expected result. But this algorithm doesn't have a loop, so how do we make a parallel version using OpenMP?

We'll use the tasks construct in OpenMP, treating the problem as task-parallel instead of data parallel. In a first version, the task-recursive version of  looks like

We introduced two tasks, each of which sets a variable that is declared  with the other task. If we did not declare the variables shared, each task would set its own local variable, then throw away the results. We then wait for the tasks to complete with  and combine the recursive results.

You may be surprised by the  followed immediately by. The thing is that the first pragma causes all of the threads in the pool to execute the next block of code. The  directive causes all threads but one (usually the first to encounter the block) to not execute it, while the   turns off the barrier on the single; there's already a barrier on the enclosing   region, to which the other threads will rush.

Unfortunately, if you actually try to run this code, you'll find that it's still not extremely fast. The reason is that the tasks are much too fine-grained: near the bottom of the recursion tree, $$\tfrac{n}{2}$$ invocations are splitting two-element arrays into subtasks that process one element each. We can solve this problem by introducing, apart from the base and recursive cases, an "intermediate case" for the recursion which is recursive, but does not involve setting up parallel tasks: if the recursion hits a prespecified cutoff, it will no longer try to set up tasks for the OpenMP thread pool, but will just do the recursive sum itself.


 * Exercise: introduce the additional case in the recursion and measure how fast the program is. Don't peek ahead to the next program, because it contains the solution to this exercise.

Now we effectively have two recursions rolled into one: one with a parallel recursive case, and a serial one. We can disentangle the two to get better performance, by doing less checks at each level. We also separate the parallel setup code to a driver function.

This technique works better when the code inside the parallel tasks spends more time computing and less time doing memory accesses, because those may need to be synchronized between processors.