Cg Programming/Unity/Computing the Brightest Pixel



This tutorial shows how to compute the position of the brightest pixel in an image with the help of compute shaders in Unity. In particular, it shows how threads in a thread group can use “groupshared” data and how the execution of these threads can be synchronized. If you are not familiar with compute shaders in Unity, you should first read and. Note that compute shaders are not supported on macOS.

Why Would Anyone Want to Do This?
Finding the brightest pixel in photographed images is useful for some applications of optical motion capture. Another application is a template matching algorithm that is applied at the positions of all pixels of an image and that stores the likelihood of a match at each pixel in an intermediate image. In this case, the “brightest” pixel of this intermediate image represents the best match with the template. Finding this best match is useful for template-based feature detection and tracking.

Also, the problem of finding the brightest pixel is closely related to many other problems, e.g., finding the darkest pixel or the two (or more) brightest pixels or the two (or more) brightest pixels with a certain distance between them or the sum or average of all pixels of an image, etc. In fact, by solving the problem of finding the brightest pixel, one is very close to solving several related problems.

Finding the Brightest Pixel with a Compute Shader
To find the brightest pixel in an image, one has to look at all pixels of the image; thus, the problem can benefit a lot from parallelization.

In this tutorial, we implement a compute shader that first finds the brightest pixel in one pixel row of an image − simply by going through all pixels of the row in a loop and keeping track of the brightest pixel it encounters. We call this compute shader for all rows of the image in parallel. The result is an array of the brightest pixels of each row, which might be a relatively large array (depending on the height of the image). Therefore, we reduce the size of this array by computing the brightest pixel of each thread group at the end of the shader. Since we use thread groups of 64 threads, this reduces the dimension of the resulting array by factor 64, and the new result is an array of the brightest pixels of each thread group. One could try to further reduce this array in parallel but since this array is already relatively small, we simply transfer the data to the CPU and find the brightest pixel of the whole image by a linear search on the CPU. Note: for any tool, it is not only important to know when to use it, but also when not to use it.

Here is a first version of the compute shader:

The first (Unity-specific) line  specifies that the function   is a compute shader function that can be called from a script.

is a uniform variable to access the RGBA input texture, while  is a uniform variable to get its width, i.e., the length of a row of pixels.

The next lines define a struct to store the data about candidates for the brightest pixel. and  are its coordinates while   is its relative luminance from 0 to 1023:

The definition  uses this struct to define a   (corresponding to a compute buffer in Unity) to store the information about the brightest pixel of each thread group.

The definition  uses the same struct to define a  array to store the information about the brightest pixel of each thread (i.e., of each row) within the current thread group. Note that the total size of the  data in Direct3D 11 is limited to 32 KB. Assuming that an unsigned int requires at most 8 bytes, the   array requires at most 64 × 3 × 8 = 1536 bytes, which is well below the limit of 32 KB.

We define the dimensions of a thread group with  instead of   since the thread group is assumed to work on an “1D array” of 64 rows and it is usually more straightforward to use the x dimension for a one-dimensional group.

The compute shader function  asks for all thread-related indices that are available (although it doesn't use  ). The index  of the thread group is used to index the , the thread index   within the thread group is used to index  , and the overall dispatch index   specifies the row of the whole image.

The function  then runs a loop over all pixels of the thread's row by counting the variable   from 0 to. It computes the relative luminance (scaled with 1023 in order to work with unsigned ints) of each pixel, compares this luminance to the greatest luminance so far, and if the new luminance is greater, it updates the data in, which at the end of the loop contains the information about the brightest pixel of the row.

After computing the brightest pixel of a row, the function computes the brightest pixel of the thread group. Since we need to compare the data of different threads, we first have to make sure that all threads have determined the brightest pixel of their row. This is achieved with  which not only makes sure that all memory writes of the thread group before this line are completed but also waits until all threads in the thread group reach this line. The code then checks whether  is 0, i.e., whether this is the zeroth thread of the thread group. Only this thread determines the brightest pixel of the pixels in  and writes it to. While this solution works (and is straightforward to implement), it is somewhat wasteful because the 63 other threads in the same thread group have nothing to do while the zeroth thread is working in this loop.

A more efficient alternative is given in the second version of the compute shader below. It implements a reduce operation (or fold function) similar to a knockout tournament: In the first step, each thread with an even number compares its brightest pixel with the brightest pixel of the next thread. In the second step, each thread with a number that is divisible by 4 compares its best candidate pixel with the next thread but one, etc. In the sixth (and last) step, the zeroth thread compares its best candidate with the best candidate of the 32nd thread. The “winner” of this last comparison is then the brightest pixel of the group. There are still many idle threads in this version but it only takes 6 steps instead of a loop of 64 iterations, which is a worthwhile improvement. Avoiding any idle threads would require multiple dispatch calls which come with some overhead and therefore might not save any time.

Here is the improved shader:

Note that the code uses the bitwise and operator & with powers of 2 minus 1 to test whether  is divisble by the power of 2. We could also use the modulus operator % with the power of 2 instead.

Calling the Compute Shader
The C# script to call the compute shader is relatively straightforward:

The script has public variables for the compute shader and the input texture image that you have to set. It returns its result in the array  at a position determined by.

The  of the compute shader corresponds to the compute buffer. Note that this is an array of elements that contain 3 unsigned ints. The array  has the same memory layout but consists of unsigned ints; thus, it has three times as many elements as.

The  function does some error checking, finds the handle for the compute shader function, creates   and , and sets the uniform variables of the compute shader.

The  function released the compute buffer because it is not released by the garbage collector.

The  function simply calls the compute shader function, where the number of thread groups is determined by the number of rows of the image (i.e., its height) divided by the number of threads in one thread group (i.e., 64 in our case). We add 63 to the number of rows before the division in order to make sure that we have enough thread groups for image heights that are not divisible by 64.

copies the data from the compute buffer to the  array. Then the code finds the brightest pixel in that array by a loop over all groups. Note that the relative luminance of the group with index  is at   because   is a “flattened” array of unsigned ints instead of an array of structs with 3 unsigned ints.

At the end, the relative luminance of the brightest pixel is at. Its x coordinate is at  and its y coordinate is at.

Summary
You have reached the end of this tutorial! A few of the things that you have learned are:
 * How to parallelize a search over all pixels of an image.
 * How to synchronize the execution of threads in a thread group.
 * How to communicate data between threads in a thread group.
 * How to use reduce operations to speed up a search in a “groupshared” array.