Message-Passing Interface

This guide assumes you have previous knowledge about C programming and will present you Message-Passing Interface (MPI) by several examples.

What is MPI ?
MPI is a standardized and portable message-passing system. Message-passing systems are used especially on distributed machines with separate memory for executing parallel applications. With this system, each executing process will communicate and share its data with others by sending and receiving messages. MPI is the specification resulting from the MPI-Forum which involved several organizations designing a portable system (that can allow programs to work on a heterogeneous network).

Since the data can only be shared by exchanging messages, this standard is not intended for use on shared-memory systems, like a multiprocessor computer (although it will work, it is not the primary goal and there are more powerful alternatives, for instance, OpenMP). Basically, MPI includes point-to-point communication, collective communication (over a network of processes), process groups, bindings for Fortran and C and other advanced functions. On the other hand, the standard does not specify explicit shared-memory operations, debugging facilities, explicit support for threads, I/O functions.

The current version is 2.0, although 1.1 is still used.

Documentation
The volume Using MPI: Portable Parallel Programming with the Message-Passing Interface by William Gropp, Ewing Lusk and Anthony Skjellum is recommended as an introduction to MPI. For more complete information, read MPI: The Complete Reference by Snir, Otto, Huss-Lederman, Walker and Dongarra. Also, the standard itself can be found at or. For other Internet references, such as tutorials, see the Open Directory Project or the List of Tutorials at the Argonne National Laboratory.

Implementation
Currently, there are two principal implementations: MPICH and LAM. The official web-site of the latter contains much information about LAM and also about MPI.

Supplementary library
Additionally libraries exist for solving numerical problem or for using I/O functions on a distributed machine. We can mention: the ScaLAPACK library, FFTW (a portable C FFT library), the PETSc Scientific Computing Libraries. A more complete list can be found on the site

Compilers
Each implementation provides a compiler or at least a front-end to an existing compiler.

Debugging
As MPI doesn't specify debugging facilities, some program can be used to this purpose: XMPI is a run/debug GUI for LAM/MPI; MPI-CHECK is a debugger for Fortran; Etnus, Inc. offers a version of the Totalview debugger that supports MPICH (as well as some other MPI implementations); and, Streamline Computing's Distributed Debugging Tool works with several MPI implementations, include MPICH and LAM/MPI.

Benchmarks
Mpptest, SKaMPI and MPBench measure the performance of MPI implementations. They can be used to decide what implementation to use, how to implement portable and efficient MPI programs and also for predicting the performance of MPI programs.

Famous and popular MPI benchmark includes NASA Parallel Benchmark, SpecMPI 2007 and HPCC benchmark suite.

Applications
It can be used on large heterogeneous parc.

Sum of an array
The point of this first simple example is to calculate the sum of all the value stored in an array. In C programming, the sequential version will be as below:

First version - Basic point-to-point communications
MPI implementations allow a user to launch several processes which will each have a defined numerical identifier, its rank. Conventionally, we will consider that the master is the process which has the rank '0'. All other processes are slaves. In this program, the master divides the array into sub-arrays and sends these to each slave. Each slave will then calculate the sum of its sub-array. In the case where the size of the array is not an even multiple of the number of slaves, the master will finish the remaining work by calculating the sum of the last values of the array (see figure below).



For each process, the source code of the program is identical and thus the executable code is, also. The code therefore must contain both the slave and the master parts. The part that will be executed depends on the rank of the process.

In this first version, the work is divided into just two parts:

This program will be explained step by step.

First, for using MPI functions, we must include the file  which contains the prototypes of the required functions. The size of the array is next defined by. and  will be used by the   and the   functions. Two prototypes of functions are declared. These functions contain the code for the master process and for the slave processes, respectively.

The main function chooses whether it should execute the master part (if the process rank is 0) or the slave part (for any other process rank.) The function begins with the  routine that must be called before any other MPI functions. This routine performs various initialization functions. Next, the rank of the process is obtained by calling the  function and, depending on the result, the master part or the slave part is executed. At the end of the computation, the final routine  is called. It performs various administrative tasks associated with ending the use of MPI.

Two declarations not found in the sequential version of the program are included in the master: the variable, which will store the result in each of the two slaves, and the   variable (whose type is defined by MPI), which is needed by the   function.

The computation part is now replaced by two calls to  and. The arguments of the  function are the address of the first element to send, the number of elements to send, their type (the more common MPI types are ,  ,  ,  , ...), the rank of the receiver, the tag of the message, and the communicator. In the simplest usage, the communicator would be, which includes all the processes sharing in the execution of the program. The arguments of the  function are the address of the receive buffer, the maximum number of elements to receive, their type, the rank of the sender, the tag of the message, communicator, and the status. For a common message, the type, the tag and the communicator must be the same (see figure below).



The first half of the array is sent to the first slave and the second half of the array is sent to the second. In each case, the size of the sub-array is. Next, the result is received from the slaves, stored, and added to the final sum.

Before it can receive the computed sums from the slaves, however, each slave must first sum the elements of the sub-array it has received from the master. The slave then sends the result back to the master.

When several messages are sent to the same process in an arbitrary order, the tag argument allow this process to distinguish between these messages and to receive them.

For execute this program with a lam implementation, the lam daemon must be started:

$ lamboot

Next, mpicc and mpirun are used for the compilation and the execution:

$ mpicc ./source.c $ mpirun -np 3 ./a.out

The option -np specifies the number of processes to launch. In this example, it must be three (one master, two slaves). The program  creates an MPI environment, copying   to all three processors and running each one with the appropriate processor number set.

Second version - Adaptiveness to the number of processes
This basic program has two problems:
 * The program needs two slaves for running correctly and cannot adapt it-self to the number of processes available
 * The master receives the data in a predefined order (the first slave and then the second). However the second slave can finish before the first

This leads to this second version:

For the first issue, the master needs the number of processes to determinate the size of the sub-array (which will be stored in the step variable). The  function gives this indication in the size variable, so   is divided by the number of slaves, namely   −1. The array is next sent to slave and while the slave will compute, the master will finish the work if it is necessary. It will then receive the result and add these to the final sum. For solving the second issue, the master will receive result from any source. This leads to the use of MPI_ANY_SOURCE in the  function in place of the rank of the process.

The slave does not need to know the number of processes but does the size of the sub-array that it has received. For this we will use the status argument. Actually, status is a structure that contains three data: the source of the message (useful if  is used), the tag of the message (if   is used), and the error code. Additionally, we can access to the real length of the received data by using the  function and the status argument. The count variable is the length of the sub-array.

This version of the program can be used with any number of processes, even 1. In this special case, the master will do all the calculations.

Unfortunately, this program is not adapted to MPI because, there is a lot of data transfer and a little computation. While MPI is designed for non-shared memory operation, there must be more computation than data transferring for losing the less possible time in message. We could improve this program by sending only the first number and the last to sum (and to add all integer between these), but in this case only geometric sum can be calculated.

Integration by the Simpson method
The purpose of this program is to calculate the integral of a given function by the Simpson method. This example is more adapting to MPI, because it requires more computation and less data-transferring than previously. The mathematical equation is:

$$\lim_{n\rightarrow\infty}\int_{a}^{b}f\left(x\right)dx=\frac{b-a}{3n}\left[f\left(x_{0}\right)+4f\left(x_{n-1}\right)+f\left(x_{n}\right)+2\sum_{k=1}^{\frac{n}{2}-1}\left[2f\left(x_{2k-1}\right)+f\left(x_{2k}\right)\right]\right]$$

In C programming, the sequential version is:

First version - User-defined type
The work to be done is a sum like in the precedent example (an array with different coefficient to each boxes). The work will be divided as shown in the figure below.



The sum is divided and calculated by the slave and the master calculate the other values and complete the work if necessary as before.

The domain of the sum will be sent to each slave and they will evaluate several return values of the function and sum these. The information that will be sent are the beginning of the sum (begin), the difference between each value (step) and the number of value for which the function must be evaluated (two double and one integer). However, the data send by the  routine must have the same type. Then, for minimizing the number of message to send and so for optimizing the program, a MPI type will be created and used for sending these data.

Generally, the data to send must be grouped and minimized for reducing the time wasted in communication (sending the minimum of message and with the minimum of data). So, two integer must be stored in a array and sent in a unique message rather than sent in two different messages.

Since the division of the work is faster than the work it-self, when a few processes are used, the master will wait a relatively long time for the result of the slaves. In this program, when there is less than a defined number of processes (LIMIT_PROC), the master will participate to the computation (namely, the work will be divided by the number of processes rather than the number of slave).

All the routine used for creating the MPI type are regrouped in the function Init_Type_Domain for simplifying the code. The function  is used for creating the new type and store it in Domain. This function need 5 arguments: the number of blocks that the new type will contain (in this case two doubles and one integer); the number of each elements of these blocks in an array (one double, another double and one integer); an array of the displacement of each block in the message; an array of these types; and, the address of the variable which will contain the structure. The displacement of each blocks is obtained by the use of the  function which give the length of a. It is the equivalent to the  function in C. The length of a MPI type is store in a   variable.

An other possibility would be to declare a structure that contains a array of two doubles and a integer:

Second version
Two different methods have been used for determining the number of operations that the slave must realize. In the first example, the slave use the status argument of the  function. In the second, this number was directly sent with the data. An other way is to obtain this by the same way that the master fixed this data, namely with the number of processes. For minimizing the quantity of data-transfer, we will use this last method (this version is just a optimization):

Product of a vector and a matrix
The point of this program is to calculate the product of a vector and a matrix. In C programming, the sequential version is:

First version - Broadcast
For parallelizing this program, the matrix will be divide by groups of lines and each part will be sent to a different slave while the vector will be sent to every slave (see figure below).



Sending the same data to each slave can be done by broadcasting these data with the  function which must be called by all the processes.

The  function requires fewer arguments than the   and the   functions. The  function requires the address of the buffer that contains or will contain the broadcasting data, the number of entries in the buffer, these type, the rank of the broadcast root (the process that will send the data) and the communicator. There are neither tag nor status arguments because they serve to differentiate messages and to give information about the receiving data and in this case, all processes are involved in the broadcast message (so, there is no need to differentiate message) and the information stored in status are known (the source is the broadcast root, the length is the number of entries and there is no tag).

When the master receive the data, it must reorder it because the order of receiving is not specified. For this purpose, the status structure contains the source of the message which is used for reordering and composing the vector.

In this example, the matrix is stored in a one-dimensional array in place of a conventional two-dimensinal array. It is possible to send a multi-dimensional array with MPI, but it must be carefully allocated. Namely, the array must be stored contiguously because the first argument of sent functions is the address of the starting buffer, and the second argument specifies the number of the following values that will be sent. Then the static and dynamic allocation are:

Second version - Scatter and gather
The MPI specification provides collective communication functions, and the  function is one of these. The next version of this program will use the "scatter" and "gather" functions which, respectively, send different data to everyone and receive different data from everyone (see figure below).



Thus, the work will be divided a little differently, namely, the master will also participate in the computation.

The call to the functions MPI_SCATTER and MPI_GATHER is the same for the master as for the slaves (emphasized arguments are ignored by the slaves). The arguments are:

Note that in the first case, the master sends data to itself and in the second, it receives data from itself. In this example, the matrix is divided by the number of processes and the master sends the first part to itself, the second part to the first slave, etc. Afterwards the master receives the first part of the resulting vector from itself, the second from the first slave, etc.

Cellular automata
In this example, we will code a cellular automata. A cellular automata is the evolution of a matrix on which we execute a defined function until there is convergence or cyclic phenomena. The function return a value for each point of the matrix depending of the value of his four neighbors. The border of the matrix will have the same value than the border of there opposite side. The resulting matrix will be a torus. In C programming, the sequential version is:

First version - Deadlock
The matrix will be divided by group of lines like in the figure [fig:vect]. For calculating the new part of the matrix, the functions applied need the values of points that belong to other parts of the matrix and then, each process will calculate the new matrix and exchange the first line and the last line with its two neighbors (see figure below).



While the processes share their data, a dead-lock can occur. A deadlock is a blocking state of a program due to the fact that one process is waiting for a specific action to be taken by another process, but the latter process is waiting for a specific action to be taken by the former process. This phenomena leads to a permanent waiting state (see figure below).



A deadlock can occur in this code because the MPI_SEND and MPI_RECV functions block the program until they are completed, namely, until the receiver receives the message sent or the sender sends the message to be received. If all the processes perform the same code (excluding the master), process one will send data to process two, process two will send data to process three, etc. Process one will wait until process two receives its data, but this will never happen because process two is also waiting until process three receives its data. Because the communication is cyclic, the program is blocked. As a first solution, we can change the order of sending and receiving on each even process.

In this example, the master and the slave have almost the same code, so it is not necessary to divide the code in two functions (the code specific to the master is simply be preceded by a condition on the rank).

While the master wait for data from the process 1, the process 1 send data to the master. Then it wait for data from the master while this one send data to its. The code will have the same behavior for all processes and thus is safe.

Second version - Non-blocking send and receive
A second solution is to use non-blocking send and receive: these functions are called and give immediately the hand to the process. Then, the process can make some computation and when the data of the message need to be accessed, it can completed it by calling some appropriate functions. Program using non-blocking communication functions can be faster. In this second version, all the functions will be completed just after that all send and all receive are begun.

In the previous example, the dimension of the matrix should be a multiple of the number of processes because the remaining lines were not send to any process. In this version, the master send these lines to the last process and receive the resulting line separately.

The MPI_Isend and the MPI_Irecv (I means immediate) functions need a supplementary argument: the address of a MPI_Request variable which will identify the request for a later completion. The status argument will be given to the completion functions. MPI_WAIT is used for complete the call to the previous functions (identified by the first argument) and will wait until the functions are effectively completed.

Third version - Sendrecv
The next version use a send-receive function: it is equivalent to call a send and a receive function in parallel. Only the while loop differ:

The MPI_SENDRECV function need arguments which are essentially the same that are required by the MPI_SEND and MPI_RECV functions.

Fourth version - Ring topology
This last version is radically different, in which that the topology used is a ring topology. A new matrix is calculated on the process 0 and then is sent to the process 1. After one iteration, this process send it to the process 2, which calculate the new matrix and send it again to the next process. The last process will return the matrix to the master and the cycle will continue until the number of iteration reach a defined constant (see figure below).



With this topology, several different data can be computed in the same time. Although in this example the operation applied on the data are the same, the processes can carry out different kinds of operation. Only one matrix will be sent in the ring. It is however possible (and it is the advantage of this structure) to send a second matrix just after.

The number of iterations is stored with the matrix, in an additional element. It will serve to determine when to stop the loop. When one process will reach the maximum number of iterations, the matrix will then be transferred on all the other processes. It is indeed necessary for stopping these properly because else, they will continue their loop and never stop.

The MPI_SENDRECV_REPLACE function is similar to the MPI_SENDRECV function, though it use the same buffer for sending and receiving the data.

Differentiation
This program will calculate the numerical value of the derivate of a functions in multiple points by using the finite elements method. The equation is: $$f''\left(x\right)=\lim_{h\rightarrow0}\frac{f\left(x+h\right)-f\left(x-h\right)}{2h}$$

However, it is numerically difficult to divide by 0 and h must not be chosen arbitrary (see figure below).



The optimal value of h is unknown and thus, the algorithm will decrease a started value of h until that the precision began to decrease (the difference between two derivate calculated with different h must decrease). In C programming, the sequential version is:

First version - Farming topology
Since the number of loop is variable, the time of execution cannot be predicted. This program send a derivate to calculate to each processes and when one have finish (not necessarily the first), he resend immediately data to this one. These data and the result are gathered into two arrays (deriv[] and result[]).

Second version - Completion of send and receive
The second topology illustrated looks like to a ring topology and introduce a new routine: MPI_REQUEST_FREE. The master will always send data to the process 1 and receive the result from any others processes (see figure below).



Since the code of the slave is similar to that of the cellular automata (in the version using a ring topology), only the master function will be described. For avoiding deadlock, only non-blocking send will be used (the non-blocking receive is used here for improving performance).

When the master finish to send enough data (for all processes), it execute a new loop. It receive a result and initialize a new data simultaneously and next, it complete the send of the data that lead to the result that it just receive and he resend in the ring a new data. Actually, if the master receive the result of a data, this means that the send of this data is completed and there is no need to complete the send. The MPI_REQUEST_FREE function allow to free a buffered request without checking when this request is logically completed.

Communicators and groups
The collective communications (like MPI_BCAST, MPI_SCATTER and MPI_GATHER) send data on the processes that belong to a same group. MPI provide functions for managing groups and communicators. This allows to use collective communications on specified processes and thus to divide a network into groups which can execute different task.

For example, different operations can occurred on a matrix (trace, transpose, inversion, diagonalizing). For each operations, a group will be defined and each processes will communicate with the other processes of its groups by using collective communication without disturbing other groups.

A communicator is a object which specifies a communication domain. It is associated to a group. A group is created from existing groups and the initial group is the group of the communicator MPI_COMM_WORLD which contains all processes.

First version - Creation of communicator from a group
The first method for creating new communicators is to create groups from this first group and then to create the communicators of these groups.

In this program, data are broadcast only on the even-ranked processes. This is achieved by creating a communicator which will contain the group of the even-ranked processes. First, the initial group of MPI_COMM_WORLD is given by the MPI_COMM_GROUP routine. The new created group contains the $$\frac{\left(size+1\right)}{2}$$ processes that have a even rank in the initial group. The MPI_GROUP_INCL function required an array of the ranks of the processes of the first group that will belong to the new group. Then, MPI_COMM_CREATE create the communicator of the new group and if the process have a odd rank, it participates to the broadcast. The group is free by MPI_GROUP_FREE.

The MPI_GROUP_SIZE and the MPI_COMM_RANK functions are similar to MPI_COMM_SIZE and MPI_GROUP_RANK. These last are just shortcut:

However, if the process that call MPI_GROUP_RANK don't belong to the group specified in argument to the function, the value return in rank will be MPI_UNDEFINED whereas a call to MPI_COMM_RANK will not work and will stop the program.

Second version - Creation of communicator from a communicator
One communicator can also be creating without using group, directly by splitting a existing communicator.

All the processes will be classified in a new communicator, depending of his color. Namely, if two processes have a color equal to 1 and one process have a color equal to 2, two communicators will be created, one for each groups. The third argument required by MPI_COMM_SPLIT is used for ordering the processes in the new communicator (this will change the rank of the new processes). Here, the same value is given for all processes.

Environmental management
Finally, environmental management functions exist for getting the name of the node on which a process run or for measuring the time taken by a part of a program (for example).

Other possibilities
There is many other functions that are not mentioned: there is different kinds of point-to-point communication functions; MPI allow user to define datatype by a lot of different functions; there is also other collective communication and reduction functions; and, more functions for using communicators and groups.

MPI defines many useful constant, among those,  ,  , ...

MPI can also be used in an advanced way for managing process topologies. Namely, the MPI implementation can exploit the specificity of a physical network (for heterogeneous network) by privileging for example the transfer of message on the fastest connection.

Prototype of basic functions
For more information on these MPI functions, report to the man page:

int MPI_Init(int *argc, char ***argv); int MPI_Finalize(void); int MPI_Comm_rank(MPI_Comm comm, int *rank); int MPI_Comm_size(MPI_Comm comm, int *size); int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm); int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status); int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count); int MPI_Type_extent(MPI_Datatype datatype, MPI_Aint *extent); int MPI_Type_struct(int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype); int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); int MPI_Scatter(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvcount, int root, MPI_Comm comm); int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm); int MPI_ISend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request); int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request); int MPI_Sendrecv(void* sendbuf, int sendcount, MPI_Datatype datatype, int dest, int sendtag, void* recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status); int MPI_Sendrecv_replace(void* buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status); int MPI_Wait(MPI_Request *request, MPI_Status *status); int MPI_Request_free(MPI_Request *request); int MPI_Group_rank(MPI_Group group, int *rank); int MPI_Group_size(MPI_Group group, int *size); int MPI_Comm_group(MPI_Comm comm, MPI_Group *group); int MPI_Group_free(MPI_Group *group); int MPI_Group_incl(MPI_Group *group, int n, int *ranks, MPI_Group *newgroup); int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newgroup); int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm); int MPI_Wtime(void); int MPI_Get_processor_name(char *name, int *resultlen);

As usual in the C programming language, the "address of an array" is really the address of the first element of the array.

Prerequisites

 * knowledge about C programming: C programming language, Wikibooks:C Programming