Linux Applications Debugging Techniques/Deadlocks

Analysis
Searching for a deadlock means reconstructing the graph of dependencies between threads and resources (mutexes, semaphores, condition variables, etc.) - who owns what and who wants to acquire what. A typical deadlock would look like a loop in that graph. The task is tedious, as some of the parameters we are looking for have been optimized by the compiler into registers.

Below is an analysis of an x86_64 deadlock. On this platform, register r8 is the one containing the first argument: the address of the mutex:

Threads 3 and 4 are deadlocking over two mutexes.

Note: If gdb is unable to find the symbol pthread_mutex_t because it has not loaded the symbol table for pthreadtypes.h, you can still print the individual members of the struct as follows:

An std::mutex has a similar structure where __owner is the LWP:

Automation
An interposition library can be built to automate deadlock analysis. A significant number of APIs have to be interposed and even then there are cases that would go unnoticed by the library. For instance, one creative way to deadlock two threads without involving any userland locking mechanism is to have each thread join the other. Thus, interposition tools have limited diagnostic functionality.

There are a number of other tools available:


 * gdb-automatic-deadlock-detector - script adds new command 'blocked' to GDB. This command analyze all threads and display which threads are waiting for other threads. It also shows deadlocks between threads.
 * Userspace lockdep. Incipient work.
 * Locksmith. Basic.
 * Valgrind Helgrind. Does not suffer from the terrible slowdown other Valgrind tools have (in particular the memory analysis ones) but does not survive long on an amd64 platform.

pthreads
pthreads has some built-in support for certain synchronization mechanisms (e.g. PTHREAD_MUTEX_ERRORCHECK mutexes).

Also, there's a POSIX mutex construction attribute for robust mutexes (a recoverable mutex that was held by a thread whose process has died): an error return code indicates the earlier owner died; and lock acquisition implies acquired responsibility for dealing with any cleanup.

lockdep
On lockdep-enabled kernels, pressing Alt+SysRq+d will dump information about all the locks known to the kernel.