Linux Applications Debugging Techniques/Aiming for and measuring performance

gprof & -pg
To profile the application with gprof:


 * Compile the code with
 * Link with
 * Run the application. This creates a file  in the current folder of the application.
 * At the prompt, in the folder where gmon.out lives:

PAPI
The Performance Application Programming Interface (PAPI) offers the programmer access to the performance counter hardware found in most major microprocessors. With a decent C++ wrapper, measuring branch mispredictions and cache misses (and much more) is literally one line of code away.

By default, these are the events that  is looking for:

The counters class is parameterized with a  instructing what to do with the counters once they get out of scope.

As an example, lets look a bit at these lines:

The code just loops over an array but in the wrong order: the innermost loop iterates on  the outer index. While the result is the same whether we loop over the first index first or over the last one, theorically, to preserve cache locality, the innermost loop should iterate over the innermost index. This should make a big difference for the time it  takes to iterate over the array:

papi::counters is a class wrapping around PAPI functionality. It will take a  snaphost of some performance counters (in our case, we are interested in cache misses  and in branch mispredictions) when a counters object is instantiated and another snapshot when the object is destroyed. Then it will print out the differences.

A first measure, with non-optimized code (-O0), shows the following:

While the cache misses have indeed improved, branch mispredictions exploded. Not exactly a good tradeoff. Down in the pipeline of the processor, a comparison operation translates into a branch operation. Something is funny with the unoptimized code the compiler generated.

Typically, branch machine code is generated directly by    and ternary operators; and indirectly by virtual calls and by calls though pointers

Maybe the optimized code (-O2) is behaving better? Or maybe not:

This time the compiler optimized the loops out! It figured we do not really use the data in the array, so it got rid of. Completely!

Let's see how this code behaves:

Both cache misses and branch mispredictions improved by at least an order of magnitude. A run with unoptimized code will show the same order of improvement.

OProfile
OProfile offers access to the same hardware counters as PAPI but without having to instrument the code:
 * It is coarser grained than PAPI - at function level.
 * Some out of the box kernels (RedHat) are not OProfile-friendly.
 * You need root access.

perf
is a kernel-based subsystem that provide a framework for performance analysis of the impact of the progams being run on the kernel. It covers: hardware (CPU/PMU, Performance Monitoring Unit) features; and software features (software counters, tracepoints).

perf tutorial

perf list
Lists the events available on a particular machine. These events vary based on the performance monitoring hardware and the software configuration of the system.

Note: Running this as root will give out an extended list of events; some of the events (tracepoints?) require root privileges.

perf stat
Gathers overall statistics for common performance events, including instructions executed and clock cycles consumed. There are option flags to gather statistics on events other than the default measurement events.

perf record
Records performance data into a file which can be later analyzed using perf report.

perf report
Reads the performance data from a file and analyzes the recorded data.

perf annotate
Reads the input file and displays an annotated version of the code. If the object file has debug symbols then the source code will be displayed alongside assembly code. If there is no debug info in the object, then annotated assembly is displayed. Broken!?

perf top
Performance top tool look-alike. It generates and displays a performance counter profile in realtime.

Valgrind: cachegrind
Cachegrind simulates a machine with two level ([I1 & D1] and L2) cache and branch (mis)prediction. Useful in that it annotates the code down to line level. Can significantly differ from machine's real CPU. Will not go very far on an AMD64 CPU (vex disassembler issues). Extremely slow, typically slows down the application 12-15 times.

DIY: libhitcount
libmemleak can be easily modified to keep track of what calls into a particular point in the code. Just insert an  call in that place.

How things scale up

 * Source: https://gist.github.com/2843375


 * Source: Paul E. McKenney

Notes on Hyper Threading

 * Presumed loss of performance for intensive floating point calculations (only one FPU and one ALU (two pipelines) per core).
 * http://ayumilove.net/hyperthreading-faq/
 * http://en.wikipedia.org/wiki/Hyper-threading

Other tools

 * bpftrace
 * SchedViz
 * eBPF & friends
 * Intel VTune
 * List of performance analysis tools

Varia

 * Microbenchmarks
 * Macrobenchmarks
 * Atomics, fences