X86 Assembly/AVX, AVX2, FMA3, FMA4

Prerequisites: X86 Assembly/SSE.

Example FMA4 program
The following program shows the use of the FMA4 instruction vfmaddps that can be used to do 8 single precision floating point multiplication and additions in one instruction.

If you set a debugger breakpoint after the last line, you can use GDB to analyze the result. Look at the program and try and spot any problems.

Spoiler alert. Dumping the result "vector" in binary, we can see that precision has been lost.

Comparing v4+12 to v4+16, one can see that the addend got too small, and was lost. We only halved it from +12 to +16, so why is it gone now? The reason is that the exponent was changed too, so the addend would be placed at a bit more than 1 bit less significant than the last set bit in the previous mantissa. And so, it got so tiny that a 32-bit single-precision float could not represent it. The data loss is also visible when dumping the floats in their base-10 representations, but one must be careful, because the base-10 representation isn't always faithful.

Resources

 * Virtual machine (pre-built for Ubuntu) and VM snapshot for testing new instructions on legacy CPUs
 * IEEE 754 single-precision interactive applet
 * Introduction to Intel AVX
 * Binutils test suite (for AVX and FMA examples in AT&T and Intel syntax):
 * http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/fma.s;hb=HEAD
 * http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/fma4.s;hb=HEAD
 * http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/avx.s;hb=HEAD
 * http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/avx-gather.s;hb=HEAD
 * http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/avx2.s;hb=HEAD
 * http://sourceware.org/git/?p=binutils.git;a=blob_plain;f=gas/testsuite/gas/i386/avx256int.s;hb=HEAD