X86 Assembly/SSE

SSE stands for Streaming SIMD Extensions. It is essentially the floating-point equivalent of the MMX instructions. The SSE registers are 128 bits, and can be used to perform operations on a variety of data sizes and types. Unlike MMX, the SSE registers do not overlap with the floating point stack.

Registers
SSE, introduced by Intel in 1999 with the Pentium III, creates eight new 128-bit registers:

XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7

Originally, an SSE register could only be used as four 32-bit single precision floating point numbers (the equivalent of a  in C). SSE2 expanded the capabilities of the XMM registers, so they can now be used as:


 * 2 64-bit floating points (double precision)
 * 2 64-bit integers
 * 4 32-bit floating points (single-precision)
 * 4 32-bit integers
 * 8 16-bit integers
 * 16 8-bit characters (bytes)

Data movement examples
The following program (using NASM syntax) performs data movements using SIMD instructions.

Arithmetic example using packed singles
The following program (using NASM syntax) performs a few SIMD operations on some numbers.

The result values should be: 30.800   51.480    77.000    107.360

Using the GNU toolchain, you can debug and single-step like this:

Debugger commands explained

 * break: In this case, sets a breakpoint at a given label
 * stepi: Steps one instruction forward in the program
 * p: short for print, prints a given register or variable. Registers are prefixed by $ in GDB.
 * x: short for examine, examines a given memory address. The "/4f" means "4 floats" (floats in GDB are 32-bits). You can use c for chars, x for hexadecimal and any other number instead of 4 of course. The "&" takes the address of v1, as in C.

Shuffling example using shufps
can be used to shuffle packed single-precision floats. The instruction takes three parameters,  an xmm register,   an xmm or a 128-bit memory location and   an 8-bit immediate control byte. will take two elements each from  and , copying the elements to. The lower two elements will come from  and the higher two elements from.

IMM8 control byte description
IMM8 control byte is split into four group of bit fields that control the output into  as follows:

  specifies which element of   ends up in the least significant element of  :



 specifies which element of   ends up in the second element of  :



 specifies which element of   ends up in the third element of  :



 specifies which element of   ends up in the most significant element of  :





IMM8 Example

Consider the byte 0x1B:

The 2-bit values shown above are used to determine which elements are copied to. Bits 7-4 are "indexes" into, and bits 3-0 are "indexes" into the.
 * Since bits 7-6 are 0, the least significant element of  is copied to the most significant elements of , bits 127-96.
 * Since bits 5-4 are 1, the second element of  is copied to third element of , bits 95-64.
 * Since bits 3-2 are 2, the third element of  is copied to the second element of , bits 63-32.
 * Since bits 0-1 are 3, the fourth element of  is copied to the least significant elements of , bits (31-0).

Note that since the first and second arguments are equal in the following example, the mask 0x1B will effectively reverse the order of the floats in the XMM register, since the 2-bit integers are 0, 1, 2, 3. Had it been 3, 2, 1, 0 (0xE4) it would be a no-op. Had it been 0, 0, 0, 0 (0x00) it would be a broadcast of the least significant 32 bits.

Example

Using GAS to build an ELF executable

Text Processing Instructions
SSE 4.2 adds four string text processing instructions,  ,   and. These instructions take three parameters,  an xmm register,   an xmm or a 128-bit memory location and   an 8-bit immediate control byte. These instructions will perform arithmetic comparison between the packed contents of  and. specifies the format of the input/output as well as the operation of two intermediate stages of processing. The results of stage 1 and stage 2 of intermediate processing will be referred to as  and   respectively. These instructions also provide additional information about the result through overload use of the arithmetic flags(, ,  ,  ,   and  ).

The instructions proceed in multiple steps:
 * 1)  and   are compared
 * 2) An aggregation operation is applied to the result of the comparison with the result flowing into
 * 3) An optional negation is performed with the result flowing into
 * 4) An output in the form of an index(in ) or a mask(in  ) is produced

IMM8 control byte description
IMM8 control byte is split into four group of bit fields that control the following settings:

  specifies the format of the 128-bit source data(  and  ):

  specifies the aggregation operation whose result will be placed in intermediate result 1, which we will refer to as. The size of  will depend on the format of the source data, 16-bit for packed bytes and 8-bit for packed words:

</li>

<li> specifies the polarity or the processing of , into intermediate result 2, which will be referred to as  :

</li>

<li> specifies the output selection, or how   will be processed into the output. For  and , the output is an index into the data currently referenced by  :

</li>

<li>For  and , the output is a mask reflecting all the set bits in  :

</li>

<li> should be set to zero since it has no designed meaning.</li>

</ol>

The Four Instructions
, Packed Compare Implicit Length Strings, Return Index. Compares strings of implicit length and generates index in.

 Operands 

arg1 arg2 IMM8
 * XMM Register
 * XMM Register
 * Memory
 * 8-bit Immediate value

 Modified flags 


 * 1)  is reset if   is zero, set otherwise
 * 2)  is set if a null terminating character is found in , reset otherwise
 * 3)  is set if a null terminating character is found in , reset otherwise
 * 4)  is set to
 * 5)  is reset
 * 6)  is reset

Example

Expected output:

, Packed Compare Implicit Length Strings, Return Mask. Compares strings of implicit length and generates a mask stored in.

 Operands 

arg1 arg2 IMM8
 * XMM Register
 * XMM Register
 * Memory
 * 8-bit Immediate value

 Modified flags 
 * 1)  is reset if   is zero, set otherwise
 * 2)  is set if a null terminating character is found in , reset otherwise
 * 3)  is set if a null terminating character is found in , reset otherwise
 * 4)  is set to
 * 5)  is reset
 * 6)  is reset

, Packed Compare Explicit Length Strings, Return Index. Compares strings of explicit length and generates index in.

 Operands 

arg1 arg2 IMM8
 * XMM Register
 * XMM Register
 * Memory
 * 8-bit Immediate value

 Implicit Operands 


 * holds the length of
 * holds the length of

 Modified flags 


 * 1)  is reset if   is zero, set otherwise
 * 2)  is set if   is < 16(for bytes) or 8(for words), reset otherwise
 * 3)  is set if   is < 16(for bytes) or 8(for words), reset otherwise
 * 4)  is set to
 * 5)  is reset
 * 6)  is reset

, Packed Compare Explicit Length Strings, Return Mask. Compares strings of explicit length and generates a mask stored in.

 Operands 

arg1 arg2 IMM8
 * XMM Register
 * XMM Register
 * Memory
 * 8-bit Immediate value

 Implicit Operands 


 * holds the length of
 * holds the length of

 Modified flags 


 * 1)  is reset if   is zero, set otherwise
 * 2)  is set if   is < 16(for bytes) or 8(for words), reset otherwise
 * 3)  is set if   is < 16(for bytes) or 8(for words), reset otherwise
 * 4)  is set to
 * 5)  is reset
 * 6)  is reset

SSE Instruction Set
There are literally hundreds of SSE instructions, some of which are capable of much more than simple SIMD arithmetic. For more in-depth references take a look at the resources chapter of this book.

You may notice that many floating point SSE instructions end with something like PS or SD. These suffixes differentiate between different versions of the operation. The first letter describes whether the instruction should be Packed or Scalar. Packed operations are applied to every member of the register, while scalar operations are applied to only the first value. For example, in pseudo-code, a packed add would be executed as: v1[0] = v1[0] + v2[0] v1[1] = v1[1] + v2[1] v1[2] = v1[2] + v2[2] v1[3] = v1[3] + v2[3] While a scalar add would only be: v1[0] = v1[0] + v2[0] The second letter refers to the data size: either Single or Double. This simply tells the processor whether to use the register as four 32-bit floats or two 64-bit doubles, respectively.

SSE: Added with Pentium III
Floating-point Instructions:

ADDPS, ADDSS, CMPPS, CMPSS, COMISS, CVTPI2PS, CVTPS2PI, CVTSI2SS, CVTSS2SI, CVTTPS2PI, CVTTSS2SI, DIVPS, DIVSS, LDMXCSR, MAXPS, MAXSS, MINPS, MINSS, MOVAPS, MOVHLPS, MOVHPS, MOVLHPS, MOVLPS, MOVMSKPS, MOVNTPS, MOVSS, MOVUPS, MULPS, MULSS, RCPPS, RCPSS, RSQRTPS, RSQRTSS, SHUFPS, SQRTPS, SQRTSS, STMXCSR, SUBPS, SUBSS, UCOMISS, UNPCKHPS, UNPCKLPS

Integer Instructions:

ANDNPS, ANDPS, ORPS, PAVGB, PAVGW, PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, PSHUFW, XORPS

SSE2: Added with Pentium 4
Floating-point Instructions:

ADDPD, ADDSD, ANDNPD, ANDPD, CMPPD, CMPSD*, COMISD, CVTDQ2PD, CVTDQ2PS, CVTPD2DQ, CVTPD2PI, CVTPD2PS, CVTPI2PD, CVTPS2DQ, CVTPS2PD, CVTSD2SI, CVTSD2SS, CVTSI2SD, CVTSS2SD, CVTTPD2DQ, CVTTPD2PI, CVTTPS2DQ, CVTTSD2SI, DIVPD, DIVSD, MAXPD, MAXSD, MINPD, MINSD, MOVAPD, MOVHPD, MOVLPD, MOVMSKPD, MOVSD*, MOVUPD, MULPD, MULSD, ORPD, SHUFPD, SQRTPD, SQRTSD, SUBPD, SUBSD, UCOMISD, UNPCKHPD, UNPCKLPD, XORPD


 * &#42; CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS); however, the former refer to scalar double-precision floating-points whereas the latter refer to doubleword strings.

Integer Instructions:

MOVDQ2Q, MOVDQA, MOVDQU, MOVQ2DQ, PADDQ, PSUBQ, PMULUDQ, PSHUFHW, PSHUFLW, PSHUFD, PSLLDQ, PSRLDQ, PUNPCKHQDQ, PUNPCKLQDQ

SSE3: Added with later Pentium 4
ADDSUBPD, ADDSUBPS, HADDPD, HADDPS, HSUBPD, HSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP

SSSE3: Added with Xeon 5100 and early Core 2
PSIGNW, PSIGND, PSIGNB, PSHUFB, PMULHRSW, PMADDUBSW, PHSUBW, PHSUBSW, PHSUBD, PHADDW, PHADDSW, PHADDD, PALIGNR, PABSW, PABSD, PABSB

SSE4.1: Added with later Core 2
MPSADBW, PHMINPOSUW, PMULLD, PMULDQ, DPPS, DPPD, BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDW, PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINSD, PMAXSD, ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD, INSERTPS, PINSRB, PINSRD, PINSRQ, EXTRACTPS, PEXTRB, PEXTRW, PEXTRD, PEXTRQ, PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ, PTEST, PCMPEQQ, PACKUSDW, MOVNTDQA

SSE4a: Added with Phenom
LZCNT, POPCNT, EXTRQ, INSERTQ, MOVNTSD, MOVNTSS

SSE4.2: Added with Nehalem
CRC32, PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, PCMPGTQ