Talk:Optimizing C++/Writing efficient code/Performance improving features

float and double performance
In C++,  is often more efficient than. With my computer (Intel Core2 Quad CPU Q6600 @ 2.4 GHz), the following code runs in about 17 seconds (exploiting only one core), if compiled by Microsoft C++ compiler version 15 for 32-bit 80x86 using the "/Ox" option:

Replacing "float" with "double", it runs in 7 seconds. How can you explain it? Carlo.milanesi (talk) 23:28, 31 August 2010 (UTC)


 * I compiled your microbenchmark with gcc 4.5.1 using -O2 and got the following results on Intel Core2 Duo T7250 @ 2.0 Ghz:
 * float 7.759s +- 0.02s
 * double 7.797s +- 0.02s
 * I am certain that in your case Microsoft C++ compiler uses non-optimal path for float. It probably performs the calculations in the FPU and rounds internal 80-bit precision data to 32-bit precision after each operation. If compiler used SSE2/3 it'd have almost the same performance in both cases.
 * Also, I must note that this microbenchmark is not a real world scenario as its bottleneck is raw computational speed (all variables will almost certainly be in the registers), while most real world applications have their bottlenecks in memory bandwidth. Using twice as large data type uses twice as much of memory bandwidth, introduces many more cache misses and is twice as slow even to compute if using vectorized code (e.g. employing SSE2/3/4 capabilities). For example try to multiply two 2000x2000 matrices. You'd definitely get huge slow downs if using double.1exec1 (talk) 14:06, 1 September 2010 (UTC)


 * Again. One x86 or ARM floating point SIMD instruction can operate on twice as much floats than doubles (see this, the compiler intrinsics section]. Also this, p.111-112, Vector programming, quote: It is advantageous to choose the smallest data size that fits the purpose in order to pack as many data as possible into one vector register.). Even then, the throughput of instructions operating on doubles is a bit slower (see | here, e.g. mulps and mulpd). This is how we arrive at the fact that doubles are 2 to 3 times slower. I'm not even talking about increased register pressure and lower cache performance, both of which are non-linear, that is, we are already guaranteed to have 2 times lower performance, but there's potential for much higher penalties.1exec1 (discuss • contribs) 23:06, 20 August 2011 (UTC)

Grouping several arrays into a single array of structures
I have to strongly disagree with this advice. Any compiler worth it's salt will see that the initial code can be vectorised trivially, where as the proposed "optimisation" would make it impossible for the compiler to perform such an important optimisation. The reverse optimisation, array of struts to a struct of arrays is a very common performance optimisation taught all over the place, simply because the compiler can apply vectorisation. I would strongly suggest that this section be rewritten to recommend the reverse transformation.

I checked the performance of both solutions and actually found that you are right. I remembered that the architectures of 20 years ago behaved differently, but now it no more as I remembered. Therefore I removed this advice, and added an opposite advice at the last section of the last chapter.--Carlo.milanesi (discuss • contribs) 20:48, 27 April 2012 (UTC)

Use of const after class function declaration
constness is generally not intended for performance, instead it is a safeguard to prevent inadvertently modify a state of an object.

The int and unsigned int types are, by definition, the most efficient ones on any platform.
Sorry, but that is wrong for small devices. Integer is at least 16 bit. On an 8 Bit CPU (micro controller) an integer is still 16 Bit width and needs more clocks than an 8 Bit operation. For example take a Atmega (AVR). It has a 8 Bit wide data width and has only the operations ADD (Adds two 8 Bit registers, 1 cycle), ADC (Adds two 8 bit registers and modifies the carry flag, 1 cycle) and ADIW (Adds an Immediate to 16 Bit, two cycles). So there is no one cycle operation for adding two 16 bit values. However this behaviour is true for most 8 bit processor.


 * Don't forget to sign your comment.
 * I agree that the book is overstating the case here. I'll have a quick go at qualifying the sentence, but I'm happy for someone else to refine it further. --Fishpi (discuss • contribs) 07:47, 5 April 2014 (UTC)