Optimizing Code for Speed/Factor Optimizations

What are Factor Optimizations?
Factor Optimizations are the opposite of order of complexity optimizations: they don't change the overall complexity of the algorithm, but rather make it run faster by a certain factor. As an extreme (but not so unrealistic) example, I may increase the formula for calculating the time that my program runs from 5 seconds times N^2 (where N is the length of the input) to 1 second times N^2. One can see that we increase the optimization by a certain factor, and still don't handle potential scalability problems.

The Motivation for Factor Optimizations
Are factor optimizations worth-your-time? Some people don't seem to think so. For example Eric Raymond has this to say in "The Art of Unix Programming":

One is the exponential effect of Moore's Law — the smartest, cheapest, and often fastest way to collect performance gains is to wait a few months for your target hardware to become more capable. Given the cost ratio between hardware and programmer time, there are almost always better things to do with your time than to optimize a working system.

We can get mathematically specific about this. It is almost never worth doing optimizations that reduce resource use by merely a constant factor; it's smarter to concentrate effort on cases in which you can reduce average-case running time or space use from O(n^2) to O(n) or O(n log n),[112] or similarly reduce from a higher order. Linear performance gains tend to be rapidly swamped by Moore's Law.[113]

Randall Hyde presents an opposing view in his OnLAMP.com feature, "Why Learning Assembly Language is Still a Good Idea":

"Most of the time you can achieve very good performance boosts by simply improving the implementation of an existing algorithm. A computer scientist may argue that a constant improvement in performance isn't as good as, say, going from an algorithm with O(n^2) performance to one with O(n lg n) performance, but the truth is that most of the time a constant factor of two or three times improvement, applied throughout a piece of software, can make the difference between a practical application and one that is simply too slow to comfortably use. And it is exactly this type of optimization with which most modern programmers have little experience."

So which viewpoint is correct? I tend to think that Raymond is wrong, while Hyde is correct. The factors in which your program is running are still important, and can make a world of difference. Some factor optimizations may yield huge benefits and will make you and your users happier.

Raymond is also misled because one cannot expect their end-users to upgrade their machines. Furthermore, it seems that we've hit the end of the linear CPU speed increase for Semiconductor-based CPUs, and that we cannot expect a linear code to become faster with new hardware. Highly parallelized code may become faster, but parallelization is not always possible, or desirable.

Another illustrative story about why it's not a good idea to depend on Moore's law to speed up your code was posted by Gilboa Davara to the Linux-IL mailing list. Quoting from there while editing for coherency:

A couple of years ago I worked for a medical software development company. I was working on the database development side. (We had our own proprietary object oriented database)

Our database was pretty cool; it could handle an hospital level load on a dual Pentium Pro machine. (Which was a far cry from most big iron machines that were used back then.)

Our medical software side used PowerBuilder (and later VB) to develop the medical applications. To put it mildly, the medical application itself, was by far, slower and heavier then the medical database that it was built upon. While 50 clients could run easily on a Pentium I 90 MHz with 32MB of RAM, the medical application ran very slowly on a Pentium I 166 MHz with 64MB of RAM machine!

And every-time we pointed this anomaly to the med team, they claimed that "new machines are bound, new CPUs; by the time we are out, CPU power won't be an issue."

You know what, that med software now runs slower than a dead dog on a top-level Pentium 3 / Pentium 4 / Athlon machine… nothing has changed.

Are "Small" Optimizations Desirable?
Should we invest time working on optimizations that only save us 5 seconds (or fewer) out of the total run? Is a 5% increase in speed desirable? Some people are likely to say "no" to these questions, but the answer is not as straightforward. For example, Bill Raymond has written a solver for FreeCell that can reportedly solve 10,000,000 deals per hour on a 733 MHz computer and he testified that:

I achieved my fast times by multitudes of 1% reductions.

From my experience working on my own solver (which is open-source and has a public version control repository), I was able to gradually reduce the run-time of a certain benchmark over time from 224 seconds down to 94 seconds (an improvement of 138% percent) just by shaving off a few seconds at a time and doing many different optimizations. In a blog post, I further explain how I quadrupled the performance of the "File-Find-Object" Perl 5 module by applying many small optimisations.

As a result, one should not rule out that even "small" optimisations are negligible and should not be pursued. It takes just four 20% speed increases to double the speed of your program, and while the yield of each small optimisation is not very dramatic, such increases adds up to a lot pretty quickly. On the other hand if you don't apply any small optimisations to your code, and instead wait for the pot at the end of the rainbow, then your code is likely to stay only as fast (or as slow) as it is at present.

Managing Pointers to Structs Instead of the Structs Themselves
If we have a collection of many C/C++-structs (structures that contain an adjacent number of elements of possibly different data types), then swapping two such structs (say for re-ordering or sorting) will require a lot of memory access. On the other hand if we manage pointers to such structures, with permanent addresses, then swapping two 32-bit or 64-bit pointers will be relatively cheap.

The first ANSI C release (0.2.0) of Freecell Solver allocated a direct array of large C-structs, and then sorted them and binary searched them. In the 0.4.0 release, an array of pointers to individually-malloc-ed structs was implemented instead, which resulted in a huge boost of speed. This taught me a valuable lesson on how to make a smart use of pointers as an optimization tool.

Reducing Memory Consumption
The more memory an application, or its individual elements consumes, the more cache misses it has, the more page swapping it requires, and it becomes more probable for data to require more than one cache line. As a result, reducing the size of a program can often lead to speed benefits.

As documented in a post to Linux-IL, when Freecell Solver was converted from representing cards as 32-bit values, and converted to representing them using 8-bit octets, it became much faster. This is likely due to less swapping, due to less cache misses, and because more cards could fit in the Pentium's 32-byte cache row.

The "Cost of Inline Functions" LWN.net story is also illustrative. When one function in the Linux kernel was un-inlined, it made the kernel somewhat faster. The reason was that all of the inlined instances of it occupied a (relatively) large amount of memory per-instance, and as a result, the kernel was larger, and there were more cache misses.

Note about the Memory-Speed Tradeoff
After reading what was written here, you may think this contradicts the common Memory-Speed Trade-off "truism". The Memory/Speed Trade-off has its origins in theoretical computer science, where it is shown that for certain tasks, one can reduce the run-time's asymptotic complexity by increasing the asymptotic amount of memory used (and vice versa). For example, we can sometimes reduce the asymptotic run-time from O(N^2) to O(N) by increasing the memory from O(1) to O(N).

This is all nice and true, but it doesn't contradict the fact that given the architecture of contemporary computer hardware and operating systems, the less memory a program uses (while the logic of the algorithm remains the same) - the faster it will probably run. It's not an asymptotic trade-off, but rather a gain-gain situation.

Parallelization
By parallelizing a task, one can split it into several lines-of-executions (processes, threads, tasks on different computers in the cluster, etc.) that each will run in parallel. As a result, the complete task itself will hopefully finish faster. A good example for parallelization is performing the same time-consuming operation on a lot of inputs. If we assign different processes subsets of the inputs, then they will likely finish it faster.

Lately, parallelization has become especially attractive for making code run faster due to the advancement of multi-core CPUs. However, it should be noted that parallelization is still limited by some factors such as locking, serialization and de-serialization of the inter-process data, context-switching, CPU and operating system-constraints, and the fact that often the number of tasks will exceed the number of available processors. Most importantly, Amdahl's law rules that the serial part of the task (by definition) cannot be parallelized, and so limits the amount of gain from the parallelization.

Putting the Most Important struct Members First
In the Linux kernel the order of the struct members is ordered so the most important members fit within the Pentium architecture's 32-byte cache line size. This way, access to the struct in general is sped up because all of its members remain in the cache line most of the time and can be accessed more quickly.

Copy-on-write
Another useful optimization is known as copy-on-write. A good example for this is when implementing a virtual machine for a programming language, and where we assign a variable to another one. While we can duplicate the contents of the variable each time, it would be cheaper to just increase the reference count, and wait for one of the variables to change before they are separated.

If the contents of the copies are significant, then copy-on-write can yield significant savings in time.

Caching
If we perform many costly queries, then caching commonly-issued queries along with the results we got in memory is likely to increase the overall performance of the program. Caching is implemented in all kinds of software - from operating system kernels that hold recently-accessed file-system entries in cache, to database servers that keep caches of the various stages of the queries given to them.

There's a variation on caching called Memoization, in which we never relieve of our results. It can be demonstrated that by memoizing the naïve (tree-recursive) Fibonacci-number calculation, one can actually reduce its complexity from an exponential one to a linear one.

One should note that caching or memoization cannot be done in many cases, for example, if the queries have most kinds of side-effects.

Avoiding Copying Altogether
This Hackers-IL Post gave the case for avoiding unnecessary copying of objects in order to increase performance. Calling copy constructors excessively can have a negative impact on performance, and reducing the number of calls to a minimum can improve performance. Just copying a large contiguous area in memory, several times, may have a negative effect on performance and eliminating it, may also turn out to be beneficial.

Inline Functions
Despite what was said earlier, in the context of reducing memory consumption, inlining functions in languages such as C or C++ can often have a positive effect on performance. Function calls are costly operations, and have a lot of overhead, and avoiding the function call by inserting the code in-place whenever the function is used, can increase performance. Inline functions become more and more beneficial the shorter they are, and if the memory occupied by the in-place calls is smaller than the memory used without inlining.

If you are unsure whether inlining a certain function has a positive or negative effect, then benchmark and see.

The Schwartzian Transform
The Schwartzian transform is a way to optimize certain kinds of comparison-based sort operations. If the keys of the function that we want to compare take a lot of time to calculate (such as if they require access to the hard-disk), then we can first prepare an equivalent array with pairs of the inputs and their keys, then sort the pairs based on the keys, and finally filter only the original inputs from the pairs.

It should be noted that the Schwartzian transform actually reduced the order of complexity of the number of times the keys are calculated from O(N*log(N)) to O(N). However, the overall order of complexity of the sort remains O(N*log(N)).

Call for a Catalog of Optimizations
I believe it would be a good idea to concentrate all the known optimizations in one place, in a similar fashion to the Catalog of Refactorings and the Portland Pattern Repository. This would be a list that people can read for completeness, and to realize where their code can be improved, and for general inspiration, and to facilitate communication.

I'm not aware of a similar effort as of the time of this writing, and it will be useful. This section only covered a very small subset of the possible optimization strategies, just to give a taste of them, and did not aim to be comprehensive.