Encyclopedia > Performance tuning

  Article Content

Optimization (computer science)

Redirected from Performance tuning

In computing, optimization modifies a system to improve its efficiency. This either reduces the resources the system consumes or increases the performance.

The system can be a single computer program, a collection of computers, or even an entire network such as Internet. The optmization might be to reduce the maximum execution time, memory use, bandwidth, or some other resource. Those objectives can be mutually exclusive, and require a tradeoff[?].

In operations research, optimization is the problem of determining the inputs of a function that minimize or maximize its value. Sometimes constraints are imposed on the values that the inputs can take; this problem is known as constrained optimization.

In computer programming, optimization usually specifically means to modify code and its compilation settings on a given computer architecture to produce more efficient software.

Usually, the term "optimization" presumes that the system retains the same functionality. However, often a crucial optimization is to solve only the actual problem, removing useless code.

Although the word "optimization" shares the same root as "optimal," economical optimizations can rarely find a truly optimal system.

Typical problems have such a large number of possibilities that a programming organization can only afford a "good enough" solution.

Some authorities define these terms:

  • code optimization or code improvement -- is used in some academic environments to mean the most effective and efficient possible code. These authorities prefer the term "code improvement" to avoid confusion with performance tuning.
  • performance tuning is said to include both "code improvement" and the scalability now often needed with network-based computing and large-scale software projects.

Optimization can be automated by compilers or performed by programmers. Gains are usually limited for local optimization, and larger for global optimizations. Perhaps the most powerful optimization is to find a superior algorithm.

Table of contents

When do I want to do optimization? It is well told that optimization often undermines readibility and add code that is uded only to improve the performance. This may complicate programs or systems, making hard to maintain and debug. Compiler optimization, for example, may introduce weird behavior because of compiler bugs. Because of that, it is preferable that optimization or performance tuning is done in the end of development stage. In other words, often systems or programs perform poorly in middle of developing.

Systemic Optimization The architectural design of a system can overwhelmingly affect its performance.

Load balancing spreads the load over a large number of servers. Often load balancing is done transparently (i.e., without users noticing it), using a so-called layer 4 router[?].

Caching stores intermediate products of computation to avoid duplicate computations.

Optimizing a whole system is usually done by human beings because the system is too complex for automated optimizers. Grid computing or distributed computing aims to optimize the whole system, by moving tasks from computers with high usage to computers with idle time.

Algorithms and data structures The choice of algorithm affects efficiency more than any other item of the design. Usually, more complex algorithms and data structures perform well with many items while simple algorithms are more suitable to small amounts of data. For small sets of data, the set-up and initialization time of the more complex algorithm can outweigh the benefit of the better algorithm.

Usually, the more memory the program uses, the faster program runs. Take a filtering program. The common practice in such a program is read each line and filter and output that line in the same time. The memory is only needed for one line, but typically the performance is poor. To improve the performance, read the entire file then output the filtered result. This typically improve the peformance dramatically but causes heaviy memory use. Caching the result is also effective though requiring some or huge memory use.

Manual optimization In this technique, programmers or system administrators explicitly change code so that the system performs better. Although it can produce better efficiencies, it is far more expensive than automated optimizations.

Code optimization usually starts with a rethinking of the algorithm used in the program: more often than not, a particular algorithm can be specifically tailored to a particular problem, yelding better performance than a generic algorithm. For example, the task of sorting a huge list of items is usually done with a quicksort routine, which is one of the most efficient generic algorithms. But if some characteristic of the items is exploitable (for example, they are already arranged in some particular order), a different method can be used, or even a custom-made sort routine.

After one is reasonably sure that the best algorithm is selected, code optimization can start: loops can be unrolled (for maximum efficiency of a processor cache memory), data types as small as possible can be used, an integer arithmetic can be used instead of a floating-point one, hash tables can replace linear vectors, and so on.

Performance bottlenecks can be due to the language rather than algorithms or datastructures used in the program. Sometimes, a critical part of the program can be re-written in a different, faster programming language. For example, it is common for very high-level languages like Python to have modules written in C, for a greater speed. Programs already written in C can have modules written in assembly. See subpages for each language-specific optimization:

Rewriting pays off because of a law known as the 90/10 law[?], which states that 90% of the time is spent in 10% of the code, and only 10% of the time in the remaining 90% of the code. So optimizing just a small part of the program can have a huge effect on the overall speed.

Manual optimization often has the side-effect of undermining readability. Thus code optimizations should be carefully documented and their effect on future development evaluated.

Automated Optimizations The program that does the automated optimization is called an optimizer. Most optimizers are embedded in compilers and operate during compilation. Optimizers often can tailor the generated code to specific processors.

Today, automated optimizations are almost exclusively limited to compiler optimization. Compiler optimization is used to improve the efficiency (in terms of running time or resource usage) of the code output by a compiler. These techniques allow programmers to write code in a straightforward manner, expressing their intentions clearly, while allowing the computer to make choices about implementation details that lead to efficient code. Contrary to what the term might imply, this rarely results in code that is perfectly "optimal" by any measure, only code that is much improved compared to direct translation of the programmer's original code.

Usually, the optimizer takes what has been produced by the compiler and replaces it with a better version. In other words, high-level redundancy in the source program (such as an inefficient algorithm) remain unchanged with optimizers.

In the early times of computer science, compiler optimizations were not as good as hand-written ones. As compiler technologies have improved, good compilers can often generate better code than human programmers. In the RISC CPU architecture, compiler optimization is the key for obtaining an optimal code, because the RISC instruction set is so compact that it is hard for a human to manually schedule or combine small instructions to get efficient results.

Techniques in optimization can be broken up along various dimensions:

local vs. global
Local techniques tend to be easier to implement, but result in lesser gains, while global techniques make the opposite tradeoff. Often, optimizations that act on the complete control flow graph are termed global, and those that work on a basic block alone are termed local.

programming language-independent vs. language-dependent
Most high-level languages share common programming constructs and abstractions--decision (if, switch, case), looping (for, while, repeat .. until, do .. while), encapsulation (structures, objects). Thus similar optimization techniques can be used across languages. However certain language features make some kinds of optimizations possible and/or difficult. For instance, the existence of pointers in C and [[C++ programming language|C++]] makes certain optmizations of array accesses difficult. Conversely, in some languages functions are not permitted to have "side effects". Therefore, if repeated calls to the same function with the same arguments are made, the compiler can immediately infer that results need only be computed once and the result referred to repeatedly.

machine independent vs. machine dependent
A lot of optimizations that operate on abstract programming concepts (loops, objects, structures) are independent of the machine targetted by the compiler. But many of the most effective optimizations are those that best exploit special features of the target platform.

These dimensions aren't completely orthogonal. It is often the case that machine dependent optimizations are local, for example.

An instance of a local machine dependent optimization: to set a register to 0, the obvious way is to use the constant 0 with the instruction that sets a register value to a constant. A less obvious way is to XOR a register with itself. It is up to the compiler to know which instruction variant to use. On many RISC machines, both instructions would be equally appropriate, since they would both be the same length and take the same time. On many other microprocessors such as the Intel x86 family, it turns out that the XOR variant is shorter and maybe faster (no need to decode an immediate operand nor use the internal "immediate operand register").

Factors affecting optimization

The machine itself

It is sometimes possible to parametrize some of these machine dependent factors. A single piece of compiler code can be used to optimize different machines just by altering the machine description parameters. See GCC for a compiler that exemplifies this approach.

The architecture of the target CPU

  • Number of CPU registers: To a certain extent, the more registers, easier it is to optimize for performance. Local variables can be allocated in the registers and not on the stack. Temporary/intermediate results can be left in registers without writing to and reading back from memory.
  • RISC vs. CISC: CISC instruction sets often have variable instruction lengths, often have a larger number of possible instructions that can be used, and each instruction could take differing amounts of time. RISC instruction sets attempt to limit the variability in each of these: instruction sets are usually constant length, with few exceptions, there are usually fewer combinations of registers and memory operations, and the instruction issue rate (the number of instructions completed per time period, usually an integer multiple of the clock cycle) is usually constant in cases where memory latency is not a factor. There may be several ways of carrying out a certain task, with CISC usually offering more alternatives than RISC. Compilers have to know the relative costs among the various instructions and choose the best instruction sequence.
  • Pipelines: A pipeline is essentially an ALU broken up into an assembly line. It allows use of parts of the ALU for different instructions by breaking up the execution of instructions into various stages: instruction decode, address decode, memory fetch, register fetch, compute, register store, etc. One instruction could be in the register store stage, while another could be in the register fetch stage. Pipeline conflicts occur when an instruction in one stage of the pipeline depends on the result of another instruction ahead of it in the pipeline but not yet completed. Pipeline conflicts can lead to pipeline stalls: where the CPU wastes cycles waiting for a conflict to resolve.
Compilers have to schedule instructions such that the pipelines don't stall, or stalls are reduced to a minimum.
  • number of functional units: Some CPUs have several ALUs and FPUs. This allows them to execute multiple instructions simultaneously. There may be restrictions on which instructions can pair with which other instructions ("pairing" is the simultaneous execution of two or more instructions), and which functional unit can execute which instruction. They also have issues similar to pipeline conflicts.
Here again, instructions have to be scheduled so that the the various functional units are fully fed with instructions to execute.

The architecture of the machine

  • Cache size (256KB, 1MB) and type (fully associative, 4-way associative): Techniques like inlining and loop unrolling may increase the size of the generated code and reduce code locality. The program may slow down drastically if an oft-run piece of code (like inner loops in various algorithms) suddenly cannot fit in the cache. Also non-fully associative caches have higher chances of cache collisions even in an unfilled cache.
  • Cache/Memory transfer rates: These give the compiler an indication of the penalty for cache misses. This is used only in specialized applications.

Intended use of the generated code

Debugging
A main factor is speed of compilation. Also, debugging code is usually stepped through in a symbolic debugger[?] -- so it is useful not to apply transformations that make it difficult to identify the source code line numbers from which the code being stepped through was generated from.

General purpose use
Prepackaged software is very often expected to be executed on a variety of machines and CPUs that may share the same instruction set, but have different timing, cache or memory characteristics. So, the code may not be tuned to any particular CPU, or may be tuned to work well on the most popular CPU and work reasonably on other CPUs.

Special purpose use
If the software is compiled to be used on one or a few very similar machines, with known characteristics, then the compiler can heavily tune the generated code to those machines alone.

Optimization techniques

Common Themes

To a large extent, optimization techniques have the following themes, which sometime conflict

Less code
There is less work for the CPU, cache, and memory. So, likely to be faster.

Straight line code, fewer jumps
Less complicated code. Jumps interfere with the prefetching of instructions, thus slowing down code.

Code locality
Pieces of code executed close together in time should be placed close together in memory, which increases spatial locality of reference.

Extract more information from code
The more information the compiler has, the better it can optimize.

Techniques

Some optimization techniques are:

loop unrolling
Attempts to reduce the overhead inherent in testing a loop's condition by replicating the code in the loop body. Completely unrolling a loop, of course, requires that the number of iterations be known at compile time.

loop combining
Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data.

loop interchange
(swapping inner and outer loops)

common subexpression elimination
In the expression "(a+b)-(a+b)/4", "common subexpression" refers to the duplicated "(a+b)". Compilers implementing this technique realize that "(a+b)" won't change, and as such, only calculate its value once.

test reordering
if we have two tests that are the condition for something, we can first deal with the simpler tests (e.g. comparing a variable to something) and only then with the complex tests (e.g. those that require a function call). This technique complements lazy evaluation, but can be used only when the tests are not dependent on one another.

constant folding and propagation
replacing expressions consisting of constants (e.g. "3 + 5") which their final value ("8") rather than doing the calculation in run-time. Used in most modern languages.

inlining of procedures
when some codes invokes a procedure, it is possible to put the body of the procedure inside this code rather than invoking it from another location. This saves the overhead related to procedure calls, but comes at the cost of duplicating the function body each time it's called inline. Generally, inlining is useful in performance-critical code that uses makes a lot of small procedure calls. If any parameters to the procedure are constants known at compile time, inlining may result in more constants to propagate.

instruction scheduling

unreachable code elimination
Dead code refers to code segments which can never be executed. Dead code elimination prevents the compiler from emitting code for such segments, saving on CPU instruction cache[?].

dead code elimination
Remove instructions that will not affect the behaviour of the program. For example in "int myFunc(int a){int b,c; b=a*2; c=a*3; return b;}" the statement "c=a*3" doesn't do anything useful and can be removed.

using CPU instructions with complex offsets to do math
on many processors in the 68000 family, for example, "lea a0,25(a1,d5*4)" assigns to the a0 register 25 + the contents of a1 + 4 * the contents of d5 in a single instruction and without an explicit move or overwriting a1 or d5

code-block reordering
reduce conditional branches and improve locality of reference

factoring out of invariants
if an expression is carried out both when a condition is met and otherwise, it can be written just once outside of the conditional statement. Similarly, if certain types of expressions (e.g. the assignment of a constant into a variable) appear inside a loop, they can be moved out of it because their effect will be the same no matter if they're executed many times or just once. Also known as total redundancy elimination. A more powerful optimization is partial redundancy elimination (PRE).

removing recursion
recursion is often expensive as it involves the function call overhead, and inconvenient as it can rapidly deplete the stack. Tail recursive algorithms can be converted to iteration, which does not have call overhead and uses a constant amount of the stack.

strength reduction
a general term encompassing optimisations that replace complex or difficult or expensive operations with simpler ones. A classic example is array indexing in C. Say I have the following code:
/* set all elements of a to x */
for(i=0; i<n; ++i) {
  a[i] = x;
}
the semantics of C say that array indexing is equivalent to pointer arithmetic and dereferencing. So this is equivalent to:
for(i=0; i<n; ++i) {
  *(a+i) = x;
}
In C the expression (a+i) is likely to involve evaluating i, multiplying by the size (in bytes) of the elements of the array and then adding that to the array address. Let's say the elements of the array are 4 bytes long (which would be typical of int on a 32-bit architecture). So in something not quite like C:
for(i=0); i<n; ++i) {
  t1 = address(a)
  t2 = 4 * i
  t3 = t1 + t2
  *t3 = x;
}
The computation of t2 involves a multiplication, a strength reduction optimisation can reduce this to an addition. Notice that the multiplication is done each time we go round the loop. The optimisation relies on the fact that when we know 4*i it is easy to compute 4*(i+1) which we need next time we go round the loop. We just add 4 to the value that we already computed for 4*i. So the new, strength reduced, code looks something like this:
t2 = 0
for(i=0; i<n; ++i) {
  t1 = address(a)
  t3 = t1 + t2
  *t3 = x
  t2 = t2 + 4
}
(as it happens the computation of t1 is invariant and can be removed from the loop by "factoring out of invariants" mentioned above).

Strength reduction can apply to many things such as: turning divides into multiplies (for example, when dividing by a constant; it may be possible to replace this with a multiplication by the reciprocal of the constant), turning multiplies into shifts and adds, etc.

reduction of cache collisions
(e.g. by disrupting alignment within a page)

register optimization
The most frequently used variables should be kept in processor registers for fastest access. To find which variables to put in registers a interference-graph is created. Each variable is a vertex and when two variables are used at the same time (have an intersecting liverange) they have an edge between them. This graph is colored using for example Chaitin's algorithm using the same number of colors as there are registers. If the coloring fails one variable is "spilled" to memory and the coloring is retried.

Stack height reduction
Rearrange expression tree to minimize resources needed for expression evaluation.

Branch offset optimization (machine independent)
Choose the shortest branch displacement that reaches target

See also: control flow graph, SSA form, queueing theory, simulation.

Subpages

References



All Wikipedia text is available under the terms of the GNU Free Documentation License

 
  Search Encyclopedia

Search over one million articles, find something about almost anything!
 
 
  
  Featured Article
Sakhalin

... which has a large Korean minority who immigrated to the island by their free will or were brought for compulsory service during World War II, to work in the coal ...

 
 
 
This page was created in 29.3 ms