Shortcut to seniority

If you have a performance issue, or you want to optimize the code, you need to keep in mind a few things first.
First of all, you should never assume anything. Everything should be benchmarked (measured) first.
Secondly, you should not measure the debug version, but the release one - that’s the one that’s important for the user.
Last but not least, optimization step is one of the last one within a project - you should plan for it, but never optimize too soon.

Definitions

Bottleneck

A bottleneck is an inefficiency where the flow of the work will stop or slow at a particular point or, in simple terms, it’s the first component which starts failing or is causing slowness within the system. Finding the bottleneck is the goal of a performance tester.

Cache lines

A CPU cache is a hardware cache used by the CPU to reduce the cost (time/energy) to access the data from the main memory. The cache is a small and fast memory, which is close to the processor core, and stores copies of data from frequently accessed memory locations. Modern CPUs have more cache levels (L1, L2, etc.) which increase in size and decrease in speed, but they are still very relevant if compared against the time required to fetch data from main memory instead.

Cache hit / miss

A cache hit/miss refers to a hit/miss in the highest level of cache in the CPU. Cache hits are crucial for performance are developers should strive to make their applications in a way in which they benefit from the cache.

Cache hits are crucial for performance and developers should strive to make their applications in a way in which they benefit from the cache, or in other words, that the data they use is already stored in the cache.so the data needs to be loaded from RAM / main memory. The data that is read from that memory is equal to the size of the memory bus width.

Principle of locality

The principle of locality refers to keeping data that is related close to each other, so that caching can occur more often.

Prefetcher

Prefetcher helps optimizing performance, but reading the traffic that comes into the CPU and the cache. When two threads are writing data in two different memory locations, they do not need any synchronization, because the data is not dependent on one another. If the data they are writing to is mapped to the same cache line, writing to a variable from that cache line will invalidate the memory for the other variable.

False sharing

False sharing is a concept when two CPUs do not share any resource but they actually do because of the undergoing hardware architecture. When two threads are writing each in two different memory locations, they do not need any synchronization. If the data they are writing to is mapped to the same cache line, writing to a variable from that cache line will invalidate the memory for the other variable.

Optimization techniques

General

Prefer static linking to dynamic linking

Because the static libraries are fully added into the application, they allow a lot of optimizations that are not available for dynamic linked libraries, such as Whole program optimization, Local jumps, or Inlining.

Prefer regular memory access patterns

If the memory access can be recognized by the system through regular accessing, the prefetcher can do its job and fetch blocks of memory before they are used, so you will not wait for the memory to be read into cache - you can use them right away!

Minimize jumps and avoid data dependencies

Jumping into memory does not benefit from caching, and dependencies between data do not allow the hardware to run the instructions in parallel.

Use the library containers

You might want to create your own implementation of containers, but you should also have in mind the fact that there probably is one already implemented and used intensively by other developers of that programming language. Those libraries are already written, debugged and tested – meaning that they are less buggy, safer, and probably faster than a new implementation.

You should take into consideration writing a new implementation in case you need something specific that is not covered by the standard and not found on the web. Standard libraries are usually a double-edged sword, because although the implementations are fast and usually bug-free, there could still be inefficient in order to contain as much functionality to please as many people as possible.

Consider Using References Instead of Pointers

In languages that contain both references and pointers, the developer needs to think which one they should use in their code. Usually, it depends, but there are a couple of differences between them, such as:

References are never null, so unlike pointers, you don’t need to compare them to it.
References usually make the code easier to read than pointers.
References could be better optimized by the compilers, because pointers make it difficult for the compiler to know when different variables refer to the same location.

Pass class objects by reference, not by value

Passing objects by value requires that the entire object to be copied through the copy constructor, whereas passing it by reference does not involve any copy operations.

Be Aware of the Cost of Virtual Functions

Having a virtual function in a class requires that every instance of that class and of its derived objects to contain an extra pointer. There’s also a hit in the performance when you call a virtual function (there’s an additional level of indirection – virtual pointer has to access the virtual table in order to get the memory address of the real function).

Storage

Use const for immutable data

Declare objects as constants (read-only) if you’re aware that the data is not supposed to change.

Use stack for most variables

Data written on stack has 0 cost of addressing and is faster due to it being usually stored in cache / registers.

Avoid global variables

Globals are as slow as a function call, because the compiler must assume that it was seen by all the other threads, so it has to reload the data from memory. Other than that, they also impact readability and increase the number of bugs, because the developer now has one more data it needs to reason about, since the globals can be changed from all threads and can cause data races or other problems that can appear in multithreading environments.

Prefer 32-bit integers

The ALU (Arithmetic logic unit) in the CPU can do one 64-bit operation at a time, or two 32-bit at the same time, using the same hardware and bus. This can double the speed of operations by using 32-bit instead of 64-bit integers. Integers that are smaller than 32 get automatically converted to that, so there’s no way of optimization in speed when using smaller data types.

Prefer unsigned to signed

Unsigned data are faster to work with, except when they are converted to floating points.

Apply the law of small numbers

When optimizing code, you should take into account that the majority of numbers that are used in an application are small.

Other techniques

Keep in mind that a write is a read followed by a write

When you want to change a specific byte for example, you don’t read/write only that value, but the whole cache line.

When you want to write a value less than the size of the cache line (usually 64 bytes), the CPU will have to read the whole cache line, change the value, and then write it back into memory.

When you want to write 64 bytes (the cache line size) at once, and those bytes are cache aligned, the reading will no longer occur - the CPU will directly write the data in memory.

If you write a value in memory, the CPU will mark a bit to know that the cache is dirty.

Dirty memory means that it was changed, so it is written back to main memory, which is a performance hit.

Therefore, you should try to limit it by not writing the same value - maybe checking if the value has changed is faster. (reading a value from cache and comparing it is faster than reading, invalidating the cache, writing the value back to cache and eventually in main memory).

Instruction-level-parallelism (ILP)

In order to have as many of our operations executed out of order or pre-computed before they are really needed, we need to have fewer data dependencies (code that relies on other code to be executed first).

Performance advices

Use heap memory (dynamic allocation) only when necessary

Dynamic memory allocations and deallocations are expensive, so try to reduce that if possible.

Use exceptions judiciously

Exceptions should not be used intensively in the code, but exist only for really exceptional circumstances.

Know the performance/functionality tradeoffs of the containers

Each container is good at something and bad at something else, there’s no such thing as a perfect container.

Some containers are good at accessing elements from it, some are good at inserting elements into it, and so on.

You should consider the container you’re using based on the actions you’ll be performing on the data.

Consider the order of evaluation of compound conditionals

If you have an expression such as ‘a && b’, the compiler is not allowed to change the order of operations because it would change the logic of your software.

We need to keep in mind that the CPU will not verify the entire condition if it’s clear the value of the expression.

Example:

A && B will result in 1 if and only if both of them are true.
if A is false, we know the result is 0, so B will not be validated.

A || B will result in 1 if at least one of them is true.
if A is true, we know the result is 1, so B will not be validated.

For example ‘if pointer != null && pointer->getData()’:

if the first condition is executed first, and the pointer is not null, we can safely call pointer->getData().
if the first condition is executed first, and the pointer is null, the CPU will leave the condition without evaluating the 2nd condition.

If the compiler would change the order of operations, then the second condition could be executed first, and if the pointer is null, we will have a null pointer dereference and the application will crash.

Therefore, the CPU is not allowed so change the order of operations in such case.

For the example ‘A && B’:
If we know that the expression B has more chances to be false, we could switch the order to ‘B && A’.

That is because:
Assuming A is false in 50% of cases and B is false in 80% of cases:
if we execute the condition 10 times, in 5 cases A will be true, and we will have to execute B for all 5 cases.

For the condition ‘A && B’, under same terms:
if we execute the condition 10 times, B will be false in 8 out of 10 cases, so A will be executed only for 2 cases.

We also need to take into account the time required to execute both A and B expressions, and to make sure that changing the condition order will not mess with the logic of the application.

Selecting the right container based on data and operation on it

If we have a vector and we know how many entries we want to push into it, we should always call reserve first. Otherwise, each time the vector will reach the maximum capacity it can hold, it will have to reallocate more memory and move the elements, to allow more elements to be inserted.
If we need to add or remove data from one end of a container, we probably should go with a queue, deque, or stack.
If we’re going to add or remove data from the middle of a container, maybe we should go for a list. Sequence containers will have to shift all the data to make some space for the new element, or to move the remaining elements in the place of a newly deleted item.
If we need to access a n^th element, list is probably not very good, because it doesn’t provide such access - we have to iterate through the list and count where we are at each step, whereas vectors perform this operation in constant time.
If we need to only iterate forwards in a list, we should go for a single-linked list instead of double-linked list, because it should be smaller in size (no back pointer).
If we do not care about the order of the elements, we should prefer using unordered containers instead of ordered ones.

Performance tricks for cache friendly code

Spatial locality of the Cache - Array traversal

The following example does not benefit from the cache, causing cache misses.

                                            
                                                int rows = 10;  
                                                int cols = 10;  
                                                
                                                int matrix[rows][cols] = {};  
                                                for (int col = 0; col < cols; ++col) {  
                                                for (int row = 0; row < rows; ++row) {  
                                                    // ...  
                                                }  
                                                }

The following example benefits from the cache and is much faster, due to transversal by columns instead of rows.

                                            
                                                int rows = 10;  
                                                int cols = 10;  
                                                
                                                int matrix[rows][cols] = {};  
                                                for (int row = 0; row < rows; ++row) {  
                                                for (int col = 0; col < cols; ++col) {  
                                                    // ...  
                                                }  
                                                }

Usage of data

In case we use data that is far from each other, we should gather them together to make use of the cache locality and cache hits.

                                            
                                                class Data {};    
                                                class Item {};    
                                                    
                                                vector<Data> data(10000);    
                                                vector<Item> items(10000);    
                                                    
                                                for (int i = 0; i < 10000; ++i) {    
                                                execute(data[i], items[i]);    
                                                }

                                            
                                                class Entry {    
                                                Data data;    
                                                Item item;    
                                                };    
                                                    
                                                vector<Entry> entries(10000);    
                                                    
                                                for (int i = 0; i < 10000; ++i) {    
                                                execute(entries[i]);    
                                                }

Temporal cache coherency

Don’t define object variables before their use/initialization.

In case the variable is a complex object, the creation of it can be costly and also not required – the case of where the function returns early, before using the object at all.

Even for small objects or primitives, having them defined before they are used will force the developer to keep an eye out for that variable and to reason about where it might change inside that function. Having only the items you need at the time they are needed will simplify the logic and allow the developer to detect bugs much easier. Also, by delaying the construction of the object, we can assure that the cost will be performed only when needed, and that the data will still be in cache when we use it.

Avoid unpredictable branches

Well-performing code will allow the branch predictor to identify and follow a pattern in order to speed things up for performance gain. In order to do that, we should try to avoid branches and replace them with other operations, such as sorting the data first.

                                            
                                                double sum = 0;  
                                                int arraySize = 10000;  
                                                int data[arraySize] = fill(); // fill the array with random values  
                                                
                                                for (int i = 0; i < arraySize; ++i) {  
                                                if (data[i] > 1000) {  
                                                    sum += data[i];  
                                                }  
                                                }

The problem with the code from above is that we don’t know what values we will have in the array, so the computer cannot predict whether we calculate the sum or not.

If we sort the data, it will be clear that the first few values could probably be below the threshold value (1000), but after we reach that value, all the following items will be above it. Therefore, the computer can easily predict if the branch will be executed, and therefore, calculate the values in advance.

                                            
                                                double sum = 0;  
                                                int arraySize = 10000;  
                                                int data[arraySize] = fill(); // fill the array with random values  
                                                sort(data);  
                                                
                                                for (int i = 0; i < arraySize; ++i) {  
                                                if (data[i] > 1000) {  
                                                    sum += data[i];  
                                                }  
                                                }

Memory Alignment / padding

Natural memory alignment generally refers to the alignment of individual variables.

Let's take for example the following sizes:

                                            
                                                char:    1 byte  
                                                short:   2 bytes  
                                                int:     4 bytes  
                                                double:  8 bytes

Memory needs to start from a multiple of word, so it applies padding in order to align the next data entry. It is important to note that the last member is padded with the number of bytes required so that the total size of the structure should be a multiple of the largest alignment of any structure member.

                                            
                                                class Data {  
                                                char c;  
                                                double d;  
                                                short s;  
                                                int i;  
                                                };  
                                                
                                                sizeof(Data) = 24 bytes

                                            
                                                X 0 0 0           // char   = 1 byte,  3 align  
                                                X X X X X X X X   // double = 8 bytes, 0 align  
                                                X X 0 0           // short  = 2 bytes, 2 align  
                                                X X X X 0 0 0 0   // int    = 4 bytes, 4 align

This will be the alignment of our structure in memory for the current alignment (X = byte used, O = padding).

Below we also have other example, in which we can see that if we change the order of the data, the size of the class changes, along with the alignment and padding.

                                            
                                                class Data {  
                                                char c;  
                                                short s;  
                                                int i;  
                                                double d;  
                                                };  
                                                
                                                sizeof(Data) = 16 bytes

                                            
                                                X X X 0             // char + short = 3 bytes, 1 align  
                                                X X X X             // int = 4 bytes, 0 align  
                                                X X X X X X X X     // double = 8 bytes, 0 align

Compiler optimizations

Be aware of the optimizations that are performed by the compiler, and the compiler settings that were used to build the binary of your application, such as debug level.

Code profiling

Sanitizers

Address Sanitizers

The Address Sanitizers (ASan) find memory corruptions and other memory errors at runtime. Memory errors can lead to unpredictable behavior and can be hard to reproduce consistently.

Memory issues are very important and should not be treated lightly, as they are the main cause of security vulnerabilities.

How ASan Works

The Address Sanitizers replace the allocation and deallocation functions with custom implementations that allow regions surrounding the requested memory to be checked for invalid access.

When memory is allocated, the surrounding regions are also allocated and marked as off-limits.

When the memory is deallocated, the region is marked as off-limits and added to a quarantine queue, which delays the time until the memory can be reutilized.

Any access to a memory region that is off-limits results in an error reported by the address sanitizer.

In scope

The Address Sanitizers can only detect memory errors that occur at runtime, such as:

Use of Deallocated Memory: Detect when memory is used after being deallocated.
Deallocation of Deallocated Memory: Detect when memory is freed after being deallocated
Deallocation of Non-allocated Memory: Detect when non-allocated memory is freed.
Use of Stack Memory After Function Return: Detect when stack variable memory is accessed after its declaring function returns.
Use of Out-of-Scope Stack Memory: Detect when variables are accessed outside of their declared scope.
Overflow and Underflow of Buffers: Detect when memory is accessed outside of a buffer’s boundaries.
Overflow of containers: Detect when a container is accessed outside its bounds

Out of scope

Out of the scope are errors such as memory leaks, access to uninitialized memory, or integer overflow.

In order to test for these errors, other instruments can be used, such as TSan (thread sanitizer), or UBSan (undefined behavior sanitizer).

Thread Sanitizers

Introduction

The Thread Sanitizers (TSan) are a tool that detects data races at runtime. Data races are dangerous because they cause programs to behave unpredictably, or result in memory corruption, while also being very hard to find unless being explicitly searched for.

TSan can also detect other threading bugs, including uninitialized mutexes and thread leaks.

How TSan Works

The Thread Sanitizers record the information about each memory access, and checks whether that access participates in a data race.

TSan does so by keeping timestamps for all threads in order to understand which threads access the data and in which order. If the order is not consistent, it means that a data race has been identified.

In scope

Data Races: Detect unsynchronized access to mutable state across multiple threads
Swift Access Races: Detect when multiple threads call a mutating method on the same structure, or pass a shared variable as input without synchronization
Races on Collections and Other APIs: Detect when a thread accesses a mutable object while another thread writes to that object, causing a data race.
Uninitialized Mutexes: Detect when a mutex is used before it’s initialized
Thread Leaks: Detect when threads aren’t closed after use.

Memory Sanitizers

Memory Sanitizers (MSan) are a tool that detect uninitialized memory reads, which happens when stack or heap-allocated memory is read before it is written.

Memory Sanitizers track the spread of uninitialized data in memory, and logs / reports when a code branch is taken or not depending on the uninitialized value.

Profilers

Profiling is the activity of measuring the space (memory) or time complexity of an application, the usage of a particular instruction, or the frequency and duration of function calls, and it is used to further optimize the application.

Performance summary

be conscious whether you’re bound by data or computation
prefer data contiguous in memory
if you can’t have contiguous data, prefer constant strides over randomness
keep data close together in space
keep accesses to same data close together in time
avoid dependencies between successive computations
avoid dependencies between two iterations of a loop
avoid ‘hard to predict’ branches
be aware of cache lines & alignment
minimize number of cache lines accessed by multiple threads
don’t be surprised by hardware weirdness (cache associativity, denormals).

Shortcut to seniority

Badea Robert