Shortcut to seniority
Home
Go to main page
Section level: Junior
A journey into the programming realm
Section level: Intermediate
The point of no return
Section level: Senior
Leaping into the unknown
                    Go to main page
A journey into the programming realm
The point of no return
Leaping into the unknown
If you have a performance issue, or you want to optimize the code, you need to keep in mind a few things first.
                                            First of all, you should never assume anything. Everything should be benchmarked (measured) first.
                                            Secondly, you should not measure the debug version, but the release one - that’s the one that’s important for the user.
                                            Last but not least, optimization step is one of the last one within a project - you should plan for it, but never optimize too soon.
                                        
A bottleneck is an inefficiency where the flow of the work will stop or slow at a particular point or, in simple terms, it’s the first component which starts failing or is causing slowness within the system. Finding the bottleneck is the goal of a performance tester.
A CPU cache is a hardware cache used by the CPU to reduce the cost (time/energy) to access the data from the main memory. The cache is a small and fast memory, which is close to the processor core, and stores copies of data from frequently accessed memory locations. Modern CPUs have more cache levels (L1, L2, etc.) which increase in size and decrease in speed, but they are still very relevant if compared against the time required to fetch data from main memory instead.
A cache hit/miss refers to a hit/miss in the highest level of cache in the CPU. Cache hits are crucial for performance are developers should strive to make their applications in a way in which they benefit from the cache.
Cache hits are crucial for performance and developers should strive to make their applications in a way in which they benefit from the cache, or in other words, that the data they use is already stored in the cache.so the data needs to be loaded from RAM / main memory. The data that is read from that memory is equal to the size of the memory bus width.
The principle of locality refers to keeping data that is related close to each other, so that caching can occur more often.
Prefetcher helps optimizing performance, but reading the traffic that comes into the CPU and the cache. When two threads are writing data in two different memory locations, they do not need any synchronization, because the data is not dependent on one another. If the data they are writing to is mapped to the same cache line, writing to a variable from that cache line will invalidate the memory for the other variable.
False sharing is a concept when two CPUs do not share any resource but they actually do because of the undergoing hardware architecture. When two threads are writing each in two different memory locations, they do not need any synchronization. If the data they are writing to is mapped to the same cache line, writing to a variable from that cache line will invalidate the memory for the other variable.
Because the static libraries are fully added into the application, they allow a lot of optimizations that are not available for dynamic linked libraries, such as Whole program optimization, Local jumps, or Inlining.
If the memory access can be recognized by the system through regular accessing, the prefetcher can do its job and fetch blocks of memory before they are used, so you will not wait for the memory to be read into cache - you can use them right away!
Jumping into memory does not benefit from caching, and dependencies between data do not allow the hardware to run the instructions in parallel.
You might want to create your own implementation of containers, but you should also have in mind the fact that there probably is one already implemented and used intensively by other developers of that programming language. Those libraries are already written, debugged and tested – meaning that they are less buggy, safer, and probably faster than a new implementation.
You should take into consideration writing a new implementation in case you need something specific that is not covered by the standard and not found on the web. Standard libraries are usually a double-edged sword, because although the implementations are fast and usually bug-free, there could still be inefficient in order to contain as much functionality to please as many people as possible.
In languages that contain both references and pointers, the developer needs to think which one they should use in their code. Usually, it depends, but there are a couple of differences between them, such as:
Passing objects by value requires that the entire object to be copied through the copy constructor, whereas passing it by reference does not involve any copy operations.
Having a virtual function in a class requires that every instance of that class and of its derived objects to contain an extra pointer. There’s also a hit in the performance when you call a virtual function (there’s an additional level of indirection – virtual pointer has to access the virtual table in order to get the memory address of the real function).
Declare objects as constants (read-only) if you’re aware that the data is not supposed to change.
Data written on stack has 0 cost of addressing and is faster due to it being usually stored in cache / registers.
Globals are as slow as a function call, because the compiler must assume that it was seen by all the other threads, so it has to reload the data from memory. Other than that, they also impact readability and increase the number of bugs, because the developer now has one more data it needs to reason about, since the globals can be changed from all threads and can cause data races or other problems that can appear in multithreading environments.
The ALU (Arithmetic logic unit) in the CPU can do one 64-bit operation at a time, or two 32-bit at the same time, using the same hardware and bus. This can double the speed of operations by using 32-bit instead of 64-bit integers. Integers that are smaller than 32 get automatically converted to that, so there’s no way of optimization in speed when using smaller data types.
Unsigned data are faster to work with, except when they are converted to floating points.
When optimizing code, you should take into account that the majority of numbers that are used in an application are small.
When you want to change a specific byte for example, you don’t read/write only that value, but the whole cache line.
When you want to write a value less than the size of the cache line (usually 64 bytes), the CPU will have to read the whole cache line, change the value, and then write it back into memory.
When you want to write 64 bytes (the cache line size) at once, and those bytes are cache aligned, the reading will no longer occur - the CPU will directly write the data in memory.
If you write a value in memory, the CPU will mark a bit to know that the cache is dirty.
Dirty memory means that it was changed, so it is written back to main memory, which is a performance hit.
Therefore, you should try to limit it by not writing the same value - maybe checking if the value has changed is faster. (reading a value from cache and comparing it is faster than reading, invalidating the cache, writing the value back to cache and eventually in main memory).
In order to have as many of our operations executed out of order or pre-computed before they are really needed, we need to have fewer data dependencies (code that relies on other code to be executed first).
Dynamic memory allocations and deallocations are expensive, so try to reduce that if possible.
Exceptions should not be used intensively in the code, but exist only for really exceptional circumstances.
Each container is good at something and bad at something else, there’s no such thing as a perfect container.
Some containers are good at accessing elements from it, some are good at inserting elements into it, and so on.
You should consider the container you’re using based on the actions you’ll be performing on the data.
If you have an expression such as ‘a && b’, the compiler is not allowed to change the order of operations because it would change the logic of your software.
We need to keep in mind that the CPU will not verify the entire condition if it’s clear the value of the expression.
Example:
A && B will result in 1 if and only if both of them are true.
                                            if A is false, we know the result is 0, so B will not be validated.
A || B will result in 1 if at least one of them is true.
                                            if A is true, we know the result is 1, so B will not be validated.
For example ‘if pointer != null && pointer->getData()’:
If the compiler would change the order of operations, then the second condition could be executed first, and if the pointer is null, we will have a null pointer dereference and the application will crash.
Therefore, the CPU is not allowed so change the order of operations in such case.
For the example ‘A && B’:
                                            If we know that the expression B has more chances to be false, we could switch the order to ‘B && A’.
That is because:
                                            Assuming A is false in 50% of cases and B is false in 80% of cases:
                                            if we execute the condition 10 times, in 5 cases A will be true, and we will have to execute B for all 5 cases.
For the condition ‘A && B’, under same terms:
                                        if we execute the condition 10 times, B will be false in 8 out of 10 cases, so A will be executed only for 2 cases.
We also need to take into account the time required to execute both A and B expressions, and to make sure that changing the condition order will not mess with the logic of the application.
The following example does not benefit from the cache, causing cache misses.
                                            
                                                int rows = 10;  
                                                int cols = 10;  
                                                
                                                int matrix[rows][cols] = {};  
                                                for (int col = 0; col < cols; ++col) {  
                                                for (int row = 0; row < rows; ++row) {  
                                                    // ...  
                                                }  
                                                }  
                                            
                                        
                                        The following example benefits from the cache and is much faster, due to transversal by columns instead of rows.
                                            
                                                int rows = 10;  
                                                int cols = 10;  
                                                
                                                int matrix[rows][cols] = {};  
                                                for (int row = 0; row < rows; ++row) {  
                                                for (int col = 0; col < cols; ++col) {  
                                                    // ...  
                                                }  
                                                }  
                                            
                                        
                                        In case we use data that is far from each other, we should gather them together to make use of the cache locality and cache hits.
                                            
                                                class Data {};    
                                                class Item {};    
                                                    
                                                vector<Data> data(10000);    
                                                vector<Item> items(10000);    
                                                    
                                                for (int i = 0; i < 10000; ++i) {    
                                                execute(data[i], items[i]);    
                                                }    
                                            
                                        
                                        
                                            
                                                class Entry {    
                                                Data data;    
                                                Item item;    
                                                };    
                                                    
                                                vector<Entry> entries(10000);    
                                                    
                                                for (int i = 0; i < 10000; ++i) {    
                                                execute(entries[i]);    
                                                }    
                                            
                                        
                                        Don’t define object variables before their use/initialization.
In case the variable is a complex object, the creation of it can be costly and also not required – the case of where the function returns early, before using the object at all.
Even for small objects or primitives, having them defined before they are used will force the developer to keep an eye out for that variable and to reason about where it might change inside that function. Having only the items you need at the time they are needed will simplify the logic and allow the developer to detect bugs much easier. Also, by delaying the construction of the object, we can assure that the cost will be performed only when needed, and that the data will still be in cache when we use it.
Well-performing code will allow the branch predictor to identify and follow a pattern in order to speed things up for performance gain. In order to do that, we should try to avoid branches and replace them with other operations, such as sorting the data first.
                                            
                                                double sum = 0;  
                                                int arraySize = 10000;  
                                                int data[arraySize] = fill(); // fill the array with random values  
                                                
                                                for (int i = 0; i < arraySize; ++i) {  
                                                if (data[i] > 1000) {  
                                                    sum += data[i];  
                                                }  
                                                }  
                                            
                                        
                                        The problem with the code from above is that we don’t know what values we will have in the array, so the computer cannot predict whether we calculate the sum or not.
If we sort the data, it will be clear that the first few values could probably be below the threshold value (1000), but after we reach that value, all the following items will be above it. Therefore, the computer can easily predict if the branch will be executed, and therefore, calculate the values in advance.
                                            
                                                double sum = 0;  
                                                int arraySize = 10000;  
                                                int data[arraySize] = fill(); // fill the array with random values  
                                                sort(data);  
                                                
                                                for (int i = 0; i < arraySize; ++i) {  
                                                if (data[i] > 1000) {  
                                                    sum += data[i];  
                                                }  
                                                }  
                                            
                                        
                                        Natural memory alignment generally refers to the alignment of individual variables.
Let's take for example the following sizes:
                                            
                                                char:    1 byte  
                                                short:   2 bytes  
                                                int:     4 bytes  
                                                double:  8 bytes  
                                            
                                        
                                        Memory needs to start from a multiple of word, so it applies padding in order to align the next data entry. It is important to note that the last member is padded with the number of bytes required so that the total size of the structure should be a multiple of the largest alignment of any structure member.
                                            
                                                class Data {  
                                                char c;  
                                                double d;  
                                                short s;  
                                                int i;  
                                                };  
                                                
                                                sizeof(Data) = 24 bytes  
                                            
                                        
                                        
                                            
                                                X 0 0 0           // char   = 1 byte,  3 align  
                                                X X X X X X X X   // double = 8 bytes, 0 align  
                                                X X 0 0           // short  = 2 bytes, 2 align  
                                                X X X X 0 0 0 0   // int    = 4 bytes, 4 align  
                                            
                                        
                                        
                                        This will be the alignment of our structure in memory for the current alignment (X = byte used, O = padding).
Below we also have other example, in which we can see that if we change the order of the data, the size of the class changes, along with the alignment and padding.
                                            
                                                class Data {  
                                                char c;  
                                                short s;  
                                                int i;  
                                                double d;  
                                                };  
                                                
                                                sizeof(Data) = 16 bytes  
                                            
                                        
                                        
                                            
                                                X X X 0             // char + short = 3 bytes, 1 align  
                                                X X X X             // int = 4 bytes, 0 align  
                                                X X X X X X X X     // double = 8 bytes, 0 align  
                                            
                                        
                                        Be aware of the optimizations that are performed by the compiler, and the compiler settings that were used to build the binary of your application, such as debug level.
The Address Sanitizers (ASan) find memory corruptions and other memory errors at runtime. Memory errors can lead to unpredictable behavior and can be hard to reproduce consistently.
Memory issues are very important and should not be treated lightly, as they are the main cause of security vulnerabilities.
The Address Sanitizers replace the allocation and deallocation functions with custom implementations that allow regions surrounding the requested memory to be checked for invalid access.
When memory is allocated, the surrounding regions are also allocated and marked as off-limits.
When the memory is deallocated, the region is marked as off-limits and added to a quarantine queue, which delays the time until the memory can be reutilized.
Any access to a memory region that is off-limits results in an error reported by the address sanitizer.
The Address Sanitizers can only detect memory errors that occur at runtime, such as:
Out of the scope are errors such as memory leaks, access to uninitialized memory, or integer overflow.
In order to test for these errors, other instruments can be used, such as TSan (thread sanitizer), or UBSan (undefined behavior sanitizer).
The Thread Sanitizers (TSan) are a tool that detects data races at runtime. Data races are dangerous because they cause programs to behave unpredictably, or result in memory corruption, while also being very hard to find unless being explicitly searched for.
TSan can also detect other threading bugs, including uninitialized mutexes and thread leaks.
The Thread Sanitizers record the information about each memory access, and checks whether that access participates in a data race.
TSan does so by keeping timestamps for all threads in order to understand which threads access the data and in which order. If the order is not consistent, it means that a data race has been identified.
Memory Sanitizers (MSan) are a tool that detect uninitialized memory reads, which happens when stack or heap-allocated memory is read before it is written.
Memory Sanitizers track the spread of uninitialized data in memory, and logs / reports when a code branch is taken or not depending on the uninitialized value.
Profiling is the activity of measuring the space (memory) or time complexity of an application, the usage of a particular instruction, or the frequency and duration of function calls, and it is used to further optimize the application.
Package diagram is a view displaying the coupled classes and their encapsulation into a group (similar to namespace).
Activity diagram is a view representing the flow from one activity to another, describing an operation within a system.
Reserve is a function that pre-allocates a specific memory size, to accommodate new data.
The act of exploiting a bug in order to get administrator access.
Composition refer to two classes (composite and component) in which one of them (composite) contain an instance of the other one (component), creating a ‘has a’ relationship. The composite object has ownership over the component, meaning that the lifetime of the component object starts and ends at the same time as the composite class. In simple terms: A class contains an instance of another class.