04 Mar 2023
As a C++ programmer templates are important to know, but my personal experience is that I rarely get to write template libraries as part of my day job and only get to use them. As such, it requires some dedicated effort to get a better understanding of why certain design decisions are made. In this article, I’ll list up a bunch of key words that are important when learning about templates. I find that knowing the taxonomy about a new subject gives me a skeleton to “attach” knowledge on and helps me prioritize what to learn first.
This article is a cop-out from writing a longer article on how templates work in depth or even why and when they are useful, but there are already a variety of resources about that out there. I’ll link some of them at the end of this article and might add on the topic in a future sprint.
Function Templates
The simplest way to use templates is to write a template function:
template<typename T>
void my_func(T t) {
// do something
}
Class Templates
Class templates let you generalize classes for various types:
template<typename T>
class MyClass
{
void my_func(T t) {
// do something
}
}
Full Specialization
Full specializations allow you to override the template class or template function for a specific type.
template<>
class MyClass<int>
{
void my_func(int t) {
// do something special for integers
}
}
Partial Specialization
Only class templates can be partially specialized. That means that you provide an override for a type that is stil not fully specificed, but more specific than just type T. An example would be an override for an array of artibrary type N MyClass<T[N]>
;
Type Parameters
Type parameters are the parameters specifid in the template declaration. In the following example that would be T and N: template<typename T, typename N>
Variadic Templates
If a template takes an arbitrary number of parameters, we are speaking of variadic templates. This is indicated via ellipses (...
), as in template<typename... T>
Template Instantiation
Template instantiation refers to the process in which the compiler generates a class or function from the template.
Closing Words
Once you get your hands dirty with actual code you’ll also run across shorthands like SFINAE and practical patterns such as CRTP. The elephant in the room will be template metaprogramming. Templates are quite an extensive topic :)
References
- if you’re into books, I can recommend “C++ Templates - The Complete Guide” by David Vandevoorde et al.
- the Microsoft guide is actually a pretty decent resource to start out with
11 Feb 2023
Memory allocators are a buzz word for everybody interested in systems programming. Today, we’ll dig into malloc()
and write our own version of it in C. The principles in other languages will be similar. Even in garbage collected languages, such as Java and Golang, much of the concepts will apply, with the important addition of compaction in Java for more effective space reclamation. But that’s a different topic.
On most systems your program will by default link against glibc’s malloc()
implementation and it will use mmap()
or sbrk()
to allocate larger chunks of memory which are then handed out piecewise on every malloc()
call. That’s fine for most applications, but since memory allocation is such a fundamental aspect of a program commercial applications tend to override this allocator with their own implementation.
Even custom implementations usually follow the same principles and have to solve the same problems using various tradeoffs. Let’s first define a rough interface we are trying to achieve:
void * malloc(size_t bytes);
void free(void *);
There are three classical approaches to implement this:
- slab allocators
- pool allocators
- for very large allocations we can pass through to mmap() directly
Slab Allocators
With all approaches we end up going to the kernel to request a big chunk of memory (our “slab”). We don’t want to request too much because we might not need that much memory, but we want large allocations to amortize the syscall overhead. We then hand out a small piece of memory of arbitrary size on every allocation.
We will somewhere need to store the size of the memory handed out. Either we allocate a small section before every chunk or we only allow specific sizes we have in our pool allocators and round up any smaller allocation to the next biggest aligned one. Both approaches imply internal fragmentation, meaning we cannot use all the memory with 100% efficiency. Both TCMalloc and the Golang allocator go for the latter solution of rounding up. They then allow looking up the size of an element via a radix tree.
Free list and coalescing
Handing out the memory was easy but if we do not free all memory at the same time we end up with external fragmentation, meaning we might be able to fulfill large allocations because we have many small free chunks which are not adjacent. Free’d elements are typically stored in a free list (a sorted linked list) and adjacent free elements can be coalesced to get bigger chunks.
Should the whole buffer become free we can release it back to the kernel.
Pool Allocators
We saw that it is messy to serve allocations of different sizes from the same slab. It is much easier to use a single slab if all allocations are of the same size.
Again, we will need a way to determine the size of the allocation on free. TCMalloc uses a radix tree, but it is also possible to encode it in the interface of malloc if we change it a little bit:
struct element;
struct memory_pool;
memory_pool*
create_memory_pool(size_t size_of_element,
size_t num_elements_in_pool);
void * malloc(memory_pool*);
void free(memory_pool*, void *);
As long as we keep the memory pool object around we only need to store the information about size in a single place.
We can use a free list to implement to store pointers to the next chunk or we can use a bit vector. This is less efficient than using a radix tree but hey, this is a teaching exercise, so don’t use this in production.
// note: this pool allocator is not taking care of alignment
// and does not zero memory
struct PoolAllocator
{
size_t const numElements = 0;
size_t const elemSize = 0;
vector<bool> freeList; /* true -> element is used */
void * memoryChunk = nullptr;
PoolAllocator(size_t numElements_, size_t elemSize_) :
numElements(numElements_),
elemSize(elemSize_),
freeList(numElements, false)
{
memoryChunk = malloc(numElements_ * elemSize_);
}
~PoolAllocator()
{
free(memoryChunk);
}
void * map()
{
for(size_t i = 0; i < numElements; ++i) {
if (!freeList[i]) {
freeList[i] = true;
return (void*)((char*)memoryChunk + i*elemSize);
}
}
return nullptr;
}
void unmap(void * addr)
{
size_t idx = ((char*)addr - (char*)memoryChunk)/elemSize;
freeList[idx] = false;
}
};
Multithreading
The implementations we discussed work fine on a single thread but will require synchronization once we end up going parallel. Beefy machines can have 100s of CPUs, making this a real bottleneck if synchronization means using a global mutex. Google and Facebook both wrote and open-sourced their own memory allacators mainly to improve scaling across many cores.
In practice, memory allocators use the concept of arenas, sometimes also referred to as heaps, to solve this problem. Every arena will be specific to a logical core (i.e. a hyperthread on x86), meaning no locking is required as long as we can serve from the arena. We will only require locking to transfer buffers between cores or to get a new buffer from the kernel.
Re: My program is spending too much time in malloc()
I mentioned that Google came up with TCMalloc to improve scaling, but it turns out that just measuring the time spent on allocation and deallocation is not necessarily the best metric to optimize for. If you’re curious, you can read more about it on Google’s blog.
References
26 Oct 2022
flat_maps
are a new container adaptor in C++23. Like std::map, it is an associative ordered container, meaning that it allows you to insert key-value-pairs and look up keys later on. While std::map is implemented using balanced binary trees, std::flat_map
maintains a pair of sorted vectors, one first for keys and a second for values. That means that std::map has better asymptotic complexity, but std::flat_map
might still perform better due to cache-friendliness of having contiguous memory. To understand that better, lets look at how exactly std::flat_map
is implemented.
How is std::flat_map
implemented exactly?
std::flat_map
is implemented using two sequence containers (by default std::vector). One container contains the keys in sorted order and the second container maintains all the values in the corresponding order.
This means that inserting is O(n) because if we insert at the front, we will have to shift all previous elements to the back. If you insert at the back, insertion will be O(1) if there is no allocation. That makes it particularly performant if your input is already sorted. Similar logic applies for deletion.
This helps us understand the following asymptotic complexities:
+-----------+----------+---------------+
| Operation | std::map | std::flat_map |
+-----------+----------+---------------+
| Create | O(nlgn) | O(nlgn) |
| Insert | O(lgn) | O(n) |
| Delete | O(lgn) | O(n) |
| Lookup | O(lgn) | O(lgn) |
+-----------+----------+---------------+
When is std::flat_map useful?
std::flat_map
was introduced because it is more cache friendly to search across keys in contiguous memory than to do the many memory dereferences required by std::map when traversing the binary tree. However, we can see in the table on complexities above that random insertion and deletion are expensive operations, at least for large sizes.
Rule of thumb is to use flat_maps when insertions and deletions are rare and mostly lookups and iterating are required. Those are the use cases where std::flat_map shines and you can profit from its cache-friendliness.
How do I use std::flat_map?
Flat map is mostly used like std::map. There are some caveats around iterator stability and node extraction is not supported. You can look in the standard for more details.
std::flat_map<int, int> my_map;
map[0] = 500;
map[4] = 3;
Together with std::flat_map
the committee also introduced the std::flat_multimap
, std::flat_set
and std::flat_multiset
in cpp23. As you can imagine, the multi-versions allow duplicate keys and sets only maintain a single array of keys without values. Note also that its fairly easy to roll your own version of flat_map
using std::pair and a single vector. Having two separate vectors is more cache-friendly, which is why its used in the standard.
If you’ve enjoyed this article and you are interested in writing high-performance software scaling across dozens of CPUs consider joining our team at Pure Storage. We have offices in Mountain View, Prague and Bangalore and are always looking for top notch talent to help us shape the future of data storage.
Literature References