Primer on Template Taxonomy

As a C++ programmer templates are important to know, but my personal experience is that I rarely get to write template libraries as part of my day job and only get to use them. As such, it requires some dedicated effort to get a better understanding of why certain design decisions are made. In this article, I’ll list up a bunch of key words that are important when learning about templates. I find that knowing the taxonomy about a new subject gives me a skeleton to “attach” knowledge on and helps me prioritize what to learn first.

This article is a cop-out from writing a longer article on how templates work in depth or even why and when they are useful, but there are already a variety of resources about that out there. I’ll link some of them at the end of this article and might add on the topic in a future sprint.

Function Templates

The simplest way to use templates is to write a template function:

template<typename T>
void my_func(T t) {
    // do something
}

Class Templates

Class templates let you generalize classes for various types:

template<typename T>
class MyClass
{
    void my_func(T t) {
        // do something
    }
}

Full Specialization

Full specializations allow you to override the template class or template function for a specific type.

template<>
class MyClass<int>
{
    void my_func(int t) {
        // do something special for integers
    }
}

Partial Specialization

Only class templates can be partially specialized. That means that you provide an override for a type that is stil not fully specificed, but more specific than just type T. An example would be an override for an array of artibrary type N MyClass<T[N]>;

Type Parameters

Type parameters are the parameters specifid in the template declaration. In the following example that would be T and N: template<typename T, typename N>

Variadic Templates

If a template takes an arbitrary number of parameters, we are speaking of variadic templates. This is indicated via ellipses (...), as in template<typename... T>

Template Instantiation

Template instantiation refers to the process in which the compiler generates a class or function from the template.

Closing Words

Once you get your hands dirty with actual code you’ll also run across shorthands like SFINAE and practical patterns such as CRTP. The elephant in the room will be template metaprogramming. Templates are quite an extensive topic :)

References

  • if you’re into books, I can recommend “C++ Templates - The Complete Guide” by David Vandevoorde et al.
  • the Microsoft guide is actually a pretty decent resource to start out with

Understanding `malloc()` - Dive into Memory Allocators

Memory allocators are a buzz word for everybody interested in systems programming. Today, we’ll dig into malloc() and write our own version of it in C. The principles in other languages will be similar. Even in garbage collected languages, such as Java and Golang, much of the concepts will apply, with the important addition of compaction in Java for more effective space reclamation. But that’s a different topic.

On most systems your program will by default link against glibc’s malloc() implementation and it will use mmap() or sbrk() to allocate larger chunks of memory which are then handed out piecewise on every malloc() call. That’s fine for most applications, but since memory allocation is such a fundamental aspect of a program commercial applications tend to override this allocator with their own implementation.

Even custom implementations usually follow the same principles and have to solve the same problems using various tradeoffs. Let’s first define a rough interface we are trying to achieve:

void * malloc(size_t bytes);
void free(void *);

There are three classical approaches to implement this:

  • slab allocators
  • pool allocators
  • for very large allocations we can pass through to mmap() directly

Slab Allocators

With all approaches we end up going to the kernel to request a big chunk of memory (our “slab”). We don’t want to request too much because we might not need that much memory, but we want large allocations to amortize the syscall overhead. We then hand out a small piece of memory of arbitrary size on every allocation.

We will somewhere need to store the size of the memory handed out. Either we allocate a small section before every chunk or we only allow specific sizes we have in our pool allocators and round up any smaller allocation to the next biggest aligned one. Both approaches imply internal fragmentation, meaning we cannot use all the memory with 100% efficiency. Both TCMalloc and the Golang allocator go for the latter solution of rounding up. They then allow looking up the size of an element via a radix tree.

Free list and coalescing

Handing out the memory was easy but if we do not free all memory at the same time we end up with external fragmentation, meaning we might be able to fulfill large allocations because we have many small free chunks which are not adjacent. Free’d elements are typically stored in a free list (a sorted linked list) and adjacent free elements can be coalesced to get bigger chunks.

Should the whole buffer become free we can release it back to the kernel.

Pool Allocators

We saw that it is messy to serve allocations of different sizes from the same slab. It is much easier to use a single slab if all allocations are of the same size.

Again, we will need a way to determine the size of the allocation on free. TCMalloc uses a radix tree, but it is also possible to encode it in the interface of malloc if we change it a little bit:

struct element;
struct memory_pool;

memory_pool*
create_memory_pool(size_t size_of_element,
                   size_t num_elements_in_pool);

void * malloc(memory_pool*);
void free(memory_pool*, void *);

As long as we keep the memory pool object around we only need to store the information about size in a single place.

We can use a free list to implement to store pointers to the next chunk or we can use a bit vector. This is less efficient than using a radix tree but hey, this is a teaching exercise, so don’t use this in production.

// note: this pool allocator is not taking care of alignment
//       and does not zero memory

struct PoolAllocator
{

size_t const numElements = 0;
size_t const elemSize = 0;
vector<bool> freeList; /* true -> element is used */

void * memoryChunk = nullptr;

PoolAllocator(size_t numElements_, size_t elemSize_) :
    numElements(numElements_),
    elemSize(elemSize_),
    freeList(numElements, false)
{
    memoryChunk = malloc(numElements_ * elemSize_);
}

~PoolAllocator()
{
    free(memoryChunk);
}

void * map()
{
    for(size_t i = 0; i < numElements; ++i) {
        if (!freeList[i]) {
            freeList[i] = true;
            return (void*)((char*)memoryChunk + i*elemSize);
        }
    }

    return nullptr;
}

void unmap(void * addr)
{
    size_t idx = ((char*)addr - (char*)memoryChunk)/elemSize;
    freeList[idx] = false;
}
};

Multithreading

The implementations we discussed work fine on a single thread but will require synchronization once we end up going parallel. Beefy machines can have 100s of CPUs, making this a real bottleneck if synchronization means using a global mutex. Google and Facebook both wrote and open-sourced their own memory allacators mainly to improve scaling across many cores.

In practice, memory allocators use the concept of arenas, sometimes also referred to as heaps, to solve this problem. Every arena will be specific to a logical core (i.e. a hyperthread on x86), meaning no locking is required as long as we can serve from the arena. We will only require locking to transfer buffers between cores or to get a new buffer from the kernel.

Re: My program is spending too much time in malloc()

I mentioned that Google came up with TCMalloc to improve scaling, but it turns out that just measuring the time spent on allocation and deallocation is not necessarily the best metric to optimize for. If you’re curious, you can read more about it on Google’s blog.

References

std::flat_map in C++

flat_maps are a new container adaptor in C++23. Like std::map, it is an associative ordered container, meaning that it allows you to insert key-value-pairs and look up keys later on. While std::map is implemented using balanced binary trees, std::flat_map maintains a pair of sorted vectors, one first for keys and a second for values. That means that std::map has better asymptotic complexity, but std::flat_map might still perform better due to cache-friendliness of having contiguous memory. To understand that better, lets look at how exactly std::flat_map is implemented.

How is std::flat_map implemented exactly?

std::flat_map is implemented using two sequence containers (by default std::vector). One container contains the keys in sorted order and the second container maintains all the values in the corresponding order.

This means that inserting is O(n) because if we insert at the front, we will have to shift all previous elements to the back. If you insert at the back, insertion will be O(1) if there is no allocation. That makes it particularly performant if your input is already sorted. Similar logic applies for deletion.

This helps us understand the following asymptotic complexities:

+-----------+----------+---------------+
| Operation | std::map | std::flat_map |
+-----------+----------+---------------+
| Create    | O(nlgn)  | O(nlgn)       |
| Insert    | O(lgn)   | O(n)          |
| Delete    | O(lgn)   | O(n)          |
| Lookup    | O(lgn)   | O(lgn)        |
+-----------+----------+---------------+

When is std::flat_map useful?

std::flat_map was introduced because it is more cache friendly to search across keys in contiguous memory than to do the many memory dereferences required by std::map when traversing the binary tree. However, we can see in the table on complexities above that random insertion and deletion are expensive operations, at least for large sizes.

Rule of thumb is to use flat_maps when insertions and deletions are rare and mostly lookups and iterating are required. Those are the use cases where std::flat_map shines and you can profit from its cache-friendliness.

How do I use std::flat_map?

Flat map is mostly used like std::map. There are some caveats around iterator stability and node extraction is not supported. You can look in the standard for more details.

std::flat_map<int, int> my_map;
map[0] = 500;
map[4] = 3;

Closing comments

Together with std::flat_map the committee also introduced the std::flat_multimap, std::flat_set and std::flat_multiset in cpp23. As you can imagine, the multi-versions allow duplicate keys and sets only maintain a single array of keys without values. Note also that its fairly easy to roll your own version of flat_map using std::pair and a single vector. Having two separate vectors is more cache-friendly, which is why its used in the standard.

If you’ve enjoyed this article and you are interested in writing high-performance software scaling across dozens of CPUs consider joining our team at Pure Storage. We have offices in Mountain View, Prague and Bangalore and are always looking for top notch talent to help us shape the future of data storage.

Literature References