Fastest heterogeneous container and benchmarks

11/11/2020 updated on 18/06/2022  │ Blog

While doing some benchmarks with the heterogeneous containers described here and here along with some related containers, I could not but notice the performance of entt, more specifically its registry class which relies on a modified sparse set data structure. In front of such witchcraft, I decided to delve behind the curtains and see for myself the performance of an heterogeneous container based on it.

Sparse sets

Sparse sets are data structures widely covered (1, 2, 3, 4). I will give a shot here and explain succinctly and as clear as possible the concepts behind the data structure.

Sparse sets, in the pure form, are actually maps, they map integers $\mathbb N_1$ to integers $\mathbb N_2$. The main members of the structure are so called a sparse vector and a dense vector.

The sparse vector contains at index $n_1\in \mathbb N_1$ a value $n_2\in\mathbb N_2$.

The dense vector contains at index $n_2\in \mathbb N_2$ a value $n_1\in\mathbb N_1$.

What makes the sparse vector sparse is that not all consecutive values within $\mathbb N_1$ are valid. On the contrary, all consecutive values within $\mathbb N_2$ exist.

For example, assuming that we expect $\mathbb N_1$ to cover the range of integers up to $2^{32}$, i.e. $\mathbb N_1$ = uint32; and $\mathbb N_2$ to cover up to $2^{8}$, i.e. $\mathbb N_2$=uint8; then the sparse vector can be represented by std::vector<uint8> and the dense vector by std::vector<uint32>. Here, the data structure can store only up to $256$ values, bounded by the maximum value describable by $\mathbb N_2$, but these values can spread up to $2^{32}$ representing an index in the sparse vector. Theoretically, the resulting sparse vector can be as big as $\max(\mathbb N_1) \times \log_2(\max(\mathbb N_2))~$bytes$=2^{32}\times8\approx34\text{Gb}$… just to store up to $256$ values. Obviously, this is not great. As such, the sparse vector is good enough to store relatively sparse but near-consecutive range of values. Basically, forget storing hash values in there since they usually aim for a great dispersion of the values over the available range.

Alright so far so good, we can store pairs of integers… yey. Fortunately, adapting it to store any sequence of objects is a matter of adding another array that follows the ordering of the dense vector.

1
2
3
4
5
6
template<typename N1, typename N2, typename ObjectT>
struct SparseSet {
    using sparse_vector = std::vector<N2>;
    using dense_vector  = std::vector<N1>;
    using object_vector = std::vector<ObjectT>;
};

Into an heterogeneous container

Two alternatives are presented below following the path laid out by the previous posts. The first rely on a type-erased pointer to store the object, while the second is based on manual memory management akin to the heterogeneous dynamic array.

The mandatory part is the generator of consecutive identifier for types. Fortunately, it is relatively simple thanks to atomic integers or static template variables.

1
2
3
4
5
6
7
8
using typeid_t = std::uint32_t;
class TypeCounter {
    static inline typeid_t i = 0;
    public:
    template<typename T>
    static inline const auto id = i++;
};
template<typename T> inline auto typeid_v = TypeCounter::id<T>;

Alternative 1: the easy way, unique pointers are you friends

The values within $\mathbb N_1$ represent a type identifier, except for the value representing absence of meaningful value. $\mathbb N_2$ only defines how many values can be stored, the fewer it is the more memory is saved when the sparse vector grows. Here $\mathbb N_2$ is chosen to hold within a byte. The object type is abstracted into a void pointer std::unique_ptr<void,void(*)(void*)>. The members are:

1
2
3
4
std::vector<std::uint8_t> sparse;
std::vector<typeid_t>     dense ;
using ptr_t = std::unique_ptr<void,void(*)(void*)>;
std::vector<ptr_t>        objects;

Retrieving a type from the container is a (almost) one liner:

1
2
template<typename T, typename U = std::remove_reference_t<T>>
decltype(auto) get() { return *static_cast<U*>(objects[sparse[typeid_v<U>]].get()); }

Insertion is a bit more involved because auxiliary sparse and dense vector must be updated:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
template<typename... Ts>
void insert(Ts&& ... values) {
    // construct objects
    objects.reserve(objects.size() + sizeof...(Ts));
    auto construct = [&](auto&& t) {
        using T = decltype(t);
        using U = std::remove_reference_t<decltype(t)>;
        objects.push_back(
            ptr_t{ new U{ std::forward<T>(t) },
                  +[](void* instance) { delete static_cast<U*>(instance); } }
        );
    };
    (construct(std::forward<Ts>(values)), ...);
    // update sparse vector
    const auto max_id = std::max({typeid_v<Ts>...});
    if (max_id >= sparse.size()) 
        sparse.resize(max_id+1, -1);
    const auto ids = std::array{typeid_v<Ts>...};
    for(int i=ids.size();i>0;--i)
        sparse[ids[i-1]] = i-1;
    // update dense vector
    dense.reserve(dense.size()+sizeof...(Ts));
    (dense.push_back(typeid_v<Ts>), ...);
}

Obviously, those are the “I know what I am doing” kind of function since insertion assumes types inserted were not already so before, and the object is expected to be there during access.

Alternative 2: the harder way, allocation only under manual supervision

The external level of indirection introduced by the std::unique_ptr and potential fragmentation can be avoided if we deal with the memory management by ourselves. The objects are stored sequentially as bytes in a buffer, and the location of each object must be stored elsewhere. Here I decided to store it within the dense vector along with the type identifier. To avoid alignment issues, I rely on the aligned allocator provided by the boost library. It yields:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
template<typename T, size_t N>
using aligned_allocator = boost::alignment::aligned_allocator<T, N>;
static constexpr inline auto max_alignment = 256UL;
using offset_t = std::uint32_t;
using destructor_t = void(*)(void*);
struct typeid_offset_dtor { typeid_t tag; offset_t offset; destructor_t dtor; };

std::vector<std::uint8_t> sparse;
std::vector<typeid_offset_dtor> dense;
std::vector<std::byte, aligned_allocator<std::byte, max_alignment>> objects;

The extra indirection is directly visible, and embedded within the container. For lookup, the memory access must be cleaned to obtain a usable pointer from bytes.

1
2
3
4
template<typename T>
decltype(auto) get() { return do_get<T>(dense[sparse[typeid_v<T>]].offset); }
template<typename T, typename U = std::remove_reference_t<T>>
U& do_get(offset_t n) { return *std::launder(reinterpret_cast<U*>(&objects[n])); }

Insertion is analogous to alternative 1, except for the manual memory allocation. Fortunately, allocation of the buffer is similar to the implementation of the one for the heterogeneous dynamic array described in a previous post.

Move construction details were omitted for the sake of conciseness here, but are present below in the full snippet. Particularly, it was needed for avoiding undefined behavior on Linux.

Limitless variations

A huge amount of combinations are possible depending on the preferred tradeoffs. For example, if wasting space is not a problem because only a very few amount of types will be stored, then duplicating the offset for object access within the sparse vector can be a good options for even faster random access.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct v3
{
    using offset_t = std::uint32_t;
	struct denseid_offset {std::uint8_t denseid; offset_t offset;}
    std::vector<denseid_offset> sparse;
    struct typeid_offset_dtor {typeid_t tag; offset_t offset; destructor_t dtor;};
    std::vector<typeid_offset_dtor> dense;
    std::vector<std::byte,aligned_allocator> objects;
    
    template<typename T> decltype(auto) get() { return *do_get<T>(objects[sparse[tag]]);}
};

Since there exists only a single destructor for a given type, the address of such function is itself a type identifier. Therefore, the dense vector could be further lighted away from the identifier representing the index within the sparse vector, at the cost of a more difficult dense-to-sparse lookup.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
struct v4
{
    using offset_t = std::uint32_t;
	struct denseid_offset {std::uint8_t denseid; offset_t offset;}
    std::vector<denseid_offset> sparse;
    struct offset_dtor {offset_t offset; destructor_t dtor;};
    std::vector<offset_dtor> dense;
    
	std::vector<std::byte,aligned_allocator> objects;
    template<typename T> decltype(auto) get()
    { return *std::launder(reinterpret_cast<T*>(objects[sparse[tag].offset])); }
};

Benchmarks

Results plotted here were computed with google benchmark. Two scenarios were considered, insertion and lookup. 32 random structures were generated with a python script. For the insertion scenario, tests were done for 1 to 8 types inserted within the container. For the lookup scenario, 32 types were first inserted within the containers, then measurements are performed for the query of 1 to 8 types out of the 32. The queried types are chosen randomly from the 32, but each type can only appear once. Benchmarks were run on an i7-6700k and repeated 11 times. Otherwise specified, tests were carried out on Windows and compiled with Clang 10.

Comparison of maps

Since a class of heterogeneous containers presented throughout these posts rely heavily on a map to access its content, let us have a look first of the performance of several kind of map implementations. Besides the standard library std::unordered_map, boost::flat_map, abseil::flat_hash_map, tsl::robin_map, tsl::hopscotch, tsl::sparse_map, and spp::sparsepp were tested for the container described here.

Insertion is fastest for the group of including std::unordered_map, abseil::flat_hash_map, boost::flat_map and spp:sparsepp. The tsl::robin_map follows up close. The slowest by a huge margin is tsl::hopscotch.

On lookup, both tsl::robin_map and std::unordered_map are the fastest. I must admit not expecting to find the standard container to be so fast considering all the “horrible” things said in general about them online.

In conclusion, tsl::robin_map and std::unordered_map offers the best performance for the task, especially considering lookup. A different but extensive suite of tests can be found here. Nevertheless, bear in mind that such benchmarks are extremely platform and compiler sensitive ! For example std::unordered_map was found to be the slowest on lookup on WSL. Therefore, tsl::robin_map is chosen as the map implementation owing to its consistency across platforms.

Heterogeneous containers showdown

hc refers to this container, ha to this one, hcs is the pointer-based heterogeneous sparse set and has is the array-based heterogeneous sparse set. In addition, some other containers are added: vecany represents a vector<any>, mapany is based on unordered_map<typeid_t,any>, entt_ctx is a container akin to a vector<pair<typeid_t,unique_ptr<void,void(*)(void*)>>>. The two others (andyg and entt_reg) are data structures that can store n instances of a type, more generic and beyond scope but included out of curiosity. Insertion of objects inside the container is either done through a loop “1by1” or all objects are inserted at the same time “bulk”.

vecany solution is the fastest on insertion for less than 7 types inserted owing to fewer memory allocation (growing memory strategy) and types potentially falling into SBO. However, beyond 7 types, the bulk insertion into an heterogeneous sparse set shows near constant time thanks to the absence of repeated heap allocation needed for each type unlike unique_ptr<void,...> and std::any.

Vector-based solution relying on linear iterations (vecany and entt_ctx) are off the chart, many times slower than the rest, and thus, hidden by default (click on their label to make them visible).

Sparse set-based solutions are significantly faster than std::any solutions and around 2 times faster than map-based heterogeneous containers. The main interesting fact is that data locality guaranteed by array of bytes’ storage is slightly slower than the std::unique_ptr respective versions. The advantage of data locality for random access lookup, albeit only going through a few types, is not beneficial.

Note that results describe specifically Clang 10 on Windows, and the outcome across platforms may vary. The full set of results can be found here.

Conclusion

Sparse sets seem the ideal data structure to store (near-consecutive) type identifiers and allowing fastest access to stored content. Storing objects contiguously into an array of bytes does not does not improve lookup nor insertion, worse so, since it affects them negatively.

This post concludes a the small series on heterogeneous containers in C++. It covered specifically containers that can hold a single instance of each type. Meanwhile, you can find the code from my past and ongoing experiments on github.

Code

For the sake of completeness, the code of the containers are shown in full bellow (under MIT license). They can also be toyed around on compiler explorer. Have fun !

heco_1_sparseset_ptr

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#include<vector>
#include<cstdint>
#include<cstdlib>
#include<memory>
#include<utility>

using typeid_t = std::uint32_t;
class TypeCounter {
    static inline typeid_t i = 0;
    public:
    template<typename T>
    static inline const auto id = i++;
};
template<typename T> inline auto typeid_v = TypeCounter::id<T>;

struct heco_1_sparseset_ptr
{
    using ptr_t = std::unique_ptr<void, void(*)(void*)>;
    std::vector<std::uint8_t> sparse;
    std::vector<typeid_t>     dense;
    std::vector<ptr_t>        objects;

    template<typename T, typename... Rest>
    decltype(auto) get()
    {
        using U = std::remove_reference_t<T>;
        if constexpr (sizeof...(Rest) == 0)
            return *static_cast<U*>(objects[sparse[typeid_v<U>]].get());
        else
            return std::forward_as_tuple(get<T>(), get<Rest>()...);
    }

    template<typename... Ts>
    void insert(Ts&& ... values)
    {
        objects.reserve(objects.size() + sizeof...(Ts));
        auto construct = [&](auto&& t) {
            using T = decltype(t);
            using U = std::remove_reference_t<decltype(t)>;
            objects.push_back(
                std::unique_ptr<void, void(*)(void*)>{
                new U{ std::forward<T>(t) }, +[](void* instance) { delete static_cast<U*>(instance); } });
        };
        (construct(std::forward<Ts>(values)), ...);
        // update sparse vector
        const auto max_id = std::max({ typeid_v<Ts>... });
        if (max_id >= sparse.size())
            sparse.resize(max_id + 1, -1);
        const auto ids = std::array{ typeid_v<Ts>... };
        for (int i = ids.size(); i > 0; --i)
            sparse[ids[i - 1]] = i - 1;
        // update dense vector
        dense.reserve(dense.size() + sizeof...(Ts));
        (dense.push_back(typeid_v<Ts>), ...);
    }
};

heco_1_sparseset_array

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#include<vector>
#include<cstdint>
#include<cstdlib>
#include<memory>
#include<utility>
#include<algorithm>
#include<boost/align/aligned_allocator.hpp>

using typeid_t = std::uint32_t;
class TypeCounter {
    static inline typeid_t i = 0;
    public:
    template<typename T>
    static inline const auto id = i++;
};
template<typename T> inline auto typeid_v = TypeCounter::id<T>;

struct heco_1_sparseset_bytes
{
    template<typename T, size_t N>
    using aligned_allocator = boost::alignment::aligned_allocator<T, N>;
    static constexpr inline auto max_alignment = 64UL;
    using offset_t = std::uint32_t;

    enum class ACTION { CONSTRUCT, COPY, MOVE, DESTROY };
    using destructor_t = size_t(*)(void*, void*, ACTION);
    struct typeid_offset_dtor { typeid_t tag; offset_t offset; destructor_t dtor; };

    std::vector<std::uint8_t> sparse;
    std::vector<typeid_offset_dtor> dense;
    using vector_bytes = std::vector<std::byte, aligned_allocator<std::byte, max_alignment>>;
    vector_bytes objects;

    heco_1_sparseset_bytes() = default;
    heco_1_sparseset_bytes(heco_1_sparseset_bytes&&) = default;
    heco_1_sparseset_bytes(const heco_1_sparseset_bytes&) = delete;
    heco_1_sparseset_bytes& operator=(const heco_1_sparseset_bytes&) = delete;
    heco_1_sparseset_bytes& operator=(heco_1_sparseset_bytes&&) = default;
    ~heco_1_sparseset_bytes() {
        for (auto& i : dense)
            i.dtor(&objects[i.offset],nullptr,ACTION::DESTROY);
    }

    template<typename T>
    auto record_destructor() {
        return +[](void* src, void* tgt, ACTION action)
        {
            switch (action) {
            case ACTION::CONSTRUCT: new(tgt) T; break;
            case ACTION::DESTROY: std::destroy_at(static_cast<T*>(src)); break;
            case ACTION::COPY: new(tgt) T{ *static_cast<T*>(src) }; break;
            case ACTION::MOVE: new(tgt) T{ std::move(*static_cast<T*>(src)) }; break;
            default: break;
            }
            return sizeof(T);
        };
    }

    template<typename T>
    decltype(auto) get() { return do_get<T>(dense[sparse[typeid_v<T>]].offset); }

    template<typename T, typename U = std::remove_reference_t<T>>
    U& do_get(offset_t n) { return *std::launder(reinterpret_cast<U*>(&objects[n])); }

    template<typename... Ts>
    void insert(Ts&&... types) {
        // allocate and construct objects
        const auto offsets = allocate<Ts...>();
        auto construct = [&](offset_t n, auto&& t) {
             using T = decltype(t);
            using U = std::remove_reference_t<T>;
            new(&objects[n]) U{ std::forward<T>(t) };
        };
        {size_t i = 0; (construct(offsets[i++], types), ...); }
        // update sparse vector
        const auto ids = std::array{ typeid_v<Ts>... };
        const auto max_id = *std::max_element(ids.cbegin(), ids.end());
        if (max_id >= sparse.size())
            sparse.resize(max_id + 1, -1);
        for (int i = ids.size(); i > 0; --i)
            sparse[ids[i - 1]] = i - 1;
        // update dense vector
        dense.reserve(dense.size() + sizeof...(Ts));
        {size_t i = 0; (dense.push_back({ typeid_v<Ts>,offsets[i++], record_destructor<Ts>() }), ...); }
    }

    template<typename... T>
    auto allocate()->std::array<offset_t, sizeof...(T)> {
        static_assert(((alignof(T) <= max_alignment) && ...));
        using namespace std;

        constexpr size_t N = sizeof...(T);
        constexpr array<size_t, N> alignments = { alignof(T)... };
        constexpr array<size_t, N> sizes = { sizeof(T)... };

        array<offset_t, N> output;

        const size_t size_before = objects.size();
        uintptr_t ptr_end = uintptr_t(objects.data() + size_before);
        size_t to_allocate = 0;
        for (int i = 0; i < N; ++i)
        {
            size_t padding = ((~ptr_end + 1) & (alignments[i] - 1));
            output[i] = size_before + to_allocate + padding;
            to_allocate += padding + sizes[i];
            ptr_end += padding + sizes[i];
        }
        size_t size_after = size_before + to_allocate;
        // objects.resize(size_after);//Resizing screw up runtime on Linux "free(): invalid pointer", not moving objects is UB
        vector_bytes tmp(size_after);
        for (auto&& [tid, offset, dtor] : dense)
            dtor(&objects[offset], &tmp[offset], ACTION::MOVE);
        objects = std::exchange(tmp, {});
        return output;
    }
};