Fastest heterogeneous container and benchmarks
While doing some benchmarks with the heterogeneous containers described here and here along with some related containers, I could not but notice the performance of entt, more specifically its registry class which relies on a modified sparse set data structure. In front of such witchcraft, I decided to delve behind the curtains and see for myself the performance of an heterogeneous container based on it.
Sparse sets
Sparse sets are data structures widely covered (1, 2, 3, 4). I will give a shot here and explain succinctly and as clear as possible the concepts behind the data structure.
Sparse sets, in the pure form, are actually maps, they map integers $\mathbb N_1$ to integers $\mathbb N_2$. The main members of the structure are so called a sparse vector and a dense vector.
The sparse vector contains at index $n_1\in \mathbb N_1$ a value $n_2\in\mathbb N_2$.
The dense vector contains at index $n_2\in \mathbb N_2$ a value $n_1\in\mathbb N_1$.
What makes the sparse vector sparse is that not all consecutive values within $\mathbb N_1$ are valid. On the contrary, all consecutive values within $\mathbb N_2$ exist.
For example, assuming that we expect $\mathbb N_1$ to cover the range of integers up to $2^{32}$, i.e. $\mathbb N_1$ = uint32
; and $\mathbb N_2$ to cover up to $2^{8}$, i.e. $\mathbb N_2$=uint8
; then the sparse vector can be represented by std::vector<uint8>
and the dense vector by std::vector<uint32>
. Here, the data structure can store only up to $256$ values, bounded by the maximum value describable by $\mathbb N_2$, but these values can spread up to $2^{32}$ representing an index in the sparse vector. Theoretically, the resulting sparse vector can be as big as $\max(\mathbb N_1) \times \log_2(\max(\mathbb N_2))~$bytes$=2^{32}\times8\approx34\text{Gb}$… just to store up to $256$ values. Obviously, this is not great. As such, the sparse vector is good enough to store relatively sparse but near-consecutive range of values. Basically, forget storing hash values in there since they usually aim for a great dispersion of the values over the available range.
Alright so far so good, we can store pairs of integers… yey. Fortunately, adapting it to store any sequence of objects is a matter of adding another array that follows the ordering of the dense vector.
|
|
Into an heterogeneous container
Two alternatives are presented below following the path laid out by the previous posts. The first rely on a type-erased pointer to store the object, while the second is based on manual memory management akin to the heterogeneous dynamic array.
The mandatory part is the generator of consecutive identifier for types. Fortunately, it is relatively simple thanks to atomic integers or static template variables.
|
|
Alternative 1: the easy way, unique pointers are you friends
The values within $\mathbb N_1$ represent a type identifier, except for the value representing absence of meaningful value. $\mathbb N_2$ only defines how many values can be stored, the fewer it is the more memory is saved when the sparse vector grows. Here $\mathbb N_2$ is chosen to hold within a byte. The object type is abstracted into a void pointer std::unique_ptr<void,void(*)(void*)>
. The members are:
|
|
Retrieving a type from the container is a (almost) one liner:
|
|
Insertion is a bit more involved because auxiliary sparse and dense vector must be updated:
|
|
Obviously, those are the “I know what I am doing” kind of function since insertion assumes types inserted were not already so before, and the object is expected to be there during access.
Alternative 2: the harder way, allocation only under manual supervision
The external level of indirection introduced by the std::unique_ptr
and potential fragmentation can be avoided if we deal with the memory management by ourselves. The objects are stored sequentially as bytes in a buffer, and the location of each object must be stored elsewhere. Here I decided to store it within the dense vector along with the type identifier. To avoid alignment issues, I rely on the aligned allocator provided by the boost library. It yields:
|
|
The extra indirection is directly visible, and embedded within the container. For lookup, the memory access must be cleaned to obtain a usable pointer from bytes.
|
|
Insertion is analogous to alternative 1, except for the manual memory allocation. Fortunately, allocation of the buffer is similar to the implementation of the one for the heterogeneous dynamic array described in a previous post.
Move construction details were omitted for the sake of conciseness here, but are present below in the full snippet. Particularly, it was needed for avoiding undefined behavior on Linux.
Limitless variations
A huge amount of combinations are possible depending on the preferred tradeoffs. For example, if wasting space is not a problem because only a very few amount of types will be stored, then duplicating the offset for object access within the sparse vector can be a good options for even faster random access.
|
|
Since there exists only a single destructor for a given type, the address of such function is itself a type identifier. Therefore, the dense vector could be further lighted away from the identifier representing the index within the sparse vector, at the cost of a more difficult dense-to-sparse lookup.
|
|
Benchmarks
Results plotted here were computed with google benchmark. Two scenarios were considered, insertion and lookup. 32 random structures were generated with a python script. For the insertion scenario, tests were done for 1 to 8 types inserted within the container. For the lookup scenario, 32 types were first inserted within the containers, then measurements are performed for the query of 1 to 8 types out of the 32. The queried types are chosen randomly from the 32, but each type can only appear once. Benchmarks were run on an i7-6700k and repeated 11 times. Otherwise specified, tests were carried out on Windows and compiled with Clang 10.
Comparison of maps
Since a class of heterogeneous containers presented throughout these posts rely heavily on a map to access its content, let us have a look first of the performance of several kind of map implementations. Besides the standard library std::unordered_map
, boost::flat_map
, abseil::flat_hash_map
, tsl::robin_map
, tsl::hopscotch
, tsl::sparse_map
, and spp::sparsepp
were tested for the container described here.
Insertion is fastest for the group of including std::unordered_map
, abseil::flat_hash_map
, boost::flat_map
and spp:sparsepp
. The tsl::robin_map
follows up close. The slowest by a huge margin is tsl::hopscotch
.
On lookup, both tsl::robin_map
and std::unordered_map
are the fastest. I must admit not expecting to find the standard container to be so fast considering all the “horrible” things said in general about them online.
In conclusion, tsl::robin_map
and std::unordered_map
offers the best performance for the task, especially considering lookup. A different but extensive suite of tests can be found here. Nevertheless, bear in mind that such benchmarks are extremely platform and compiler sensitive ! For example std::unordered_map
was found to be the slowest on lookup on WSL. Therefore, tsl::robin_map
is chosen as the map implementation owing to its consistency across platforms.
Heterogeneous containers showdown
hc refers to this container, ha to this one, hcs is the pointer-based heterogeneous sparse set and has is the array-based heterogeneous sparse set. In addition, some other containers are added: vecany represents a vector<any>
, mapany is based on unordered_map<typeid_t,any>
, entt_ctx is a container akin to a vector<pair<typeid_t,unique_ptr<void,void(*)(void*)>>>
. The two others (andyg and entt_reg) are data structures that can store n instances of a type, more generic and beyond scope but included out of curiosity. Insertion of objects inside the container is either done through a loop “1by1” or all objects are inserted at the same time “bulk”.
vecany
solution is the fastest on insertion for less than 7 types inserted owing to fewer memory allocation (growing memory strategy) and types potentially falling into SBO. However, beyond 7 types, the bulk insertion into an heterogeneous sparse set shows near constant time thanks to the absence of repeated heap allocation needed for each type unlike unique_ptr<void,...>
and std::any
.
Vector-based solution relying on linear iterations (vecany and entt_ctx) are off the chart, many times slower than the rest, and thus, hidden by default (click on their label to make them visible).
Sparse set-based solutions are significantly faster than std::any
solutions and around 2 times faster than map-based heterogeneous containers. The main interesting fact is that data locality guaranteed by array of bytes’ storage is slightly slower than the std::unique_ptr
respective versions. The advantage of data locality for random access lookup, albeit only going through a few types, is not beneficial.
Note that results describe specifically Clang 10 on Windows, and the outcome across platforms may vary. The full set of results can be found here.
Conclusion
Sparse sets seem the ideal data structure to store (near-consecutive) type identifiers and allowing fastest access to stored content. Storing objects contiguously into an array of bytes does not does not improve lookup nor insertion, worse so, since it affects them negatively.
This post concludes a the small series on heterogeneous containers in C++. It covered specifically containers that can hold a single instance of each type. Meanwhile, you can find the code from my past and ongoing experiments on github.
Code
For the sake of completeness, the code of the containers are shown in full bellow (under MIT license). They can also be toyed around on compiler explorer. Have fun !
heco_1_sparseset_ptr
|
|
heco_1_sparseset_array
|
|