The source code and dockerfile for the GSW2024 AI Lab.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
This repo is archived. You can view files and clone it, but cannot push or open issues/pull-requests.

338 lines
12 KiB

4 weeks ago
  1. [![Build Status](https://travis-ci.org/greg7mdp/sparsepp.svg?branch=master)](https://travis-ci.org/greg7mdp/sparsepp)
  2. # Sparsepp: A fast, memory efficient hash map for C++
  3. Sparsepp is derived from Google's excellent [sparsehash](https://github.com/sparsehash/sparsehash) implementation. It aims to achieve the following objectives:
  4. - A drop-in alternative for unordered_map and unordered_set.
  5. - **Extremely low memory usage** (typically about one byte overhead per entry).
  6. - **Very efficient**, typically faster than your compiler's unordered map/set or Boost's.
  7. - **C++11 support** (if supported by compiler).
  8. - ~~Single header~~ not anymore
  9. - **Tested** on Windows (vs2010-2015, g++), linux (g++, clang++) and MacOS (clang++).
  10. We believe Sparsepp provides an unparalleled combination of performance and memory usage, and will outperform your compiler's unordered_map on both counts. Only Google's `dense_hash_map` is consistently faster, at the cost of much greater memory usage (especially when the final size of the map is not known in advance).
  11. For a detailed comparison of various hash implementations, including Sparsepp, please see our [write-up](bench.md).
  12. ## Example
  13. ```c++
  14. #include <iostream>
  15. #include <string>
  16. #include <sparsepp/spp.h>
  17. using spp::sparse_hash_map;
  18. int main()
  19. {
  20. // Create an unordered_map of three strings (that map to strings)
  21. sparse_hash_map<std::string, std::string> email =
  22. {
  23. { "tom", "tom@gmail.com"},
  24. { "jeff", "jk@gmail.com"},
  25. { "jim", "jimg@microsoft.com"}
  26. };
  27. // Iterate and print keys and values
  28. for (const auto& n : email)
  29. std::cout << n.first << "'s email is: " << n.second << "\n";
  30. // Add a new entry
  31. email["bill"] = "bg@whatever.com";
  32. // and print it
  33. std::cout << "bill's email is: " << email["bill"] << "\n";
  34. return 0;
  35. }
  36. ```
  37. ## Installation
  38. No compilation is needed, as this is a header-only library. The installation consist in copying the sparsepp directory wherever it will be convenient to include in your project(s). Also make the path to this directory is provided to the compiler with the `-I` option.
  39. ## Warning - iterator invalidation on erase/insert
  40. 1. erasing elements is likely to invalidate iterators (for example when calling `erase()`)
  41. 2. inserting new elements is likely to invalidate iterators (iterator invalidation can also happen with std::unordered_map if rehashing occurs due to the insertion)
  42. ## Usage
  43. As shown in the example above, you need to include the header file: `#include <sparsepp/spp.h>`
  44. This provides the implementation for the following classes:
  45. ```c++
  46. namespace spp
  47. {
  48. template <class Key,
  49. class T,
  50. class HashFcn = spp_hash<Key>,
  51. class EqualKey = std::equal_to<Key>,
  52. class Alloc = libc_allocator_with_realloc<std::pair<const Key, T>>>
  53. class sparse_hash_map;
  54. template <class Value,
  55. class HashFcn = spp_hash<Value>,
  56. class EqualKey = std::equal_to<Value>,
  57. class Alloc = libc_allocator_with_realloc<Value>>
  58. class sparse_hash_set;
  59. };
  60. ```
  61. These classes provide the same interface as std::unordered_map and std::unordered_set, with the following differences:
  62. - Calls to `erase()` may invalidate iterators. However, conformant to the C++11 standard, the position and range erase functions return an iterator pointing to the position immediately following the last of the elements erased. This makes it easy to traverse a sparse hash table and delete elements matching a condition. For example to delete odd values:
  63. ```c++
  64. for (auto it = c.begin(); it != c.end(); )
  65. if (it->first % 2 == 1)
  66. it = c.erase(it);
  67. else
  68. ++it;
  69. ```
  70. As for std::unordered_map, the order of the elements that are not erased is preserved.
  71. - Since items are not grouped into buckets, Bucket APIs have been adapted: `max_bucket_count` is equivalent to `max_size`, and `bucket_count` returns the sparsetable size, which is normally at least twice the number of items inserted into the hash_map.
  72. ## Memory allocator on Windows (when building with Visual Studio)
  73. When building with the Microsoft compiler, we provide a custom allocator because the default one (from the Visual C++ runtime) fragments memory when reallocating.
  74. This is desirable *only* when creating large sparsepp hash maps. If you create lots of small hash_maps, memory usage may increase instead of decreasing as expected. The reason is that, for each instance of a hash_map, the custom memory allocator creates a new memory space to allocate from, which is typically 4K, so it may be a big waste if just a few items are allocated.
  75. In order to use the custom spp allocator, define the following preprocessor variable before including `<spp/spp.h>`:
  76. `#define SPP_USE_SPP_ALLOC 1`
  77. ## Integer keys, and other hash function considerations.
  78. 1. For basic integer types, sparsepp provides a default hash function which does some mixing of the bits of the keys (see [Integer Hashing](http://burtleburtle.net/bob/hash/integer.html)). This prevents a pathological case where inserted keys are sequential (1, 2, 3, 4, ...), and the lookup on non-present keys becomes very slow.
  79. Of course, the user of sparsepp may provide its own hash function, as shown below:
  80. ```c++
  81. #include <sparsepp/spp.h>
  82. struct Hash64 {
  83. size_t operator()(uint64_t k) const { return (k ^ 14695981039346656037ULL) * 1099511628211ULL; }
  84. };
  85. struct Hash32 {
  86. size_t operator()(uint32_t k) const { return (k ^ 2166136261U) * 16777619UL; }
  87. };
  88. int main()
  89. {
  90. spp::sparse_hash_map<uint64_t, double, Hash64> map;
  91. ...
  92. }
  93. ```
  94. 2. When the user provides its own hash function, for example when inserting custom classes into a hash map, sometimes the resulting hash keys have similar low order bits and cause many collisions, decreasing the efficiency of the hash map. To address this use case, sparsepp provides an optional 'mixing' of the hash key (see [Integer Hash Function](https://gist.github.com/badboy/6267743) which can be enabled by defining the proprocessor macro: SPP_MIX_HASH.
  95. ## Example 2 - providing a hash function for a user-defined class
  96. In order to use a sparse_hash_set or sparse_hash_map, a hash function should be provided. Even though a the hash function can be provided via the HashFcn template parameter, we recommend injecting a specialization of `std::hash` for the class into the "std" namespace. For example:
  97. ```c++
  98. #include <iostream>
  99. #include <functional>
  100. #include <string>
  101. #include <sparsepp/spp.h>
  102. using std::string;
  103. struct Person
  104. {
  105. bool operator==(const Person &o) const
  106. { return _first == o._first && _last == o._last; }
  107. string _first;
  108. string _last;
  109. };
  110. namespace std
  111. {
  112. // inject specialization of std::hash for Person into namespace std
  113. // ----------------------------------------------------------------
  114. template<>
  115. struct hash<Person>
  116. {
  117. std::size_t operator()(Person const &p) const
  118. {
  119. std::size_t seed = 0;
  120. spp::hash_combine(seed, p._first);
  121. spp::hash_combine(seed, p._last);
  122. return seed;
  123. }
  124. };
  125. }
  126. int main()
  127. {
  128. // As we have defined a specialization of std::hash() for Person,
  129. // we can now create sparse_hash_set or sparse_hash_map of Persons
  130. // ----------------------------------------------------------------
  131. spp::sparse_hash_set<Person> persons = { { "John", "Galt" },
  132. { "Jane", "Doe" } };
  133. for (auto& p: persons)
  134. std::cout << p._first << ' ' << p._last << '\n';
  135. }
  136. ```
  137. The `std::hash` specialization for `Person` combines the hash values for both first and last name using the convenient spp::hash_combine function, and returns the combined hash value.
  138. spp::hash_combine is provided by the header `sparsepp/spp.h`. However, class definitions often appear in header files, and it is desirable to limit the size of headers included in such header files, so we provide the very small header `sparsepp/spp_utils.h` for that purpose:
  139. ```c++
  140. #include <string>
  141. #include <sparsepp/spp_utils.h>
  142. using std::string;
  143. struct Person
  144. {
  145. bool operator==(const Person &o) const
  146. {
  147. return _first == o._first && _last == o._last && _age == o._age;
  148. }
  149. string _first;
  150. string _last;
  151. int _age;
  152. };
  153. namespace std
  154. {
  155. // inject specialization of std::hash for Person into namespace std
  156. // ----------------------------------------------------------------
  157. template<>
  158. struct hash<Person>
  159. {
  160. std::size_t operator()(Person const &p) const
  161. {
  162. std::size_t seed = 0;
  163. spp::hash_combine(seed, p._first);
  164. spp::hash_combine(seed, p._last);
  165. spp::hash_combine(seed, p._age);
  166. return seed;
  167. }
  168. };
  169. }
  170. ```
  171. ## Example 3 - serialization
  172. sparse_hash_set and sparse_hash_map can easily be serialized/unserialized to a file or network connection.
  173. This support is implemented in the following APIs:
  174. ```c++
  175. template <typename Serializer, typename OUTPUT>
  176. bool serialize(Serializer serializer, OUTPUT *stream);
  177. template <typename Serializer, typename INPUT>
  178. bool unserialize(Serializer serializer, INPUT *stream);
  179. ```
  180. The following example demonstrates how a simple sparse_hash_map can be written to a file, and then read back. The serializer we use read and writes to a file using the stdio APIs, but it would be equally simple to write a serialized using the stream APIS:
  181. ```c++
  182. #include <cstdio>
  183. #include <sparsepp/spp.h>
  184. using spp::sparse_hash_map;
  185. using namespace std;
  186. class FileSerializer
  187. {
  188. public:
  189. // serialize basic types to FILE
  190. // -----------------------------
  191. template <class T>
  192. bool operator()(FILE *fp, const T& value)
  193. {
  194. return fwrite((const void *)&value, sizeof(value), 1, fp) == 1;
  195. }
  196. template <class T>
  197. bool operator()(FILE *fp, T* value)
  198. {
  199. return fread((void *)value, sizeof(*value), 1, fp) == 1;
  200. }
  201. // serialize std::string to FILE
  202. // -----------------------------
  203. bool operator()(FILE *fp, const string& value)
  204. {
  205. const size_t size = value.size();
  206. return (*this)(fp, size) && fwrite(value.c_str(), size, 1, fp) == 1;
  207. }
  208. bool operator()(FILE *fp, string* value)
  209. {
  210. size_t size;
  211. if (!(*this)(fp, &size))
  212. return false;
  213. char* buf = new char[size];
  214. if (fread(buf, size, 1, fp) != 1)
  215. {
  216. delete [] buf;
  217. return false;
  218. }
  219. new (value) string(buf, (size_t)size);
  220. delete[] buf;
  221. return true;
  222. }
  223. // serialize std::pair<const A, B> to FILE - needed for maps
  224. // ---------------------------------------------------------
  225. template <class A, class B>
  226. bool operator()(FILE *fp, const std::pair<const A, B>& value)
  227. {
  228. return (*this)(fp, value.first) && (*this)(fp, value.second);
  229. }
  230. template <class A, class B>
  231. bool operator()(FILE *fp, std::pair<const A, B> *value)
  232. {
  233. return (*this)(fp, (A *)&value->first) && (*this)(fp, &value->second);
  234. }
  235. };
  236. int main(int argc, char* argv[])
  237. {
  238. sparse_hash_map<string, int> age{ { "John", 12 }, {"Jane", 13 }, { "Fred", 8 } };
  239. // serialize age hash_map to "ages.dmp" file
  240. FILE *out = fopen("ages.dmp", "wb");
  241. age.serialize(FileSerializer(), out);
  242. fclose(out);
  243. sparse_hash_map<string, int> age_read;
  244. // read from "ages.dmp" file into age_read hash_map
  245. FILE *input = fopen("ages.dmp", "rb");
  246. age_read.unserialize(FileSerializer(), input);
  247. fclose(input);
  248. // print out contents of age_read to verify correct serialization
  249. for (auto& v : age_read)
  250. printf("age_read: %s -> %d\n", v.first.c_str(), v.second);
  251. }
  252. ```
  253. ## Thread safety
  254. Sparsepp follows the thread safety rules of the Standard C++ library. In Particular:
  255. - A single sparsepp hash table is thread safe for reading from multiple threads. For example, given a hash table A, it is safe to read A from thread 1 and from thread 2 simultaneously.
  256. - If a single hash table is being written to by one thread, then all reads and writes to that hash table on the same or other threads must be protected. For example, given a hash table A, if thread 1 is writing to A, then thread 2 must be prevented from reading from or writing to A.
  257. - It is safe to read and write to one instance of a type even if another thread is reading or writing to a different instance of the same type. For example, given hash tables A and B of the same type, it is safe if A is being written in thread 1 and B is being read in thread 2.