Using plain old '''malloc()''' can cause serious bottlenecks and slowdown in parallel programming, due to contention on allocator. Many alternatives are available, for example:
- Intel Threading Building Blocks has a good allocator.
- The Hoard memory allocator.
- Google's TCMalloc, part of gperftools.
All these allocators take into account that there are multiple threads running, and can avoid common pitfalls introduced by modern hardware (multi-core).
CPU memory caches
Concurrent programming imposes new problems for assuring optimal usage of CPU memory caches. In particular, two things are particularly important for performance:
- Keep unrelated data as far away from each other as possible. Modern CPUs have minimum cache line sizes, and there are implicit write locks on each such cache line, so you can in effect introduce lock contention if you try to write on the same cache line from multiple threads. This problem is commonly referred to as '''False Sharing''', and it should be avoided (to avoid contention on writes).
- Keep related data as close to each other as possible. This will increase cache locality, and reduce the number of cache lines used for a particular thread / task.