Drastically decrease memory usage by switching multiprocessing to forkserver
Before this MR, the parallelism relies on a the fork
method for multiprocessing.set_start_method
(default), meaning the multiprocessing is done in a usual, unix-like fashion: all the memory is passed to the child (forked) process in a Copy-on-Write (CoW) fashion.
Yet, Python being Python, it seems that the huge chunks of memory used (up to 15GB observed on a dense benchmarks matrix) are actually copied at some point, probably because of data structures being re-indexed or magicked upon. This leads to a situation where we need ~10GB x NB_CPU
RAM, which, often, is just too much and we run out of memory.
This MR makes the CorePinnedPool
rely on the forkserver
method instead, which does not share unnecessary memory, therefore not CoW-ing. This required a bit of a revamp.