dnadna simulation with multiple cpus fails (get stucked after generating part of the simulations only)
I launched simulations on titanic to have a large dataset (1,000 scenarios with 10 replicates). I'm connected interactively. I'm using the --n-cpus option and each time it gets stucked after only a few percent of the simulations are generated and nothing happens anymore (nothing generated, cpus not busy etc). I have to interrupt the command myself after several minutes. I got the error copy pasted below.
When I launch the same simulation without using the --n-cpus, its way slower of course but works all the way
(dnadna) fjay@titanic-5 ~/gitlab/dnadna (master) $ dnadna simulation run largedatatest/largedatatest_simulation_config.yml --n-cpus 44 --overwrite
2022-04-13 14:22:06; INFO; Running one_event simulator with n_scenarios=1000 and n_replicates=10
2022-04-13 14:22:06; WARNING; Existing scenario params file /home/tao/fjay/gitlab/dnadna/largedatatest/one_event_params.csv and associated simulation data will be overwritten
2022-04-13 14:22:06; INFO; Generating new scenario params table
2022-04-13 14:22:06; INFO; Saving generated scenario params file to /home/tao/fjay/gitlab/dnadna/largedatatest/one_event_params.csv
3%|██▌ | 258/10000 [00:14<08:52, 18.29sample/s]Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/tao/fjay/miniconda3/envs/dnadna/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/tao/fjay/miniconda3/envs/dnadna/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/tao/fjay/miniconda3/envs/dnadna/lib/python3.8/multiprocessing/pool.py", line 576, in _handle_results
task = get()
File "/home/tao/fjay/miniconda3/envs/dnadna/lib/python3.8/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/home/tao/fjay/miniconda3/envs/dnadna/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/home/tao/fjay/miniconda3/envs/dnadna/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/tao/fjay/miniconda3/envs/dnadna/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/home/tao/fjay/miniconda3/envs/dnadna/lib/python3.8/multiprocessing/reduction.py", line 164, in recvfds
raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata
7%|██████▊ | 708/10000 [00:50<07:04, 21.87sample/s]
nproc -> 48 I tried using 44 or 30 cpus
I'll try again with lower nb of cpu as well as in debug mode