Local Submission
Description
The Deep Learning component of Melissa is still work in progress as we are investigating how to avoid catastrophic forgetting and use active learning to train faster on fewer simulations.
To do these experiments we rely on few examples that are fairly simple (e.g. Lorenz's attractor). Each simulation last for less than a second. The model for the training is also simple, with less than 1M trainable parameters. For these experiments, cluster may not be appropriate. For instance the time waiting for the submission while the experiment itself is short can be prohibitive. We also experienced with @mschoule that launching many simulations with slurm
was problematic. First the cluster may limit the number of job running, second slurm
has an overhead that makes the experiment much longer than when executed locally. Hence, we are losing time when we are looking to be quick and responsive.
The mpirun
scheduler of melissa.launcher
seemed appropriate for local testing. We added the option --oversubscribe
to the submission to allow the oversubscription of many jobs. On local computer for many Lorenz's simulation (i.e. around 1000) this leads to the freezing of the computer that is overwhelmed (16 processors, 16GB of RAM).
The mpirun
scheduler is not properly a scheduler itself. It doesn't manage the submission based on what have been already submitted.
Understanding of the Current Implementation of the Launcher
The launcher
is the module responsible for launching jobs in Melissa. It launches and monitors both the server and the clients.
The IoMaster
class takes action depending on event messages it receives. Its _handle_job_submission
method is responsible for submitting jobs. It calls the scheduler.submit_job
that provides the command to launch the job. Hence the melissa.launcher.Scheduler
class is merely responsible for formatting a command line (e.g. a srun ...
or mpirun ...
command with the appropriate options). Once the command is formatted, it is submitted through the _launcher_process
method of the IoMaster
. This ultimately calls the launch
method of process.py
. The submission is done through subprocess.Popen
.
What we need is to know how many processes have been submitted and how many are still running, in order to regulate the number of alive clients.
I proposed to do so using concurrent.futures
in https://gitlab.inria.fr/melissa/melissa-combined/-/merge_requests/31. The concurrent.futures
module is convenient as it provides an easy way to do scheduling. multiprocessing
also provides a Pool
of process, but it is slightly less straightforward to use than the one of concurrent.futures
.
The proposed solution is to have a limited number of processes that launch themselves the subprocess.Popen
command, hence another process. This is not a direct control of the number of processes launched.
The proposed solution is rather clumsy as it introduces a dichotomy between processes launched by subprocess
and by concurrent.futures
. Besides, it doesn't work.
Log files are link here : logs.zip. The openmpi.*.err
are either empty or not relevant. The error seems still to be in the client execution. Due to the concurrent.fututres
we may lost part of the traceback.
The implementation has to be further thought before implementing any fix.
I see two options to be discussed:
- Improving the
process.py
so to record the number of running subprocesses and create a queue for the ones that are waiting. I'm not sure this doesn't break the launcher logics. - Use as suggested @mschoule the event to mark a job as pending.