Local Submission

Description

The Deep Learning component of Melissa is still work in progress as we are investigating how to avoid catastrophic forgetting and use active learning to train faster on fewer simulations.

To do these experiments we rely on few examples that are fairly simple (e.g. Lorenz's attractor). Each simulation last for less than a second. The model for the training is also simple, with less than 1M trainable parameters. For these experiments, cluster may not be appropriate. For instance the time waiting for the submission while the experiment itself is short can be prohibitive. We also experienced with @mschoule that launching many simulations with slurm was problematic. First the cluster may limit the number of job running, second slurm has an overhead that makes the experiment much longer than when executed locally. Hence, we are losing time when we are looking to be quick and responsive.

The mpirun scheduler of melissa.launcher seemed appropriate for local testing. We added the option --oversubscribe to the submission to allow the oversubscription of many jobs. On local computer for many Lorenz's simulation (i.e. around 1000) this leads to the freezing of the computer that is overwhelmed (16 processors, 16GB of RAM). The mpirun scheduler is not properly a scheduler itself. It doesn't manage the submission based on what have been already submitted.

Understanding of the Current Implementation of the Launcher

The launcher is the module responsible for launching jobs in Melissa. It launches and monitors both the server and the clients. The IoMaster class takes action depending on event messages it receives. Its _handle_job_submission method is responsible for submitting jobs. It calls the scheduler.submit_job that provides the command to launch the job. Hence the melissa.launcher.Scheduler class is merely responsible for formatting a command line (e.g. a srun ... or mpirun ... command with the appropriate options). Once the command is formatted, it is submitted through the _launcher_process method of the IoMaster. This ultimately calls the launch method of process.py. The submission is done through subprocess.Popen.

What we need is to know how many processes have been submitted and how many are still running, in order to regulate the number of alive clients. I proposed to do so using concurrent.futures in https://gitlab.inria.fr/melissa/melissa-combined/-/merge_requests/31. The concurrent.futures module is convenient as it provides an easy way to do scheduling. multiprocessing also provides a Pool of process, but it is slightly less straightforward to use than the one of concurrent.futures. The proposed solution is to have a limited number of processes that launch themselves the subprocess.Popen command, hence another process. This is not a direct control of the number of processes launched. The proposed solution is rather clumsy as it introduces a dichotomy between processes launched by subprocess and by concurrent.futures. Besides, it doesn't work.

Log files are link here : logs.zip. The openmpi.*.err are either empty or not relevant. The error seems still to be in the client execution. Due to the concurrent.fututres we may lost part of the traceback.

The implementation has to be further thought before implementing any fix.

I see two options to be discussed:

Improving the process.py so to record the number of running subprocesses and create a queue for the ones that are waiting. I'm not sure this doesn't break the launcher logics.
Use as suggested @mschoule the event to mark a job as pending.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Local Submission

Description

Understanding of the Current Implementation of the Launcher