g5k: delegate execution of big experiments to OAR during the night
Context
On Grid'5000, large experiments cannot be done during the day, they need to be done during nights (19.00 - 9.00) or week-ends.
This is possible with Enoslib by passing a parameter such as reservation="2022-08-12 19:00"
to the G5K provider. However, Enoslib will simply wait for the OAR job to start. There is a manual way to do it asynchronously: run your script once in advance, kill it after it has reserved resources, and run the same script again around 19:00. During this second run, Enoslib will pick up the previous OAR job and use it for the experiment. However, this method has several issues:
- you need to be there at 19:00 to manually start your experiment
- G5K has a limit of 2 advanced reservations, so it's not possible to plan several big experiments in advance
Idea
The idea would be to implement a way for Enoslib to re-execute itself automatically when the OAR job start. The user workflow would look like the following:
- the user writes its experiment script using Enoslib, and makes sure that it works well by doing small experiments during the day
- for the big experiment, the user activates a special Enoslib mode and runs the script
- in this mode:
- Enoslib will reserve resources in advance as usual
- but then Enoslib will instruct OAR to re-run the same script automatically when the OAR job start
- Enoslib simply exits for now
- the user is happy because they can stop working before 19:00, everything is automated after that
- a few hours/days later, the OAR job starts and re-runs the same Enoslib script automatically, which finds the resources and can start the actual experiment
- the day after, the user can look at the result of the experiment
Caveats
Some caveats about this special mode:
- it will likely not work in Jupyter notebooks, it needs a real script
- the script and its environment should not be changed between the first run and the start of the OAR job. In particular, changing the
job_name
or the number of resources will break things - the second execution will be done on a control node, not on a frontend: the user needs to ensure that this will work. The /home is the same, but there are subtle differences such as different API authentication.