Proper server job cancellation for slurm semiglobal
The recent introduction of the server fault tolerance necessitates a proper management of the server job cancellation with slurm-semiglobal
. This MR introduces the formal notion of hybrid scheduler which does the following:
-
submission: all jobs are submitted as direct processes which means that although the server job is submitted through the batch scheduler, the success of its submission won't be monitored by the launcher (no calls to
_run_process_asynchronously
,_wait_for_process
and no creation of aProcessCompletion_
event) - updates: client job updates rely on the status of their associated subprocess while server job updates only rely on PING receptions
- cancellation: client jobs are cancelled through their associated subprocess while the server job is properly killed via the scheduler. For cancellation, both direct and indirect cancellation steps are applied.