Fix server time-out issue with the slurm_semiglobal scheduler (!75) · Merge requests · melissa / Melissa

SCHOULER Marc requested to merge fix-semiglobal-scheduler into develop Jan 11, 2023

Because server jobs were initialized with a RUNNING state, the launcher was expecting life signals from the server 2 * server_ping_interval seconds after the launcher started. This ultimately results in a time-out detection even though the server may not even be actually running. Indeed, nothing guarantees that the server job actually starts before the time-out delay since it depends on the GPU availability on the cluster.

This MR solves this issue by using the WAITING initial job state instead of RUNNING. In addition, the server job state is not monitored through the scheduler function _update_jobs_impl anymore. It's solely monitored via connection and messages handled by the state machine.

Admin message

Fix server time-out issue with the slurm_semiglobal scheduler

Merge request reports