Mentions légales du service

Skip to content

Fix server time-out issue with the slurm_semiglobal scheduler

SCHOULER Marc requested to merge fix-semiglobal-scheduler into develop

Because server jobs were initialized with a RUNNING state, the launcher was expecting life signals from the server 2 * server_ping_interval seconds after the launcher started. This ultimately results in a time-out detection even though the server may not even be actually running. Indeed, nothing guarantees that the server job actually starts before the time-out delay since it depends on the GPU availability on the cluster.

This MR solves this issue by using the WAITING initial job state instead of RUNNING. In addition, the server job state is not monitored through the scheduler function _update_jobs_impl anymore. It's solely monitored via connection and messages handled by the state machine.

Merge request reports