Mentions légales du service

Skip to content

Implementation of the server restart and checkpointing

SCHOULER Marc requested to merge server-checkpoint into develop

Until now, a server failure resulted in a total restarting of the study. This MR aims at adding the possibility of checkpointing the server to keep track of a given state (statistics structures or buffer state and population, list of finished clients, list of clients to resubmit, etc). It'll require the following:

  • environment variable addition to indicate that the server was submitted multiple times,
  • data structures checkpointing (e.g. new function server.checkpoint()),
  • data structure loading (e.g. new function server.restart()).
  • ensure Tensorboard log continuity - potentially not possible :/
  • Add Sensitivity Analysis server checkpointing (doing it in separate MR)
  • Split/clean torch_server.py restart_from_checkpoint()
  • Add rank specific buffer checkpointing
  • Test checkpointing mechanism on G5k or JZ
  • Handle case where only one of the server processes crashes, while the other stays up
  • restarting the server from scratch if the failure occurred before its first checkpoint,
  • having a config variable to define how often the server should be checkpointed.
  • Make it possible to start launcher from existing folder.
  • Add documentation for how to use checkpointing/fault tolerance
  • Add full fault tolerance test to CI
Edited by CAULK Robert

Merge request reports