Implementation of the server restart and checkpointing
Until now, a server failure resulted in a total restarting of the study. This MR aims at adding the possibility of checkpointing the server to keep track of a given state (statistics structures or buffer state and population, list of finished clients, list of clients to resubmit, etc). It'll require the following:
-
environment variable addition to indicate that the server was submitted multiple times, -
data structures checkpointing (e.g. new function server.checkpoint()
), -
data structure loading (e.g. new function server.restart()
). -
ensure Tensorboard log continuity - potentially not possible :/ -
Add Sensitivity Analysis server checkpointing(doing it in separate MR) -
Split/clean torch_server.py restart_from_checkpoint()
-
Add rank specific buffer checkpointing -
Test checkpointing mechanism on G5k or JZ -
Handle case where only one of the server processes crashes, while the other stays up -
restarting the server from scratch if the failure occurred before its first checkpoint, -
having a config variable to define how often the server should be checkpointed. -
Make it possible to start launcher from existing folder. -
Add documentation for how to use checkpointing/fault tolerance -
Add full fault tolerance test to CI
Edited by CAULK Robert