Fix client fault-tolerance with parallel server (!132) · Merge requests · melissa / Melissa

SCHOULER Marc requested to merge fix-client-ft into develop Jun 23, 2023

Currently the launcher only communicates with server rank 0 which means that any detected failure will only be indicated to this rank. In the same way, rank 0 only will be aware of a failure resulting client restart. This could lead to serious problems if the client inputs are resampled since rank 0 only will have an up to date simulation dictionary.

This MR aims at solving this problem by making all ranks receive job status update message from the launcher. This way, all ranks are informed of the client statuses. Ultimately all ranks should call relaunch_group in case of failure hence updating all simulation dictionary when needed.

To-do:

communicating the server comm_size to the launcher. This enables to make sure (in both io.py and state_machine.py) that the number of connections to the launcher is consistent with what's expected i.e. the number of server ranks. A new type of message and action were designed for that purpose.
modifying _handle_message_sending in io.py so that update messages are sent to all server ranks. This required to introduce a server_cids attribute to the State object to keep track of the file descriptor numbers corresponding to each server rank. The MessageSending(Action) constructor was also modified to use a list of connection ids (cid) instead of a single one.
modifying connect_to_launcher so that server ranks > 0 all connect to the launcher after rank 0. Appropriate task ordering is ensured thanks to a MPI.comm.Barrier() call.
modifying run so that all ranks call handle_fd.
fixing all unit tests affected by these changes.

Edited Jun 27, 2023 by SCHOULER Marc

Fix client fault-tolerance with parallel server

Merge request reports