Fault Tolerance and simulation time-out monitoring

Problem

Fault-Tolerance (FT) was momentarily put aside. This issue aims at:

building a FaultTolerance class
monitoring simulation time-out efficiently

In Melissa, FT related tasks are spread across the whole server but it boils down to the following tasks:

restart/checkpointing of the server (this is not implemented for deep-melissa and will be handled in another issue)
discarding duplicated messages (with deep-melissa this is currently handled in handle_simulation_data)
relaunching failed simulations (with deep-melissa this is currently handled in handle_fd)
monitoring the number of crashed simulation i.e. unfinished simulation which did not send messages since at least max_delay (this has been removed and should be reimplemented)

While task 2 is already taken care of in a separate context (i.e. data assembling), it makes sense to keep it untouched. Tasks 1, 3 and 4 however should be managed with FT methods. At that point, it may be noted that since message sending is handled by the MelissaServer object/methods, FT functions should then return IDs of simulation to be relaunched (with or without resampling).

Regarding task 4, it was initially handled in all_done with the crashed function. But as pointed out by Lucas during his profiling of the server, this was highly inefficient since checking at every all_done call in receive and among all simulations whether any of them has timed-out is wasteful. Instead, a new monitoring strategy based on time separated events should be considered as discussed below.

Proposed solution for task 4

Similarly to what is done with the launcher (see timer and its run function), it is more relevant to trigger this kind of timely event by using a separate thread in charge of sending (null) bytes on a specific file descriptor.

Thus, every time a verification is requested through this socket (i.e. every walltime), for every Simulation in the simulations dictionary, timed-out ids are identified and processed.

Edited Sep 29, 2022 by SCHOULER Marc

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Fault Tolerance and simulation time-out monitoring

Problem

Proposed solution for task 4