Fault Tolerance and simulation time-out monitoring
Problem
Fault-Tolerance (FT) was momentarily put aside. This issue aims at:
- building a
FaultTolerance
class - monitoring simulation time-out efficiently
In Melissa, FT related tasks are spread across the whole server but it boils down to the following tasks:
- restart/checkpointing of the server (this is not implemented for deep-melissa and will be handled in another issue)
- discarding duplicated messages (with deep-melissa this is currently handled in
handle_simulation_data
) - relaunching failed simulations (with deep-melissa this is currently handled in
handle_fd
) - monitoring the number of crashed simulation i.e. unfinished simulation which did not send messages since at least
max_delay
(this has been removed and should be reimplemented)
While task 2 is already taken care of in a separate context (i.e. data assembling), it makes sense to keep it untouched. Tasks 1, 3 and 4 however should be managed with FT methods. At that point, it may be noted that since message sending is handled by the MelissaServer
object/methods, FT functions should then return IDs of simulation to be relaunched (with or without resampling).
Regarding task 4, it was initially handled in all_done
with the crashed
function. But as pointed out by Lucas during his profiling of the server, this was highly inefficient since checking at every all_done
call in receive
and among all simulations whether any of them has timed-out is wasteful. Instead, a new monitoring strategy based on time separated events should be considered as discussed below.
Proposed solution for task 4
Similarly to what is done with the launcher (see timer
and its run
function), it is more relevant to trigger this kind of timely event by using a separate thread in charge of sending (null) bytes on a specific file descriptor.
Thus, every time a verification is requested through this socket (i.e. every walltime
), for every Simulation
in the simulations
dictionary, timed-out ids are identified and processed.