Client termination from finalization message
This MR was made in the frame of the SC2023 paper work. Its purpose is to add a termination signal to the melissa_finalize
function in the API. This becomes necessary when the number of expected time-steps is not known a priori which can happen when the client termination condition is not based on the simulation time.
For instance with Code-Saturne, if we want each client to simulate 10 periods of Von-Karman vortices, the total number of time-steps may vary from one client to another and cannot be anticipated. In this case, the server can be informed by the API that a certain client is done as soon as it receives a message whose time-step is a negative number (in our case the number of sent time-steps).
This will require the following modifications:
-
adding a message to the server socket inside the melissa_finalize
function [API side], -
checking for the "termination" field name and updating the termination monitoring [DL server side], -
make sure this does not break the fault tolerance nor the training (e.g. introduces a dead lock risk) [pass CI].
Notes:
- the main point of this solution is that is stays compatible with the way solvers are instrumented with Melissa,
- this allows clients to produce different number of time-steps which is fine for DL but does not make sense for SA,
- this allows users not to set
num_samples
in the configuration file.
Regarding this last point, an immediate inconvenient is that an incoherent watermark
value will only be caught once all clients are done.