SP10-item0 : re-design plan of lower layers (redesign internal thread communication for the researcher)
Propose and agree on a re-design plan of lower layers for being able to handle errors.
-
add proper error messages -
separate logging and error transmission (node -> researcher) -
change message de-queuing on researcher side
Then a node disapears, the main process (of the researcher) deals with the error then it reaches it in its TaskQueue:
-
training loop detects faulty nodes and sent it to strategy
Fist implementation of error handling in default_strategy:
-
stop the expriment then node does not answer -
use the FBxxx error codes -
send exception then something went wrong
Edited by SZPYRKA Jean-Luc