need more robust handling of secagg training error
When training with:
- node1: SECURE_AGGREGATION=True FORCE_SECURE_AGGREGATION=True
- node2: SECURE_AGGREGATION=True FORCE_SECURE_AGGREGATION=False
notebooks/notebooks/101_getting-started.ipynb
Then node1 fails with:
2023-04-17 18:16:48,257 fedbiomed INFO - Error message received during training: FB300: undetermined node error - FB314: Node round error: Node requires to apply secure aggregation but Secure aggregation context for the training is not defined.
2023-04-17 18:16:48,259 fedbiomed INFO - Downloading model params after training on node_c14d0b6b-23ea-4b81-aebd-e956e68f4ba4 - from http://localhost:8844/media/uploads/2023/04/17/node_params_5d2a9ef6-81b6-4ee5-b7ce-bd5cebb32505.mpk
2023-04-17 18:16:48,283 fedbiomed DEBUG - download of file node_params_d302f411-57c1-413f-a855-3005b262ddf6.mpk successful, with status code 200
2023-04-17 18:16:48,289 fedbiomed ERROR - FB408: node did not answer during training (node = node_5d16974d-c308-463b-8efe-06cd22703b8e)
2023-04-17 18:16:48,290 fedbiomed CRITICAL - FB408: node did not answer during training
For robustness sake, it would be better to try/except
the error raised by Round._configure_secagg
here https://gitlab.inria.fr/fedbiomed/fedbiomed/-/blob/develop/fedbiomed/node/round.py#L250
and return a TrainingReply
rather than an Error
message.
try:
secagg_arguments = {} if secagg_arguments is None else secagg_arguments
self._use_secagg = self._configure_secagg(
secagg_servkey_id=secagg_arguments.get('secagg_servkey_id'),
secagg_biprime_id=secagg_arguments.get('secagg_biprime_id'),
secagg_random=secagg_arguments.get('secagg_random')
)
except FedbiomedRoundError:
return self._send_round_reply(success=False, message=...)
In some cases, successive such failures with 1 node doing secagg, 1 node not doing secagg (could not reproduce) put the Fed-BioMed instance in inconsistent state (had to clean the message queues for nodes + server to restore coherent state).
Edited by VESIN Marc