training does not fail nicely when data does not exist on node
When doing a training request, dataset may not exist on node for example if:
- researcher searches for available datasets
- then node removes the dataset selected by researcher
- then researcher sends a training request
In that case, we want the experiment to fail gently. But though the error is (first) properly caught and advertised, Experiment()
then fails with some uncaught error:
-----------------------------------------------------------------
2023-02-01 08:59:39,341 fedbiomed DEBUG - researcher_26940cf7-f022-46b4-84ad-2a55a4c35780
2023-02-01 08:59:39,345 fedbiomed INFO - ERROR
NODE node_2ff3675d-9d8b-458c-aa5a-6b44ecfe84d2
MESSAGE: Did not found proper data in local datasets on node=node_2ff3675d-9d8b-458c-aa5a-6b44ecfe84d2
-----------------------------------------------------------------
2023-02-01 08:59:49,355 fedbiomed INFO - Error message received during training: FB313: no dataset matching request - Did not found proper data in local datasets
2023-02-01 08:59:49,360 fedbiomed CRITICAL - Fed-BioMed stopped due to unknown error:
'job_id'
--------------------
Fed-BioMed researcher stopped due to unknown error:
'job_id'
More details in the backtrace extract below
--------------------
Traceback (most recent call last):
File "/home/mvesin/GIT/fedbiomed/fedbiomed/fedbiomed/researcher/experiment.py", line 63, in payload
ret = function(*args, **kwargs)
File "/home/mvesin/GIT/fedbiomed/fedbiomed/fedbiomed/researcher/experiment.py", line 1593, in run_once
_ = self._job.start_nodes_training_round(round=self._round_current,
File "/home/mvesin/GIT/fedbiomed/fedbiomed/fedbiomed/researcher/job.py", line 432, in start_nodes_training_round
m['job_id'] != self._id or m['node_id'] not in list(self._nodes):
KeyError: 'job_id'
--------------------