multiples datasets from one node in an `Experiment()`
Current implementation enables multiples datasets from the same node to participate in an experiment. In this case, the TrainRequest
to a node contains multiples datasets, and the node should train the model for each of these datasets in sequence.
This is an historical functionality that has been part of Fed-BioMed since the begining of the current implementation, but:
- we currently don't use it (no demo/notebook using it)
- moreover, we may decide that is not a necessary concept/functionality and say that at most one dataset from a node may participate in an experiment (which means: at most one
dataset_id
attached to anode_id
in anExperiment()
- also, this is probably broken (developers little know this aspect, so some extensions may have been coded in a way that does not support it)
Tasks of this issue:
-
as a product owner (assisted by dev team ?) i want to decide whether at most one (or multiple) dataset from a node may participate to an experiment => DECISION: at most one dataset per node for a TrainRequest
(and in anExperiment()
[ ] (if multiple) as the dev team i want to review/test/fix the code to validate proper support of multiples datasets per node in experiments- (if at most one) as the dev team i want to clean the code to remove provisions for supporting multiple datasets per node in experiments
-
clean node side - don't modify handling of search (
SearchRequest
/SearchReply
): a search can still return multiple datasets (all the ones matching the tags) Node
- no adaptation needed for
Round
- don't modify handling of search (
-
clean researcher side - don't modify the handling of search
- adapt
FederatedDataset
, 'Job' - no adaptation needed for
Requests
,Aggregator
,Experiment
-
update notebooks -
unit tests -
documentation
-
Implementation choices:
- node side data registration: changed in #471 (closed) for new (coherent) tag rules
-
SearchReply
/ListReply
messages: unchanged, can still return multiple datasets per node. This can be useful for exploring the data shared by nodes -
Job()
Experiment()
FederatedDataSet()
: changed handling of training data, must now have one dataset per node -
FederatedDataSet()
format is unchanged (see below), but we now enforce that it contains only one dataset per node.
# valid `data` for calling `FederatedDataSet(data)`, so it accepts `Requests().search()` result
{
'node1_id': [ { .... dataset1 desc } ],
'node2_id': [ { .... dataset2 desc } ],
...
}
# valid `data` for calling `FederatedDataSet(data)`, because we don't need a list
{
'node1_id': { .... dataset1 desc },
'node2_id': { .... dataset2 desc },
...
}
# invalid `FederatedDataSet(data)`
{
'node1_id': [ { .... dataset1 desc }, { ... dataset3 desc } ],
'node2_id': [ { .... dataset2 desc } ],
...
}
-
TrainingRequest
message: changed format of dataset- now the field is
dataset_id
typed asstr
- only one dataset per node
- remove useless
NODE_ID
as we use only training requests targeted to one node through a dedicated channel (no broadcast message train request)
- before the field was
training_data
- now the field is
{ 'node1': [ 'dataset1', 'dataset2' ],
'node2': [ 'dataset3', 'dataset4' ],
...
}
Edited by VESIN Marc