Implement (federated) evaluation metrics
Currently, the only evaluation metric designed in declearn is the model's loss, which is being computed and shared as part of the implemented federated learning process, and monitored to decide on the "best" achieved model and on (optional) early stopping.
It would be valuable for the framework to enable users to compute (and opt. share) a broader range of evaluation metrics, notably for human-readability of the training progress (and parametrization of associated early stopping criteria).
This could be done in a number of ways:
- Allowing users to write a custom method to evaluate their model (and/or adding a framework-specific method to the Model API, that may or may not be parametrized, as the loss is).
- Designing a generic metrics API and adding a method to the Model API that produces standard inputs for the latter.
- Doing "a little bit of both", i.e. writing framework-specific metrics and having a way to specify which must be used for a given task/model/process.
Additionally, there are at least two ways these metrics may be shared with and aggregated by the server:
- Sending local metrics and averaging them, which is the simplest way but is also biased for quite a number of metrics (e.g. classification metrics that rely on class labels' support).
- Sending underlying variables that may be aggregated (by averaging or more complex operations) prior to being "finalized" into a global metric (e.g. TP/FP/TN/FN accounts from clients, summed and finally used to compute the precision and recall on the union of local validation datasets).
Finally, it may make sense (probably on the longer run) to implement local DP and/or SecAgg mechanisms to provide clients with privacy guarantees as to their local values (whether these of the metric, or of the underlying counts).
Based on the former three parameters, some design and implementation effort will be required to modularize:
- which metrics are computed locally as part of the chronic evaluation process during training
- which of these metrics are being shared with the server, and how / under which privacy constraints
- which of these metrics is/are to be monitored to decide on model convergence, trigger early-stopping and/or select the "best" final model
Tasks:
-
Write a minimal Metrics API and/or metrics-computing method for the Model API to compute metrics locally. -
Write a minimal sharing API that relies on averaging local values (which is bound to be biased in some cases). -
Write additional mechanisms to share and finalize state variables for correct metrics aggregation. -
Design and write how the server and/or clients decide on metrics to compute and/or share. -
Enable clients to refuse sharing local metrics. -
Modularize metric-tracking by the Checkpointer and EarlyStopping utils. -
Pave the way for (and document as a dedicated issue) privacy-preserving mechanisms addition. -
Write associated documentation. -
Write associated tests. -
Metric API and subclasses -
MetricSet util -
Integration into Model -
Integration into TrainingManager -
Integration into federated process=> tackled in Heart UCI example
-