Mentions légales du service

Skip to content
Snippets Groups Projects
Commit 08bdb5fb authored by CREMONESI Francesco's avatar CREMONESI Francesco
Browse files

Heart disease notebooks

parent 5507c89a
No related branches found
No related tags found
No related merge requests found
...@@ -12,6 +12,10 @@ parts: ...@@ -12,6 +12,10 @@ parts:
chapters: chapters:
- file: fedbiomed-tutorial/intro-tutorial-mednist.ipynb - file: fedbiomed-tutorial/intro-tutorial-mednist.ipynb
title: Intro tutorial (MedNIST) title: Intro tutorial (MedNIST)
- file: fedbiomed-tutorial/tutorial-sklearn-problem.ipynb
title: Heart disease detection
- file: fedbiomed-tutorial/tutorial-sklearn-solutions.ipynb
title: Heart disease detection
- file: fedbiomed-tutorial/brain-segmentation-exercise.ipynb - file: fedbiomed-tutorial/brain-segmentation-exercise.ipynb
title: Brain segmentation title: Brain segmentation
- file: fedbiomed-tutorial/brain-segmentation-solution.ipynb - file: fedbiomed-tutorial/brain-segmentation-solution.ipynb
......
%% Cell type:markdown id:64e87007 tags:
# Heart disease detection
%% Cell type:markdown id:9e351015 tags:
In this tutorial, we will focus on applying federated learning techniques to a classification problem using Scikit-Learn, a popular machine learning library in Python. We will walk you through the process step by step, from setting up the federated learning environment to evaluating the model's performance.
Scikit-Learn, also known as sklearn, is a popular machine learning library in Python. It provides a wide range of tools and algorithms for tasks such as data preprocessing, feature selection, model training, and evaluation. Sklearn is widely used for tasks such as classification, regression, clustering, and dimensionality reduction. It offers a user-friendly interface and integrates well with other libraries in the Python ecosystem, making it a go-to choice for many machine learning practitioners and researchers.
%% Cell type:code id:ade4cbea tags:
``` python
%load_ext autoreload
%autoreload 2
```
%% Cell type:markdown id:5827f560 tags:
# Table of content
1. [The dataset](#dataset)
2. [Task 1: training plan](#task1)
3. [Task 2: the experment](#task2)
4. [Task 3: model validation](#task3)
%% Cell type:markdown id:59d00ae5 tags:
# Tutorial
%% Cell type:markdown id:b9036f7d tags:
## The dataset <a name="dataset"></a>
%% Cell type:markdown id:e4f03a34 tags:
The Heart Disease dataset available at https://archive.ics.uci.edu/dataset/45/heart+disease is a widely used dataset in the field of cardiovascular research and machine learning. It contains a collection of medical attributes from patients suspected of having heart disease, along with their corresponding diagnosis (presence or absence of heart disease). The dataset includes information such as age, sex, blood pressure, cholesterol levels, and various other clinical measurements.
It was collected in 4 hospitals in the USA, Switzerland and Hungary. This dataset contains tabular information about 740 patients distributed among these four clients.
A federated version of this dataset has been proposed in [Flamby](https://arxiv.org/pdf/2210.04620.pdf). Following thier actions, we preprocess the dataset by removing missing values and encoding non-binary categorical variables as dummy variables. We finally obtain the following centers:
| Number | Client | Dataset size |
|--------|----------------------|--------------|
| 0 | Cleveland’s Hospital | 303 |
| 1 | Hungarian Hospital | 261 |
| 2 | Switzerland Hospital | 46 |
| 3 | Long Beach Hospital | 130 |
%% Cell type:markdown id:6d8554fa tags:
For teaching purposes, we decided to merge: client0 with client3 and client1 with client2.
The final federated scenario, in this way, is the following:
- **client1**, with 349 elements
- **client2**, with 391 elements
%% Cell type:markdown id:e860a74e tags:
## Task 1: Defining the training plan <a name="task1"></a>
%% Cell type:markdown id:fd2daf01 tags:
A training plan is a class that defines the four main components of federated model training: the data, the model, he loss and the optimizer. It is responsible for providing custom methods allowing every node to perform the training.
In the case of scikit-learn, Fed-BioMed already does a lot of the heavy lifting for you by providing the FedPerceptron, FedSGDClassifier and FedSGDRegressor classes as training plans. These classes already take care of the model, optimizer, loss function and related dependencies for you, so you only need to define how the data will be loaded.
In this tutorial we are going to use an [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html), so the related FedSGDClassifier training plan.
%% Cell type:markdown id:11ec284b tags:
### Model arguments
*model_args* is a dictionary with the arguments related to the model, that will be passed to the Perceptron constructor.
**IMPORTANT** For classification tasks, you are required to specify the following two fields:
- n_features: the number of features in each input sample (in our case, the number of pixels in the images)
- n_classes: the number of classes in the target data
Other model arguments depend on the specific model you are using, and are defined in the model definition. Refer to the model documentation
### Training arguments
*training_args* is a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.
**IMPORTANT** To set the training arguments we may either pass them to the Experiment constructor, or set them on an instance with the setter method:
'exp.set_training_arguments(training_args=training_args)'
The setters are available also for single training arguments, like:
'exp.set_aggregator(aggregator=FedAverage)'
%% Cell type:markdown id:7571ab66 tags:
**TO_DO:**
- Apply the scaler to your data
- Define training args: num_updates, batch_size.
- Define model args as explained above.
%% Cell type:code id:998bb584 tags:
``` python
from fedbiomed.common.training_plans import FedSGDRegressor, FedPerceptron, FedSGDClassifier
from fedbiomed.common.data import DataManager
from sklearn.preprocessing import MinMaxScaler
class SkLearnClassifierTrainingPlan(FedSGDClassifier):
def init_dependencies(self):
"""Define additional dependencies.
return ["from torchvision import datasets, transforms",
"from torch.utils.data import DataLoader"]
def training_data(self, batch_size):
In this case, we rely on torchvision functions for preprocessing the images.
"""
return ["from sklearn.preprocessing import MinMaxScaler"]
def training_data(self, batch_size):
df = pd.read_csv(self.dataset_path, delimiter=';', header=None)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
self.scaler = MinMaxScaler()
# X = [...] TODO: apply the transformer to the data.
return DataManager(dataset=X,target=y, batch_size=batch_size, shuffle=True)
```
%% Cell type:code id:653e8aea tags:
``` python
n_features = 18
n_classes = 2
model_args = { 'max_iter':100,
'tol': 1e-1 ,
'loss': 'huber',
# [...] TODO: Insert the missing model arguments.
}
training_args = {
# [...] TODO: Insert the training arguments as elements in the dic.
}
```
%% Cell type:markdown id:b3020aa6 tags:
## Task 2: the Experiment <a name="task2"></a>
%% Cell type:markdown id:9ae58ccf tags:
The experiment enables Federated Learning by orchestrating the training process across multiple nodes. It searches for datasets based on specific tags, uploads the training plan file, sends model and training arguments, tracks and checks training progress, and downloads and aggregates model parameters for the next round.
%% Cell type:markdown id:c17ef104 tags:
**TO_DO:**
- Define the used training plan.
- Pass model and training arguments
%% Cell type:markdown id:28f602da tags:
<div class="alert alert-block alert-info"> <b>TAGS:</b> Replace %%%% in the tags with your username </div>
%% Cell type:code id:4b1a1341 tags:
``` python
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage
tags = ['heart-jupyter-%%%%']
rounds = 10
# search for corresponding datasets across nodes datasets
exp = Experiment(tags=tags,
model_args=None, #TODO: insert the correct value
training_plan_class=None, #TODO: insert the correct value
training_args=None, #TODO: insert the correct value
round_limit=rounds,
aggregator=FedAverage(),
node_selection_strategy=None)
```
%% Cell type:code id:d6ff55da tags:
``` python
exp.run()
```
%% Cell type:markdown id:88e2a782 tags:
## Task 3: Model Validation <a name="task3"></a>
%% Cell type:markdown id:84f2ad10 tags:
During federated training, model validation plays a crucial role in assessing performance without a dedicated holdout dataset. Fed-BioMed enables separate model validation on each node after parameter updates, allowing comparison of model performances. Two types of validation can be performed:
- one on globally updated parameters before training a round,
- another on locally updated parameters after local training is completed on a node.
This helps users evaluate the impact of node-specific training on model improvement.
%% Cell type:markdown id:1cbc4d1a tags:
Here is the list of validation arguments that can be configured.
- *test_ratio*: Ratio of the validation partition of the dataset. The remaining samples will be used for training. By default, it is 0.0.
- *test_on_global_updates*: Boolean value that indicates whether validation will be applied to globally updated (aggregated) parameters (see Figure 1). Default is False
- *test_on_local_updates*: Boolean value that indicates whether validation will be applied to locally updated (trained) parameters (see Figure 1). Default is False
- *test_metric*: One of MetricTypes that indicates which metric will be used for validation. It can be str or an instance of MetricTypes (e.g. MetricTypes.RECALL or RECALL ). If it is None and there isn't testing_step defined in the training plan (see section: Define Custom Validation Step) default metric will be ACCURACY.
- *test_metric_args*: A dictionary that contains the arguments that will be used for the metric function.
%% Cell type:markdown id:700d8da1 tags:
**TO_DO:**
- Initialize a new experiements.
- Use the setters to define the validation arguments.
- Launch the training and check the validation performances.
%% Cell type:code id:9972f57d tags:
``` python
exp = Experiment(
# [...]
training_args=training_args
)
#TODO: set the parameters using the setters. Example: exp.set_test_ratio(test_ratio=0.1)
```
%% Cell type:code id:827b9920 tags:
``` python
exp.run()
```
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment