Commit 4e6cd19f authored by xtof's avatar xtof


parent 1617b31f
title: "Federated Deep Learning"
date: 2019-10-23T12:02:00+06:00
description : "On federated deep learning"
type: draft
author: Christophe Cerisara
tags: ["AI", "deep learning"]
## Challenge of training deep learning models
Training deep learning models has always been a challenge.
Not only because of the large quantity of data to train on,
or because of the large number of model parameters to optimize,
but also because of the fact that training a deep learning model
involves finding its high through a huge variety of related paths,
which involves testing many different combinations of hyper-parmeters
and model topologies, without any certainty but just using basic intuition
and human knowledge and expertise.
As a result, it's not only that a single training run may be very long even on
many GPUs, but that we have to actually perform many of such runs before finding a good path.
So, globally, the cost - in time, human resources and money - of training deep learning models is very high.
In order to speed up this process, the common trend is to deploy models
in high-performance computing clusters, equipped with last-generation powerful GPUs.
But such clusters are extremely costly, and become outdated only after a few years.
This is why another paradigm is currently emerging under the name **federated deep learning**.
## Distributed training
Let us first review the various strategies to train deep learning models, centralized or distributed:
- On a single GPU on a single machine: this is the easiest nowadays, thanks to efficient implementation of CUDA operations in pytorch ans tensorflow
- Accross GPUs on a single machine: this is a bit more difficult to realize, because it often involves modifying the code of the model to efficiently distribute it on multiple GPUs. More and more libraries are however proposed to try and automate this distribution.
- Accross GPUs on multiple nodes in a single cluster: this is more difficult, because central memory is not shared anymore, and latency between nodes is higher.
- Accross low-end devices: this is the so-called **federated deep learning**. Latency is however a major issue there, and synchronous SGD is not an option any more.
So the idea is to train on each device for multiple epochs using only the data hosted locally, before updating the global model.
This is also presented as a way to protect privacy, as device-dependent data is not transferred to another node.
## Federated, but not in the sense of OLKi
The term **federated deep learning** in the sense given above involves a central model, and every node is still working to update this central model.
In that sense, it fundamentally differs from the use of federated in the **federated OLKi platform**, where there is no central model
and every node has his own objectives, but still voluntarily share knowledge with other nodes in his community.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment