TensorFlow & Torch enable using GPUs when available to accelerate computations. At the moment, no GPU-dedicated tools are implemented as part of declearn, which means it is up to the clients to implement any required operations to make GPUs detectable and usable by the computation framework backing the federated model.
It could therefore be useful (and user-friendly) to implement (limited) framework-specific and/or generic tools that deal with GPU use - or at the very least to provide some documentation as to GPU support and use related to how the third-party frameworks work.
Notional task list (tasks' utility and complexity should be evaluated before completing them):
Add code to move TorchModel to a given device (CPU or GPU).
Add code to move TensorflowModel to a given device (if needed).
Add generic configuration tools to detect / hide / select GPU devices to use.
Abstract device-management-related Model API changes.
Fully test the implementation and improve the docs prior to merging.
✓
5 of 5 checklist items completed
· Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
tensor.cpu() or tensor.cuda(device=None) to place on a given device (type)
alternatively, tensor.to(device) to place on a specific device
A torch.nn.Module may (recursively) be moved to a given device.
this in fact moves each and every wrapped parameter
By default, created tensors and modules are backed on CPU.
When inputting data into a module, it must be placed on the same device.
When converting a tensor to numpy, it must first be moved to CPU.
tensor.detach().cpu().numpy() creates a copy of a GPU-placed tensor, without removing the initial one (i.e. there is no need to place it on GPU again).
Possible implementations:
Add device-management parameters and attribute to TorchModel.
Use a custom wrapper to automate module and input data placement.
Solution 1 - Add device attribute to TorchModel
TorchModel could be given a device attribute, based on which the wrapped modules (model and loss) and their input data would be moved:
place self._model and self._loss_fn as part of the __init__
provide with an API-defined method to move their placement (and update self.device)
alter _unpack_batch to move tensors to self.device
Pros:
easy to write and read
Cons:
requires adjusting TorchModel.compute_batch_predictions to avoid unnecessary GPU-CPU copies of output data and sample weights
Solution 2 - Use a custom Module wrapper
This solution consists in adding a torch.nn.Module wrapper that manages the module's device placement and adequately places input data as they come.
internalizes some code, reducing changes to TorchModel itself
avoids unnecessary GPU placement
Cons:
still requires to write loss.mul_(s_wght.to(self.device)) (in compute_batch_gradients)
might be harder ~ more complex to read for developers (?)
produces unrequired GPU placement in TorchModel.loss_function
Note: in general, AutoDeviceModule(LossModule) only adds input-placement management since there are no weights
Additional notes
It might be worth it to implement a generic torch_to_numpy function that takes care of device placement (in addition to detaching, etc.) rather than writing tensor.detach().cpu().numpy() everywhere.
cpu_tensor.cpu() and gpu_tensor.cuda() add an overhead, although they do not create a copy (they return the same tensor).
With a functorch-translated module, device placement logic remains the same:
fn_mod does not need to be placed on a device since it has no parameters
params and data placement matter (as in non-functional mode)
=> hence, proper management of self._model and input data placement should be sufficient to cover the functorch computations (similarly to the base-torch ones)
The discussion above does not cover the device-management of TorchVector in other contexts, e.g. in optimizer plug-ins. If a state variable, or output tensor, is created from an input one, it will be placed on the same device; there may however be value in catching device-based errors and fixing them as part of the TorchVector backend.
In unit tests, it might be good to parametrize some of the operations to test them on GPU when one is available (and skip when there is no GPU).
TorchVector now handles combining a pair of vectors with different device-placement schemes.
TorchModel now exposes an optional device-placement API, using the previously-hintedAutoDeviceModule wrapper (Solution 2 was preferred to Solution 1).
I was able to run some tests that confirm things work smoothly. Those will need to be formalized in the future and integrated to the code base.
Issue: TorchVector device-placement information is lost in (de)serialization.
It does not make sense to force using the same scheme when de-serializing data sent by another agent (e.g. by the server to a client).
This is not an issue with model weights: TorchModel.set_weights preserves pre-existing device placement (hence a CPU-backed vector or weights received by either party will not change their local computation-device strategy).
This can be an issue with state and auxiliary variables of OptiModule and Regularizer instances: if a plug-in holds a CPU-backed vector and combines it with a GPU-backed input one, device placement will change based on the ordering of the operation (cpu + gpu = cpu while gpu + cpu = gpu).
For ScaffoldClientModule: output corrected gradients will be on GPU, but the states will be on CPU, and both the inner-state-update and gradient-correction operations will cause data copies, which is entirely sub-optimal.
For any momentum-based module: states and gradients will be on the device of the first input gradients vector, unlessset_state has been called with CPU-backed reloaded states, in which case everything (states, inputs and outputs) will be moved to CPU.
=> Hence the current propagation rule will probably need to be updated (cpu + gpu = gpu no matter the ordering), OR some form of device-placement policy will need to be implemented to cleverly place vectors managed by optimizer plug-ins and/or deserialized vectors.
=> For now I will move on to adding Tensorflow GPU support, so that the differences and commonalities between frameworks can guide the overall design of such mechanisms, and of an harmonized policy regarding computation-backing devices use.
ANDREY Paulmarked the checklist item Add code to move TorchModel to a given device (CPU or GPU). as completed
marked the checklist item Add code to move TorchModel to a given device (CPU or GPU). as completed
Some utils are provided under tf.config to list and manage visible devices.
List hardware devices: tf.config.list_physical_devices and tf.config.list_logical_devices
Get/set usable devices: tf.config.get_visible_devices and tf.config.set_visible_devices
However the latter cannot be used passed some initialization point (typically once data has been placed on a device that would no longer be visible to tensorflow).
Device placement:
Is automated in general: created tensors go to GPU and so do most operations (unless designed not to do so, e.g. lookup tables remain on CPU)
Can be controlled using with tf.device("..."): context managers. Typically, with tf.device("GPU:0") will force placing things on the first GPU... if there is one. Otherwise this will silently fail, and place things declared under its scope on CPU.
Can be verified once a tensor is created, by accessing its device attribute. Or its backing_device attribute, because in some cases these can differ, e.g. when using signed integers. See this issue and this trick to place an int tensor on GPU.
Operations:
Can combine a CPU-backed and a GPU-backed tensors. In most cases this will result in a GPU-backed tensor.
In Eager mode, by default, there is so-called soft placement: operations that do not have a GPU kernel are automatically backed on CPU. In graph mode, they will fail. In declearn, we use a combination of both modes: our code is eager, but some parts (namely, gradients-computation) are graph-compiled using tf.function to optimize their runtime.
Therefore:
In the absence of GPU-oriented code, the current declearn implementation natively supports GPUs and will actually use one whenever it is visible to tensorflow.
End-users might run some python instructions prior to running a declearn experiment and/or set the CUDA_VISIBLE_DEVICES environment variable to disable using a GPU, or target a specific one.
The full extent of GPU-placement with the current code has not been tested yet.
Moving a wrapped model to CPU or to a GPU should be fairly easy using a tf.device context-manager.
Ensuring computations, and notably created outputs, remain on that same device might be trickier.
Deserialized tensorflow vectors, such as the reloaded-state and/or network-obtained auxiliary variables of an optimizer plugin, will be placed on GPU whenever one is available. At any rate, combining them with GPU-backed input tensors will result in GPU-backed outputs. It is unclear yet whether or in which cases the CPU-backed states would be durably moved to GPU (if not, this might generate a lot of CPU->GPU copy operations, as in the current Torch implementation).
Possible solutions
Solution 1 - select the used device(s) once and for all
Have the users specify whether they want to use a GPU or not, and if so which one.
Use tf.config.set_visible_devices to disable using undesired GPUs.
Rely on TensorFlow to place things adequately.
Solution 2 - extensively use tf.device
Context-manage operations that might result in new tensors being created and placed on a GPU device.
Attach a device attribute to TensorflowModel (and enable changing it via a dedicated method) and use it.
Note: this does not resolve the placement of optimizer state variables.
Solution 2.a - add a Vector.to_device method
This could enable handling objects (including optimizer plug-ins) to propagate a device-placement policy to the instances they manage.
This could be coupled with adding device-backing-policy to Optimizer, but that might be counterproductive in terms of clarity and isolation of framework-specific needs...
Solution 2.b - add a general GPU-management policy
Add declearn config utils to specify whether to use the CPU or a GPU.
Have framework-specific backend code access that information and act accordingly to place the data and computations on the appropriate device, using framework-specific logic to do so.
Pros:
Would unify the device-policy language and logic in declearn.
Would enable changing said policy when required (but is that useful?).
Could solve optimizer state variables' placement without adding device-placement code elsewhere than in Vector subclasses and in the dedicated utils.
Cons:
Newcomers that are used to a framework's behavior might need more adaptation efforts.
Perhaps not as flexible as letting the user set a layer-wise device policy (but the latter is hard to implement, and currently unsupported in our torch gpu-support implementation).
Might alter the way "experimental mode" works if the config ends up being shared between spawned processes.
I pushed this morning a draft for a small utils API to manage a global device-placement policy, specified through minimal parameters (for now, gpu: bool and idx: Optional[int]).
Torch implementation
This was then used to correct and complement yesterday's implementation of GPU support for Torch:
TorchModel: remove cuda init argument and replace move_to_device method with update_device_policy.
TorchVector: use the policy to place unpacked tensors when deserializing a vector.
The latter change should make it so that optimizer plug-ins place their state and auxiliary variables on GPU when a client has decided to use one.
I still have to run proper tests to ensure things go smoothly.
Tensorflow implementation
I am currently working towards implementing controlled GPU support for TensorFlow (rather than letting tensorflow decide to use GPU whenever one is visible to it).
The current status is that I am mirroring the torch implementation whenever possible:
Add declearn.model.tensorflow.utils submodule, with a select_device method and an AutoDeviceLayer wrapper for keras layers.
TensorflowModel: rely on AutoDeviceLayer and the global DevicePolicy to place computations on the desired type of device, using the tf.device context manager.
TensorflowVector: rely on the device policy and tf.device to manage the placement of operations as well as of unpacked tensors.
This implementation effort is currently ongoing, and might change before I push. Notably, I am still unsure whether AutoDeviceLayer is a good idea, since it is insufficient to place backward computations on the desired device (it might be easier to just use tf.device directly in TensorflowModel).
I finalized a first implementation of proper device-management in TensorflowModel and TensorflowVector, using custom utils to limit redundant boilerplate code.
The use of an AutoDeviceLayer wrapper was dropped, as it made the TensorflowModel code harder to read while still requiring to add most tf.device context managers into it.
The specifications of the implemented changes are similar to those of TorchModel (which was revised today as well - see post above on the experimental DevicePolicy API).
The implementation should be on par with the torch one (in terms of API, documentation and effective behaviors), but it requires further testing to look out for bugs and/or control that everything runs as expected. This is true also for the TorchVector revisions, and typically for the way both frameworks' Vector subclasses' revisions impact the identified issues regarding the placement of data and operations as part of Optimizer pipelines (and, to a lesser extent, Aggregator computations).