Improve GPU support for TensorFlow and Torch

added feature-request label

assigned to @paandrey

created branch gpu-support to address this issue

mentioned in merge request !24 (merged)

In Torch:

A torch.Tensor may be moved to a given device.
- tensor.cpu() or tensor.cuda(device=None) to place on a given device (type)
- alternatively, tensor.to(device) to place on a specific device
A torch.nn.Module may (recursively) be moved to a given device.
- this in fact moves each and every wrapped parameter
By default, created tensors and modules are backed on CPU.
When inputting data into a module, it must be placed on the same device.
When converting a tensor to numpy, it must first be moved to CPU.
- tensor.detach().cpu().numpy() creates a copy of a GPU-placed tensor, without removing the initial one (i.e. there is no need to place it on GPU again).

Possible implementations:

Add device-management parameters and attribute to TorchModel.
Use a custom wrapper to automate module and input data placement.

Solution 1 - Add `device` attribute to `TorchModel`

TorchModel could be given a device attribute, based on which the wrapped modules (model and loss) and their input data would be moved:

place self._model and self._loss_fn as part of the __init__
provide with an API-defined method to move their placement (and update self.device)
alter _unpack_batch to move tensors to self.device

Pros:

easy to write and read

Cons:

requires adjusting TorchModel.compute_batch_predictions to avoid unnecessary GPU-CPU copies of output data and sample weights

Solution 2 - Use a custom Module wrapper

This solution consists in adding a torch.nn.Module wrapper that manages the module's device placement and adequately places input data as they come.

from typing import Any, Optional

import torch

class AutoDeviceModule(torch.nn.Module):

    def __init__(
        self,
        module: torch.nn.Module,
        device: torch.device,
    ) -> None:
        super().__init__()
        self.device = device
        self.module = module.to(device)

    def forward(
        self,
        *inputs: Any
    ) -> None:
        inputs = [x.to(self.device) for x in inputs]
        return self.module(*inputs)

class TorchModel:
    def __init__(
        self,
        model: torch.nn.Module,
        loss: torch.nn.Module,
    ) -> None:
        device = NotImplemented  # select device (CPU, GPU...)
        self._model = AutoDeviceModule(model, device)
        self._loss_fn = AutoDeviceModule(loss, device)
        # ...

Pros:

internalizes some code, reducing changes to TorchModel itself
avoids unnecessary GPU placement

Cons:

still requires to write loss.mul_(s_wght.to(self.device)) (in compute_batch_gradients)
might be harder ~ more complex to read for developers (?)
produces unrequired GPU placement in TorchModel.loss_function

Note: in general, AutoDeviceModule(LossModule) only adds input-placement management since there are no weights

Additional notes

It might be worth it to implement a generic torch_to_numpy function that takes care of device placement (in addition to detaching, etc.) rather than writing tensor.detach().cpu().numpy() everywhere.
cpu_tensor.cpu() and gpu_tensor.cuda() add an overhead, although they do not create a copy (they return the same tensor).
With a functorch-translated module, device placement logic remains the same:
- fn_mod, params = functorch.make_functional(module) preserves device placement
- fn_mod does not need to be placed on a device since it has no parameters
- params and data placement matter (as in non-functional mode)
- => hence, proper management of self._model and input data placement should be sufficient to cover the functorch computations (similarly to the base-torch ones)
The discussion above does not cover the device-management of TorchVector in other contexts, e.g. in optimizer plug-ins. If a state variable, or output tensor, is created from an input one, it will be placed on the same device; there may however be value in catching device-based errors and fixing them as part of the TorchVector backend.
In unit tests, it might be good to parametrize some of the operations to test them on GPU when one is available (and skip when there is no GPU).

Advancement on Torch (2023-02-09)

Advancement:

TorchVector now handles combining a pair of vectors with different device-placement schemes.
TorchModel now exposes an optional device-placement API, using the previously-hintedAutoDeviceModule wrapper (Solution 2 was preferred to Solution 1).
I was able to run some tests that confirm things work smoothly. Those will need to be formalized in the future and integrated to the code base.

Issue: TorchVector device-placement information is lost in (de)serialization.

It does not make sense to force using the same scheme when de-serializing data sent by another agent (e.g. by the server to a client).
This is not an issue with model weights: TorchModel.set_weights preserves pre-existing device placement (hence a CPU-backed vector or weights received by either party will not change their local computation-device strategy).
This can be an issue with state and auxiliary variables of OptiModule and Regularizer instances: if a plug-in holds a CPU-backed vector and combines it with a GPU-backed input one, device placement will change based on the ordering of the operation (cpu + gpu = cpu while gpu + cpu = gpu).
- For ScaffoldClientModule: output corrected gradients will be on GPU, but the states will be on CPU, and both the inner-state-update and gradient-correction operations will cause data copies, which is entirely sub-optimal.
- For any momentum-based module: states and gradients will be on the device of the first input gradients vector, unless set_state has been called with CPU-backed reloaded states, in which case everything (states, inputs and outputs) will be moved to CPU.
=> Hence the current propagation rule will probably need to be updated (cpu + gpu = gpu no matter the ordering), OR some form of device-placement policy will need to be implemented to cleverly place vectors managed by optimizer plug-ins and/or deserialized vectors.
=> For now I will move on to adding Tensorflow GPU support, so that the differences and commonalities between frameworks can guide the overall design of such mechanisms, and of an harmonized policy regarding computation-backing devices use.

marked the checklist item Add code to move TorchModel to a given device (CPU or GPU). as completed

In TensorFlow:

Some utils are provided under tf.config to list and manage visible devices.
- List hardware devices: tf.config.list_physical_devices and tf.config.list_logical_devices
- Get/set usable devices: tf.config.get_visible_devices and tf.config.set_visible_devices
- However the latter cannot be used passed some initialization point (typically once data has been placed on a device that would no longer be visible to tensorflow).
Device placement:
- Is automated in general: created tensors go to GPU and so do most operations (unless designed not to do so, e.g. lookup tables remain on CPU)
- Can be controlled using with tf.device("..."): context managers. Typically, with tf.device("GPU:0") will force placing things on the first GPU... if there is one. Otherwise this will silently fail, and place things declared under its scope on CPU.
- Can be verified once a tensor is created, by accessing its device attribute. Or its backing_device attribute, because in some cases these can differ, e.g. when using signed integers. See this issue and this trick to place an int tensor on GPU.
Operations:
- Can combine a CPU-backed and a GPU-backed tensors. In most cases this will result in a GPU-backed tensor.
- In Eager mode, by default, there is so-called soft placement: operations that do not have a GPU kernel are automatically backed on CPU. In graph mode, they will fail. In declearn, we use a combination of both modes: our code is eager, but some parts (namely, gradients-computation) are graph-compiled using tf.function to optimize their runtime.

Therefore:

In the absence of GPU-oriented code, the current declearn implementation natively supports GPUs and will actually use one whenever it is visible to tensorflow.
- End-users might run some python instructions prior to running a declearn experiment and/or set the CUDA_VISIBLE_DEVICES environment variable to disable using a GPU, or target a specific one.
- The full extent of GPU-placement with the current code has not been tested yet.
Moving a wrapped model to CPU or to a GPU should be fairly easy using a tf.device context-manager.
Ensuring computations, and notably created outputs, remain on that same device might be trickier.
Deserialized tensorflow vectors, such as the reloaded-state and/or network-obtained auxiliary variables of an optimizer plugin, will be placed on GPU whenever one is available. At any rate, combining them with GPU-backed input tensors will result in GPU-backed outputs. It is unclear yet whether or in which cases the CPU-backed states would be durably moved to GPU (if not, this might generate a lot of CPU->GPU copy operations, as in the current Torch implementation).

Possible solutions

Solution 1 - select the used device(s) once and for all

Have the users specify whether they want to use a GPU or not, and if so which one.
Use tf.config.set_visible_devices to disable using undesired GPUs.
Rely on TensorFlow to place things adequately.

Solution 2 - extensively use `tf.device`

Context-manage operations that might result in new tensors being created and placed on a GPU device.
Attach a device attribute to TensorflowModel (and enable changing it via a dedicated method) and use it.
Note: this does not resolve the placement of optimizer state variables.

Solution 2.a - add a `Vector.to_device` method

This could enable handling objects (including optimizer plug-ins) to propagate a device-placement policy to the instances they manage.
This could be coupled with adding device-backing-policy to Optimizer, but that might be counterproductive in terms of clarity and isolation of framework-specific needs...

Solution 2.b - add a general GPU-management policy

Add declearn config utils to specify whether to use the CPU or a GPU.
Have framework-specific backend code access that information and act accordingly to place the data and computations on the appropriate device, using framework-specific logic to do so.

Pros:

Would unify the device-policy language and logic in declearn.
Would enable changing said policy when required (but is that useful?).
Could solve optimizer state variables' placement without adding device-placement code elsewhere than in Vector subclasses and in the dedicated utils.

Cons:

Newcomers that are used to a framework's behavior might need more adaptation efforts.
Perhaps not as flexible as letting the user set a layer-wise device policy (but the latter is hard to implement, and currently unsupported in our torch gpu-support implementation).
Might alter the way "experimental mode" works if the config ends up being shared between spawned processes.

POC: DevicePolicy API (2023-02-10)

DevicePolicy utils

I pushed this morning a draft for a small utils API to manage a global device-placement policy, specified through minimal parameters (for now, gpu: bool and idx: Optional[int]).

Torch implementation

This was then used to correct and complement yesterday's implementation of GPU support for Torch:

TorchModel: remove cuda init argument and replace move_to_device method with update_device_policy.
TorchVector: use the policy to place unpacked tensors when deserializing a vector.

The latter change should make it so that optimizer plug-ins place their state and auxiliary variables on GPU when a client has decided to use one.

I still have to run proper tests to ensure things go smoothly.

Tensorflow implementation

I am currently working towards implementing controlled GPU support for TensorFlow (rather than letting tensorflow decide to use GPU whenever one is visible to it).

The current status is that I am mirroring the torch implementation whenever possible:

Add declearn.model.tensorflow.utils submodule, with a select_device method and an AutoDeviceLayer wrapper for keras layers.
TensorflowModel: rely on AutoDeviceLayer and the global DevicePolicy to place computations on the desired type of device, using the tf.device context manager.
TensorflowVector: rely on the device policy and tf.device to manage the placement of operations as well as of unpacked tensors.

This implementation effort is currently ongoing, and might change before I push. Notably, I am still unsure whether AutoDeviceLayer is a good idea, since it is insufficient to place backward computations on the desired device (it might be easier to just use tf.device directly in TensorflowModel).

marked the checklist item Add code to move TensorflowModel to a given device (if needed). as completed

marked the checklist item Add generic configuration tools to detect / hide / select GPU devices to use. as completed

changed the description

Advancement on TensorFlow (2023-02-10)

I finalized a first implementation of proper device-management in TensorflowModel and TensorflowVector, using custom utils to limit redundant boilerplate code.
The use of an AutoDeviceLayer wrapper was dropped, as it made the TensorflowModel code harder to read while still requiring to add most tf.device context managers into it.
The specifications of the implemented changes are similar to those of TorchModel (which was revised today as well - see post above on the experimental DevicePolicy API).

The implementation should be on par with the torch one (in terms of API, documentation and effective behaviors), but it requires further testing to look out for bugs and/or control that everything runs as expected. This is true also for the TorchVector revisions, and typically for the way both frameworks' Vector subclasses' revisions impact the identified issues regarding the placement of data and operations as part of Optimizer pipelines (and, to a lesser extent, Aggregator computations).

Advancement on TensorFlow (2023-02-13)

I tested the current TensorFlow implementation and was able to:

fix a breaking bug (the compiled model was the input one, not the device-moved one)
confirm that things work properly:
- the input model is moved to the selected device
- gradients computations occur on the selected device
- gradients' refinement into updates occur on the selected device
- updates' application occurs on the selected device
- inputting CPU-backed updates to a GPU-backed model (and vice-versa) works

changed milestone to %Release declearn 2.1.0

marked the checklist item Abstract device-management-related Model API changes. as completed

Current state:

Model API has been revised
TensorFlow and Torch implementations have been tested and work
Formal tests need implementing, and executing on a dedicated resource prior to merging
Readme needs updating; in-code docs have been updated.

added api-change label

mentioned in commit ae0c0796

closed with merge request !24 (merged)

Improve GPU support for TensorFlow and Torch

Designs

Child items ...

Activity

In Torch:

Possible implementations:

Solution 1 - Add `device` attribute to `TorchModel`

Solution 2 - Use a custom Module wrapper

Additional notes

Advancement on Torch (2023-02-09)

In TensorFlow:

Possible solutions

Solution 1 - select the used device(s) once and for all

Solution 2 - extensively use `tf.device`

Solution 2.a - add a `Vector.to_device` method

Solution 2.b - add a general GPU-management policy

POC: DevicePolicy API (2023-02-10)

DevicePolicy utils

Torch implementation

Tensorflow implementation

Advancement on TensorFlow (2023-02-10)

Advancement on TensorFlow (2023-02-13)

Improve GPU support for TensorFlow and Torch

Activity

In Torch:

Possible implementations:

Solution 1 - Add device attribute to TorchModel

Solution 2 - Use a custom Module wrapper

Additional notes

Advancement on Torch (2023-02-09)

In TensorFlow:

Possible solutions

Solution 1 - select the used device(s) once and for all

Solution 2 - extensively use tf.device

Solution 2.a - add a Vector.to_device method

Solution 2.b - add a general GPU-management policy

POC: DevicePolicy API (2023-02-10)

DevicePolicy utils

Torch implementation

Tensorflow implementation

Advancement on TensorFlow (2023-02-10)

Advancement on TensorFlow (2023-02-13)

Solution 1 - Add `device` attribute to `TorchModel`

Solution 2 - extensively use `tf.device`

Solution 2.a - add a `Vector.to_device` method