Tensorflow server (!103) · Merge requests · melissa / Melissa

SCHOULER Marc requested to merge tensorflow-server into develop Mar 14, 2023

This MR introduces a tensorflow deep learning server.

To-do:

Note: as opposed to torch.DDP, it seems that tensorflow.MultiWorkerMirroredStrategy requires multiple adjustments for multi-gpu per node execution to work in the Melissa frame.

With TF, setting a specific GPU can be done by setting an environment variable CUDA_VISIBLE_DEVICES=GPU_ID before running the server as explained on the following threads:

Another solution would be to use tf.config.set_visible_devices before defining the distributed strategy:

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices(physical_devices[self.rank], 'GPU')

This would perhaps avoid the necessity to hard code the CUDA environment variable declaration before executing the server.

edit: this was successfully tested on JZ and was hence implemented.

Additional changes and remarks:

Since Jean-Zay prevents users from loading both torch and tensorflow at the same time, the following modifications were made:

The MelissaIterableDataset was subclassed in TorchMelissaIterableDataset and TfMelissaIterableDataset classes,
Because of the torch dependency induced by SummaryWriter, new TorchTensorboardLogger and TfTensorboardLogger classes were introduced.

Finally, TensorboardLogger was initialized before the distribution strategy which is not allowed:

Important: There is currently a TensorFlow limitation on declaring the strategy; it must be done before any other call to a TensorFlow operation.

It was then moved at the end of the setup_environment method of each DL server.

Edited May 04, 2023 by SCHOULER Marc

Admin message

Tensorflow server

Merge request reports