Mentions légales du service

Skip to content

Tensorflow server

SCHOULER Marc requested to merge tensorflow-server into develop

This MR introduces a tensorflow deep learning server.


Note: as opposed to torch.DDP, it seems that tensorflow.MultiWorkerMirroredStrategy requires multiple adjustments for multi-gpu per node execution to work in the Melissa frame.

With TF, setting a specific GPU can be done by setting an environment variable CUDA_VISIBLE_DEVICES=GPU_ID before running the server as explained on the following threads:

Another solution would be to use tf.config.set_visible_devices before defining the distributed strategy:

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices(physical_devices[self.rank], 'GPU')

This would perhaps avoid the necessity to hard code the CUDA environment variable declaration before executing the server.

edit: this was successfully tested on JZ and was hence implemented.

Additional changes and remarks:

Since Jean-Zay prevents users from loading both torch and tensorflow at the same time, the following modifications were made:

  • The MelissaIterableDataset was subclassed in TorchMelissaIterableDataset and TfMelissaIterableDataset classes,
  • Because of the torch dependency induced by SummaryWriter, new TorchTensorboardLogger and TfTensorboardLogger classes were introduced.

Finally, TensorboardLogger was initialized before the distribution strategy which is not allowed:

Important: There is currently a TensorFlow limitation on declaring the strategy; it must be done before any other call to a TensorFlow operation.

It was then moved at the end of the setup_environment method of each DL server.

Edited by SCHOULER Marc

Merge request reports