This MR introduces a tensorflow deep learning server.
Note: as opposed to
torch.DDP, it seems that
tensorflow.MultiWorkerMirroredStrategy requires multiple adjustments for multi-gpu per node execution to work in the Melissa frame.
With TF, setting a specific GPU can be done by setting an environment variable
CUDA_VISIBLE_DEVICES=GPU_ID before running the server as explained on the following threads:
- Tensorflow set CUDA_VISIBLE_DEVICES within jupyter,
- How do I select which GPU to run a job on?,
- How to set specific gpu in tensorflow?.
Another solution would be to use
tf.config.set_visible_devices before defining the distributed strategy:
physical_devices = tf.config.list_physical_devices('GPU') tf.config.set_visible_devices(physical_devices[self.rank], 'GPU')
This would perhaps avoid the necessity to hard code the
CUDA environment variable declaration before executing the server.
edit: this was successfully tested on JZ and was hence implemented.
Additional changes and remarks:
Since Jean-Zay prevents users from loading both
tensorflow at the same time, the following modifications were made:
MelissaIterableDatasetwas subclassed in
- Because of the
torchdependency induced by
TfTensorboardLoggerclasses were introduced.
TensorboardLogger was initialized before the distribution strategy which is not allowed:
Important: There is currently a TensorFlow limitation on declaring the strategy; it must be done before any other call to a TensorFlow operation.
It was then moved at the end of the
setup_environment method of each DL server.