This MR introduces a tensorflow deep learning server.
To-do:
-
tf_server.py
-
config_mpi_tf.json
-
tf_heatpde_dl_server.py
-
tf-plot-result-dl.py
-
dataset.py
-
tensorboard_logger
-
CI test -
multi-gpu test
Note: as opposed to torch.DDP
, it seems that tensorflow.MultiWorkerMirroredStrategy
requires multiple adjustments for multi-gpu per node execution to work in the Melissa frame.
With TF, setting a specific GPU can be done by setting an environment variable CUDA_VISIBLE_DEVICES=GPU_ID
before running the server as explained on the following threads:
- Tensorflow set CUDA_VISIBLE_DEVICES within jupyter,
- How do I select which GPU to run a job on?,
- How to set specific gpu in tensorflow?.
Another solution would be to use tf.config.set_visible_devices
before defining the distributed strategy:
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices(physical_devices[self.rank], 'GPU')
This would perhaps avoid the necessity to hard code the CUDA
environment variable declaration before executing the server.
edit: this was successfully tested on JZ and was hence implemented.
Additional changes and remarks:
Since Jean-Zay prevents users from loading both torch
and tensorflow
at the same time, the following modifications were made:
- The
MelissaIterableDataset
was subclassed inTorchMelissaIterableDataset
andTfMelissaIterableDataset
classes, - Because of the
torch
dependency induced bySummaryWriter
, newTorchTensorboardLogger
andTfTensorboardLogger
classes were introduced.
Finally, TensorboardLogger
was initialized before the distribution strategy which is not allowed:
Important: There is currently a TensorFlow limitation on declaring the strategy; it must be done before any other call to a TensorFlow operation.
It was then moved at the end of the setup_environment
method of each DL server.