Tensorflow server checkpointing
This MR implements the missing pieces of the tensorflow server checkpointing.
Tests were performed locally (sequential training) and on Jean-Zay (parallel training with 2 GPUs).
To this end, the following lines were added to tf_heatpde_dl_server.py
:
self.checkpoint(batch + 1)
logger.debug('Checkpointed')
# testing changes
import os
os.kill(os.getpid(), 4)
The slurm config and batch files are both attached.
Edited by SCHOULER Marc