Tensorflow server checkpointing (!123) · Merge requests · melissa / Melissa

SCHOULER Marc requested to merge checkpoint-tf into develop May 15, 2023

This MR implements the missing pieces of the tensorflow server checkpointing.

Tests were performed locally (sequential training) and on Jean-Zay (parallel training with 2 GPUs). To this end, the following lines were added to tf_heatpde_dl_server.py:

self.checkpoint(batch + 1)  
logger.debug('Checkpointed')
# testing changes           
import os                   
os.kill(os.getpid(), 4)

The slurm config and batch files are both attached.

study_sg.sh

config_slurm.json

Edited May 24, 2023 by SCHOULER Marc

Tensorflow server checkpointing

Merge request reports