Mentions légales du service

Skip to content

Tensorflow server checkpointing

SCHOULER Marc requested to merge checkpoint-tf into develop

This MR implements the missing pieces of the tensorflow server checkpointing.

Tests were performed locally (sequential training) and on Jean-Zay (parallel training with 2 GPUs). To this end, the following lines were added to tf_heatpde_dl_server.py:

self.checkpoint(batch + 1)  
logger.debug('Checkpointed')
# testing changes           
import os                   
os.kill(os.getpid(), 4)     

The slurm config and batch files are both attached.

study_sg.sh

config_slurm.json

Edited by SCHOULER Marc

Merge request reports