Mentions légales du service

Skip to content

Tensorflow server checkpointing

SCHOULER Marc requested to merge checkpoint-tf into develop

This MR implements the missing pieces of the tensorflow server checkpointing.

Tests were performed locally (sequential training) and on Jean-Zay (parallel training with 2 GPUs). To this end, the following lines were added to

self.checkpoint(batch + 1)  
# testing changes           
import os                   
os.kill(os.getpid(), 4)     

The slurm config and batch files are both attached.


Edited by SCHOULER Marc

Merge request reports