Add a `--resume` feature
This feature is done for pursuing a training (eventually due to unplanned shut down, or Ctrl+C interruption) , for fine tuning or for transfer learning.
The feature is documented :
dnadna train --help
(...)
--overwrite overwrite run (otherwise, create a new run), if -r is not defined, overwrites last execution
--resume RESUME load a trained model; `RESUME` is the path of the `.pth` file.
Example of Ctrl C
dnadna train my_model/my_model_training_config.yml
(...)
2023-10-24 17:06:56; INFO; Process ID: 64290 2023-10-24 17:06:56; INFO; Preparing training run 2023-10-24 17:06:56; INFO; Initializing dataset... 2023-10-24 17:06:56; INFO; 20 samples in the validation set and 20 in the training set 2023-10-24 17:06:56; INFO; inferred parameters for CustomCNN: n_snp=500, n_indiv=50, concat=True 2023-10-24 17:06:56; INFO; Start training 2023-10-24 17:06:56; INFO; Networks states are saved after each validation step 2023-10-24 17:06:56; WARNING; Current behavior if SNP matrices have different shapes: padding with -1 (right and bottom) to fit the maximum dimension within each batch. 2023-10-24 17:06:56; INFO; Starting Epoch #1 2023-10-24 17:06:58; INFO; Validation at epoch: 1 and batch: 1 2023-10-24 17:06:58; INFO; Compute all outputs for validation dataset... 2023-10-24 17:06:58; INFO; Done 2023-10-24 17:06:58; INFO; training loss = 0.9783284068107605 // validation loss = 1.1616548299789429 2023-10-24 17:06:58; INFO; Better loss found on validation set: None --> 1.1616548299789429 2023-10-24 17:06:58; INFO; Saving model to ".../dnadna/my_model/run_072/my_model_run_072_best_net.pth" ... 2023-10-24 17:06:58; INFO; Starting Epoch #2 ^Cpoch 2/6: 17%|█████████████████████████▊ | 1/6 [00:02<00:10, 2.17s/batch] 2023-10-24 17:07:00; ERROR; Training stopped due to keyboard interrupt and will attempt to shut down gracefully 2023-10-24 17:07:00; INFO; Saving model to ".../dnadna/my_model/run_072/my_model_run_072_epoch1_net.pth" ... 2023-10-24 17:07:00; ERROR; The model checkpoint is saved here:.../dnadna/my_model/run_072/my_model_run_072_epoch1_net.pth 2023-10-24 17:07:00; ERROR; This training run can be resumed from the last checkpoint with: 2023-10-24 17:07:00; ERROR; 2023-10-24 17:07:00; ERROR; dnadna train .../dnadna/my_model/my_model_training_config.yml --resume .../dnadna/my_model/run_072/my_model_run_072_epoch1_net.pth
And we can pursue the training with the command specified in the ERROR log :
dnadna train .../dnadna/my_model/my_model_training_config.yml --resume .../dnadna/my_model/run_072/my_model_run_072_epoch1_net.pth
2023-10-24 17:10:35; INFO; Process ID: 65104 2023-10-24 17:10:35; INFO; Preparing training run 2023-10-24 17:10:35; INFO; Initializing dataset... 2023-10-24 17:10:35; INFO; 20 samples in the validation set and 20 in the training set 2023-10-24 17:10:35; INFO; inferred parameters for CustomCNN: n_snp=500, n_indiv=50, concat=True 2023-10-24 17:10:35; INFO; Start training 2023-10-24 17:10:35; INFO; Networks states are saved after each validation step 2023-10-24 17:10:35; WARNING; Current behavior if SNP matrices have different shapes: padding with -1 (right and bottom) to fit the maximum dimension within each batch. 2023-10-24 17:10:35; INFO; Starting Epoch #2 epoch 2/6: 0%|
Edited by REGAN Cyril