Add a `--resume` feature (!178) · Merge requests · Machine learning for population genetics / private / dnadna

REGAN Cyril requested to merge resume into master Oct 24, 2023

This feature is done for pursuing a training (eventually due to unplanned shut down, or Ctrl+C interruption) , for fine tuning or for transfer learning.

The feature is documented :

dnadna train --help
(...)
--overwrite           overwrite run (otherwise, create a new run), if -r is not defined, overwrites last execution
--resume RESUME       load a trained model; `RESUME` is the path of the `.pth` file.

Example of Ctrl C

dnadna train my_model/my_model_training_config.yml
(...)

2023-10-24 17:06:56;     INFO;  Process ID: 64290
2023-10-24 17:06:56;     INFO;  Preparing training run
2023-10-24 17:06:56;     INFO;  Initializing dataset...
2023-10-24 17:06:56;     INFO;  20 samples in the validation set and 20 in the training set
2023-10-24 17:06:56;     INFO;  inferred parameters for CustomCNN: n_snp=500, n_indiv=50, concat=True
2023-10-24 17:06:56;     INFO;  Start training
2023-10-24 17:06:56;     INFO;  Networks states are saved after each validation step
2023-10-24 17:06:56;  WARNING;  Current behavior if SNP matrices have different shapes: padding with -1 (right and bottom) to fit the maximum dimension within each batch.
2023-10-24 17:06:56;     INFO;  Starting Epoch #1
2023-10-24 17:06:58;     INFO;  Validation at epoch: 1 and batch: 1                                                                                                                                         
2023-10-24 17:06:58;     INFO;  Compute all outputs for validation dataset...                                                                                                                               
2023-10-24 17:06:58;     INFO;  Done                                                                                                                                                                        
2023-10-24 17:06:58;     INFO;  training loss = 0.9783284068107605 // validation loss = 1.1616548299789429                                                                                                  
2023-10-24 17:06:58;     INFO;  Better loss found on validation set: None --> 1.1616548299789429                                                                                                            
2023-10-24 17:06:58;     INFO;  Saving model to ".../dnadna/my_model/run_072/my_model_run_072_best_net.pth" ...                                                    
2023-10-24 17:06:58;     INFO;  Starting Epoch #2                                                                                                                                                           
                                                                                                                                                                                                           ^Cpoch 2/6:  17%|█████████████████████████▊                                                                                                                                 | 1/6 [00:02<00:10,  2.17s/batch]
2023-10-24 17:07:00;    ERROR;  Training stopped due to keyboard interrupt and will attempt to shut down gracefully
2023-10-24 17:07:00;     INFO;  Saving model to ".../dnadna/my_model/run_072/my_model_run_072_epoch1_net.pth" ...                                                  
2023-10-24 17:07:00;    ERROR;  The model checkpoint is saved here:.../dnadna/my_model/run_072/my_model_run_072_epoch1_net.pth                                     
2023-10-24 17:07:00;    ERROR;  This training run can be resumed from the last checkpoint with:                                                                                                             
2023-10-24 17:07:00;    ERROR;                                                                                                                                                                              
2023-10-24 17:07:00;    ERROR;      dnadna train .../dnadna/my_model/my_model_training_config.yml --resume .../dnadna/my_model/run_072/my_model_run_072_epoch1_net.pth

And we can pursue the training with the command specified in the ERROR log :

dnadna train .../dnadna/my_model/my_model_training_config.yml --resume .../dnadna/my_model/run_072/my_model_run_072_epoch1_net.pth

2023-10-24 17:10:35;     INFO;  Process ID: 65104
2023-10-24 17:10:35;     INFO;  Preparing training run
2023-10-24 17:10:35;     INFO;  Initializing dataset...
2023-10-24 17:10:35;     INFO;  20 samples in the validation set and 20 in the training set
2023-10-24 17:10:35;     INFO;  inferred parameters for CustomCNN: n_snp=500, n_indiv=50, concat=True
2023-10-24 17:10:35;     INFO;  Start training
2023-10-24 17:10:35;     INFO;  Networks states are saved after each validation step
2023-10-24 17:10:35;  WARNING;  Current behavior if SNP matrices have different shapes: padding with -1 (right and bottom) to fit the maximum dimension within each batch.
2023-10-24 17:10:35;     INFO;  Starting Epoch #2
                                                                                                                                                                                                           epoch 2/6:   0%|

Edited Oct 24, 2023 by REGAN Cyril

Admin message

Admin message

Add a `--resume` feature

Merge request reports