Improved interrupt handling for dnadna train
Previously, trying to interrupt dnadna train
(e.g. with Ctrl-C)
could often take several tries, and would result in a bunch of messy
tracebacks (often overlapping each other due to tracebacks from
interrupted worker processes)
This now handles some interrupts more cleanly.
In particular, pressing Ctrl-C does two things:
-
Rather than immediately interrupting the training, it simply pauses it. Pressing Enter resumes the training, and pressing Ctrl-C again cancels it.
-
When canceling the training, it attempts to shut down gracefully: If in a validation pass it interrupts the validation, and if in a training pass it interrupts the training loop and tries to exit cleanly. This is not always 100% guarantee as not all code is interrupt-safe, but it will be more rare to get a non-clean interrupt.
In a follow-up, we could also choose to save a checkpoint right when interrupted.
Likewise, trying to terminate the process will attempt a clean shutdown.