Data loader freeze with num_workers > 0

During a training session, we observe (see profiler screenshot) random latencies. This leads to an increase in the duration of certain batches (between 2x and 10x), making the training approximately 2x longer. This is confirmed by the time displayed in tqdm; the speed varies significantly, and the "time remaining" prediction is consistently inaccurate. The profiler reveals that there is a communication issue between the workers (dataloader threads).

These tests come from the Jean Zay cluster. It is necessary to confirm this issue on other GPU configurations.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Data loader freeze with num_workers > 0