ModelTrainer.prepare slow on large data sets

@pjobic and I think @jcury have reported long start-up times when running dnadna train, where it sits there and outputs nothing for a long time before producing output.

We believe this is something in ModelTrainer.prepare().

@pjobic suspects the data loader, and in particular guessing the correct filename. I concur that this is a possibility, but I think this is less likely since !40 (merged) implemented many optimizations to this, and this code is also used during pre-processing. It was the context of pre-processing where this issue was first reported, and @jcury wrote that it solved the issue for him. Nevertheless there could still be a corner case that is not handled.

I am able to produce the problem on a simulation of my own containing 2000000 samples. In my case the problem is not so extreme though. ModelTrainer.prepare() took 176.246s on 2000000 samples. But this is also with a dataset on a local disk, and not an NFS share (which may also be playing a part in the extreme delays @pjobic is seeing).

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Admin message

ModelTrainer.prepare slow on large data sets