ModelTrainer.prepare slow on large data sets
@pjobic and I think @jcury have reported long start-up times when running dnadna train
, where it sits there and outputs nothing for a long time before producing output.
We believe this is something in ModelTrainer.prepare()
.
@pjobic suspects the data loader, and in particular guessing the correct filename. I concur that this is a possibility, but I think this is less likely since !40 (merged) implemented many optimizations to this, and this code is also used during pre-processing. It was the context of pre-processing where this issue was first reported, and @jcury wrote that it solved the issue for him. Nevertheless there could still be a corner case that is not handled.
I am able to produce the problem on a simulation of my own containing 2000000 samples. In my case the problem is not so extreme though. ModelTrainer.prepare()
took 176.246s on 2000000 samples. But this is also with a dataset on a local disk, and not an NFS share (which may also be playing a part in the extreme delays @pjobic is seeing).