Makes progress on issue #65 (!64) · Merge requests · Machine learning for population genetics / private / dnadna

E Madison Bray requested to merge embray/issue-65 into master Mar 22, 2021

This eliminates the worst pain points when generating the integer indices for the pytorch Dataset from the scenario parameters table.

It's still slower than I would like, but sped up several times. In particular: I changed the training_series argument to DNATrainingDataset to just "training_set" which takes a normal Python set containing the scenario_idx for each scenario used in the training set. This can be generated quickly with a one-liner from the Pandas dataframe (and this will be reworked later when we work on #14 (closed); better management of training/validation/test sets).

It also changes the meaning of the ignore_missing argument to DNADataset.

Now, if ignore_missing=True, it does not check immediately if a sample is missing. Instead, when it tries to load a missing/corrupt sample, it converts the exception to a warning, and returns None.

This means that when indexing a DNADataset like dataset[idx] it might return None for the sample, meaning it couldn't be loaded. So any code that uses it needs to be able to account for the possibility that a sample will be missing (see e.g. the changes to collate_batch).

This offloads checking whether the file exists to when actually trying to open the file. This is better in a way since in addition to checking whether the file exists, this also checks whether or not it can be read (e.g. in case it's corrupt or something) and handles that case better, instead of crashing the training run.

Also added a log message when starting training that the dataset is being initialized, so that it at least does not appear to be doing nothing...

Admin message

Makes progress on issue #65

Merge request reports