Implement new dataset splits configuration
Currently the split names are hard-coded: they can be either "training/validation/test/unused"
Of the two, currently only "training" and "validation" are required, because they are the only two used in the code. "Test" can be used for a test set, but we don't currently use that, so it is optional. We have not (to my recollection) discussed whether or not there are plans for explicitly doing something with the test set (e.g. the dnadna predict
command could have an option to run against the test set).
"Unused" just means some other scenarios will be set aside for an otherwise unspecified reason. If the ratios of the splits don't add up to 1, the additional scenarios go to "unused" by default (generally this is not a desirable situation, but we allow it, and just log a warning).
If the ratios add up to greater than 1, this is an error.