Major refactoring of config format and init/preprocess/train commands in response to #68 (!74) · Merge requests · Machine learning for population genetics / private / dnadna

E Madison Bray requested to merge embray/refactoring/issue-68-1 into master Apr 08, 2021

This makes major changes to how the config files are structured as well as how the commands that use them work; since all of these changes are linked to each other it's almost impossible to do in a piecemeal fashion, hence the huge diff.

Here is an overview of the changes:

Config file changes

dataset/simulation config: essentially unchanged from the previous format; the only difference is that generated config files also note the dnadna version that generated it.
preprocessing config: this is a new file that did not previously exist on its own (although the existing config format would have previously enabled splitting preprocessing/training configs if desired).

As discussed, this contains just the essential settings for running preprocessing, including the dataset config, learned_params config, and the config for the preprocessor itself (min_snp, etc.). It also adds a new preprocessing.n_workers property, rather than reusing the loader_num_workers property.

For now it keeps n_validation_scenarios since nothing has been done on #14 (closed) yet, but that will change when we add dataset_splits.
training config: mostly the same as the original training config format. It now references the preprocessing.yml schema for sections like learned_params. I also reorganized the order of the properties in the schema a little bit more logically (I think). It does not make any of the other changes to the training config discussed in #68 (closed). I also elimitated the start_from_last_checkpoint property, which is unused. When we implement resume from checkpoint functionality that will make more sense to have as a command line argument to dnadna train I think.

Command changes

dnadna init: contrary to some of the discussion on #68 (closed), after a separate discussion with @jcury we decided to keep dnadna init for now. It can be used to generate a template preprocessing config, and optionally a template dataset config (if it is not passed an existing one via --dataset-config or `--simulation-config).

It does not output a training config.
dnadna preprocess: it no longer generates a run_NNN/ directory. Instead it just outputs the preprocessing results directly to the model_root directory. Later we may wish to output each preprocessing call to some preprocessing_NNN/ directory as discussed in #68 (closed), but this isn't done yet. Running dnadna preprocess does output an example training config file, which also includes preprocessing results, such as the path to the preprocessed scenario params CSV file, and statistics like train_mean.
dnadna train: takes the training config file generated by dnadna preprocess (and any edits made by the user) and produces a training run. Now dnadna train creates the run_NNN/ directory which includes a full copy of the training config used for the run, a copy of the preprocessed scenario params CSV file (not sure if this is needed?), and of course the trained models.

Other changes

After discussion with @jcury I decided to get rid of the different options for config templates for now (e.g. dnadna init --template=default). Now there is just one default config template for each config file that contains examples for most of the settings that can be changed.

I did keep the ability for Simulator classes to provide their own extensions to the default templates, which I believe is useful (and is very useful for the tests). But writing custom Simulators hasn't been explored yet so it remains a mostly undocumented feature.
Improvements to various utility code. In particular, some improvements have been made to the code which generates example config files from their schemas, to ensure that more properties' default values and documentation are written.

I'm not fully happy with all the details of the code in this commit. It contains a number of quick workarounds to get things working, but that have a bit of code smell. In particular I'm still unhappy with the config file handling in some areas.

TODO

Some TODO items to be handled in follow-ups:

Make further changes to the training config format as discussed in #68 (closed). Opened #80 (closed) for this.
It would be nice if the default preprocessing/training config could automatically generated a learned_params section using the actual param names from the scenario_params.csv file. Currently it just inserts some dummy values. I opened #81 for this.
learned_params section in preprocessing/training config should actually be validated against the param names in the scenario params table. I've opened #79 to track this issue, which I think can be kept separate from this merge request.

Edited Apr 14, 2021 by E Madison Bray

Admin message

Major refactoring of config format and init/preprocess/train commands in response to #68

Config file changes

Command changes

Other changes

TODO

Merge request reports