Major refactoring of config format and init/preprocess/train commands in response to #68
This makes major changes to how the config files are structured as well as how the commands that use them work; since all of these changes are linked to each other it's almost impossible to do in a piecemeal fashion, hence the huge diff.
Here is an overview of the changes:
Config file changes
-
dataset/simulation config: essentially unchanged from the previous format; the only difference is that generated config files also note the dnadna version that generated it.
-
preprocessing config: this is a new file that did not previously exist on its own (although the existing config format would have previously enabled splitting preprocessing/training configs if desired).
As discussed, this contains just the essential settings for running preprocessing, including the dataset config, learned_params config, and the config for the preprocessor itself (min_snp, etc.). It also adds a new
preprocessing.n_workers
property, rather than reusing theloader_num_workers
property.For now it keeps
n_validation_scenarios
since nothing has been done on #14 (closed) yet, but that will change when we adddataset_splits
. -
training config: mostly the same as the original training config format. It now references the preprocessing.yml schema for sections like
learned_params
. I also reorganized the order of the properties in the schema a little bit more logically (I think). It does not make any of the other changes to the training config discussed in #68 (closed). I also elimitated thestart_from_last_checkpoint
property, which is unused. When we implement resume from checkpoint functionality that will make more sense to have as a command line argument todnadna train
I think.
Command changes
-
dnadna init
: contrary to some of the discussion on #68 (closed), after a separate discussion with @jcury we decided to keepdnadna init
for now. It can be used to generate a template preprocessing config, and optionally a template dataset config (if it is not passed an existing one via--dataset-config
or `--simulation-config).It does not output a training config.
-
dnadna preprocess
: it no longer generates arun_NNN/
directory. Instead it just outputs the preprocessing results directly to themodel_root
directory. Later we may wish to output each preprocessing call to somepreprocessing_NNN/
directory as discussed in #68 (closed), but this isn't done yet. Runningdnadna preprocess
does output an example training config file, which also includes preprocessing results, such as the path to the preprocessed scenario params CSV file, and statistics liketrain_mean
. -
dnadna train
: takes the training config file generated bydnadna preprocess
(and any edits made by the user) and produces a training run. Nowdnadna train
creates therun_NNN/
directory which includes a full copy of the training config used for the run, a copy of the preprocessed scenario params CSV file (not sure if this is needed?), and of course the trained models.
Other changes
-
After discussion with @jcury I decided to get rid of the different options for config templates for now (e.g.
dnadna init --template=default
). Now there is just one default config template for each config file that contains examples for most of the settings that can be changed.I did keep the ability for Simulator classes to provide their own extensions to the default templates, which I believe is useful (and is very useful for the tests). But writing custom Simulators hasn't been explored yet so it remains a mostly undocumented feature.
-
Improvements to various utility code. In particular, some improvements have been made to the code which generates example config files from their schemas, to ensure that more properties' default values and documentation are written.
I'm not fully happy with all the details of the code in this commit. It contains a number of quick workarounds to get things working, but that have a bit of code smell. In particular I'm still unhappy with the config file handling in some areas.
TODO
Some TODO items to be handled in follow-ups:
-
Make further changes to the training config format as discussed in #68 (closed). Opened #80 (closed) for this. -
It would be nice if the default preprocessing/training config could automatically generated a learned_params
section using the actual param names from the scenario_params.csv file. Currently it just inserts some dummy values. I opened #81 for this. -
learned_params
section in preprocessing/training config should actually be validated against the param names in the scenario params table. I've opened #79 to track this issue, which I think can be kept separate from this merge request.