Proposal for new organisation of the configuration files

As of now, there are some confusion about the configuration files. I'll assume here, that we are in the standard situation where data are already simulated and a user wants to train a network on its data. There will be 3 steps, and 3 configuration files one for each step. I make comments in the files below, as well as new propositions without any constrains with respect to what is already inplace, I just try to keep the average user's point of view, and what seems (to me at least) to give less friction in the process.

In all situations, the subcommand without a config file would start the first run wizard (so dnadna train would start the wizard from scratch). But let's not discuss this here, but instead, there #48

Initialization : dnadna init dataset_config.yml
- Specifies where is the data, what is the format, etc…, but also where the upcoming runs should be, what's the name of the model (the model being what describes briefly the simulation, and scenarios are variations around the model), where is the table with the parameters.

1. dataset_config.yml

# JSON Schema (YAML-formatted) for basic properties of a simulation on which a
# model will be trained.

# root directory for all files related to the simulation, either as an absolute
# path, or as a path relative to the location of this config file
data_root: /path/to/DATA/

# a name to give the simulation; used in generating filenames and logging output
dataset_name: myDataset

# path to the CSV file containing the per-scenario parameters used in this
# simulation, either as an absolute path, or as a path relative to this config
# file
scenario_params_path: /path/to/my_param.csv

# options for describing the format in which the simulations are written;
# currently only one format ("dnadna", the native format for DNADNA) is
# understood, but others may be added later
data_source:
    # a unique label identifying the data format; the format property determines
    # what reader is used for simulation data, and any further options in
    # data_source may depend on the format
    format: dnadna
    filename_format: scenario_{scenario:05}/myDataset_{scenario:05}_{replicate:03}.npz
    position_format: # Here it is the format of the position vector as stored on disk
        distance: False
        chromosome_size: 2e6 

# root directory for all training runs of this model / training configuration
model_root: .

# unique name to give to models trained with this configuration; individual
# training runs will prepend this to the run_id
model_name: myModel


---- END ----
--- Below are not needed ----
--- I don't know whether I got this from old simulation config file or not---

# number of different scenarios simulated; each scenario is a parameterization
# of the simulation with a different set of (possibly random) parameter values;
# each scenario may have one or more "replicates"--simulations using the same
# parameters, but with different randomized outputs--the number of replicates of
# each scenario should be listed in the scenario parameters table
n_scenarios: 11000  
n_replicates: 100
^-- Those 2 parameters above are currenlty mandatory (at least n_scenario) but have not effect whatsoever in the downstream analaysis, we may remove them here, they are just for the simulations.

Preprocessing: dnadna preprocess myModel/preprocessing_config.yml
- the preprocess_config.yml is created with the previous command. We should specify in it, what is required in our dataset, so we can use it for training. Conditions on min number of snp and indiv, the learned parameter and whether they are for regression or classification (to standardize the data), number of cpu we want to use, fraction of validation dataset, etc..

2. preprocessing_config.yml

# the dataset configuration
dataset:
    inherit: ../dataset_config.yml

# these are parameters used for data pre-processing prior to
# training; they determine the subset of the dataset that will be
# used for a training run
preprocessing:
    min_snp: 400
    min_indiv: 600

# Split dataset between different sets:
split_datasets: 
    train: 0.6
    validation: 0.2
    test: 0.1

# configuration of parameters to learn in training
# It checks that the params are present in the table, and if it's regression it standardizes the parameters
learned_params:
    T00:
        type: regression
        log_transform: false
    T01:
        type: regression
        log_transform: false

# format string for the name given to this run for a sequence of runs of the
# same model; the outputs of each run are placed in subdirectories of
# <run_path>/<model_name> with the name of this run; the format string can use
# the template variables model_name and run_id
run_name_format: run_{run_id}

# format string for the filename of the final output model; it can use the
# template variables model_name, run_name, and/or run_id
model_filename_format: '{model_name}_{run_name}_last_epoch_net.pth'

# number of subprocesses to use for data loading
loader_num_workers: 20 # I suggest we rename it n_cpu or something more common and less convoluted, and keep that for any parallelization process. 

# seed for initializing the PRNG prior to a training run for reproducible
# results; if unspecified the PRNG chooses its default seeding method
seed: 13
        
---- END ----
# I don't know if GPU are used in the preprocessing, I believe not, but maybe I'm missing something, so if yes, we can keep it there. 
# use CUDA-capable GPU where available
use_cuda: true
# specifies the CUDA device index to use
cuda_device: null

Training: dnadna train myModel/run_000/training_config.yml
- Again, created with the previous command, in a run_xxx folder. We need to modify it, to choose things related to the run (network name, n_epoch, batch_size, etc..), but also parameters related to the dataloader (max_snp, max_indiv, uniform, transforms, etc...)

3. training_config.yml


# Parameters telling the dataloader how to load the data so they will match network's input. 
dataset_params:  # Maybe we could rename it to dataloader_params
    max_snp: 400
    max_indiv: 100
    uniform: true # what about fixed_dimension or something less confusing. Isn't it, by the way, obvious from the 2 previous param ? Maybe we can get rid of this one. 
    # How positions should be formatted before going in the net: 
    position_format: # I put that here, because it is done in the dataloader as other arg here.
         distance: True
         normalized: True
    concat: true
    ignore_missing: false

# I'm not sure of the current syntax for transforms, I'm making things up here (but we might take this into account)
# The difference with max_snp/indiv is that transforms below are coded as classes of Transform metaclass, whereas above they are coded within the dataloader.
# The order should be taken into account, as they will be "composed" from a list. 
transforms:
    # data augmentation on training set
    training: 
         subsample: 100
         maf: 0.05 # this parameter was alone somewhere, it is said it is used in preprocessing or for data augmentation depending on where we look, but I think the latter is more appropriate so I put it there. 
         transform_allel_min_major: false # maybe this should be above in dataset_params ?
    validation:
          maf: 0.05
    test: 
          maf: 0.05

# New proposal for network parameters, to group similar stuff a bit more.
network_params:
    name: SPIDNA
    params:
        n_outputs: 2
        n_blocks: 7
        n_features: 50

# training hyperparameters:
n_epochs: 10
batch_size: 64
optimizer_params:
     name: Adam # This is not currently modifiable, although it should be, the next 2 param are linked to it.
     params: # It should be the kwargs from the choosen optimizer, i.e. it is called Adam(lr=0.01, weight_decay=0.02)
        learning_rate: 0.01 # it should thus be lr=0.01 here. Maybe we can create an alias for lr as it is in every optimizer and lr is not really a good variable name.
        weight_decay: 0.02
evaluation_interval: 250
learned_params:
    T00:
        type: regression
        loss_func: MSE
        loss_weight: 1
        tied_to_position: false
    T01:
        type: regression
        loss_func: MSE
        loss_weight: 1
        tied_to_position: false


# hardware used: 
use_cuda: true
cuda_device: null
loader_num_workers: 20 # Change name to n_cpu if it's fine for you 

# Seed for samplers I guess.
seed: 13

start_from_last_checkpoint: false


# Other parameters inherited from previous step, stored for reproductibility, and ease of reuse (with predict command notably).

train_mean:
    T00: 82073.26118722184
    T01: 82073.26118722184
train_std:
    T00: 34809.54181663087
    T01: 34809.54181663087
model_root: /path/to/.
model_name: myModel
dataset:
    data_root: /path/to/DATA/
    dataset_name: myDataset
    scenario_params_path: /path/to/my_param.csv
    data_source:
        format: dnadna
        filename_format: scenario_{scenario:05}/myDataset_{scenario:05}_{replicate:03}.npz
        keys:
        - SNP
        - POS

# these are parameters used for data pre-processing prior to
# training; they determine the subset of the dataset that will be
# used for a training run
preprocessing:
    min_snp: 400
    min_indiv: 600

# Split dataset between different sets:
split_datasets: 
    train: 0.6
    validation: 0.2
    test: 0.1

dnadna_version: 0.1.dev690+gf7e9624
run_id: 0
run_name: run_000
run_name_format: run_{run_id}
model_filename_format: '{model_name}_{run_name}_last_epoch_net.pth'
preprocessed_datetime: '2020-12-16T17:30:31.460204+00:00'
preprocessed_scenario_params_path: myDataset_000_preprocessed_params.csv

Possible commands from there, could be (not all are actually implemented):

dnadna preprocess --clone myModel/run_000/training_config.yml # To create run_001 and change some hyperparameters. See issues #42 (comment 372222)
dnadna predict myModel_run000_best_net.pth --test # to run on test set (this is now #104 (closed))
dnadna predict myModel_run000_best_net.pth *.npz --preprocess # to predict of another dataset while still preprocessing the data.

The above proposal may contain mistakes, and I may have missed some situations. The overall idea is also to make things clearer, so if you could read it and give feedback, it would be great (@embray @fjay @thsanche @pjobic @j.guez)

Edited Jul 19, 2021 by E. Madison Bray

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Proposal for new organisation of the configuration files