diff --git a/docs/data_preprocessing.rst b/docs/data_preprocessing.rst index 565fcced1996449808b9083fe93b9633146632ec..9b4fd953c234c65c7b7d4ccf64a67a9e3f3e6a0e 100644 --- a/docs/data_preprocessing.rst +++ b/docs/data_preprocessing.rst @@ -227,11 +227,11 @@ as well as the version of ``dnadna`` used. Command line ============ -Once the preprocessing configuration file has been filled and the required input files are created, the command to start the preprocessing is simply: +Once the preprocessing configuration file has been filled and the required input files are created, the command to start the preprocessing is simply: .. code-block:: bash - dnadna preprocess preprocess_config_file.yml + dnadna preprocess <model_name>_preprocessing_config.yml More details can be found in the :ref:`introduction:Quickstart Tutorial`. diff --git a/docs/datasets.rst b/docs/datasets.rst index 8669bec4d4f94960100496c88c85d31d063fac4d..a6f6eba675557a8fbe340f98aff80d0c0b3be281 100644 --- a/docs/datasets.rst +++ b/docs/datasets.rst @@ -205,6 +205,33 @@ DNADNA format can be changed to suit your wishes, e.g. you could change to:: filename = f"{dataset_name}/scen_{scenario}_arbitrary_text/rep_{replicate}/{scenario}_{replicate}.npz" + +In which case you will update ``filename_format`` in the +:ref:`dataset config file <dnadna-dataset-simulation-config>`: + +.. code-block:: YAML + + data_source: + # string template for per-replicate simulation files in Python + # string template format; the following template variables may be + # used: 'name', the same as the name property used in this config + # file; 'scenario', the scenario number, and 'replicate', the + # replicate number of the scenario (if there are multiple + # replicates); path separators may also be used in the template to + # form a directory structure + filename_format: {dataset_name}/scen_{scenario}_arbitrary_text/rep_{replicate}/{scenario}_{replicate}.npz + + +before running: + +.. code-block:: bash + + $ dnadna init --dataset-config={dataset_name}/{dataset_name}_dataset_config.yml + +where ``{dataset_name}/{dataset_name}_dataset_config.yml`` is the name you +picked for the config file. + + You can check our `notebook <https://gitlab.com/mlgenetics/dnadna/-/tree/master/examples/example_simulate_msprime_save_dnadna_npz.ipynb>`_ for an illustration of a simple constant demographic scenario in ``msprime`` diff --git a/docs/overview.rst b/docs/overview.rst index fb472023248f58383400e723d8d6ea2a6ff8ddb8..78ee14f1ec51d31cea9671b53dc2fd75ab57be00 100644 --- a/docs/overview.rst +++ b/docs/overview.rst @@ -140,20 +140,126 @@ would output config files to ``/mnt/nfs/username/models/my_model/``. Preprocessing ============= -TODO +The preprocessing step performs the following: +* validating input files and filtering out scenarios that do not match minimal requirements (defined by users) +* splitting the dataset into training/validation/test sets (the latter is optional) +* applying transformations to target parameter(s) if required by users (e.g. log transformation) +* standardizing target parameter(s) for regression tasks (the mean and standard deviation used in standardization are computed based on the training set only). + +Preprocessing is necessary before performing the first training run and should +be re-run if and only if one of the following is true: + +* the dataset changed, + +* the task changed (e.g. predicting other parameters or the same parameters but with different transformations), + +* the required input dimensions changed (e.g. to match the dimensions expected by some networks). + +At this stage we expect the user to open ``my_model_preprocessing_config.yml`` +and edit the properties to match the task/network needs in terms of minimal +number of SNPs and individuals required for a dataset to be valid, names of the +evolutionary parameters to be targeted, split proportions, etc. More details +are provided in the :doc:`dedicated preprocessing page <data_preprocessing>`. + +Once the preprocessing configuration file has been filled and the required input +files are created, run preprocessing with: + +.. code-block:: bash + + $ dnadna preprocess my_model_preprocessing_config.yml + + +which outputs ``my_model/my_model_training_config.yml``, +``my_model/my_model_preprocessed_params.csv`` and +``my_model/my_model_preprocessing.log``. + +The latter is simply a log file. ``my_model_preprocessed_params.csv`` is a +parameter table similar to ``my_model_params.csv`` but with log-transformed (if +required) and standardized target parameters, and with an additional column +indicating the assignment of each scenario to training, validation or test sets. +Note that all replicates of a scenario are assigned to the same class. +``my_model/my_model_training_config.yml`` will be described in the next section. + +More details on the dedicated :doc:`preprocessing page +<data_preprocessing>`. .. _overview-training: Training ======== -TODO +We can now proceed to training. It consists of optimizing the parameters of a +statistical model (here the weights of a network) based on a training dataset +and optimization hyperparameters, and evaluating the performance on a validation +set. + +First edit ``my_model/my_model_training_config.yml`` to define, in +particular, which network should be trained, its hyperparameters and loss +function, the optimization hyperparameters, transformation for data +augmentation, etc. More details on the dedicated :doc:`training page +<training>`. + +Then run: + +.. code-block:: bash + + $ dnadna train my_model_name_training_config.yml + +which creates a subdirectory ``run_{run_id}/`` containing the optimized network +``my_model_run_{run_id}_best_net.pth`` as well as checkpoints during training, a +log file and loss values stored in a tensorboard directory. + +``dnadna train`` takes additional arguments such as: + +* ``--plugin PLUGIN`` to pass plugin files that define custom networks, + optimizers or transformation that we would like to use for training + despite them not being in the original dnadna code. See :doc:`dedicated + plugin page<extending>`. +* ``-r RUN_ID`` or ``--run-id RUN_ID`` to specify a run identifier different from the one created by default (the default starts at run_000 and then monotonically increases to run_001 etc.). RUN_ID can also be specified in the config file. + +* ``--overwrite`` to overwrite the previous run (otherwise, create a new run directory). + + +More details on the dedicated :doc:`training page <training>`. .. _overview-prediction: Prediction ========== -TODO +Once trained, a network can be applied to a dataset in :doc:`DNADNA dataset format <datasets>` to classify/predict its evolutionary parameters. The following command is used: + +.. code-block:: bash + + $ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth realdata/dataset.npz + + + +This will use the best net, but you can use any net name, such as ``run_{run_id}/my_model_run_{run_id}_last_epoch_net.pth``. + +This outputs the predictions in CSV format which is printed to standard out +by default while the process runs. You can pipe this to a file using +standard shell redirection operators like ``dnadna predict {args} > +predictions.csv``, or you can specify a file to output to using the +``--output`` option. + + +You can also apply ``dnadna predict`` to multiple npz files as follows: + +.. code-block:: bash + + $ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth {extra_dir_name}/scenario*/*.npz + +where ``{extra_dir_name}`` is a directory (that you created) containing +independent simulations which will serve as test for all networks or as +illustration of predictive performance under specific conditions. + + +Importantly if you want to ensure that target examples comply to the +preprocessing constraints (such as the minimal number of SNPs and individuals) +use ``--preprocess``. In that case, a warning will be displayed for each rejected scenario, with the reason of rejection (such as the minimal number of SNPs). + + +More details on the dedicated :doc:`prediction page <prediction>`. diff --git a/docs/prediction.rst b/docs/prediction.rst index 19d5c9a894d067a1d8bad8657919f96cbc139719..88d8a4962a1b054335e598a234c9f9f220af2cec 100644 --- a/docs/prediction.rst +++ b/docs/prediction.rst @@ -2,3 +2,83 @@ Prediction ########## + + +Once trained a network can be applied (through a simple forward pass) to other +datasets, such as: + +* a test set, after hyperparameter optimization has been done for all networks. It enables to compare fairly multiple networks and check whether they overfitted the validation set, + +* specific examples, to evaluate predictive performance on specific scenarios or the robustness under specific conditions (such as new data under selection while selection was absent from the training set), + +* real datasets to reconstruct the past evolutionary history of real populations. + + +The required arguments for ``dnadna predict`` are: + +* MODEL: most commonly a path to a .pth file, such as + ``run_{runid}/my_model_run_{runid}_best_net.pth``, that contains the + trained network we wish to use and additional information (such as data + transformation that should be applied beforehand and info to unstandardize + and/or "untransform" the predicted parameters). Alternatively the final + config file of a run ``run_{runid}/my_model_run_{runid}_final_config.yml`` + can be passed (in which case the best network of the given run is used by + default). + +* INPUT: path to one or more npz files, or to a :ref:`dataset config file <dnadna-dataset-simulation-config>` (describing a whole dataset). + + +A typical usage will thus be: + +.. code-block:: bash + + $ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth realdata/sample.npz + +to classify/predict evolutionary parameters for a single data sample +``realdata/sample.npz`` in :doc:`DNADNA dataset format <datasets>`. + +This will use the best net, but you can use any net name, such as ``run_{run_id}/my_model_run_{run_id}_last_epoch_net.pth``. + +This outputs the predictions in CSV format which is printed to standard out +by default while the process runs. You can pipe this to a file using +standard shell redirection operators like ``dnadna predict {args} > +predictions.csv``, or you can specify a file to output to using the +``--output`` option. + + +You can also apply dnadna predict to multiple npz files as follows: + +.. code-block:: bash + + $ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth {extra_dir_name}/scenario*/*.npz + +where ``{extra_dir_name}`` is a directory (that you created) containing +independent simulations which will serve as test for all networks or as +illustration of predictive performance under specific conditions. + + +The previous command is equivalent to: + +.. code-block:: bash + + $ dnadna predict run_{run_id}/my_model_run_{run_id}_final_config.yml {extra_dir_name}/scenario*/*.npz + +where the training config file is passed rather than the ``.pth`` of the best +network, but you could alternatively add the option ``--checkpoint last_epoch`` +to use the network at final stage of training rather than the best one. + + +Importantly if you want to ensure that target examples comply to the +preprocessing constraints (such as the minimal number of SNPs and individuals) +use ``--preprocess``. In that case, a warning will be displayed for each rejected scenario, with the reason of rejection (such as the minimal number of SNPs). + +In the current version the same data transformations are applied to the +training/validation/test sets and to extra simulations or real data on which +the prediction is made. These are the same data transformations that are +defined in the training config file for the training run that produced the +model. + +Finally you can fine-tune resource usage with the options ``--gpus --GPUS`` and +``--loader-num-workers LOADER_NUM_WORKERS`` to indicate the specific GPUs and +the number of CPUs to use. You can display a progress bar with the option +``--progress-bar``. diff --git a/docs/spelling_wordlist.txt b/docs/spelling_wordlist.txt index 3b6a3472345b432e63ba35bd27af8610c3434c7c..46f6f9aa86b23e7ae470725fb56df2e0124f9a60 100644 --- a/docs/spelling_wordlist.txt +++ b/docs/spelling_wordlist.txt @@ -98,7 +98,9 @@ normalizations npz nSl numpydoc +optimizers overfit +overfitted overfitting overline parallelization @@ -150,6 +152,7 @@ uncategorized unnormalized unregister unstandardize +untransform utils validator validators