Mentions légales du service

Skip to content
Snippets Groups Projects
Commit 0475ac08 authored by E Madison Bray's avatar E Madison Bray
Browse files

Merge branch 'flora/documentation/overview_network' into 'master'

Flora/documentation/overview network

See merge request !120
parents c0a45224 b5bd9735
No related branches found
No related tags found
1 merge request!120Flora/documentation/overview network
Pipeline #275473 passed
......@@ -227,11 +227,11 @@ as well as the version of ``dnadna`` used.
Command line
============
Once the preprocessing configuration file has been filled and the required input files are created, the command to start the preprocessing is simply:
Once the preprocessing configuration file has been filled and the required input files are created, the command to start the preprocessing is simply:
.. code-block:: bash
dnadna preprocess preprocess_config_file.yml
dnadna preprocess <model_name>_preprocessing_config.yml
More details can be found in the :ref:`introduction:Quickstart Tutorial`.
......
......@@ -205,6 +205,33 @@ DNADNA format can be changed to suit your wishes, e.g. you could change to::
filename = f"{dataset_name}/scen_{scenario}_arbitrary_text/rep_{replicate}/{scenario}_{replicate}.npz"
In which case you will update ``filename_format`` in the
:ref:`dataset config file <dnadna-dataset-simulation-config>`:
.. code-block:: YAML
data_source:
# string template for per-replicate simulation files in Python
# string template format; the following template variables may be
# used: 'name', the same as the name property used in this config
# file; 'scenario', the scenario number, and 'replicate', the
# replicate number of the scenario (if there are multiple
# replicates); path separators may also be used in the template to
# form a directory structure
filename_format: {dataset_name}/scen_{scenario}_arbitrary_text/rep_{replicate}/{scenario}_{replicate}.npz
before running:
.. code-block:: bash
$ dnadna init --dataset-config={dataset_name}/{dataset_name}_dataset_config.yml
where ``{dataset_name}/{dataset_name}_dataset_config.yml`` is the name you
picked for the config file.
You can check our `notebook
<https://gitlab.com/mlgenetics/dnadna/-/tree/master/examples/example_simulate_msprime_save_dnadna_npz.ipynb>`_
for an illustration of a simple constant demographic scenario in ``msprime``
......
......@@ -140,20 +140,126 @@ would output config files to ``/mnt/nfs/username/models/my_model/``.
Preprocessing
=============
TODO
The preprocessing step performs the following:
* validating input files and filtering out scenarios that do not match minimal requirements (defined by users)
* splitting the dataset into training/validation/test sets (the latter is optional)
* applying transformations to target parameter(s) if required by users (e.g. log transformation)
* standardizing target parameter(s) for regression tasks (the mean and standard deviation used in standardization are computed based on the training set only).
Preprocessing is necessary before performing the first training run and should
be re-run if and only if one of the following is true:
* the dataset changed,
* the task changed (e.g. predicting other parameters or the same parameters but with different transformations),
* the required input dimensions changed (e.g. to match the dimensions expected by some networks).
At this stage we expect the user to open ``my_model_preprocessing_config.yml``
and edit the properties to match the task/network needs in terms of minimal
number of SNPs and individuals required for a dataset to be valid, names of the
evolutionary parameters to be targeted, split proportions, etc. More details
are provided in the :doc:`dedicated preprocessing page <data_preprocessing>`.
Once the preprocessing configuration file has been filled and the required input
files are created, run preprocessing with:
.. code-block:: bash
$ dnadna preprocess my_model_preprocessing_config.yml
which outputs ``my_model/my_model_training_config.yml``,
``my_model/my_model_preprocessed_params.csv`` and
``my_model/my_model_preprocessing.log``.
The latter is simply a log file. ``my_model_preprocessed_params.csv`` is a
parameter table similar to ``my_model_params.csv`` but with log-transformed (if
required) and standardized target parameters, and with an additional column
indicating the assignment of each scenario to training, validation or test sets.
Note that all replicates of a scenario are assigned to the same class.
``my_model/my_model_training_config.yml`` will be described in the next section.
More details on the dedicated :doc:`preprocessing page
<data_preprocessing>`.
.. _overview-training:
Training
========
TODO
We can now proceed to training. It consists of optimizing the parameters of a
statistical model (here the weights of a network) based on a training dataset
and optimization hyperparameters, and evaluating the performance on a validation
set.
First edit ``my_model/my_model_training_config.yml`` to define, in
particular, which network should be trained, its hyperparameters and loss
function, the optimization hyperparameters, transformation for data
augmentation, etc. More details on the dedicated :doc:`training page
<training>`.
Then run:
.. code-block:: bash
$ dnadna train my_model_name_training_config.yml
which creates a subdirectory ``run_{run_id}/`` containing the optimized network
``my_model_run_{run_id}_best_net.pth`` as well as checkpoints during training, a
log file and loss values stored in a tensorboard directory.
``dnadna train`` takes additional arguments such as:
* ``--plugin PLUGIN`` to pass plugin files that define custom networks,
optimizers or transformation that we would like to use for training
despite them not being in the original dnadna code. See :doc:`dedicated
plugin page<extending>`.
* ``-r RUN_ID`` or ``--run-id RUN_ID`` to specify a run identifier different from the one created by default (the default starts at run_000 and then monotonically increases to run_001 etc.). RUN_ID can also be specified in the config file.
* ``--overwrite`` to overwrite the previous run (otherwise, create a new run directory).
More details on the dedicated :doc:`training page <training>`.
.. _overview-prediction:
Prediction
==========
TODO
Once trained, a network can be applied to a dataset in :doc:`DNADNA dataset format <datasets>` to classify/predict its evolutionary parameters. The following command is used:
.. code-block:: bash
$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth realdata/dataset.npz
This will use the best net, but you can use any net name, such as ``run_{run_id}/my_model_run_{run_id}_last_epoch_net.pth``.
This outputs the predictions in CSV format which is printed to standard out
by default while the process runs. You can pipe this to a file using
standard shell redirection operators like ``dnadna predict {args} >
predictions.csv``, or you can specify a file to output to using the
``--output`` option.
You can also apply ``dnadna predict`` to multiple npz files as follows:
.. code-block:: bash
$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth {extra_dir_name}/scenario*/*.npz
where ``{extra_dir_name}`` is a directory (that you created) containing
independent simulations which will serve as test for all networks or as
illustration of predictive performance under specific conditions.
Importantly if you want to ensure that target examples comply to the
preprocessing constraints (such as the minimal number of SNPs and individuals)
use ``--preprocess``. In that case, a warning will be displayed for each rejected scenario, with the reason of rejection (such as the minimal number of SNPs).
More details on the dedicated :doc:`prediction page <prediction>`.
......@@ -2,3 +2,83 @@
Prediction
##########
Once trained a network can be applied (through a simple forward pass) to other
datasets, such as:
* a test set, after hyperparameter optimization has been done for all networks. It enables to compare fairly multiple networks and check whether they overfitted the validation set,
* specific examples, to evaluate predictive performance on specific scenarios or the robustness under specific conditions (such as new data under selection while selection was absent from the training set),
* real datasets to reconstruct the past evolutionary history of real populations.
The required arguments for ``dnadna predict`` are:
* MODEL: most commonly a path to a .pth file, such as
``run_{runid}/my_model_run_{runid}_best_net.pth``, that contains the
trained network we wish to use and additional information (such as data
transformation that should be applied beforehand and info to unstandardize
and/or "untransform" the predicted parameters). Alternatively the final
config file of a run ``run_{runid}/my_model_run_{runid}_final_config.yml``
can be passed (in which case the best network of the given run is used by
default).
* INPUT: path to one or more npz files, or to a :ref:`dataset config file <dnadna-dataset-simulation-config>` (describing a whole dataset).
A typical usage will thus be:
.. code-block:: bash
$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth realdata/sample.npz
to classify/predict evolutionary parameters for a single data sample
``realdata/sample.npz`` in :doc:`DNADNA dataset format <datasets>`.
This will use the best net, but you can use any net name, such as ``run_{run_id}/my_model_run_{run_id}_last_epoch_net.pth``.
This outputs the predictions in CSV format which is printed to standard out
by default while the process runs. You can pipe this to a file using
standard shell redirection operators like ``dnadna predict {args} >
predictions.csv``, or you can specify a file to output to using the
``--output`` option.
You can also apply dnadna predict to multiple npz files as follows:
.. code-block:: bash
$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth {extra_dir_name}/scenario*/*.npz
where ``{extra_dir_name}`` is a directory (that you created) containing
independent simulations which will serve as test for all networks or as
illustration of predictive performance under specific conditions.
The previous command is equivalent to:
.. code-block:: bash
$ dnadna predict run_{run_id}/my_model_run_{run_id}_final_config.yml {extra_dir_name}/scenario*/*.npz
where the training config file is passed rather than the ``.pth`` of the best
network, but you could alternatively add the option ``--checkpoint last_epoch``
to use the network at final stage of training rather than the best one.
Importantly if you want to ensure that target examples comply to the
preprocessing constraints (such as the minimal number of SNPs and individuals)
use ``--preprocess``. In that case, a warning will be displayed for each rejected scenario, with the reason of rejection (such as the minimal number of SNPs).
In the current version the same data transformations are applied to the
training/validation/test sets and to extra simulations or real data on which
the prediction is made. These are the same data transformations that are
defined in the training config file for the training run that produced the
model.
Finally you can fine-tune resource usage with the options ``--gpus --GPUS`` and
``--loader-num-workers LOADER_NUM_WORKERS`` to indicate the specific GPUs and
the number of CPUs to use. You can display a progress bar with the option
``--progress-bar``.
......@@ -98,7 +98,9 @@ normalizations
npz
nSl
numpydoc
optimizers
overfit
overfitted
overfitting
overline
parallelization
......@@ -150,6 +152,7 @@ uncategorized
unnormalized
unregister
unstandardize
untransform
utils
validator
validators
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment