Mentions légales du service

Skip to content

Rework dataset configuration

E. Madison Bray requested to merge embray/refactoring/dataset-config into master

This is part of #68 (closed).

The main change here is to clarify the distinction (or lack thereof) between a "simulation" and a "dataset".

The "dataset" is just the files that will be trained on, and the associated configuration for how the dataset is loaded (the dataset section in preprocessing/training configs).

A "simulation" is nothing more than a dataset, with possibly additional properties in the configuration related to the simulation that produced the dataset.

Conceptually this has been the case for a while, but I clarified it in the config formats and in the code.

On the config format side this follows from #68 (closed): There is no longer a simulation property in the config formats. It's just called dataset.

On the code side, the Simulation class is just a subclass of DNADataset (in fact it already duplicated most of the functionality of DNADataset just in a slightly different way.

The Simulation class itself is now no longer used much anywhere in the code, and could possibly be deleted. There's not much use for it now.

Classes/methods that previously took a Simulation object as one of its arguments now just take a DNADataset.

So overall we speak primarily of "datasets", where "simulation" is just a dataset that was produced by a DNADNA Simulator.

This should clarify what @jcury wrote in his example dataset_config.yml from #68 (closed):

---- END ----
--- Below are not needed ----
--- I don't know whether I got this from old simulation config file or not---

The properties below this point are related to running a simulation, but are not needed for an arbitrary dataset (which may or may not have been a simulated dataset).

This MR is built on top of !70 (merged) and in turn !69 (merged) so it may be hard to distinguish until those merged (for !69 (merged) the only remaining question is whether or not to still include torchvision in default conda environment we give people).

Merge request reports