Position normalization
Here's an attempt at implementing #6 (closed)
As we discussed in the issue, there are two sides to this:
- For a dataset it's possible (and a good idea, but only necessary if we're going to transform the positions) to specify the format of the positions arrays in the dataset. E.g. in the simulation config adding:
position_format:
distance: True
normalized: True
circular: True
chromosome_size: 100000
initial_position: 100
specifies that the position arrays are normalized circular distances, with the given chromosome size and initial position. This can also be something simpler like:
position_format:
normalized: True
meaning these are just positions normalized to [0.0, 1.0).
- If we want to transform the positions, this is now another transformation that can be applied to the dataset during training. No position transforms are applied by default anymore. The old default behavior happened to be equivalent to (in the training config):
dataset_params:
transforms:
- {'name': 'reformat-position', 'distance': True, 'circular': True}
That is, it converted positions to circular distances. Note the warning that the 'reformat-position' transform should usually come before any other transforms. For example if you put 'rotate' before 'reformat-position' you'll get strange results.
Additional:
This MR also includes reworking of how transforms are specified in the config (see #23). This was
especially to accommodate the fact that transforms should be applied in a specific order. So now
instead of being a mapping, dataset.transforms
is a list of all the transforms that should be applied, in the order they are applied. Generally they are in the format {'name': <transform-name>, 'param1': ..., 'param2': ..., ...}
where the additional paramN's are parameters for the transform itself. If the transform does not require any parameters (e.g. currently Rotate
does not, you can also give the transform name by itself as a string. E.g.
dataset_params:
transforms:
- {'name': 'reformat-position', 'distance': True, 'circular': True}
- 'rotate'
- {'name': 'subsample', 'size': 5}
Let me know if there's anything I can improve on this approach.
Note: This MR currently has some conflicts with !24 (merged), but I'll resolve those conflicts later depending on which one gets merged first.