Mentions légales du service

Skip to content

Position normalization

E. Madison Bray requested to merge embray/position-normalization into master

Here's an attempt at implementing #6 (closed)

As we discussed in the issue, there are two sides to this:

  1. For a dataset it's possible (and a good idea, but only necessary if we're going to transform the positions) to specify the format of the positions arrays in the dataset. E.g. in the simulation config adding:
position_format:
    distance: True
    normalized: True
    circular: True
    chromosome_size: 100000
    initial_position: 100

specifies that the position arrays are normalized circular distances, with the given chromosome size and initial position. This can also be something simpler like:

position_format:
    normalized: True

meaning these are just positions normalized to [0.0, 1.0).

  1. If we want to transform the positions, this is now another transformation that can be applied to the dataset during training. No position transforms are applied by default anymore. The old default behavior happened to be equivalent to (in the training config):
dataset_params:
    transforms:
        - {'name': 'reformat-position', 'distance': True, 'circular': True}

That is, it converted positions to circular distances. Note the warning that the 'reformat-position' transform should usually come before any other transforms. For example if you put 'rotate' before 'reformat-position' you'll get strange results.

Additional:

This MR also includes reworking of how transforms are specified in the config (see #23). This was especially to accommodate the fact that transforms should be applied in a specific order. So now instead of being a mapping, dataset.transforms is a list of all the transforms that should be applied, in the order they are applied. Generally they are in the format {'name': <transform-name>, 'param1': ..., 'param2': ..., ...} where the additional paramN's are parameters for the transform itself. If the transform does not require any parameters (e.g. currently Rotate does not, you can also give the transform name by itself as a string. E.g.

dataset_params:
    transforms:
        - {'name': 'reformat-position', 'distance': True, 'circular': True}
        - 'rotate'
        - {'name': 'subsample', 'size': 5}

Let me know if there's anything I can improve on this approach.

Note: This MR currently has some conflicts with !24 (merged), but I'll resolve those conflicts later depending on which one gets merged first.

Edited by E. Madison Bray

Merge request reports