Reworked transform configuration from #68 (!70) · Merge requests · Machine learning for population genetics / private / dnadna

E. Madison Bray requested to merge embray/refactoring/transform-configuration-take2 into master Apr 02, 2021

Partial implementation of the new transform configuration format suggested by @jcury in #68 (closed). It does not yet use splits for training/validation/test but that will be easier to add on once I implement better dataset splitting.

The dataset_params section in the training config is now gone. Everything option previously in it has been moved elsewhere (namely ignore_missing) or re-implemented as a transform.

To summarize, you can now configure transforms like:

    dataset_transforms:
        - rotate
        - subsample: 100
        - crop:
            max_snp: 500
            max_indiv: 100

This demonstrates 3 different styles allowed depending on the transform:

If the transform is one that does not take any arguments, such as Rotate, you just list the name of the transform as a string (in the future maybe we could also extend this to any transform that has defaults for all its arguments)
If the transform takes only one argument, you can pass that argument as the value of the transform without specifying the name; e.g.
```
- subsample: 100
```
and
```
- subsample:
    size: 100
```
are equivalent.
Otherwise, you give the transform name and as its value a dict mapping its argument names to values:
```
- crop:
    max_snp: 500
    max_indiv: 100
```

This last case has a slight ambiguity: If the transform takes only one argument and the type of that argument happens to be a dict, there is an ambiguity (though I try to overcome it by checking the keys in the dict); this case will probably be rare or even non-existent though.

Additionally:

max_snp and max_indiv are replaced by a single transform named crop.
The concat: true|false option has now been re-implemented as a transform named snp_format which be specified like snp_format: concat or snp_format: product. If unspecified, 'concat' is still the default format. This could easily be extended to support additional formats if there are any that make sense.
The pesky uniform: true|false option which we've struggled with has also been removed. Its functionality is replaced by an identity transform called validate_snp, which does not modify the data but which can perform different validations on it during training, and throw out SNPs that fail validation.

This is not enabled, however, unless validate_snp is explicitly included in the transform list.

Currently the only validation performed is the same one that uniform: true implemented. It checks if all SNPs have the same shape, and throws out ones that don't have the same shape as the first one. This could probably be further improved if anyone even has a use for it.
I removed the maf and transform_allel_min_major options entirely since they were unused. As previously noted they could probably be implemented as transforms. I think I roughly understand what they are supposed to do, but I opted to exclude them for now until someone explicitly requests it. It might also be a good chance for someone else to try implementing a Transform 😃

Edited Apr 02, 2021 by E. Madison Bray

Admin message

Reworked transform configuration from #68

Merge request reports