Reworked transform configuration from #68
Partial implementation of the new transform configuration format suggested by @jcury in #68 (closed). It does not yet use splits for training/validation/test but that will be easier to add on once I implement better dataset splitting.
The dataset_params
section in the training config is now gone. Everything option previously in it has been moved elsewhere (namely ignore_missing
) or re-implemented as a transform.
To summarize, you can now configure transforms like:
dataset_transforms:
- rotate
- subsample: 100
- crop:
max_snp: 500
max_indiv: 100
This demonstrates 3 different styles allowed depending on the transform:
-
If the transform is one that does not take any arguments, such as Rotate, you just list the name of the transform as a string (in the future maybe we could also extend this to any transform that has defaults for all its arguments)
-
If the transform takes only one argument, you can pass that argument as the value of the transform without specifying the name; e.g.
- subsample: 100
and
- subsample: size: 100
are equivalent.
-
Otherwise, you give the transform name and as its value a dict mapping its argument names to values:
- crop: max_snp: 500 max_indiv: 100
This last case has a slight ambiguity: If the transform takes only one argument and the type of that argument happens to be a dict, there is an ambiguity (though I try to overcome it by checking the keys in the dict); this case will probably be rare or even non-existent though.
Additionally:
-
max_snp
andmax_indiv
are replaced by a single transform namedcrop
. -
The
concat: true|false
option has now been re-implemented as a transform namedsnp_format
which be specified likesnp_format: concat
orsnp_format: product
. If unspecified,'concat'
is still the default format. This could easily be extended to support additional formats if there are any that make sense. -
The pesky
uniform: true|false
option which we've struggled with has also been removed. Its functionality is replaced by an identity transform calledvalidate_snp
, which does not modify the data but which can perform different validations on it during training, and throw out SNPs that fail validation.This is not enabled, however, unless
validate_snp
is explicitly included in the transform list.Currently the only validation performed is the same one that
uniform: true
implemented. It checks if all SNPs have the same shape, and throws out ones that don't have the same shape as the first one. This could probably be further improved if anyone even has a use for it. -
I removed the
maf
andtransform_allel_min_major
options entirely since they were unused. As previously noted they could probably be implemented as transforms. I think I roughly understand what they are supposed to do, but I opted to exclude them for now until someone explicitly requests it. It might also be a good chance for someone else to try implementing a Transform😃