Define an interface for dataset transformations

Currently some of the documentation (in docstrings) suggests that a dataset transformation function can be any function/callable that accepts a Tensor as an argument and returns a Tensor. But this is not actually accurate (in fact the existing transformation functions in the code work on Numpy arrays not PyTorch Tensors). This is also not how they are defined.

Currently there are two transforms defined in the code (taken originally from Jean's code):

In Théophile's code there are two additional related operations that could be implemented as transforms in the same interface:

transform_min_to_major (as in allele frequency)
remove_maf_folded

These transforms take as inputs a 3-tuple of (pos, snp, shift) and return a similar tuple. Here shift is any shift to position values (currently implemented assuming positions are normalized to [0.0, 1.0)). Currently this argument is not used as input to any of the transforms though it could be if, say, one wanted to perform multiple rounds of rotation, so it makes sense to pass through.

My point is just that this interface needs to be formalized and documented (I would also swap snp and pos in the tuple with is more consistent with other parts of the code, but this is trivial). And maybe rather than make it a tuple the transforms would just take 3 arguments instead.

Finally, currently it is not possible to specify the order of the transforms, nor is it possible to specify multiple instances of each transform (such as performing multiple rounds of rotation as the example I gave earlier). I don't know if this is desirable but I imagine it likely is.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Define an interface for dataset transformations