Jcury/fix preprocessing (!44) · Merge requests · Machine learning for population genetics / private / dnadna

Jean Cury requested to merge jcury/fix_preprocessing into master Oct 23, 2020

Proposition to fix issue #47 (closed) about multiprocessing.

In the NpzSNPSource object, I added an argument to return the shape of the arrays only. If yes, it returns just a dict with POS and SNP as key and the shape of the corresponding element in values. If False, it doesn't change (it's the default).

DataPreprocessor.check_scenario() is adapted accordingly. It doesn't use a SNPSample object anymore. It could have been nice to use those objects, but they are created by loading directly the arrays in memory into tensors, which is too costly and heavy to just give the size of an array.

I don't understand the use of the magic method __new__() in SNPSample objects (why not a standard __init__() ?). Maybe this class can be reworked such that its instantiation does not load the data (and could load the shape only), and add a method to load data from disk into tensors. I am really not used to the __new__() method, so there might be a good reason for using it this way, but that's why I did it in another way.

If others could confirm that using this branch instead of master decrease massively the preprocessing time with their data, it would be nice.

Admin message

Jcury/fix preprocessing

Merge request reports