Jcury/fix preprocessing
Proposition to fix issue #47 (closed) about multiprocessing.
In the NpzSNPSource
object, I added an argument to return the shape of the arrays only. If yes, it returns just a dict with POS and SNP as key and the shape of the corresponding element in values. If False, it doesn't change (it's the default).
DataPreprocessor.check_scenario()
is adapted accordingly. It doesn't use a SNPSample
object anymore.
It could have been nice to use those objects, but they are created by loading directly the arrays in memory into tensors, which is too costly and heavy to just give the size of an array.
I don't understand the use of the magic method __new__()
in SNPSample
objects (why not a standard __init__()
?). Maybe this class can be reworked such that its instantiation does not load the data (and could load the shape only), and add a method to load data from disk into tensors. I am really not used to the __new__()
method, so there might be a good reason for using it this way, but that's why I did it in another way.
If others could confirm that using this branch instead of master decrease massively the preprocessing time with their data, it would be nice.