Bug with missing data during preprocessing
I noticed with the latest master branch (f568547e) ignore_missing
is not working properly during preprocessing.
My dataset is missing a few files, so pre-processing crashed with
an unexpected error occurred: could not load scenario 5864 replicate 0 from "/data1/embray/dnadna/one_event/scenarios/scenario_05864/one_event_05864_00.npz": FileNotFoundError(2, 'No file matching or similar to'); run again with --debug to view the full traceback
This is because by default ignore_missing=False
. This is sort of unfortunate since this could come as a nasty surprise far into a preprocess run. Perhaps we should change the default for ignore_missing
to True?
However, even when I edited the config file to set ignore_missing: true
now I get:
an unexpected error occurred: DNADataset index out of range; run again with --debug to view the full traceback
The full traceback
Traceback (most recent call last):
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 555, in _guess_filename
_, _, matching_filename = next(filename_iter)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 381, in __getitem__
filename = self._get_filename(scenario, replicate)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 530, in _get_filename
return self._guess_filename(scenario, replicate)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 561, in _guess_filename
replicate=replicate))
FileNotFoundError: [Errno 2] No file matching or similar to: 'scenarios/scenario_5864/one_event_5864_0.npz'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/data_preprocessing.py", line 175, in check_scenario
snp = self.dataset.source[scenario_idx, replicate_idx]
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 389, in __getitem__
raise MissingSNPSample(scenario, replicate, filename, reason=exc)
dnadna.datasets.MissingSNPSample: could not load scenario 5864 replicate 0 from "scenarios/scenario_5864/one_event_5864_0.npz": FileNotFoundError(2, 'No file matching or similar to')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 950, in _get_index
next_idx, value = self._cache_next_index()
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 924, in _cache_next_index
next_idx, value = next(self._sample_iter)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/users/galac/embray/.local/opt/miniconda3/envs/dnadna/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/data_preprocessing.py", line 199, in _check_scenario_wrapped
return self.check_scenario(*scenario)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/data_preprocessing.py", line 182, in check_scenario
if not self.dataset['ignore_missing']:
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 886, in __getitem__
scenario_idx, replicate_idx = self._get_index(index)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/datasets.py", line 957, in _get_index
f'{self.__class__.__name__} index out of range')
IndexError: DNADataset index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/users/galac/embray/.local/opt/miniconda3/envs/dnadna/bin/dnadna", line 33, in <module>
sys.exit(load_entry_point('dnadna', 'console_scripts', 'dnadna')())
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/utils/cli.py", line 242, in main
raise exc
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/utils/cli.py", line 234, in main
ret2 = cls.run_subcommand(args)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/utils/cli.py", line 201, in run_subcommand
return command_cls.main(command[1:], namespace=args)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/utils/cli.py", line 242, in main
raise exc
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/utils/cli.py", line 226, in main
ret = cls.run(args)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/cli/preprocess.py", line 38, in run
preprocessor.run_preprocessing(progress_bar=True)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/data_preprocessing.py", line 453, in run_preprocessing
self.preprocess_scenario_params(progress_bar=progress_bar)
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/data_preprocessing.py", line 290, in preprocess_scenario_params
for idx, result in enumerate(bar):
File "/users/galac/embray/.local/opt/miniconda3/envs/dnadna/lib/python3.7/site-packages/tqdm/std.py", line 1097, in __iter__
for obj in iterable:
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/data_preprocessing.py", line 252, in check_scenarios
for result in iter_results():
File "/users/galac/embray/src/ml_genetics/dnadna/dnadna/data_preprocessing.py", line 244, in iter_results
param_iter):
File "/users/galac/embray/.local/opt/miniconda3/envs/dnadna/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
IndexError: DNADataset index out of range
This used to work so I believe this is a regression.