Checking for data quality causes more bugs than it solves

Issue description

Our job.check_data_quality function may not be ideal. We check things like whether the number of columns matches across sites, but all of this doesn't matter because the researcher may only be interested in a subset of those columns anyway, and we shouldn't prevent them from running the experiment! Furthermore, we have a recent bug raised on the mailing list by Armin, due to pandas trying to guess the data types of columns. Here again we shouldn't raise any error, since the researcher may prefer to do that conversion in the training_data fuction, or may not even care about converting them!

Finally, we use assert statements, which lead to confusing exceptions (in Armin's case, an unknown error was raised, although thankfully the error message was preserved and therefore it was easily understandable by humans)

Solutions

My proposal would be that we don't check for any of this quality stuff. We definitely should not be so strict, and we should not check for quality before the training_data function has been executed. So my proposal would be to simply avoid checking altogether. If there is an error, it will appear during training and still be communicated to the researcher.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Checking for data quality causes more bugs than it solves

Issue description

Solutions