Checking for data quality causes more bugs than it solves
Issue description
Our job.check_data_quality
function may not be ideal. We check things like whether the number of columns matches across sites, but all of this doesn't matter because the researcher may only be interested in a subset of those columns anyway, and we shouldn't prevent them from running the experiment!
Furthermore, we have a recent bug raised on the mailing list by Armin, due to pandas trying to guess the data types of columns. Here again we shouldn't raise any error, since the researcher may prefer to do that conversion in the training_data
fuction, or may not even care about converting them!
Finally, we use assert
statements, which lead to confusing exceptions (in Armin's case, an unknown error
was raised, although thankfully the error message was preserved and therefore it was easily understandable by humans)
Solutions
My proposal would be that we don't check for any of this quality stuff. We definitely should not be so strict, and we should not check for quality before the training_data
function has been executed. So my proposal would be to simply avoid checking altogether. If there is an error, it will appear during training and still be communicated to the researcher.