ce-bench now fails when there is a diff
ce-bench is run but always return exit code 0. So, differences in counterexamples are never seen. I don't know if this was intentional to assume ce diffs are minor. If this was, we should never run ce-bench. If not, this merge solve the problem. Copying @marche for information on this ?