This was discussed approximately 6 months ago and multiple Chameleon developers agreed that it would be nice to have such a thing in Chameleon. However, it opened several questions about was is the proper way to do it, where to put the code (as GFlop is unrelated of the runtime, but for now as long as I know only StarPU uses it), how to ensure that all codelets have GFlop, etc. As no conclusions were reached, the solving of this question was postponed and thus unfortunately nothing was pushed to Chameleon.
At the time, I have added few lines of code for computing GFlop for Chameleon Cholesky. Maybe this can serve as an example for someone who will address this whole issue. The attached patch_flops contains this code, only the patch itself is probably old and broken, and perhaps needs some minor editing before applying it directly.
I don't see any major problem against it. So, if one of you propose a merge request that covers all main kernels, there is no problem. For information, the proposed patch is incorrect as it is using the incorrect precision s instead of z.
Be also careful that most of the flops are already provided (sometimes incorrectly) in the callback function.
The only thing I don't like about this, despite it might be really useful for all of us, is the multiplication of those specific parameters to traces that increase the number of arguments given to the insert task function and so its cost. We might need to think about enabling it at compile time only.