Check the number of instruction per benchmark
500 instr/loop is probably too much, as the uops cache is only 1500 entries of unfused µops (see WikiCip, µops fusion occurs in the IDQ, after the cache!) and complex instructions (note that this article states that fused uops stay fused):
- are decoded at most at 1 instr / cycle (with a peak thrpt of 2 uops/cycle)
- produces >= 2 uops/cycle
So the bench add reg, mem
^1 sub mem, reg
^1 will saturate the uops cache. However, they also are port-limited, and uops.info even predict a lower throughput that expected (due to fusion?). We need to talk about this and take a fixed decision.
Edited by DERUMIGNY Nicolas