-
John Mather authored
Implemented __builtin_roundeven with the corresponding assembly insturction for Arm platforms when the compiler doesn't provide __builtin_roundeven. __builtin_roundeven isn't provided by Apple clang at the moment, so this makes a large difference on Apple silicon as it removes 19 instructions. The observed performance increase varies from 1.00x to 1.15x on an Apple M1 Ultra. cosf 4.65580 ns/call -> 4.03543 ns/call (-0.62037 ns) [1.15x] coshf 3.84132 ns/call -> 3.68328 ns/call (-0.15804 ns) [1.04x] sinf 4.65548 ns/call -> 4.19025 ns/call (-0.46523 ns) [1.11x] sinhf 3.99483 ns/call -> 3.68801 ns/call (-0.30682 ns) [1.08x] tanf 4.34781 ns/call -> 4.19637 ns/call (-0.15144 ns) [1.04x] tgammaf 20.97220 ns/call -> 20.62030 ns/call (-0.35190 ns) [1.02x] acos 7.94175 ns/call -> 7.48937 ns/call (-0.45238 ns) [1.06x] erfc 17.50430 ns/call -> 17.32990 ns/call (-0.17440 ns) [1.01x] exp 4.47247 ns/call -> 4.17019 ns/call ...