-
John Mather authored
Upon profiling on Arm-based systems, it was observed that the call to fegetexceptflag was quite expensive. After further investigation, it was discovered that fegetexceptflag accesses the FPSR register, which seems to be much slower than accessing the FPCR register. The rounding mode is exposed in FPCR, so we can obtain the information from it instead to increase performance. The added code emulates the _mm_getcsr SSE intrinsic and was adapted from the sse2neon project: https://github.com/DLTcollab/sse2neon. The changes included removing 32-bit codepaths and replacing the switch statement with a branchless lookup table. The following performance improvements were observed: Apple M1 Ultra --------------------------------------------------------------- cbrtf 7.33838 ns/call -> 5.62758 ns/call (-1.7108 ns) [1.30x] asin 14.23860 ns/call -> 13.38090 ns/call (-0.8577 ns) [1.06x] cbrt 8.91429 ns/call -> 6.89891 ns/call (-2.0154 ns) [1.29x] hypot ...