Mentions légales du service

Skip to content
  • John Mather's avatar
    Improve ARM Performance by avoiding expensive FPSR access · 46e4883f
    John Mather authored
    Upon profiling on Arm-based systems, it was observed that the call to
    fegetexceptflag was quite expensive. After further investigation, it was
    discovered that fegetexceptflag accesses the FPSR register, which seems to be
    much slower than accessing the FPCR register. The rounding mode is exposed in
    FPCR, so we can obtain the information from it instead to increase performance.
    
    The added code emulates the _mm_getcsr SSE intrinsic and was adapted from the
    sse2neon project: https://github.com/DLTcollab/sse2neon. The changes included
    removing 32-bit codepaths and replacing the switch statement with a branchless
    lookup table.
    
    The following performance improvements were observed:
    
    Apple M1 Ultra
    ---------------------------------------------------------------
    cbrtf  7.33838 ns/call ->  5.62758 ns/call (-1.7108 ns) [1.30x]
    asin  14.23860 ns/call -> 13.38090 ns/call (-0.8577 ns) [1.06x]
    cbrt   8.91429 ns/call ->  6.89891 ns/call (-2.0154 ns) [1.29x]
    hypot ...
    46e4883f