There is a straight-forward algorithm using the fact that only one of
the bounds can be crossed...
Something like this:
(Inputs in %xmm0, and %xmm1, output in %xmm0)
subsd %xmm1,%xmm0
movsd plusM_PI(%rip), %xmm1
movsd minusM_PI(%rip), %xmm2
cmpgtsd %xmm0, %xmm1
cmpltsd %xmm0, %xmm2
andpd minus2M_PI(%rip), %xmm1
andpd plus2M_PI(%rip), %xmm2
addsd %xmm1, %xmm0
addsd %xmm2, %xmm0
I probably have some of the comparisons reversed by mistake... but you
get the idea. You can do both comparisons in parallel. Using sign
tricks doesn't seem to be profitable, as that increases the length of
the critical path.