help-gplusplus
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help with Hand-Optimized Assembly


From: Terje Mathisen
Subject: Re: Help with Hand-Optimized Assembly
Date: Wed, 28 Mar 2012 18:29:55 -0000
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20111221 Firefox/9.0.1 SeaMonkey/2.6.1

James Van Buskirk wrote:
"Bill Woessner"<woessner@nospicedham.gmail.com>  wrote in message
67ddafac-ae03-4ef1-b156-5488e8b8086a@i26g2000vbt.googlegroups.com">news:67ddafac-ae03-4ef1-b156-5488e8b8086a@i26g2000vbt.googlegroups.com...

This compiles, runs and produces the correct answers.  But I have a
few issues with it:

1) If I declare this function inline, it gives me garbage (like
10^-304)
2) If I compile with -Wall, I get a warning that the function doesn't
return a value, which is absolutely true, but I don't know how to fix
it.
3) I don't like how TWO_PI and NEG_TWO_PI are defined.  I had to steal
it from some generated assembly.  It would be nice to use M_PI,
4*atan(1) or something like that.

I can't help you with your questions because I would always write
something like this in assembly rather than C, but is there some
reason that you can't use SSE2 rather than x87 here?  SSE2 should
be much faster if available in the context of your problem.


I'll second James' suggestion about SSE2!

Anyway, it seem that what you are trying to do is to take the difference between two angles and then make sure that said difference will be in the [-pi .. pi> range, right?

I.e. what is the rotation angle to get from theta2 to theta1?

Let's start by looking at the various alternatives:

if the signs of th1 and th2 are the same, then the difference _must_ be in range:

 0 - pi = -pi
 pi - 0 = pi

 -0 - -pi = pi
 -pi - 0  = -pi

It is only when the signs differ that you might need to add or subtract 2pi to bring it into range:

 pi - -pi = 2pi
-pi - pi = -2pi

I don't see immediately how I can use this to speed it up though...

Anyway, trying your original algorithm:

  movq xmm0,[theta1]

  subsd xmm0,[theta2]   ;; Result in [-2pi to 2pi]
  movq xmm2,[plus_mask] ;; 0x7fffff...

  andpd xmm2,xmm0       ;; ABS(diff), [0 to 2pi]
  movq xmm3,[pi]

  cmplesd xmm3,xmm2     ;; -1 mask if diff > pi
  andpd xmm3,[twopi]    ;; 0 or 2pi
  subsd xmm2,xmm3       ;; [-pi to pi]

If the original subtraction sign was negative, then we must invert the sign of the result:

  andpd xmm0,[signbits] ;; (-0.0 , -0.0)
  xorpd xmm0, xmm2

The code can be rescheduled a bit, and the mixture of 64-bit scalar and 128-bit packed operations must be checked that they don't introduce forwarding problems, but it seems like it should run in 10-15 cycles, depending upon the latency of the FP operations (SUBSD, CMPLESD, SUBSD)

I tried to figure out a way to use scaling and integer math, but that is likely to be slower.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"


reply via email to

[Prev in Thread] Current Thread [Next in Thread]