[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-libc-dev] Faster integer division
From: |
Ambroz Bizjak |
Subject: |
Re: [avr-libc-dev] Faster integer division |
Date: |
Sat, 8 Jun 2013 18:15:07 +0200 |
I've done some further optimization, now it's even faster and uses 4
registers less (but needs >=r16 for the result due to the use of
immediate operands). Even better, there are now only 4 kinds of
iterations (4 macros). The code is here http://ideone.com/EGWcuI and
pasted below.
I've calculated that my division takes 384 instructions in the worst
case. I've also disassembled a compiled program where gcc did the
division (-O4, the attached program, just changed the main func), see
here http://ideone.com/jZNxkF . I see that the call to __udivmodsi4
really hasn't been inlined, but the extra instructions are almost
negligible. In my calculations the gcc division takes 653 instructions
in the worst case (may be a bit off), excluding the call entry and
exit. Put simply, my code takes 58% time compared to gcc's division.
When I said 2/3 previously, I was referring to relative times, not
probabilities, but this was in a benchmark so it appeared slower due
to the testing code.
The sources of speedup in my algorithm seem to be:
1. Algorithm on the high level. In each iteration, gcc does four
32-bit operations (add, compare, subtract, add) in the worst case. I
only do three (shift, compare, subtract).
2. Optimization of groups of 8 iterations. This allows removing some
instructions where some operands are known to be zero.
3. Unrolling individual iterations.
About code size, yes, the algorithm I attached here is indeed much
larger. However, I believe we can compact it into four loops and still
be faster than gcc (points 1 and 2 above would still apply). I think
the result would be about twice the size of gcc algorithm. Even
compacting it to one loop should still be faster than gcc due to point
1.
I don't really know when a faster but larger algorithm should be used
by gcc, this is part of the reason why I'm posting here. The -Os flag
is probably one thing to look for, but we need to consider than people
may not want a giant division code even at other optimization levels.
Perhaps more specific flags could be used to control the division
algorithm? On the other hand, if the large algorithm goes into an
inline function in avr-libc, how do we make sure people can find it?
#include <stdint.h>
#define DIVIDE_ITER_0_7(i) \
" lsl %D[n]\n" \
" rol __tmp_reg__\n" \
" cp __tmp_reg__,%A[d]\n" \
" cpc __zero_reg__,%B[d]\n" \
" cpc __zero_reg__,%C[d]\n" \
" cpc __zero_reg__,%D[d]\n" \
" brcs zero_bit_" #i "__%=\n" \
" sub __tmp_reg__,%A[d]\n" \
" ori %D[q],1<<(7-" #i ")\n" \
"zero_bit_" #i "__%=:\n"
#define DIVIDE_ITER_8_15(i) \
" lsl %C[n]\n" \
" rol __tmp_reg__\n" \
" rol %D[n]\n" \
" cp __tmp_reg__,%A[d]\n" \
" cpc %D[n],%B[d]\n" \
" cpc __zero_reg__,%C[d]\n" \
" cpc __zero_reg__,%D[d]\n" \
" brcs zero_bit_" #i "__%=\n" \
" sub __tmp_reg__,%A[d]\n" \
" sbc %D[n],%B[d]\n" \
" ori %C[q],1<<(15-" #i ")\n" \
"zero_bit_" #i "__%=:\n"
#define DIVIDE_ITER_16_23(i) \
" lsl %B[n]\n" \
" rol __tmp_reg__\n" \
" rol %D[n]\n" \
" rol %C[n]\n" \
" cp __tmp_reg__,%A[d]\n" \
" cpc %D[n],%B[d]\n" \
" cpc %C[n],%C[d]\n" \
" cpc __zero_reg__,%D[d]\n" \
" brcs zero_bit_" #i "__%=\n" \
" sub __tmp_reg__,%A[d]\n" \
" sbc %D[n],%B[d]\n" \
" sbc %C[n],%C[d]\n" \
" ori %B[q],1<<(23-" #i ")\n" \
"zero_bit_" #i "__%=:\n"
#define DIVIDE_ITER_24_30(i) \
" lsl %A[n]\n" \
" rol __tmp_reg__\n" \
" rol %D[n]\n" \
" rol %C[n]\n" \
" rol %B[n]\n" \
" cp __tmp_reg__,%A[d]\n" \
" cpc %D[n],%B[d]\n" \
" cpc %C[n],%C[d]\n" \
" cpc %B[n],%D[d]\n" \
" brcs zero_bit_" #i "__%=\n" \
" sub __tmp_reg__,%A[d]\n" \
" sbc %D[n],%B[d]\n" \
" sbc %C[n],%C[d]\n" \
" sbc %B[n],%D[d]\n" \
" ori %A[q],1<<(31-" #i ")\n" \
"zero_bit_" #i "__%=:\n"
static inline uint32_t divide (uint32_t n, uint32_t d)
{
uint32_t q = 0;
asm(
"clr __tmp_reg__\n"
DIVIDE_ITER_0_7(0)
DIVIDE_ITER_0_7(1)
DIVIDE_ITER_0_7(2)
DIVIDE_ITER_0_7(3)
DIVIDE_ITER_0_7(4)
DIVIDE_ITER_0_7(5)
DIVIDE_ITER_0_7(6)
DIVIDE_ITER_0_7(7)
DIVIDE_ITER_8_15(8)
DIVIDE_ITER_8_15(9)
DIVIDE_ITER_8_15(10)
DIVIDE_ITER_8_15(11)
DIVIDE_ITER_8_15(12)
DIVIDE_ITER_8_15(13)
DIVIDE_ITER_8_15(14)
DIVIDE_ITER_8_15(15)
DIVIDE_ITER_16_23(16)
DIVIDE_ITER_16_23(17)
DIVIDE_ITER_16_23(18)
DIVIDE_ITER_16_23(19)
DIVIDE_ITER_16_23(20)
DIVIDE_ITER_16_23(21)
DIVIDE_ITER_16_23(22)
DIVIDE_ITER_16_23(23)
DIVIDE_ITER_24_30(24)
DIVIDE_ITER_24_30(25)
DIVIDE_ITER_24_30(26)
DIVIDE_ITER_24_30(27)
DIVIDE_ITER_24_30(28)
DIVIDE_ITER_24_30(29)
DIVIDE_ITER_24_30(30)
" lsl %A[n]\n"
" rol __tmp_reg__\n"
" rol %D[n]\n"
" rol %C[n]\n"
" rol %B[n]\n"
" cp __tmp_reg__,%A[d]\n"
" cpc %D[n],%B[d]\n"
" cpc %C[n],%C[d]\n"
" cpc %B[n],%D[d]\n"
" sbci %A[q],-1\n"
: [q] "=&a" (q),
[n] "=&r" (n)
: "[q]" (q),
"[n]" (n),
[d] "r" (d)
);
return q;
}
// instructions:
// 4 (init q) + 1 + 8 * 9 + 8 * 11 + 8 * 13 + 7 * 15 + 10 = 384
volatile uint32_t test1;
volatile uint32_t test2;
volatile uint32_t test3;
int main ()
{
test3 = divide(test1, test2);
//test3 = test1 / test2;
}
On Sat, Jun 8, 2013 at 4:31 PM, Weddington, Eric
<address@hidden> wrote:
>
>
>> -----Original Message-----
>> From: address@hidden
>> [mailto:address@hidden On
>> Behalf Of Ambroz Bizjak
>> Sent: Friday, June 07, 2013 8:34 PM
>> To: address@hidden
>> Subject: [avr-libc-dev] Faster integer division
>>
>> Hi!
>>
>> I've found 32bit integer division in gcc too slow, and I managed to
>> implement division in asm that's faster than what gcc 4.8 produces at
>> -O4, about 2/3 the time (but I'm not sure if any of this is due to
>> inlining). The algorithm is restoring division but unrolled and
>> heavily optimized. Could this get in gcc or avr-libc, wherever the
>> right place is?
>>
>
> Hi Ambroz,
>
> What's difficult is to balance different needs of different users. For the
> most part, and there are always exceptions to this, most of the AVR GCC users
> are interested in code size, rather than speed. They would rather see the
> smallest way to do division, even if it is slower. But I also recognize that
> there are definitely times where speed is the most desired.
>
> So given that, where do you think it best for your algorithm to generated?
> Should it be generated at a specific optimization level in gcc? Should it be
> a specialized inline-assembly function call in avr-libc that is specifically
> called by the user?
>
> You said that it's faster than what gcc produces about 2/3 of the time, but
> you're not sure if that is due to inlining. Can you do some further testing
> to see if it's due to inlining? Or whether it's because your algorithm is
> better? This is important to know, to see if it's worth the time and effort
> to get it in gcc or avr-libc.
>
> Thanks,
> Eric