Georg-Johann Lay wrote:
If you are sure it is really some performance issue/regression and not
due to some language standard implication, you can add a report to
http://sourceforge.net/tracker/?group_id=68108
so that the subject won't be forgotten. Also mind
http://gcc.gnu.org/bugs.html
And of course, you can ask questions here. In that case it is helpful
if you can manage to simplify the source to a small piece of code that
triggers the problem and allows others to reproduce the problem. (i.e.
no #include in the code, no ... (except for varargs), a.s.o).
Snippets of .s may point to the problem when you add -dp -fverbose-asm
And there are lots of places where avr-gcc produces suboptimal or even
bad code, so feedback is welcome.
But note that just a few guys are working on the AVR part of gcc.
I would do more if I had the time (and the support of some gurus to
ask questions on internals then and when...)
OK, I only spent a few minutes looking at old code and I found some
obviously sub-optimal results. It distills down to this:
#include <avr/io.h>
int main(void) {
unsigned long packet = 0;
while(1) {
if( !(PINC & _BV(PC2)) ) {
packet = (packet<<1)|(((unsigned char)PINC>>1)&1);
}
PORTB = packet;
}
}
avr-gcc is: avr-gcc (Gentoo 4.3.3 p1.0, pie-10.1.5) 4.3.3
The avr/io stuff is just so that it won't optimise the code away to nothing.
I tried compiling it with both -Os and -O2:
avr-gcc -g -dp -fverbose-asm -Os -S -mmcu=atmega48 -o test test.c
The result includes this:
lsl r18 ; packet ; 50 *ashlsi3_const/2 [length = 4]
rol r19 ; packet
rol r20 ; packet
rol r21 ; packet
in r24,38-0x20 ; D.1214, ; 16 *movqi/4 [length = 1]
lsr r24 ; D.1214 ; 17 lshrqi3/3 [length = 1]
ldi r25,lo8(0) ; , ; 48 *movqi/2 [length = 1]
ldi r26,lo8(0) ; , ; 46 *movhi/4 [length = 2]
ldi r27,hi8(0) ; ,
andi r24,lo8(1) ; tmp52, ; 19 andsi3/2 [length = 4]
andi r25,hi8(1) ; tmp52,
andi r26,hlo8(1) ; tmp52,
andi r27,hhi8(1) ; tmp52,
or r18,r24 ; packet, tmp52 ; 20 iorsi3/1 [length
= 4]
or r19,r25 ; packet, tmp52
or r20,r26 ; packet, tmp52
or r21,r27 ; packet, tmp52
The problem, it seems, is that the compiler doesn't realize that the
right hand side of the expression can only have any non-zero values in
the bottom 8 bits, since it's an unsigned char which is being implicitly
expanded to 32 bits for the or operation. In fact, it's only the bottom
bit that's ever non-zero. As a result it's spending a number of cycles
and registers doing useless things. I'll copy a report to the locations
you mention in your e-mail.
There are probably ways to work around this, such as making "packet" a
union of an unsigned char and a long, then shifting the long and only
ORing in the unsigned char. I'll note that there's also an optimization
to be had with the right hand side of the expression. I would write the
assembly something like this: