Hi,
On 4/20/09, Georg-Johann Lay <address@hidden> wrote:
Hi Sean,
as far as I understand, your tool runs on asm code, i.e. you map short
sequences of asm instructions (which must not contain code labels) to
other instruction sequences.
Right, and it's far from complete. In theory it could handle code
with labels and branches but I haven't implemented that yet, and it
would be slower for processing.. but I have ideas for how to do it
with reasonable efficiency.
The lookup table is generated by brute force? Observing the results of
asm sequences and trying to map to an other sequence that shows better
performance depending on the cost function (time, space, ...).
Yes, right now it's brute force. Fortunately the lookup table can be
stored to disk with a checksum of each instruction sequence which is
computed in such a way that it is guaranteed to be the same checksum
for a longer sequence if they are identical
Then you use patterns that actually match some pieces of gcc output to
automatically generate peephole2 resp. peephole patterns for avr-gcc?
Not quite yet. My peepholes are assembly->assembly. The peepholes
for gcc are rtl.
I must admit that I am not a big fan of insn peepholes generated this
way: peepholes can fix some mess from register allocation, but what is
possible on peep level is very limited because register allocation is
finished. So you cannot get some scratch register, and if a register is
no more used after the peephole, you cannot free it and use for
something else. Moreover, if there is a peephole that matches a sequence
A, B
it won't match
A, X, B
where X is independent of A and B.
Right, this is a major concern to me. I know of a number of
peepholes defined in avr.md which do not always get applied in cases
that it seems like they should.
Maybe it's possible to invent your work to generate patterns for insn
combine rather that for insn peephole? That pass runs before register
allocation and allows to transform from RTL to RTL.
Yes, I'm still don't think it's the perfect solution though. I have
to look into this.
Skimmin over the code abr-gcc generates for libgcc, e.g., we see much
romm for improving code both in size and in speed.
Finally I have some comments/question on code snippets in your avrfreaks
post:
*I*
from:
eor r6, r6
eor r7, r7
eor r8, r8
eor r9, r9
to:
ldi r6, 0 ;; typo
eor r7, r7
movw r8, r6
Besides the typo, avr-gcc already knows how to do this, see
avr.c::output_movsisf:
AVR_HAVE_MOVW ? (AS1 (clr,%A0) CR_TAB
AS1 (clr,%B0) CR_TAB
AS2 (movw,%C0,%A0))
: (AS1 (clr,%A0) CR_TAB
AS1 (clr,%B0) CR_TAB
AS1 (clr,%C0) CR_TAB
AS1 (clr,%D0));
Maybe you did -fsplit-wide-types? In many situations
-fno-split-wide-types yields better code, but not always. Without
splitting wide types RTL is sometimes a bit unhandy because it has to
deal with larger entities, but with splitting wide types it's harder to
keep track of the bigger picture.
I did not know about this optimization. I was just using -Os. I will try it.
Also, that peephole only works for 32bit numbers correct? What if
there happen to be 2 16 bit ones? Or even 4 8bit numbers that happen
to be able to benefit from this. Also what if you want to load 0x3bd3
into the upper and lower half using ldi, ldi, movw? Currently gcc
just does 4 ldis
*II*
from:
ldi r16, 101
ori r16, 50
swap r16
to:
ldi r16, 255
andi r16, 119
I am a little bit confused. Is the source an output of avr-gcc with
optimization turned on? If so, we should find out why the compiler
generates this code and remove the cause rather than the symptom and say
"ldi r16, 119".
Sorry the second set of examples are ones I manually entered.. gcc did
not generate it. I wanted to demonstrate the capability of my
superoptimizer has for solving constants (pretty cool isn't it?) And
*III*
from:
mov r24, r25
ldi r25, 0
mov r25, r24
eor r24, r24
to:
eor r24, r24
The original snippet would look something like
(set (reg:HI 101) (zero_extend:HI (reg:QI 100)))
(set (reg:HI 102) (ashift:HI (reg:HI 101) (const_int 8)))
Combine will try
(set (reg:HI 102)
(ashift:HI (zero_extend:HI (reg:QI 100))
(const_int 8)))
Which could be split/mapped to
(set (subreg:QI (reg:HI 102) 1) (reg:QI 100))
(set (subreg:QI (reg:HI 102) 0) (const_int 0))
Note that this is more general and contains the code above if reload
decides to allocate 102 to R24 and 100 to R25 (or 102 to REGNO and 100
to REGNO+1).
Unfortunately, skimming generated asm for such sequences and writing
patterns to catch them is very time consuming. But I am unsure if
automatically generated peepholes2 is what we want
-- there will be bulk of patterns in the backend where no one really
knows where they come from. There will be no individual comments why
they are there. It will be harder to maintain the backend.
I would specify them as generated and keep them in a separate file.
-- I think before adding peepholes we should try to fix the very
problems: maybe missing combine patterns, playing around with command
line options, smarter ways to printout assembler, maybe costs, insn
constraints, see if the bad code still persist in gcc 4.5, etc.
Yes I agree, it is better to handcode a few patterns to take care of
90% of the cases than to have a few hundred generated cases.
As I said, IMHO peep2 should be a last resort to fix mess if more
sophisticated and more general approaches fail. I guess a bunch of the
cases you see and treat is just because gcc doesn't handle what it could
have handled if it was described somewhere.
Yes, and it's annoying the way gcc is 32bit centric so it means all
the patterns have to be duplicated for 8, 16, and 32 bits on avr.
Maybe I'm getting carried away, but Ideally gcc would figure out how
to add 32bit numbers if it knows how to add 8 bit ones.. it should be
able to generate multiplication and division routines using it's
knowledge of the assembly language.. and then it would be trivial to
support 24bit integers, or fixed point types of any size for any
target (It was a pain writing all the multiplication and division
routines for 8 different types of fixed point numbers) If it could
do this type of thing, it would significantly speed up writing new
backends since you would only need to define the instruction set to
the compiler, not rtl to the instruction set.