On Sun, 14 Feb 2021, Dmitry Selyutin wrote:
The first patch introduces a set of routines which any platform which
wants to support atomics must implement. I don't quite like that
there's a lot of code duplication, but I haven't come up with a good
idea on how to avoid it (I've been thinking of some trick with weak
functions, though). I'm also not sure of ST_STATIC specifier; any tips
regarding its usage are highly appreciated. I added it as I saw it's
used in the code around; perhaps this is not required, so I can make
the routines weak by default?
The second patch adjusts tokenizer and generator appropriately, and
also fixes some minor issues. From now on, the count of tokens matches
count of atomic routines, and calls platform-specific code instead of
calling usual functions. I'd like to keep this approach in order to
make the code a bit more flexible. This is not for speed but, rather,
for being able to tune per-platform code in the future. I'm totally
open for the discussion.
The third patch extends x86_64 code generator to generate code from
the binary buffers, not byte-by-byte, as with g() routine. This
functionality will be used in the ultimate patch, if it gets accepted.
The last patch is the implementation for x86_64. This patch is likely
a controversial one. I tried to make the code somewhat generic to
different argument sizes, at the same time making it look like a
function call. It's also caused by the fact that I checked the code
generated by gcc for cases when usual stdatomic routines are wrapped
into simple routines. I'm pretty sure a lot there can be improved;
perhaps many of you will find the approach to be unorthodox to some
degree. This is just the idea; I'm totally open for discussion.
So, I think you want to iterate a bit on this to find some tiny ways :)
Some ideas:
* For the unimplemented targets: e.g. introduce a define that a target
sets, define erroring fallbacks (or empty macros or suchlike) if the
macro isn't set (see e.g. CONFIG_TCC_ASM in tccgen.c).
* commonize the routines: there is no reason why you need four routines
for four basic arithmetic operations, if gen_op() supports all
arithmetic operations. I.e. make it an argument to a single routine.
* The atomic routines itself: like others I suggest doing normal calls to
library routines. TCC is _not_ about fastest code.
* For the routines you do want to inline, the use of opcode bytes: nah,
that can be done in a nicer way. Think about the very core that you
need: you will find it's the locked cmpxchg loop and the xchg insn
itself. Both have the property that they are very similar to stores
(including the fact that they have a size),
they just happen to leave something interesting in the register operand.
I.e. it would be natural to just extend the 'store' routine to be able
to emit (lock cmp)xchg instead of the store opcode.
That will give you the possiblity to accept arbitrary registers instead
of having to hard-code ax/si/di, obviating the need for the
prologue/epilogue routines.
For that you also probably want to use the existing helpers orex and
gen_modrm(64) from x86_64-gen.c . After some fiddling you will probably
find that _not_ hardcoding specific registers is actually going to be
easier.
* In a similar vain: your atomic load/store routines: these are simply
load/store themself again.