[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: memchr2 speed, gcc
From: |
Bruno Haible |
Subject: |
Re: memchr2 speed, gcc |
Date: |
Tue, 4 Mar 2008 03:41:08 +0100 |
User-agent: |
KMail/1.5.4 |
Eric Blake wrote:
> +2008-03-01 Eric Blake <address@hidden>
> +
> + New module 'memchr2'.
> + * modules/memchr2: New file.
> + * modules/memchr2-tests: Likewise.
> + * lib/memchr2.h: Likewise.
> + * lib/memchr2.c: Likewise, based on memchr.c.
Wondering why you used 'uintmax_t' as basic word type, rather than the
'unsigned long' that memchr.c uses, I benchmarked this and a few other
variations of the memchr2.c implementation.
Summary of results:
- With gcc 3.2.2 and 4.2.2, the word type 'unsigned long' is more efficient.
- With gcc 4.3-20080215, it is the opposite. But this version of gcc also
exhibits mysterious performance characteristics.
Details about the variants of memchr2.c:
- Variant M is the original one, variant L the one with 'unsigned long'.
- Variant O is the original one, with the test like this:
((((longword1 + magic_bits) ^ ~longword1) & ~magic_bits) != 0
|| (((longword2 + magic_bits) ^ ~longword2) & ~magic_bits) != 0)
Variant S uses a simplified expression:
(((((longword1 + magic_bits) ^ ~longword1)
| ((longword2 + magic_bits) ^ ~longword2)) & ~magic_bits) != 0)
- Variant X uses a __builtin_expect (..., 0) around this expression.
Details about the compilers used:
- gcc 3.2.2
- gcc 4.2.2
- gcc 4.3-20080215
CPU: x86 (Athlon-K7).
The attached test program and the variant file were compiled with -O2 -g
and linked. Then "time ./a.out 100000" was run two or three times, and
the average of the "user" time taken. All times are in seconds.
Results:
MO MOX MS MSX LO LOX LS LSX
gcc-3.2.2 6.75 6.28 5.84 4.13
gcc-3.2.2 -mcpu=athlon 6.72 5.16 4.68 5.25
gcc-4.2.2 6.17 5.27 5.91 5.32
gcc-4.2.2 -mtune=athlon 6.14 4.98 5.36 5.25
gcc-4.3-ss 4.51 4.72 4.51 4.72 4.75 4.67 5.26 4.67
gcc-4.3-ss -mtune=athlon 4.69 4.39 4.69 4.39 4.75 4.67 4.75 4.68
Result interpretation:
- Variant O vs. variant S: no clear winner on either side.
- gcc 4.3 results are pretty random: Sometimes -mtune=athlon (tuning for the
CPU actually used) is a win, sometimes a deterioation. Sometimes variant M
is better than variant L, sometimes the opposite.
- But gcc 4.3's absolute results are always better than those of previous gcc
versions.
- Looking at the -mtune=athlon cases only:
- Variant O vs. variant S: still no clear winner on either side.
- Variant M vs. variant L: no clear winner here either.
Btw, how do you need to write code such that gcc uses the SSE3 instructions?
Bruno
main.c
Description: Text Data
- new module memchr2, Eric Blake, 2008/03/01
- Message not available
- Re: memchr2 speed, gcc,
Bruno Haible <=