coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Speedup wc -l


From: Pádraig Brady
Subject: Re: [PATCH] Speedup wc -l
Date: Wed, 18 Mar 2015 15:57:08 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

On 15/03/15 22:18, Pádraig Brady wrote:
> On 15/03/15 21:14, Kristoffer Brånemyr wrote:
>>
>>
>>
>>
>>> Den söndag, 15 mars 2015 20:13 skrev Pádraig Brady <address@hidden>:
>>>
>>>
>>>> On 15/03/15 08:33, Kristoffer Brånemyr wrote:
>>>>
>>>> Hi,
>>>>
>>>> I did some tests and found out you can actually beat memchr with a simple 
>>>> loop. Tests were done on >>a Intel Xeon E3-1231v3 (4*3.4GHz), on a 4GB 
>>>> file that was already cached in memory. >>Benchmarking >was done simply 
>>>> with the 'time' command. I don't know how this code would run on >>other 
>>>> >architectures, but I guess you could put it in an #ifdef?
>>>>
>>>> Coreutils 2.83 version, compiled with -O3:
>>>> 507755520 /home/ztion/words
>>>>
>>>> real    0m3.126s
>>>> user    0m2.699s
>>>> sys    0m0.429s
>>>>
>>>>
>>>> Improved version compiled with -O2:
>>>> 507755520 /home/ztion/words
>>>>
>>>> real    0m2.857s
>>>> user    0m2.461s
>>>> sys    0m0.396s
>>>>
>>>> Improved version compiled with -O3:
>>>>  507755520 /home/ztion/words
>>>>
>>>> real    0m1.518s
>>>> user    0m1.157s
>>>> sys    0m0.361s
>>>>
>>>> I studied the generated assembly and with -O3 gcc generates some fancy SSE 
>>>> code, getting some nice speedups. memchr is also SSE optimized as far as I 
>>>> know, so it's interesting that this is so much faster, twice as fast 
>>>> actually.
>>>>
>>>> In case you don't like turning -O3 on for some reason (the default in 
>>>> coreutils is -O2 i think), the best version I could put together for -O2 
>>>> was this:
>>>>
>>>> Improved version 2, compiled with -O2:
>>>> 507755520 /home/ztion/words
>>>>
>>>> real    0m2.206s
>>>> user    0m1.827s
>>>> sys    0m0.379s
>>
>>
>>> Interesting. Thanks for the results.
>>> I use 'gcc -march=native -g -O3' locally, and with that can't see a 
>>> difference in performance.
>>>
>>> What version of glibc and gcc are you using?
>>> gcc-4.9.2-1.fc21.x86_64 and glibc-2.20-7.fc21.x86_64 here.
>>>
>>> thanks,
>>> Pádraig.
>>
>>
>> Hi,
>>
>> This is with gcc 4.9.2-7 and glibc 2.19-17 on Debian amd64. The difference 
>> is still there for me when compiling with your CFLAGS. Have they improved 
>> memchr in glibc 2.20? I don't think they have that yet in debian 
>> unfortunately.
>>
>> What cpu do you have?
> 
> 
> i3-2310M
> 
> I was doing a very quick test with _short_ lines
> Specifically /usr/share/dict/words
> 
> Note GCC should be using builtin_memchr here so not
> hitting the function call overhead.
> 
> I'll look in more detail later.

builtin_memchr isn't significant here as it's only used for constant folding I 
think.
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=124617

As hinted above, the main difference in results is due to the line lengths 
involved.

glibc memchr() is more sophisticated, though there is overhead
in calling the external function.  I.E. memchr() be  faster for
longer lines, which I tested with:

  yes "$(printf %80s '')" | head -n10m > 80x10M.txt

With that the existing code performs like:

  $ time src/wc.memchr -l 80x10M.txt
  real 0.253s
  user 0.128s
  sys  0.126s

  $ time src/wc.memchr -l 2x100M.txt
  real 0.842s
  user 0.810s
  sys  0.033s

while the the new proposed code gave
both improvements and regressions:

  $ time src/wc.internal -l 80x10M.txt
  real 0.543s
  user 0.422s
  sys  0.121s

  $ time src/wc.internal -l 2x100M.txt
  real 0.156s
  user 0.122s
  sys  0.035s

Given then new code is better for shorter lines,
it suggests benefits from a hybrid approach, and
testing with the attached gives:

  $ time src/wc.hybrid -l 80x10M.txt
  real 0.253s
  user 0.132s
  sys  0.122s

  $ time src/wc.hybrid -l 2x100M.txt
  real 0.142s
  user 0.111s
  sys  0.031s

The most important technique used is to avoid conditionals
within the loop, and this extends to adding a separate loop
for the short line case, as gcc 4.9.2 at least isn't sophisticated
enough to handle the "check_len" invariant within the loop.

cheers,
Pádraig.

Attachment: wc-l-short-lines.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]