[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] sort: Add --threads option, which parallelizes internal sort
From: |
Pádraig Brady |
Subject: |
Re: [PATCH] sort: Add --threads option, which parallelizes internal sort. |
Date: |
Mon, 20 Apr 2009 00:57:59 +0100 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Ralf Wildenhues wrote:
>
> I've looked at this in a bit more detail; no big conclusion but maybe
> a few more hints that could help.
>
> I am now pretty confident that your patch implements the threading
> correctly. When inserting some
> expensive_computation ();
>
> in the sortlines function right after `swap' is computed, then all
> timings look good (in relation to each other). For the expensive
> computation, I used a function in a separate translation unit doing some
> floating point, and compiled that without optimization.
>
> 'valgrind --cachegrind' shows a very low cache miss rate of 0.2% max
> with both 1 and 2 threads, indicating there isn't anything funny going
> on here.
>
> So I think the only remaining issues that could be relevant for the
> original timings are, IMHO:
> - false data sharing. Don't see where that could happen,
> - memory bandwidth limitations of the specific system, or
> - suboptimal NUMA memory distribution,
>
> That said, I did manage to get 'valgrind --tool=helgrind sort
> --threads=2' to show a potential error, with a large data set. I'm
> quite sure it is a false positive, but I post the output below for
> reference. The 'valgrind --tool=drd' tool could not reproduce this,
> though.
That's for looking at this Ralf!
Lots of valuable info there...
> Anyway. I tried another experiment. I replicated some large data set 8
> times. I then timed a manual merge:
> sort -m <( sort data ) <( sort data ) ... (8 times) >/dev/null
>
> and compared that with:
>
> sort --threads=P 8-fold-data
>
> manual merge: 16.75 s
> --threads=8: 44.11 s
> --threads=1: 81.32 s
>
> This comparison isn't entirely fair, as the splicing was done as a
> precomputation. However, the difference is so pronounced that even
> taking the splicing into account, the non-thread version would be
> quite a bit faster. So to me, it would seem that some approach going
> in that direction could be beneficial.
Absolutely. That's one of the things Glen was going to work on this summer.
I.E. enhance split and perhaps have a helper script so as to achieve the
above more easily.
I would suggest that any data that can be split processed and recombined
doesn't need a threaded solution. Now a threaded solution while
being more complicated may involve less data copying and thus be enough
of a performance win to be worth it. Also it may be more transparent.
But the simple option should be explored first I think and tested with
various scenarios (which I mentioned in a previous mail).
It would be useful to know where you stored the 8 sets of split data.
Were they files on different spindles or cached in RAM, or on SSD etc.
> Other than that, have you thought of implementing something like
> described in <http://www.it.uom.gr/teaching/dbpp/text/node127.html>?
That link is timing out for me?
thanks again!
Pádraig.
- Re: [PATCH] sort: Add --threads option, which parallelizes internal sort., Jim Meyering, 2009/04/02
- Re: [PATCH] sort: Add --threads option, which parallelizes internal sort., Jim Meyering, 2009/04/03
- Re: [PATCH] sort: Add --threads option, which parallelizes internal sort., Paul Eggert, 2009/04/03
- Re: [PATCH] sort: Add --threads option, which parallelizes internal sort., Jim Meyering, 2009/04/03
- Re: [PATCH] sort: Add --threads option, which parallelizes internal sort., Ralf Wildenhues, 2009/04/04
- Re: [PATCH] sort: Add --threads option, which parallelizes internal sort., Ralf Wildenhues, 2009/04/04
- Re: [PATCH] sort: Add --threads option, which parallelizes internal sort., James Youngman, 2009/04/05
- Re: [PATCH] sort: Add --threads option, which parallelizes internal sort., Ralf Wildenhues, 2009/04/18