Re: fix for inefficient merges in sort.c

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: fix for inefficient merges in sort.c

From:	Paul Eggert
Subject:	Re: fix for inefficient merges in sort.c
Date:	Mon, 14 Feb 2005 10:23:20 -0800
User-agent:	Gnus/5.1006 (Gnus v5.10.6) Emacs/21.4 (gnu/linux)

Lemley James - jlemle <address@hidden> writes:

> I've coded a patch to sort.c from coreutils 5.3.0 for your careful
> consideration (and my use ;) to perform a binary search in the merge pass,
> only if the key was not found in the first merge temp file.

Thanks for that idea.  How about the following patch instead?  It uses
just one loop rather than two nested ones, and should do two fewer key
comparisons (in the common case with large merges, anyway), and I
think the code is simpler overall.  I installed this patch, but if you
see something wrong with it please let me know so that we can improve
it.

I also added you to the THANKS file.  Thanks!

While we're on the subject of large NMERGE values, why don't we modify
"sort" as follows?

  1. Add an option to let the user specify NMERGE on the command line,
  e.g., "sort --n-way-merge=1024".

  2 (probably more important in practice).  Use a better heuristic for
  NMERGE, when the user does not specify NMERGE.

Can you suggest a good heuristic, given your experience with large
merges and GNU sort?  For example, would it be good simply to use
an NMERGE that is as large as possible (i.e., does not exhaust memory
and/or file descriptors)?  If not, what would be the limiting factors?

--- sort.c      2 Dec 2004 06:55:31 -0000       1.302
+++ sort.c      14 Feb 2005 18:04:22 -0000      1.303
@@ -1778,18 +1778,31 @@ mergefps (char **files, size_t ntemps, s
        }

       /* The new line just read in may be larger than other lines
-        already in core; push it back in the queue until we encounter
-        a line larger than it. */
-      for (i = 1; i < nfiles; ++i)
-       {
-         int cmp = compare (cur[ord[0]], cur[ord[i]]);
-         if (cmp < 0 || (cmp == 0 && ord[0] < ord[i]))
-           break;
-       }
-      t = ord[0];
-      for (j = 1; j < i; ++j)
-       ord[j - 1] = ord[j];
-      ord[i - 1] = t;
+        already in main memory; push it back in the queue until we
+        encounter a line larger than it.  Optimize for the common
+        case where the new line is smallest.  */
+      {
+       size_t lo = 1;
+       size_t hi = nfiles;
+       size_t probe = lo;
+       size_t ord0 = ord[0];
+       size_t count_of_smaller_lines;
+
+       while (lo < hi)
+         {
+           int cmp = compare (cur[ord0], cur[ord[probe]]);
+           if (cmp < 0 || (cmp == 0 && ord0 < ord[probe]))
+             hi = probe;
+           else
+             lo = probe + 1;
+           probe = (lo + hi) / 2;
+         }
+
+       count_of_smaller_lines = lo - 1;
+       for (j = 0; j < count_of_smaller_lines; j++)
+         ord[j] = ord[j + 1];
+       ord[count_of_smaller_lines] = ord0;
+      }
     }

   if (unique && savedline)

[Prev in Thread]

Current Thread

[Next in Thread]

fix for inefficient merges in sort.c, Lemley James - jlemle, 2005/02/11
- Re: fix for inefficient merges in sort.c, Paul Eggert <=
- RE: fix for inefficient merges in sort.c, Lemley James - jlemle, 2005/02/11
- RE: fix for inefficient merges in sort.c, Lemley James - jlemle, 2005/02/14

Prev by Date: RE: Regarding the recover of the contents of the file that is removed using "rm"
Next by Date: Test fails on darwin
Previous by thread: fix for inefficient merges in sort.c
Next by thread: RE: fix for inefficient merges in sort.c
Index(es):
- Date
- Thread