bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#8824: sort occasionally hangs - appears to be in a merge-sort loop


From: Mina Naguib
Subject: bug#8824: sort occasionally hangs - appears to be in a merge-sort loop
Date: Wed, 8 Jun 2011 16:16:17 -0400

Hi

I've observed a few instances when `sort` simply "hangs" and never returns the 
sorted data.

Currently using coreutils 8.12 on gentoo linux 2.6.25, `locale` outputs:
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

An example ongoing right now on one of my servers.  Sort invoked as:
/usr/bin/sort -t '|' -k1,1 -k6,6n

It was fed via STDIN pipe-delimited line-based records, and STDIN was closed.  
The parent process is waiting on sort to start receiving the output via STDOUT 
which never happens.

Input was roughly 350 million records, each record looking something like this:
000WQg16MdC2Puk|direct_count|2049|63581|15090|1306695571

I dug into the sort process with strace and I believe it's doing one of the 
final merge sorts:
read(12, "986|61518|14528|1301616354\nnHM5Og"..., 4096) = 4096
read(12, "|1967|61925|14584|1301634780\nnHMk"..., 4096) = 4096
read(12, "533\nnHNQDohSrKT3Ffn|direct_count|"..., 4096) = 4096
write(4, "4488|1302016964\nnGq7jG4hfFS9flL|d"..., 4096) = 4096
write(4, "nGq9POv11Jh7OiQ|direct_count|1965"..., 4096) = 4096
write(4, "05|1299763468\nnGqCgAT3jwkrzhc|dir"..., 4096) = 4096
write(4, "460\nnGqFItcr2F1Pz8n|tag|1995|6275"..., 4096) = 4096
write(4, "JuhUfMkOPC7J|direct_count|1914|57"..., 4096) = 4096
write(4, "|1914|57180|13801|1302640204\nnGqO"..., 4096) = 4096

After watching it do this diddy for 5.5 hours I dug in further.  Here's its 
open temp files:

$ date; lsof -n -p 31370 | grep tmp | nl
Wed Jun  8 15:37:41 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1 3123896320  12705303 
/tmp/sort0YtEtm
     2  sort    31370  dmt    5r   REG    8,1 1199907358  12705551 
/tmp/sortWNOpxy
     3  sort    31370  dmt    7r   REG    8,1 1181994889  12705627 
/tmp/sortIpm44r
     4  sort    31370  dmt    8r   REG    8,1 1199857626  12705636 
/tmp/sort2WOPPo
     5  sort    31370  dmt    9r   REG    8,1 1196868088  12705640 
/tmp/sortCUkW6a
     6  sort    31370  dmt   10r   REG    8,1 1199488185  12705644 
/tmp/sortSDopQX
     7  sort    31370  dmt   11r   REG    8,1 1188267263  12705646 
/tmp/sortUdXqke
     8  sort    31370  dmt   12r   REG    8,1 1176087055  12705665 
/tmp/sortMiavw5
     9  sort    31370  dmt   13r   REG    8,1 1184572313  12705666 
/tmp/sortUf5q03
    10  sort    31370  dmt   14r   REG    8,1 1187433621  12705672 
/tmp/sorti3mKO9
    11  sort    31370  dmt   15r   REG    8,1 1184537051  12705680 
/tmp/sort0iYrQr
    12  sort    31370  dmt   16r   REG    8,1 1186874578  12705696 
/tmp/sortUwVkpQ
    13  sort    31370  dmt   17r   REG    8,1 1173767980  12705703 
/tmp/sortov5uyz
    14  sort    31370  dmt   18r   REG    8,1 1172616590  12705711 
/tmp/sort2ibXqt
    15  sort    31370  dmt   19r   REG    8,1 1184319006  12705714 
/tmp/sortA1DmGE
    16  sort    31370  dmt   20r   REG    8,1 1188372691  12705728 
/tmp/sortibGbvd
    17  sort    31370  dmt   21r   REG    8,1 1185407259  12705713 
/tmp/sort4Ou1u6

again:
$ date; lsof -n -p 31370 | grep tmp | nl
Wed Jun  8 15:37:58 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1 3432607744  12705303 
/tmp/sort0YtEtm
     2  sort    31370  dmt    5r   REG    8,1 1199907358  12705551 
/tmp/sortWNOpxy
     3  sort    31370  dmt    7r   REG    8,1 1181994889  12705627 
/tmp/sortIpm44r
     4  sort    31370  dmt    8r   REG    8,1 1199857626  12705636 
/tmp/sort2WOPPo
     5  sort    31370  dmt    9r   REG    8,1 1196868088  12705640 
/tmp/sortCUkW6a
     6  sort    31370  dmt   10r   REG    8,1 1199488185  12705644 
/tmp/sortSDopQX
     7  sort    31370  dmt   11r   REG    8,1 1188267263  12705646 
/tmp/sortUdXqke
     8  sort    31370  dmt   12r   REG    8,1 1176087055  12705665 
/tmp/sortMiavw5
     9  sort    31370  dmt   13r   REG    8,1 1184572313  12705666 
/tmp/sortUf5q03
    10  sort    31370  dmt   14r   REG    8,1 1187433621  12705672 
/tmp/sorti3mKO9
    11  sort    31370  dmt   15r   REG    8,1 1184537051  12705680 
/tmp/sort0iYrQr
    12  sort    31370  dmt   16r   REG    8,1 1186874578  12705696 
/tmp/sortUwVkpQ
    13  sort    31370  dmt   17r   REG    8,1 1173767980  12705703 
/tmp/sortov5uyz
    14  sort    31370  dmt   18r   REG    8,1 1172616590  12705711 
/tmp/sort2ibXqt
    15  sort    31370  dmt   19r   REG    8,1 1184319006  12705714 
/tmp/sortA1DmGE
    16  sort    31370  dmt   20r   REG    8,1 1188372691  12705728 
/tmp/sortibGbvd
    17  sort    31370  dmt   21r   REG    8,1 1185407259  12705713 
/tmp/sort4Ou1u6


again several times, you see as the main writable file approaches the total 
size sum of the other files combined:
Wed Jun  8 15:55:20 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1 18965975040  12705303 
/tmp/sort0YtEtm
     2  sort    31370  dmt    5r   REG    8,1  1199907358  12705551 
/tmp/sortWNOpxy
     3  sort    31370  dmt    7r   REG    8,1  1181994889  12705627 
/tmp/sortIpm44r
     4  sort    31370  dmt    8r   REG    8,1  1199857626  12705636 
/tmp/sort2WOPPo
     5  sort    31370  dmt    9r   REG    8,1  1196868088  12705640 
/tmp/sortCUkW6a
     6  sort    31370  dmt   10r   REG    8,1  1199488185  12705644 
/tmp/sortSDopQX
     7  sort    31370  dmt   11r   REG    8,1  1188267263  12705646 
/tmp/sortUdXqke
     8  sort    31370  dmt   12r   REG    8,1  1176087055  12705665 
/tmp/sortMiavw5
     9  sort    31370  dmt   13r   REG    8,1  1184572313  12705666 
/tmp/sortUf5q03
    10  sort    31370  dmt   14r   REG    8,1  1187433621  12705672 
/tmp/sorti3mKO9
    11  sort    31370  dmt   15r   REG    8,1  1184537051  12705680 
/tmp/sort0iYrQr
    12  sort    31370  dmt   16r   REG    8,1  1186874578  12705696 
/tmp/sortUwVkpQ
    13  sort    31370  dmt   17r   REG    8,1  1173767980  12705703 
/tmp/sortov5uyz
    14  sort    31370  dmt   18r   REG    8,1  1172616590  12705711 
/tmp/sort2ibXqt
    15  sort    31370  dmt   19r   REG    8,1  1184319006  12705714 
/tmp/sortA1DmGE
    16  sort    31370  dmt   20r   REG    8,1  1188372691  12705728 
/tmp/sortibGbvd
    17  sort    31370  dmt   21r   REG    8,1  1185407259  12705713 
/tmp/sort4Ou1u6

then what appears to be finalization and cleanup:
Wed Jun  8 15:55:28 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1 18990366720  12705303 
/tmp/sort0YtEtm
     2  sort    31370  dmt    5r   REG    8,1  1199907358  12705551 
/tmp/sortWNOpxy
     3  sort    31370  dmt    7r   REG    8,1  1181994889  12705627 
/tmp/sortIpm44r
     4  sort    31370  dmt    8r   REG    8,1  1199857626  12705636 
/tmp/sort2WOPPo
     5  sort    31370  dmt    9r   REG    8,1  1196868088  12705640 
/tmp/sortCUkW6a
     6  sort    31370  dmt   12r   REG    8,1  1176087055  12705665 
/tmp/sortMiavw5
     7  sort    31370  dmt   13r   REG    8,1  1184572313  12705666 
/tmp/sortUf5q03
     8  sort    31370  dmt   14r   REG    8,1  1187433621  12705672 
/tmp/sorti3mKO9
     9  sort    31370  dmt   15r   REG    8,1  1184537051  12705680 
/tmp/sort0iYrQr
    10  sort    31370  dmt   16r   REG    8,1  1186874578  12705696 
/tmp/sortUwVkpQ
    11  sort    31370  dmt   17r   REG    8,1  1173767980  12705703 
/tmp/sortov5uyz
    12  sort    31370  dmt   18r   REG    8,1  1172616590  12705711 
/tmp/sort2ibXqt

again, almost done:
Wed Jun  8 15:55:34 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1 18990370816  12705303 
/tmp/sort0YtEtm
     2  sort    31370  dmt   16r   REG    8,1  1186874578  12705696 
/tmp/sortUwVkpQ

again, 1 main file left:
Wed Jun  8 15:55:35 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1 18990370816  12705303 
/tmp/sort0YtEtm

But alas, 16 more readable files pop up and 1 writable starts adding up again:
Wed Jun  8 15:55:36 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1          0  12705551 
/tmp/sortevbKK1
     2  sort    31370  dmt    5r   REG    8,1 1178392494  12705739 
/tmp/sortIIvZhW
     3  sort    31370  dmt    7r   REG    8,1 1180102559  12706259 
/tmp/sortwZr2sU
     4  sort    31370  dmt    8r   REG    8,1 1181439270  12706530 
/tmp/sort8B2hbx
     5  sort    31370  dmt    9r   REG    8,1 1183275901  12706731 
/tmp/sorteGJa35
     6  sort    31370  dmt   10r   REG    8,1 1178538478  12706793 
/tmp/sortoGXmOo
     7  sort    31370  dmt   11r   REG    8,1 1191511729  12706795 
/tmp/sortG0RNr9
     8  sort    31370  dmt   12r   REG    8,1 1199965207  12706799 
/tmp/sort6Q8tAy
     9  sort    31370  dmt   13r   REG    8,1 1192944995  12706817 
/tmp/sortCGXzT2
    10  sort    31370  dmt   14r   REG    8,1 1191485559  12706823 
/tmp/sortMwsygv
    11  sort    31370  dmt   15r   REG    8,1 1200144993  12706830 
/tmp/sortyA2RWw
    12  sort    31370  dmt   16r   REG    8,1 1198251727  12706866 
/tmp/sortyZlqHC
    13  sort    31370  dmt   17r   REG    8,1 1184876212  12706871 
/tmp/sortCCjKOH
    14  sort    31370  dmt   18r   REG    8,1 1174916186  12706880 
/tmp/sortKrUJSJ
    15  sort    31370  dmt   19r   REG    8,1 1184800950  12706824 
/tmp/sort2Vjy7s
    16  sort    31370  dmt   20r   REG    8,1 1187817055  12706891 
/tmp/sort89xsBm
    17  sort    31370  dmt   21r   REG    8,1 1185686697  12706944 
/tmp/sortwjlE5L

Wed Jun  8 15:55:40 EDT 2011
     1  sort    31370  dmt    4u   REG    8,1   62631936  12705551 
/tmp/sortevbKK1
     2  sort    31370  dmt    5r   REG    8,1 1178392494  12705739 
/tmp/sortIIvZhW
     3  sort    31370  dmt    7r   REG    8,1 1180102559  12706259 
/tmp/sortwZr2sU
     4  sort    31370  dmt    8r   REG    8,1 1181439270  12706530 
/tmp/sort8B2hbx
     5  sort    31370  dmt    9r   REG    8,1 1183275901  12706731 
/tmp/sorteGJa35
     6  sort    31370  dmt   10r   REG    8,1 1178538478  12706793 
/tmp/sortoGXmOo
     7  sort    31370  dmt   11r   REG    8,1 1191511729  12706795 
/tmp/sortG0RNr9
     8  sort    31370  dmt   12r   REG    8,1 1199965207  12706799 
/tmp/sort6Q8tAy
     9  sort    31370  dmt   13r   REG    8,1 1192944995  12706817 
/tmp/sortCGXzT2
    10  sort    31370  dmt   14r   REG    8,1 1191485559  12706823 
/tmp/sortMwsygv
    11  sort    31370  dmt   15r   REG    8,1 1200144993  12706830 
/tmp/sortyA2RWw
    12  sort    31370  dmt   16r   REG    8,1 1198251727  12706866 
/tmp/sortyZlqHC
    13  sort    31370  dmt   17r   REG    8,1 1184876212  12706871 
/tmp/sortCCjKOH
    14  sort    31370  dmt   18r   REG    8,1 1174916186  12706880 
/tmp/sortKrUJSJ
    15  sort    31370  dmt   19r   REG    8,1 1184800950  12706824 
/tmp/sort2Vjy7s
    16  sort    31370  dmt   20r   REG    8,1 1187817055  12706891 
/tmp/sort89xsBm
    17  sort    31370  dmt   21r   REG    8,1 1185686697  12706944 
/tmp/sortwjlE5L

I may be mistaken, but I don't think this is normal behaviour - once the 16-way 
merge sort is done, I see no reason to observe this whole process happening 
over and over.

Ironically, I had upgraded from coreutils 8.5 to 8.12 because other tools on 
the same machine invokes sort with "--compress-program=gzip" and had observed 
similar-but-different hanging on that tool's invocation.  I dug in and found 
this commit ( 
http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=1b31ce6982a9151d9dfe2ea3595ad7595cb9ca86
 ) which encouraged me to try the upgrade to fix the hang-while-compressing 
issue, but I'm now seeing this hanging even in the non-compressed invocation.

In my case, I think I'll downgrade coreutils back down to 8.5 as well as 
disable all tools from invoking it with "--compress-program" to get predictable 
non-hanging behaviour across the board.

If I can be of further help please let me know.

Thank you.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]