coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multithreaded sort hangs on Solaris


From: Pádraig Brady
Subject: Re: Multithreaded sort hangs on Solaris
Date: Tue, 12 Mar 2013 11:06:59 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 03/11/2013 03:47 PM, McFarland, Jeffrey wrote:
> I have come across some odd results regarding the sort utility in coreutils 
> version 8.20.  I’ve looked through the archives and don’t see any similar 
> issues so it may be something specific to our systems.
> 
>  
> 
> System:  SunOS 5.10 Generic_147440-26 sun4u sparc SUNW,Sun-Fire-V890
> 
>  
> 
> Issue:  When running sort on a 22.5 GB file I found that about 1 out of 10 
> times the process seems to hang (out of 100+ tests).  The process is still 
> running but the temp files are no longer changing and the final file either 
> has not been created or is a 0 byte file.  When this happens the temp files 
> are never in the exact same state as a previous run.  On this machine a 
> complete sort normally takes about 20 minutes.  On one occasion the process 
> hung for over 48 hours before I killed it.  Running top shows no significant 
> load on the system. 
> 
>  
> 
> Command run: 
> 
> ./sort -t\n -S 256M --batch-size=100 -T /disk/craiwk01/prod/SORTWK -T 
> /disk/craiwk02/prod/SORTWK -T /disk/craiwk03/prod/SORTWK -T 
> /disk/craiwk04/prod/SORTWK -T /disk/craiwk06/prod/SORTWK -k1.1,1.10 infile -o 
> infile.sorted
> 
>  
> 
>>: ps
> 
>    PID TTY         TIME CMD
> 
> 16328 pts/3       5:06 sort
> 
>         12697 pts/3       0:00 ps
> 
>  
> 
>>: sudo truss -rall -wall -f -p 16328
> 
> 16328:  lwp_park(0x00000000, 0)         (sleeping...)
> 
>  
> 
>>: sudo pstack 16328
> 
> 16328:  /usr/local/abacus/etsort/sort -tn -S 295063 --batch-size=100 -T /disk/
> 
> -----------------  lwp# 1 / thread# 1  --------------------
> 
> ffffffff7d4d8818 lwp_park (0, 0, 0)
> 
> 0000000100009c74 sortlines (111b56580, 111c56080, ffffffff7fffeab0, 
> 10012a321, ffffffff7fffead0, 10012a328) + 514
> 
> 000000010000a5cc sortlines (111558380, 2, ffffffff7fffeab0, 1121765e0, 0, 
> ffffffff7fffeab0) + e6c
> 
> 000000010000a5cc sortlines (111956f80, 4, ffffffff7fffeab0, 112176420, 0, 
> ffffffff7fffeab0) + e6c
> 
> 000000010000a5cc sortlines (112154760, 8, ffffffff7fffeab0, 1121760a0, 1, 
> ffffffff7fffeab0) + e6c
> 
> 000000010000c070 sort (10012a740, 0, ffffffff7fffead0, 23, 10012cddd, 
> 112154760) + 350
> 
> 000000010000e6e8 main (13, ffffffff7ffff148, 0, 10012c220, fffd, 10012b1e0) + 
> 1ee8
> 
> 00000001000041bc _start (0, 0, 0, 0, 0, 0) + 7c
> 
> -----------------  lwp# 240 / thread# 240  --------------------
> 
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
> 
>         ** zombie (exited, not detached, not yet joined) **
> 
> -----------------  lwp# 241 / thread# 241  --------------------
> 
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
> 
>         ** zombie (exited, not detached, not yet joined) **
> 
> -----------------  lwp# 242 / thread# 242  --------------------
> 
> 000000010000a600 sortlines_thread(), exit value = 0x0000000000000000
> 
>         ** zombie (exited, not detached, not yet joined) **
> 
>  
> 
> If I change the sort to run as a single threaded process (add “--parallel=1” 
> to above command) then it doesn’t hang.  This makes me think that it’s most 
> likely a threading issue.  I ran the same tests on a LINUX machine and it did 
> not have the same hanging issue so it’s most likely only an issue with 
> Solaris. 
> 
>  
> 
> I initially found this issue using coreutils 8.9 and I changed to 8.20 to see 
> if a fix had been made but no luck.
> 
>  
> 
> Is this a known issue?  Are there any additional tests I should run to 
> further narrow down this issue?

I can't think of anything TBH.
There may possibly be some portability issues with --compress and --parallel
(due to possibly non async safe functions being called after a fork),
but you're not using --compress, so we can discount that at least.

No matter if the bug is in coreutils or solaris,
adding some sleeps may help trigger a race more quickly?

BTW the `sort -t\n` looks strange. Did you mean: sort -t$'\n' ?

thanks,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]