[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Whether the parallel tool has the similar size limit problem as xarg
Re: Whether the parallel tool has the similar size limit problem as xargs?
Tue, 30 Nov 2021 21:58:29 +0800
On Tue, Nov 30, 2021 at 9:12 PM Joe Sapp <firstname.lastname@example.org> wrote:
> If you're trying to get around the maximum argument limit, maybe this will
> Or this:
> Parallel can break up the calls to a command, limiting the number of
> arguments to the maximum allowed on the command line. But then you
> won't have one sorted file in the end. Try the examples, use "-m -j1"
> or "-X -j1", and do a final `sort -u` on the output file.
I still not so sure whether these tricks can deal with the following
problem analyzed by Janis Papanagnou :
The issue stems from the fact of a limited exec-buffer size and
that [shell-external] commands will operate on that limited buffer.
Whenever your sample size - actually the argument list size - will
exceed that limit the outcome is unreliable and depends on the data
used; it may work in 10 cases and fail in 100, or vice versa, it
may work for all your application cases (because you are operating
only on toy data), or it may always fail (because you are working
with huge amounts of scientific data), or anything else.
To understand the issue it suffices to assume small values, say a
buffer-size of 15 and a few short arguments.
Say you have the file arguments A B C D ... Z and want to sort
them. Say in the buffer there's room for only 5, so that sorting
with above 'find'-based constructs will result in many calls;
sort A B C D E
sort F G H I J
and the output will be the concatenation of the individual calls.
A..E will be sorted, F..J will be sorted, etc. but A..Z will not
be sorted after the concatenation of the individual sorted parts.
Very subtle errors can occur this way if one is not aware of that
fact; the result may look correct if one looks at the first few MB
of the result, but may actually be wrong.
Whether other tools (like the one mentioned below) circumvent the
exec-buffer issue must be checked - but I wouldn't expect it does.
What a tool would need to do is either the ability to see all data
in one call, or to create partly sorted data and make more sort
runs on that partly sorted data; merge-sort is an algorithm that
works that way (which had been used on sequentially operating
tape archives especially in former times).