[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Proposed optional asymmetry within the 'join' command of 'coreutils'
From: |
Tony Peters & Robyn Smart |
Subject: |
Proposed optional asymmetry within the 'join' command of 'coreutils' |
Date: |
Sat, 1 May 2004 11:34:13 +1000 |
Hello,
I am not sure
whether this is the appropriate place to log a suggested enhancement request for
the
'join'
program within the 'coreutils' package?
In my work I perform
a lot of text file manipulations on behalf of my clients (through my data
cleansing and matching services). I use the 'textutil' programs
extensively to manipulate (intermediate) files typically exceeding 10Gb in
size.
A problem I
encounter with the 'join' program in all variants of Unix/Linux, is that
they are symmetrical in their handling of files 1
and 2.
The problem occurs
when there is a heavy bias in the values of the joining field. For
instance, if one or both files have a high level of repeat on a particular value
(for instance where a foreign key value is optional in the data), the internal
memory buffer may be exhausted. This typically results in a memory fault
and the 'join' operation being aborted (or within some linux-variants (MS SFU),
being silently suspended when within a pipeline).
In the past I have
obtained the source code for the 'join' command and made minor code adjustments
to make the file processing asymmetrical. That is by retaining the
internally buffering for file 1 but processing only a single record at a
time from file 2.
This asymmetrical
nature means the buffer would only be exceeding when 1 of the files contains a
large repeat of the key value. If this does occur, I typically swap the
input file definitions around. Of course the problem may still occur if
the repeated key values in either file would exceed the internal
capacity. However this would result in an 'equi-join'
outputting many millions of records (typically indicating a logic
flaw within my system that I would need to correct).
So, to the question,
is it feasible to introduce an optional asymmetrical behaviour to the
'join' command?
Thank you for your
time.
- Proposed optional asymmetry within the 'join' command of 'coreutils',
Tony Peters & Robyn Smart <=