[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Proposed optional asymmetry within the 'join' command of 'coreutils'

From: Tony Peters & Robyn Smart
Subject: Proposed optional asymmetry within the 'join' command of 'coreutils'
Date: Sat, 1 May 2004 11:34:13 +1000

I am not sure whether this is the appropriate place to log a suggested enhancement request for the
        'join' program within the 'coreutils' package?
In my work I perform a lot of text file manipulations on behalf of my clients (through my data cleansing and matching services).  I use the 'textutil' programs extensively to manipulate (intermediate) files typically exceeding 10Gb in size.
A problem I encounter with the 'join' program in all variants of Unix/Linux, is that they are symmetrical in their handling of files 1 and 2.
The problem occurs when there is a heavy bias in the values of the joining field.  For instance, if one or both files have a high level of repeat on a particular value (for instance where a foreign key value is optional in the data), the internal memory buffer may be exhausted.  This typically results in a memory fault and the 'join' operation being aborted (or within some linux-variants (MS SFU), being silently suspended when within a pipeline).
In the past I have obtained the source code for the 'join' command and made minor code adjustments to make the file processing asymmetrical.  That is by retaining the internally buffering for file 1 but processing only a single record at a time from file 2.
This asymmetrical nature means the buffer would only be exceeding when 1 of the files contains a large repeat of the key value.  If this does occur, I typically swap the input file definitions around.  Of course the problem may still occur if the repeated key values in either file would exceed the internal capacity.  However this would result in an 'equi-join' outputting many millions of records (typically indicating a logic flaw within my system that I would need to correct).
So, to the question, is it feasible to introduce an optional asymmetrical behaviour to the 'join' command?
Thank you for your time.
Tony Peters
Phone:      +61 7 3848 1607
Fax:            +61 7 3848 1607
Mobile:     0402 292 459

reply via email to

[Prev in Thread] Current Thread [Next in Thread]