[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Join enhancements

From: David G. Pickett
Subject: Join enhancements
Date: Mon, 27 Feb 2006 08:31:34 -0800 (PST)

I wrote an enhanced join, which I call m1join:
   - It allows you to join files and pipes for many to one, one to many, and 
one to one joins, which joins are far more common than many to many.
   - It is about 30% faster than join because of the cleaner flow (gratuitous 
seeks flush buffers).
   - It also allows you to specify more than one key field, bringing it more in 
line with sort.
  Usage: m1_join [-i] [-a] [-t <sep_chars>] [-m] [-c <key_col_ct>] f1 f2
  Joins (possibly multiple) lines from sorted file f1 with each line from 
sorted file f2, on leading key fields.  Leading separators are not ignored.  
Output is all the f1 fields followed by the first separator character (tab) 
followed by non-matched fields of f2.
** Does not mind pipes as files! **
** Does not support 'one to many' or 'many to many', just 'many to one'! **
   -a   All lines are output (full outer join).
 -a1  All lines of f1 are output (left outer join).
 -a2  All lines of f2 are output (right outer join).
 -c   Only one column is matched, unless -c is specified.
 -i   Keys (and the required sort order) are case-sensitive unless -i is 
specified, in which case all letters are treated as lower case in the ASCII 
binary sort order: both 'A' 0101 0x41 and 'a' 0141 0x61 are greater than '_' 
0137 0x5F.
 -m   Multiple separators are treated as one unless -m is specified.
 -t   Columns are separated by tab, space, carriage return or linefeed unless 
-t specifies a string of other character(s).

  I think we might extend the gnu join in a backwards compatible way to have 
this flavor of capabilities, and make the it much more useful.  The detection 
of one pipe could be changed to never be an error, or to keep things simple at 
first, maybe not when you provide a new arg.  Similarly, the detection of two 
pipes might only be an error if a many to many is detected.  Detecting and 
handling a 'many' side requires saving the last line for each side, not a very 
big memory penalty these days.
  The sort merge is still often the fastest solution to a join.  A 
piped-together sort merge join allows some parallel processing and avoids 
creating and managing intermediate files.  The ksh on systems with /dev/fd/ has 
the nice <(...) and >(...) operators to create and manage multiple pipes for 
you.  Using the sort and join tools and scripting can prevent a lot of tedious 
and error prone code.

Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]