[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
David G. Pickett
Mon, 27 Feb 2006 08:31:34 -0800 (PST)
I wrote an enhanced join, which I call m1join:
- It allows you to join files and pipes for many to one, one to many, and
one to one joins, which joins are far more common than many to many.
- It is about 30% faster than join because of the cleaner flow (gratuitous
seeks flush buffers).
- It also allows you to specify more than one key field, bringing it more in
line with sort.
Usage: m1_join [-i] [-a] [-t <sep_chars>] [-m] [-c <key_col_ct>] f1 f2
Joins (possibly multiple) lines from sorted file f1 with each line from
sorted file f2, on leading key fields. Leading separators are not ignored.
Output is all the f1 fields followed by the first separator character (tab)
followed by non-matched fields of f2.
** Does not mind pipes as files! **
** Does not support 'one to many' or 'many to many', just 'many to one'! **
-a All lines are output (full outer join).
-a1 All lines of f1 are output (left outer join).
-a2 All lines of f2 are output (right outer join).
-c Only one column is matched, unless -c is specified.
-i Keys (and the required sort order) are case-sensitive unless -i is
specified, in which case all letters are treated as lower case in the ASCII
binary sort order: both 'A' 0101 0x41 and 'a' 0141 0x61 are greater than '_'
-m Multiple separators are treated as one unless -m is specified.
-t Columns are separated by tab, space, carriage return or linefeed unless
-t specifies a string of other character(s).
I think we might extend the gnu join in a backwards compatible way to have
this flavor of capabilities, and make the it much more useful. The detection
of one pipe could be changed to never be an error, or to keep things simple at
first, maybe not when you provide a new arg. Similarly, the detection of two
pipes might only be an error if a many to many is detected. Detecting and
handling a 'many' side requires saving the last line for each side, not a very
big memory penalty these days.
The sort merge is still often the fastest solution to a join. A
piped-together sort merge join allows some parallel processing and avoids
creating and managing intermediate files. The ksh on systems with /dev/fd/ has
the nice <(...) and >(...) operators to create and manage multiple pipes for
you. Using the sort and join tools and scripting can prevent a lot of tedious
and error prone code.
Bring photos to life! New PhotoMail makes sharing a breeze.
- Join enhancements,
David G. Pickett <=