[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: join bug
From: |
Bob Proulx |
Subject: |
Re: join bug |
Date: |
Wed, 5 Mar 2008 22:25:01 -0700 |
User-agent: |
Mutt/1.5.13 (2006-08-11) |
Martin,
Martin Schmeing wrote:
> Hi Bob,
> Join works fine with my test smaller files, giving an appropriate
> output. When both files are 1000 (short) lines long, it outputs
> maybe one or two of the joined lines, but there should be hundreds
> output. The files are sorted, and there is no error message given.
> Here are my test files:
pcmodel.list
pcmodel1000.list
radmodel.list
radmodel1000.list
This one is tricky. At first pass it would seem that everything is in
good shape for join. For example the input files to join must be
sorted and not having them sorted is a common problem. But these are
obvously sorted. The first thing I did was to check this.
for f in *.list; do sort -c $f; done
No errors from sort. All of the files were sorted. So I tried
joining the larger files.
join pcmodel1000.list radmodel1000.list
992 16023 239 3915 2793 43472.2226562 257.2904053
993 16023 240 4134 2889 44867.9531250 393.2121582
Two lines. What are in these files? The first 15 lines of the first
file show the problem. But it is tricky. In fact I missed it until
this point.
1 16021 1 834 6525
2 16021 2 1005 6699
3 16021 3 1296 6651
4 16021 4 1380 6594
5 16021 5 1188 6534
6 16021 6 1044 6363
7 16021 7 498 6240
8 16021 8 357 6405
9 16021 9 270 5886
10 16021 10 957 5436
11 16021 11 1122 6096
12 16021 12 1506 5865
13 16021 13 1407 6030
14 16021 14 1383 5922
15 16021 15 1533 6045
The first field is lined up with a variable number of spaces in the
first column. That is the root of the issue here. Sort by default
sorts the entire line using the character collating sequence specified
by the LC_COLLATE locale. Join does the same but does so ignoring
blanks at the start of the field. Because of the variable number of
blanks sort and join are seeing a different sort order for the first
field.
Just last month (Feb 19 2008) James Youngman added a new feature to
join that warns about this case. Using this very recent join the
following diagnostic is printed. Eventually this will help people be
made aware of this problem much more easily than with older versions
of join.
join: File 1 is not in sorted order
join: File 2 is not in sorted order
Knowing this makes it obvious that I used the wrong sort check. What
I should have done was using -b to skip blanks to match what join is
doing. Or more precisely 'sort -k 1b,1'.
for f in *.list; do sort -c -k 1b,1 $f; done
sort: pcmodel1000.list:10: disorder: 10 16021 10 957
5436
sort: radmodel1000.list:116: disorder: 1001 44867.9531250
393.2121582
Now the problem is much more apparent. The file needs to be sorted in
the same order that join would expect it. Not numberically but
lexically using 'sort -k 1b,1'.
sort -k 1b,1 -o pcmodel1000.list pcmodel1000.list
sort -k 1b,1 -o radmodel1000.list radmodel1000.list
head -n10
1 16021 1 834 6525
10 16021 10 957 5436
100 16021 100 1764 714
1000 16023 247 4833 3609
101 16021 101 1932 588
102 16021 102 2058 501
103 16021 103 2418 399
104 16021 104 2256 447
105 16021 105 1644 849
Looks better for join even if it looks worse for humans. That is the
ordering that is needed for character sorting.
join pcmodel1000.list radmodel1000.list | wc -l
115
That looks a little more reasonable.
Hope that explanation helped.
Bob
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: join bug,
Bob Proulx <=