coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] join feature: auto-format


From: Pádraig Brady
Subject: Re: [coreutils] join feature: auto-format
Date: Thu, 06 Jan 2011 12:05:01 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 07/10/10 19:25, Pádraig Brady wrote:
> On 07/10/10 18:43, Assaf Gordon wrote:
>> Pádraig Brady wrote, On 10/07/2010 06:22 AM:
>>> On 07/10/10 01:03, Pádraig Brady wrote:
>>>> On 06/10/10 21:41, Assaf Gordon wrote:
>>>>>
>>>>> The "--auto-format" feature simply builds the "-o" format line 
>>>>> automatically, based on the number of columns from both input files.
>>>>
>>>> Thanks for persisting with this and presenting a concise example.
>>>> I agree that this is useful and can't think of a simple workaround.
>>>> Perhaps the interface would be better as:
>>>>
>>>> -o {all (default), padded, FORMAT}
>>>>
>>>> where padded is the functionality you're suggesting?
>>>
>>> Thinking more about it, we mightn't need any new options at all.
>>> Currently -e is redundant if -o is not specified.
>>> So how about changing that so that if -e is specified
>>> we operate as above by auto inserting empty fields?
>>> Also I wouldn't base on the number of fields in the first line,
>>> instead auto padding to the biggest number of fields
>>> on the current lines under consideration.
>>
>> My concern is the principle of "least surprise" - if there are existing 
>> scripts/programs that specify "-e" without "-o" (doesn't make sense, but 
>> still possible) - this change will alter their behavior.
>>
>> Also, implying/forcing 'auto-format' when "-e" is used without "-o" might be 
>> a bit confusing.
> 
> Well seeing as -e without -o currently does nothing,
> I don't think we need to worry too much about changing that behavior.
> Also to me, specifying -e EMPTY implicitly means I want
> fields missing from one of the files replaced with EMPTY.
> 
> Note POSIX is more explicit, and describes our current operation:
> 
> -e EMPTY
>   Replace empty output fields in the list selected by -o with EMPTY
> 
> So changing that would be an extension to POSIX.
> But I still think it makes sense.
> I'll prepare a patch soon, to do as I describe above,
> unless there are objections.

The attached changes `join` (from what's done on other platforms) so that...

`join -e` will automatically pad missing fields from one file
so that the same number of fields are output from each file.
Previously -e was only used for missing fields specified with -o or -j.

With this change join now does:

$ cat file1
a 1 2
b 1
d 1 2

$ cat file2
a 3 4
b 3 4
c 3 4

$ join -a1 -a2 -1 1 -2 1 -e. file1 file2
a 1 2 3 4
b 1 . 3 4
c . . 3 4
d 1 2 . .

$ join -a1 -a2 -1 1 -2 4 -e. file1 file2
. . . . a 3 4
. . . . b 3 4
. . . . c 3 4
a 1 2 . .
b 1 .
d 1 2 . .

$ join -a1 -a2 -1 4 -2 1 -e. file1 file2
. a 1 2 . . .
. b 1 . .
. d 1 2 . . .
a . . 3 4
b . . 3 4
c . . 3 4

$ join -a1 -a2 -1 4 -2 4 -e. file1 file2
. a 1 2 a 3 4
. a 1 2 b 3 4
. a 1 2 c 3 4
. b 1 . a 3 4
. b 1 . b 3 4
. b 1 . c 3 4
. d 1 2 a 3 4
. d 1 2 b 3 4
. d 1 2 c 3 4

While -e without -o was previously a noop, and so could safely be extended IMHO,
this will also change the behavior when with -e and -j are specified.
Previously if -j > 1 was specified, and that field was missing,
then -e would be used in its place, rather than the empty string.
This still does that, but also does the padding.
Without the -j issue I'd be 80:20 for just extending -e to auto pad,
but given -j I'm 50:50.  The alternative it to select this with
say '-o padded', but that's less discoverable, and complicates
the interface somewhat.

cheers,
Pádraig.

Attachment: join-auto-format.diff
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]