coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as


From: Rob Landley
Subject: Re: [PATCH 4/4] cut: Optionally treat multiple consecutive delimiters as one
Date: Wed, 16 Aug 2023 01:02:03 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0


On 8/15/23 06:31, Pádraig Brady wrote:
> On 15/08/2023 11:22, Dragan Simic wrote:
>> On 2023-08-10 17:05, Dragan Simic wrote:
>>> On 2023-08-01 20:37, Dragan Simic wrote:
>>>> On 2023-08-01 16:42, Pádraig Brady wrote:
>>>>> On 01/08/2023 10:07, Dragan Simic wrote:
>>>>>> Add new command-line option and the required logic that allow
>>>>>> multiple
>>>>>> consecutive delimiters to be treated as a single delimiter.  Of
>>>>>> course,
>>>>>> this option is valid only with the cut's field mode.
>>>>>>
>>>>>> This new feature should make cut much more usable in various
>>>>>> real-world
>>>>>> applications, some of which are already mentioned in the gotchas.
>>>>>> For
>>>>>> example, merging the consecutive delimiters is very useful when cut
>>>>>> is
>>>>>> used to process the outputs of various commands.
>>>>>>
>>>>>> Add a whole battery of new cut tests, which cover this new feature,
>>>>>> and
>>>>>> add more tests for the related already existing features, to make
>>>>>> sure
>>>>>> no regressions are introduced.
>>>>>>
>>>>>> While there, clean up the comments and the whitespace in the cut
>>>>>> tests
>>>>>> a bit, to make them slightly more readable.
>>>>>
>>>>> Thanks for the patch.
>>>>> I wonder whether a --empty-fields={ignore,suppress} is a more general
>>>>> interface.
>>>>
>>>> I wonder would it be a more complex approach, and more importantly,
>>>> less intuitive?  Quite frankly, I think it's easier to visualize the
>>>> empty space. or the delimiters as a more general approach, becoming
>>>> "squeezed".  I think that visualizing the empty fields is harder,
>>>> especially when the delimiter is a whitespace character.
>>>>
>>>>> This overlaps somewhat with the -w option in FreeBSD's cut,
>>>>> which merges runs of whitespace, and which I was also considering
>>>>> adding.
>>>>
>>>> After thinking a bit about it, how about having both "-m", from the
>>>> patch I submitted, and "-w", which would behave differently than the
>>>> FreeBSD's "-w"?  Please, allow me to explain.
>>>>
>>>> More specifically, our "-w" would simply "squeeze" all the whitespace
>>>> in the input without forcing the delimiter to be whitespace.  The
>>>> "squeezing" would produce a whitespace character in the input, instead
>>>> of whatever got "squeezed" there.  That would be either the whitespace
>>>> character specified as an optional value for the "-w" option, or it
>>>> may by default produce a space wherever only spaces were "squeezed",
>>>> or a tab wherever the "squeezed" whitespace contained at least one
>>>> tab.
>>>>
>>>> With both "-m" and "-w" options in place we'd end up with a quite
>>>> versatile cut, which would cover what FreeBSD's cut does, and be able
>>>> to do more.  I'd be willing to implement the "-w" option as well.
>>>
>>> Just checking, any further thoughts on this approach?
>> 
>> This feature for cut has been hoped for more than a few times, here are
>> a few examples:
>> 
>> -
>> https://stackoverflow.com/questions/21322968/does-cut-support-multiple-spaces-as-the-delimiter
>> -
>> https://stackoverflow.com/questions/7142735/how-to-specify-more-spaces-for-the-delimiter-using-cut
>> -
>> https://unix.stackexchange.com/questions/109835/how-do-i-use-cut-to-separate-by-multiple-whitespace
>> -
>> https://unix.stackexchange.com/questions/606639/why-does-cut-d-not-work-with-space-in-this-case
>> -
>> https://unix.stackexchange.com/questions/387544/cut-with-2-character-delimiter
>> -
>> https://stackoverflow.com/questions/25447324/how-to-use-cut-with-multiple-character-delimiter-in-unix
>> 
>> I'd really appreciate if we could discuss this further.
> 
> Yes this functionality is definitely under consideration.
> The interface is the main consideration for me at present.
> I need to review the existing interfaces to see how best to proceed.

Would this be instead of the -DFO stuff, or in addition to?

It seems to me that "delimiter can be a regex, which means it can be an
arbitrary string if you don't use special characters or escape them", covers the
use case? And the default delimiter for -F _is_ a run of whitespace, because
it's the common case when replacing awk '{print $3,$7}'.

This has worked in toybox (and busybox) for years now:

$ echo "one  two   three" | toybox cut -F 2
two
$ echo abconedefoneghi | toybox cut -F 2 -d one
def
$ echo abconeonedefoneoneoneghionejkl | toybox cut -F 2,3 -d '(one)+' -O potato
defpotatoghi

Prebuilt binaries you can play with:

https://landley.net/bin/toybox/0.8.10/

> cheers,
> Pádraig

Rob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]