coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cut -DF


From: Pádraig Brady
Subject: Re: cut -DF
Date: Fri, 7 Jan 2022 15:06:28 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Thunderbird/95.0

On 06/01/2022 23:02, Assaf Gordon wrote:
Hello,

On 2022-01-06 7:35 a.m., Pádraig Brady wrote:
Thanks for taking the time to consolidate options/functionality
across different implementations.  This is important for users.
Some notes below...

On 05/01/2022 16:23, Rob Landley wrote:
Around 5 years ago toybox added the -D, -F, and -O options to cut:

      -D  Don't sort/collate selections or match -fF lines without
delimiter
      -F  Select fields separated by DELIM regex
      -O  Output delimiter (default one space for -F, input delim for -f)


As I see it, the main functionalities added here:
    - reordering of selected fields
    - adjusted suppression of lines without matching fields
    - regex delimiter support

I see regex support as less important, but still useful.



Attached is a suggestion for initial implementation of "cut -FDO".
It's split into smaller steps to ease review.

The main issue is that the current "cut_fields" and "cut_bytes" are
highly optimized for speed, so I left them as-is and created a secondary
set of 'cut' functions - slower but with additional options.

If this is acceptable, I'll go on to clean up the patches, add more
tests and write documentation.

There are likely some edge-cases regarding regex matching that need to
be decided upon (e.g. BRE or ERE, what about BOL/EOL anchors, groups, etc.).

Comments and feedback very welcomed,

Looking good.

About the default regex delimiter, you're checking for "blanks" rather than 
"whitespace".
join(1) has similar functionality, and it delimits fields also with blanks,
so if this is the case we should be careful to describe it as such
in the comments, and documentation.
BTW it might be more performant to use isblank() rather than regular expressions
for this mode (if it wasn't too awkward)

About regular expression syntax, it's probably best to be consistent
with other uses in coreutils:

$ git grep RE_SYNTAX
src/csplit.c:    RE_SYNTAX_POSIX_BASIC & ~RE_CONTEXT_INVALID_DUP & 
~RE_NO_EMPTY_RANGES;
src/expr.c:    RE_SYNTAX_POSIX_BASIC & ~RE_CONTEXT_INVALID_DUP & 
~RE_NO_EMPTY_RANGES;
src/nl.c:        RE_SYNTAX_POSIX_BASIC & ~RE_CONTEXT_INVALID_DUP & 
~RE_NO_EMPTY_RANGES;

I agree with the splitting out of the implementation for this.
The existing implementation is highly tuned for perf,
so it would be good to keep that.

I'm not sure about the --allow-duplicates long option for -D,
as the duplicate aspect is not the only or even the most important
aspect of that mode.  I do like it matches the "-D", though I
don't like "-D" was used for this as I mentioned previously.
Maybe -D, --definitive-list would be better?
I.e. the specified list has a defined order and duplicates.

thanks!
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]