bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% fas

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% fas

From:	Kingsley G. Morse Jr.
Subject:	bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% faster )
Date:	Fri, 13 Jul 2018 20:03:01 -0700
User-agent:	NeoMutt/20170306 (1.8.0)

Hey Berny,

I like your suggestion of testing whether hashing
interferes with any other option.

I was glad to see the POSIX standard doesn't
explicitly require sorted input or output.

If someone writes the patch, I should be able to
test it, at least on up to 1 GB of input.

So,
Kingsley

On 07/12/2018 08:43, Bernhard Voelker wrote:
> On 07/10/2018 08:21 PM, Assaf Gordon wrote:
> > I would suggest several improvements to the measurements, to ensure
> > relevant information is conveyed:
> > 
> > 1.
> > [...]
> 
> great summary.  With the risk of mentioning already said aspects:
> 
> 7. Consider all the existing options, i.e. processing modes, of 'uniq' as 
> well.
> This means, it has to be considered (upfront!) whether introducing an 
> alternate
> way to do things - in this case hashing - only serves for the trivial case,
> or whether this would slow down or even contradict the processing with other
> options, currently:
> 
>   -c, --count           prefix lines by the number of occurrences
>   -d, --repeated        only print duplicate lines, one for each group
>   -D                    print all duplicate lines
>       --all-repeated[=METHOD]  like -D, but allow separating groups
>                                  with an empty line;
>                                  METHOD={none(default),prepend,separate}
>   -f, --skip-fields=N   avoid comparing the first N fields
>       --group[=METHOD]  show all items, separating groups with an empty line;
>                           METHOD={separate(default),prepend,append,both}
>   -i, --ignore-case     ignore differences in case when comparing
>   -s, --skip-chars=N    avoid comparing the first N characters
>   -u, --unique          only print unique lines
>   -z, --zero-terminated     line delimiter is NUL, not newline
>   -w, --check-chars=N   compare no more than N characters in lines
> 
> Without deeper thinking about it, especially the combinations with
> the --group, --repeated and --unique options might be problematic.
> 
> 8. Standards.  POSIX says:
>   http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html
> 
>   The uniq utility shall read an input file comparing adjacent lines,
>   and write one copy of each input line on the output. The second and
>   succeeding copies of repeated adjacent input lines shall not be written.
> 
> This makes an assumption both on the input and output.
> 
> Regarding the input, this means that the processing just has to remember
> the previous line in order to decide whether to print/discard it in the
> output.  Therefore, arbitrary input size is fine.
> Hashing most probably will have issues with arbitrary input size;
> I do not talk about 1GB files, but _much_ larger files (yes, we
> are in 2018 where 1GB is like nothing).
> 
> Regarding the output, this means that the output is implicitly in
> sort order as well.  Like the input, uniq can discard the already
> written data from memory because it is sure that uniq doesn't need
> to consider it anymore.
> 
> 
> Thus said, a successful optimization idea does not only have to show
> that it is faster or needs less resources in _some_ cases, but also
> must prove that it does not tear down many cases including extreme
> ones.  The current implementation might be as-is for a good reason.
> If it turns out that the optimization idea screws up a single
> of the above use cases, then the dilemma is whether 2 implementations
> are warranted to be kept (maintenance!), and whether it is possible
> to detect the extreme cases early enough to switch from the default
> to the other processing strategy.
> 
> https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good
> or D. Knuth:
>   "premature optimization is the root of all evil
>    (or at least most of it) in programming"
> 
> As Assaf said, a patch with a proof of concept would be helpful ... even
> if it's just helpful to proof that the current implementation is fine.
> 
> Have a nice day,
> Berny

-- 
Time is the fire in which we all burn.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#32099: New uniq option for speed, Kingsley G. Morse Jr., 2018/07/08
- bug#32099: New uniq option for speed, Paul Eggert, 2018/07/09
- bug#32099: New uniq option for speed, Assaf Gordon, 2018/07/09
  - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Kingsley G. Morse Jr., 2018/07/10
    - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Assaf Gordon, 2018/07/10
    - bug#32099: datamash wins! (Was: New uniq option for speed), Kingsley G. Morse Jr., 2018/07/11
    - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Bernhard Voelker, 2018/07/12
    - bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% faster ), Kingsley G. Morse Jr. <=

Prev by Date: bug#32127: RFE -- in the way "cp -rl" -- enable 'ln' to do likewise?
Next by Date: bug#32127: RFE -- in the way "cp -rl" -- enable 'ln' to do likewise?
Previous by thread: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)
Next by thread: bug#32101: New uniq option for speed
Index(es):
- Date
- Thread