[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% fas
From: |
Kingsley G. Morse Jr. |
Subject: |
bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% faster ) |
Date: |
Fri, 13 Jul 2018 20:03:01 -0700 |
User-agent: |
NeoMutt/20170306 (1.8.0) |
Hey Berny,
I like your suggestion of testing whether hashing
interferes with any other option.
I was glad to see the POSIX standard doesn't
explicitly require sorted input or output.
If someone writes the patch, I should be able to
test it, at least on up to 1 GB of input.
So,
Kingsley
On 07/12/2018 08:43, Bernhard Voelker wrote:
> On 07/10/2018 08:21 PM, Assaf Gordon wrote:
> > I would suggest several improvements to the measurements, to ensure
> > relevant information is conveyed:
> >
> > 1.
> > [...]
>
> great summary. With the risk of mentioning already said aspects:
>
> 7. Consider all the existing options, i.e. processing modes, of 'uniq' as
> well.
> This means, it has to be considered (upfront!) whether introducing an
> alternate
> way to do things - in this case hashing - only serves for the trivial case,
> or whether this would slow down or even contradict the processing with other
> options, currently:
>
> -c, --count prefix lines by the number of occurrences
> -d, --repeated only print duplicate lines, one for each group
> -D print all duplicate lines
> --all-repeated[=METHOD] like -D, but allow separating groups
> with an empty line;
> METHOD={none(default),prepend,separate}
> -f, --skip-fields=N avoid comparing the first N fields
> --group[=METHOD] show all items, separating groups with an empty line;
> METHOD={separate(default),prepend,append,both}
> -i, --ignore-case ignore differences in case when comparing
> -s, --skip-chars=N avoid comparing the first N characters
> -u, --unique only print unique lines
> -z, --zero-terminated line delimiter is NUL, not newline
> -w, --check-chars=N compare no more than N characters in lines
>
> Without deeper thinking about it, especially the combinations with
> the --group, --repeated and --unique options might be problematic.
>
> 8. Standards. POSIX says:
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html
>
> The uniq utility shall read an input file comparing adjacent lines,
> and write one copy of each input line on the output. The second and
> succeeding copies of repeated adjacent input lines shall not be written.
>
> This makes an assumption both on the input and output.
>
> Regarding the input, this means that the processing just has to remember
> the previous line in order to decide whether to print/discard it in the
> output. Therefore, arbitrary input size is fine.
> Hashing most probably will have issues with arbitrary input size;
> I do not talk about 1GB files, but _much_ larger files (yes, we
> are in 2018 where 1GB is like nothing).
>
> Regarding the output, this means that the output is implicitly in
> sort order as well. Like the input, uniq can discard the already
> written data from memory because it is sure that uniq doesn't need
> to consider it anymore.
>
>
> Thus said, a successful optimization idea does not only have to show
> that it is faster or needs less resources in _some_ cases, but also
> must prove that it does not tear down many cases including extreme
> ones. The current implementation might be as-is for a good reason.
> If it turns out that the optimization idea screws up a single
> of the above use cases, then the dilemma is whether 2 implementations
> are warranted to be kept (maintenance!), and whether it is possible
> to detect the extreme cases early enough to switch from the default
> to the other processing strategy.
>
> https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good
> or D. Knuth:
> "premature optimization is the root of all evil
> (or at least most of it) in programming"
>
> As Assaf said, a patch with a proof of concept would be helpful ... even
> if it's just helpful to proof that the current implementation is fine.
>
> Have a nice day,
> Berny
--
Time is the fire in which we all burn.