bug#32099: datamash wins! (Was: New uniq option for speed)

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#32099: datamash wins! (Was: New uniq option for speed)

From:	Kingsley G. Morse Jr.
Subject:	bug#32099: datamash wins! (Was: New uniq option for speed)
Date:	Wed, 11 Jul 2018 16:43:09 -0700
User-agent:	NeoMutt/20170306 (1.8.0)

Hey Assaf,

Thank you very much for taking the time to share
your thoughts.

People like you, willing to improve free software
and help the common good, restore my faith in
humanity.


I followed much of your advice.

I ran more benchmarks: 

gawk, mawk, sort -u, sort | uniq and ... datamash rmdup!

Plus, I benchmarked them against a bigger file.

~1 GB!

Plus, plus, I measured how much memory they used!


The result?

Hashing algorithms always won.

I expected it.

When I benchmarked data compression apps, one
named "rzip" that searched its cache of unique
values with hashing was much faster.

    https://www.linuxjournal.com/article/8051


But, to be fair, I was surprised by how little
memory datamash seemed to use.

Maybe I screwed up.

I checked max resident set size of the process
during its lifetime with /usr/bin/time's "%M"
specifier.


Plus, plus, plus, I created a composite
performance criteria of 

    elapsed time * memory used

(It's in the attached spread sheet.)

That winner at ~1 GB?

datamash!

No matter whether the keys were unique or
duplicated, "datamash rmdup 1" won!

New files are attached.


Humble suggestion:

Don't worry about running out of memory.

Hashing could be added to the "uniq" command as...
(wait for it...) an option.

So you could always omit it and use sort.


For yours truly, the big gain is actually being
able to ditch sort.

Less is more.

It's a UNIX thing.


So it is said.

So it is done.

Sorting is great, but

less is number one!

~K

PS: Congratulations on datamash performing so
    well.

On 07/10/2018 12:21, Assaf Gordon wrote:
> tag 32099 notabug
> severity 32099 wishlist
> close 32099
> stop
> 
> Hello Kingsley,
> 
> On 09/07/18 10:29 PM, Kingsley G. Morse Jr. wrote:
> > And I like your benchmarks.
> > 
> > Mine are humbly attached.
> > 
> > They compare sorting to hashing.
> > 
> > At the moment, hashing seems to me to be about 70%
> > faster.
> > 
> > And to scale better.
> > 
> > I'd still like uniq to be faster and free from
> > sort.
> 
> Thank you for taking the time to create and provide these,
> including the scripts to reproduce it.
> 
> I've marked this item as "closed / wishlist" - but only because it
> is technically not a bug report. Discussion is welcomed to continue
> by replying to this thread.
> 
> Being "faster" alone is important but not the only criteria (as
> mentioned in the previous email's examples).
> 
> Your measurements are a good start towards stronger support to
> hash-based uniq implementation (a working patch, btw, would be even
> stronger).
> 
> I would suggest several improvements to the measurements, to ensure
> relevant information is conveyed:
> 
> 1.
> Specify exactly which versions of awk/sort you are using
> (e.g. which versions, whether from your distribution or compiled from
> source, if compiled than which compiler / optimization flags).
> What CPUs is being used, how much memory does the computer have, etc.
> 
> 2.
> I would suggest adding "sort | uniq" to the charts, because that is the
> real base-line you are trying to improve.
> I would also consider adding "datamash rmdup" to the charts, because its
> hash-based implementation could serve as a ball-park indication of how
> a naive hash implementation would work in uniq.
> 
> 3.
> You input is very small (up to 90K lines of 10 characters each) - that
> does not seem to stress-test any significant parts of the algorithms
> (and for a typical user, waiting 0.05 seconds vs 0.15 seconds does not
> matter).
> 
> I would recommend much larger input files, with much longer lines -
> trying to simulate various real-world scenarios.
> 
> Here's a project of someone who tried to compare various hashing
> functions: https://github.com/jhunt/hash/
> Under the "corpus" directory you'll find few demo data files.
> Even the largest one (qnames) is only 84MB - not very big.
> 
> Perhaps consider duplicating the content few times, then test it:
>   for i in $(seq 1 5);do cat qnames; done > input
> 
> To get even closer to real-world input, try several input files
> with different ratios of duplicated lines (your script create completely
> random data - not very similar to real-world input files).
> 
> For example, what happens with many lines share similar sub-strings?
> depending on the hashing implementation, this could lead to lower
> performance.
> 
> 4.
> It would be beneficial to see what's the memory requirements and limits
> of a hash-based implementation. Currently, because 'sort' is not limited
> by available RAM, there is no memory limit (though resorting to disk I/O
> might make everything slow).
> For example, is it possible to hash an 800M file on a machine with only 1GB
> of RAM ? and without additional code, hashing files larger than available
> memory is simply not possible - a very big disadvantage.
> 
> 5.
> To compare "sort' fairly against other options,
> I would add to the measurements using "--parallel 1" and later "--parallel
> N" (N being the number of cpu/cores you have).
> Does that improve performance?
> 
> The "awk" (or datamash) programs do not have a built-in memory limit,
> while sort does. It would be useful to specify a large amount of
> memory for sort (using "-S") to ensure sort did not revert to disk I/O.
> 
> 6.
> The reported time in your script is the elapsed time (aka "wall time").
> A more relevant option would be "user time" + "kernel time"
> (optiosn %S and %U in "time -f") - the wall time can be affected by
> factors that don't immediately relate to hashing/sorting performance.
> 
> 7.
> In your script you are providing a file-based input,
> it might be useful to clear the kernel's cache before each run
> to ensure these do not affect the results (especially when dealing with
> large files).
> See: 
> https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/
> 
> 
> 
> 
> To conclude,
> There is definitely merit in trying to optimize uniq's performance,
> but it must be weighed against other factors (e.g. the ability to 'uniq'
> a file larger than available memory, and the ease of using existing programs
> to achieve the same effect).
> 
> IMHO it is not a top-priority, so I don't want to give the impression
> that if an amazing speed improvement is presented, uniq will be swiftly
> rewritten.
> 
> But of course, a working patch and strong pro arguments would go a long
> way towards incorporating this feature.
> 
> regards,
>  - assaf
> 
> 
> 

-- 
Time is the fire in which we all burn.

uniq_demo
Description: updated benchmarking script

hash_and_sort_benchmarks.2.gnumeric
Description: updated gnumeric spread sheet

uniq_demo.log.2
Description: new benchmark results

[Prev in Thread]

Current Thread

[Next in Thread]

bug#32099: New uniq option for speed, Kingsley G. Morse Jr., 2018/07/08
- bug#32099: New uniq option for speed, Paul Eggert, 2018/07/09
- bug#32099: New uniq option for speed, Assaf Gordon, 2018/07/09
  - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Kingsley G. Morse Jr., 2018/07/10
    - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Assaf Gordon, 2018/07/10
    - bug#32099: datamash wins! (Was: New uniq option for speed), Kingsley G. Morse Jr. <=
    - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Bernhard Voelker, 2018/07/12
    - bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% faster ), Kingsley G. Morse Jr., 2018/07/13

Prev by Date: bug#32127: RFE -- in the way "cp -rl" -- enable 'ln' to do likewise?
Next by Date: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)
Previous by thread: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)
Next by thread: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)
Index(es):
- Date
- Thread