[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#32073: Improvements in Grep (Bug#32073)

From: Sergiu Hlihor
Subject: bug#32073: Improvements in Grep (Bug#32073)
Date: Wed, 1 Jan 2020 20:06:39 +0100

Arnold, there is no need to write user code, it is already done in
benchmarks. One of the standard benchmarks when testing HDDs and SSDs is
read throughput vs block size and at different queue depths.  Take a look
at this"
. In this benchmark, at queue depth 4 and 128KB block size, the SSD was not
yet able to achieve the maximum throughput 5GB/s. Moreover, if you
extrapolate the results, to a queue depth of 1, you get about ~1.2GB/s out
of over 5GB/s theoretical. Therefore for this particular model you need to
issue read requests at minimum 512KB block size to achieve maximum
throughput. With hard drives I already explained the issue. I have a
production server where the HDD RAID array can do theoretically 2.5GB/s and
I see read speeds over 500MB/s sustained when large block sizes are used
for reads, yet when I use grep, I have a practical bandwidth of 20 to 50
MB/s. Moreover, when it comes to HDDs the math is quite simple and here it
is for a standard HDD at 7200 RPM, 240MB/s:
7200 RPM => 120 revolutions per second
240 MB/s at 120 revolutions => 2MB per revolution
One revolution time  = 1000/120 => 8,33 ms
Read throughput per ms = 240KB

Worst case scenario: each read request requires a full revolution to reach
to the data (head positioning is done concurrently and this can be
Seek time: 8.33ms
At 96KB:
 - Read time: 0.4ms
 - Total read latency  = 8.33 + 0.4 = 8.73ms, read throughput  = 1000 /
8.73 * 96KB = 11MB/s
At 512KB:
 - Read time: 2.3ms
 - Total read latency = 8.33 + 2.3 = 10.63ms, read throughput  = 1000 /
10.63 * 512KB = 48MB/s
In practice average seek latencies are 4.16ms so throughput is double. This
is the cold hard reality. In practice, when each one of you is testing, you
are very likely deceived by testing on *one hdd, on an idle system* where
you don't have anything else consuming IO in background like a database. In
such an ideal scenario you do see 240MB/s because HDDs do also read ahead
and by the time the data is transferred over interface and consumed, next
chuck is in the buffer and can be delivered with apparent 0 seek time. This
means first read takes 4ms, next ones takes 0.1ms. With a* HDD RAID array
on a server where your IO is always at 50% load*, if you have a strip size
of 128KB or more, you are hitting one drive at a time, each one with a
penalty of 4.16ms. And due to constant load, by the time you hit the first
hdd again, the read ahead buffer maintained by the HDD itself is also
discarded, so all reads go directly to physical medium. If however you hit
all HDDs at the same time, you will benefit from the read ahead from the
HDD for at least one or more cycles thus having reads with apparent 0
latency and a way higher average bandwidth. The cost of reading from all
HDDs at the same time is a potential of adding extra latencies for all
other applications running, this is why the value should be configurable,
such that best value can be setup based on hardware. The issue of large
block sizes for IO operations is widespread across all tools from Linux,
like rsync or cp and its only getting worse, to an extend where in my
company we are considering writing our own tools for something that should
have worked out of the box. One side issue, which I have to mention as I'm
not aware of implementation details: as we are getting in GB/s territory,
read is best done within it's own thread which then serves the output to
the processing thread. With SSDs that can do multi GB/s this matters.

On Wed, 1 Jan 2020 at 12:19, <address@hidden> wrote:

> As a quite serious question, how is someone writing user-level code
> supposed to be able to figure out the right buffer size for a particular
> file, and to do so portably? ("Show me the code.")
> Gawk bases its reads on the st_blksize member in struct stat.  That will
> typically be something like 4K - not nearly enough, given your description
> below.
> Arnold
> Sergiu Hlihor <address@hidden> wrote:
> > This topic is getting more and more frustrating. If you rely on OS, then
> > you are at the mercy of whatever read ahead configuration you have. And
> > read ahead is typically 128KB so does not help that much. A HDD RAID 10
> > array with 12 disks and a strip size of 128KB reaches the maximum read
> > throughput if read block size is 6 * 128 = 768KB. When issuing read
> > requests with 128KB , you only hit one HDD, having 1/6 read throughput.
> > With flash the same. A state of the art SSD that can do 5GB/s reads can
> > actually do around 1GB/s or less at 128KB block size. Why is so hard to
> > understand how hardware works and the fact that you need huge block sizes
> > to actually read at full speed? Why not just exposing the read buffer
> size
> > as a configurable parameter, then anyone can just tune it as needed? 96KB
> > is purely retarded.
> >
> > On Wed, 1 Jan 2020 at 08:52, Paul Eggert <address@hidden> wrote:
> >
> > > > This makes me think we should follow Coreutils' lead[0] and increase
> > > > grep's initial buffer size from 32KiB, probably to 128KiB.
> > >
> > > I see that Jim later installed a patch increasing it to 96 KiB.
> > >
> > > Whatever number is chosen, it's "wrong" for some configuration. And I
> > > suppose
> > > the particular configuration that Sergiu Hlihor mentioned could be
> tweaked
> > > so
> > > that it worked better with grep (and with other programs).
> > >
> > > I'm inclined to mark this bug report as a wishlist item, in the sense
> that
> > > it'd
> > > be nice if grep and/or the OS could pick buffer sizes more
> intelligently
> > > (though
> > > it's not clear how grep and/or the OS could go about this).
> > >

reply via email to

[Prev in Thread] Current Thread [Next in Thread]