grep-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Grep-devel] decompression/decoding preprocessor subprogram patch


From: Charle chaud
Subject: [Grep-devel] decompression/decoding preprocessor subprogram patch
Date: Thu, 29 Mar 2018 19:55:59 -0400

This is a long email for a short patch I hope you like.  I start with
user-level stuff and finish with coding.  Your patience is appreciated.

I have something like 15 zgrep-style wrapper scripts on my systems now.
They're fragile (eg., when grep adds an option/ability like --color it
may be many years for the wrappers to pick up the new feature, if ever).
The GNU grep TODO has mentioned addressing unifying over gzip/bzip2 for
over a year based on some FreeBSD idea that I don't know anything about
(I haven't looked at any of that code).  So, GNU grep has a label option
and accepts some maintenance burden to support auto-decoding.

With this patch, [ef]*grep -p zcat works the way z[ef]*grep always have,
e.g. with the requirement that all inputs are gzipped, only without all
the wrapper script gotchas.  Moreover, the way this is set up the burden
is not on GNU grep to track an ever growing set of encodings, but on
some third party or whoever is doing the encoding in the first place.
For under 60 lines of new C code, many fragile wrapper scripts can be
replaced by simultaneously simpler and more complete grep functionality!

Even better, it is easy to achieve unification over gzip/bzip2/xz/any
desirable encoding.  One just needs a program that can dispatch to the
right decoder.  The simplest possibility is filename extension-based
identification using the patch's new GREP_INPUT environment variable:
  -- grepPP.sh -- (extension-based; see below for magic number-based)
  #!/bin/sh
  case "$GREP_INPUT" in     # should be safe from most hostile filenames
    *gz)  exec gzip -dc ;;  # "exec" here just saves a fork()
    *xz)  exec xzcat ;;
    *)    exec cat ;;       # add however many more file types above..
  esac
With such a script in PATH, "grep -p grepPP.sh" can work through sets
of files compressed many different ways (or not compressed at all!).
"/var/log" and "pre-formatted man page" directories often have that
sort of structure, as just two examples.  The older a set of files is,
the more likely there will be .gz, .xz, .bz2, etc., etc. variations.

These days, there are *many* extensions and surely a few collisions.
A less name-based/name-trusting design would examine the magic numbers
of stdin.  For example, in shell it would *almost* be as simple as:
  case "`file -`" in
    *gzip compressed*) gzip -dc;;
    *) cat ;;
  esac
That shell snippet won't *quite* work because gzip will not see initial
data that "file" reads, in particular the magic number.  "file" does not
at present have an option to reset stdin's file position to the start of
the file with seek(2).  Even if file had such an option, some inputs (eg.
pipes, FIFOs) may not be seekable.  In that case, the decoder dispatcher
program must identify the type, create a child decoder, replenish the
bytes it consumed to ID, and write the rest of the input to the child.

So, a C program that handles those wrinkles would be best for the most
general input cases.  I wrote such a 200 line C program decades ago.
Such programs have a few applications.  I've used it as a universal
zgrep-style-wrapper, for 'untar' wrappers, LESSOPEN, etc.  Given the
broader applicability, I realize GNU grep may not be the best home for
such a program, but I am happy to donate it wherever.

--

In terms of the patch itself, there's not much to it beyond the idea,
but there were a few debatable design considerations along the way.

First, the big one, a subprocess preprocessor/decoder is expensive when
compared to library calls (but not always by as much as you may think).
There are at least half a dozen compressors in common use these days and
surely dozens in less common use.  Even if a library that (successfully)
abstracted decoding all the various desired encodings existed, using it
and whatever compression libraries it depended upon would be a complex
configure/build-time dependency for grep.  Some compressors may well be
command-line only. [ Admittedly, some may be library-only, though these
are rarer in my experience. ] Decryption programs can usually be used
much like decompression programs, though similar program-/library-only
caveats apply.  An omnibus decryption-decompression library/usage seems
really unlikely, but a couple wrapper commands - that could happen.

Current wrapper script approaches to this functionality also spawn a new
process per file in every script I've seen. [ Doing otherwise requires
decompressors processing >1 file to frame outputs to get proper "file:"
labels on grep matches. ] So, the efficiency of subprograms is no worse
than current practices for compressed files.  This approach of patch can
be faster than current practice since it needs no script interpreting.
This feature also does not prevent grep from someday further optimizing
common cases by recognizing a subset of very common encodings and using
library code for those.  Subprograms, like wrapper scripts or decoders,
are just a more loosely coupled way to cover a lot of functionality.

Ok.  So, assuming subprograms are acceptable, a few questions arise.
First, how are subprograms launched.  I introduced a new environment
variable $GREP_SHELL.  $SHELL often refers to the user's interactive
shell, not a launcher program.  glibc simply hard codes "/bin/sh".  If
GREP_SHELL is not set then /bin/sh is used.  Not hard coding to /bin/sh
or overloading regular SHELL enables performance ambitious users to set
GREP_SHELL=smart-dispatcher where the program ignores the -c COMMAND and
execvp()s the appropriate decoder based on GREP_INPUT or magic numbers.
If the dispatcher and gzip/bzip2/etc. are all statically linked and
environ[] is small, overhead really can be <100 microseconds per input
file.  That performance can be fragile to a whole cascade of added costs.
Heck, zcat and gunzip are scripts these days.  GREP_SHELL=gzip grep -p-d
works with this patch, though.  (A happy coincidence that "grep -c -d"
is accepted here with "-c" being a very distinct meaning from sh "-c".)

Besides launching, we also have to deal with subprogram termination.
grep needs to kill the subprogram whenever it is not going to read any
more from it to avoid hanging on wait() forever.  That is actually a
normal termination in this scenario.  We could also drain all the output
but decoding a bunch of data that won't be used seems silly.

When the program ends abnormally (the spawned process exits !=0), grep
can either keep going or die.  Given that grep continues to open input
files in my patch, failures seem more likely to be ongoing related to a
COMMAND than to be transient related to an input file.  It seems to me
unfriendly to keep producing the same error for maybe thousands of files
(emitted beyond grep's control by shell/decoder programs).  So, this
patch just quits immediately at the first such error.  It may be better
to decide failure behavior based on recursiveness, an error counter,
another option, etc.  This choice seems particularly debatable.

The patch could definitely use some stress testing/other eyeballs.
I've done basic tests only, but it all seems to work fine.  I'm sure
there are at least several details to fix, but thought I should see
if this approach is even remotely possible to take for GNU grep.

Attachment: preprocV2.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]