grep-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Grep-devel] [PATCH] Add GREP_OPEN preprocessor feature, documentation,


From: Charles Blake
Subject: [Grep-devel] [PATCH] Add GREP_OPEN preprocessor feature, documentation, and a test program.
Date: Wed, 4 Apr 2018 18:14:50 -0400

This change adds a feature that allows easily searching through files
that are preprocessed on the fly into more searchable text.  There is
no impact on users who do not set the GREP_OPEN environment variable.
The grep.texi portion of this patch describes usage more thoroughly.
A sketch of this change follows.

Declare one new global and tokenize 'GREP_OPEN' with 'strtok' into
it at start up.  Just before calling 'grep (desc,..)', if we have
GREP_OPEN tokens create a pipe to a new preprocessor for each input
replacing 'desc' with the end of the pipe reading from the program.

Post-'grep ()', once done with the input, if we have a kid then we
need to be sure to shut down that process.  Things like "grep -l"
mean this shut down could (and should) happen long before the whole
input is consumed.  We use SIGKILL to ensure grep does not hang (or
is at least no more likely to hang on uninterruptible sleeping kids
than if grep were doing the IO itself).

Finally, large sets of input files may contain a few culprits causing
some errors, but need not stop the whole grep.  So, we guard messages
related to unusual child termination with 'if (!suppress_errors)' to
track the way that is handled elsewhere for missing/unreadable files.

Overheads using the new feature are likely to be mostly less than any
but the most trival preprocessors.  Pipe bandwidth usually dwarfs even
grep -F search performance never mind the preprocessor.  The per file
overhead can be substantial for many small files since each starts a
new program (minimized with vfork).  In my timings, 'env -i PATH=$PATH
GREP_OPEN=statically-linked-cat grep' ran only 70 us per file slower
(0.225 s/3200 files) than "grep" on the same pattern&data.  Per byte
overhead was below statistical variation (5%).
---
 doc/grep.in.1     | 45 +++++++++++++++++++++++++++++
 doc/grep.texi     | 56 ++++++++++++++++++++++++++++++++++++
 src/grep.c        | 72 +++++++++++++++++++++++++++++++++++++++++++++++
 tests/Makefile.am |  1 +
 tests/grep-open   | 30 ++++++++++++++++++++
 5 files changed, 204 insertions(+)
 create mode 100755 tests/grep-open

diff --git a/doc/grep.in.1 b/doc/grep.in.1
index ecc8105..1c6d405 100644
--- a/doc/grep.in.1
+++ b/doc/grep.in.1
@@ -865,6 +865,51 @@ and
 warns if it is used.
 Please use an alias or script instead.
 .TP
+.B GREP_OPEN
+This variable specifies a program to preprocess all file inputs.  If set,
+.B grep
+filters each file input (not standard input) through a new instance of
+that program, searching its output rather than the direct contents of
+files.  E.g., to make
+.B grep
+function like
+.B zgrep\fR,
+simply export
+.B GREP_OPEN="gzip -d -c"\fR.
+.IP
+There is usually no need to re-open the file in the program
+indicated by \fBGREP_OPEN\fR.
+.B grep
+opens each input file and attaches the standard input of the
+.B GREP_OPEN
+program to the open file.
+.IP
+However, when
+.B GREP_OPEN
+is being used, the
+.B GREP_INPUT
+variable is set to each input file name.
+This may be useful to dispatch to decoders based on file name or
+when a decoding program cannot act as a standard input-output filter.
+.IP
+E.g., to have \fBgrep\fR \fIPATTERN\fR search through a mixture of
+gzipped files, the extractable text in PDF files, and text files,
+you could \fBGREP_OPEN=gunzip_or_pdf.sh\fR, where that script might
+be:
+.nf
+  #!/bin/sh
+  case "$GREP_INPUT" in
+    *.gz) gzip -dc ;;                   # input name not needed..
+    *.pdf) pdftotext "$GREP_INPUT" - ;; #..but is possible
+    *) cat ;; # unencoded
+  esac
+.fi
+.IP
+.B GREP_OPEN
+can be as many space-tab-delimited words as needed.
+There is no rule to quote said delimiters.
+So, complex invocations may require a wrapper script.
+.TP
 .B GREP_COLOR
 This variable specifies the color used to highlight matched (non-empty)
text.
 It is deprecated in favor of
diff --git a/doc/grep.texi b/doc/grep.texi
index 922d96e..90765b8 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -875,6 +875,62 @@ export PATH=/usr/bin
 exec grep --color=auto --devices=skip "$@@"
 @end example

address@hidden GREP_OPEN
address@hidden GREP_OPEN @r{environment variable}
address@hidden preprocessor environment variable
+This variable specifies a program to preprocess all file inputs.
+
+If set, @command{grep} filters each file input (not standard input)
+through a new instance of that program, searching the output of the
+program rather than the direct contents of files.  For example,
+to make @command{grep} function like @command{zgrep}, simply do:
+
address@hidden
+export GREP_OPEN="gzip -d -c"
address@hidden example
+
+As in that example, there is usually no need to re-open the file in
+the program indicated by @env{GREP_OPEN}.  @command{grep} opens each
+input file and attaches the standard input of the @env{GREP_OPEN}
+program to the open file.
+
+However, when @env{GREP_OPEN} is being used, the @env{GREP_INPUT}
+variable is set to each input file name.  This may be useful to
+dispatch to preprocessors based on file name or when a decoding
+program cannot function as a standard input-output filter.
+
+For example, it is easy to make @command{grep} search through a
+mixture of gzipped files, the extractable text in PDF files,
+and text files with @samp{GREP_OPEN=decode_gunzip_or_pdf.sh}.
+That script could be:
+
address@hidden
+#!/bin/sh
+case "$GREP_INPUT" in
+  *.gz) gzip -dc ;;                   # input name not needed..
+  *.pdf) pdftotext "$GREP_INPUT" - ;; #..but is possible
+  *) cat ;; # unencoded
+esac
address@hidden example
+
+A more complete preprocessor handling @samp{xz} and other encodings
+could search through a large hierarchy of heterogeneous files with
+a simple @samp{grep -r}.
+
+Although @samp{gzip -dc "$GREP_INPUT"} would also have worked in the
+above example, it is somewhat poor form.  It unnecessarily creates a
+race condition.  If another process is renaming or deleting files in
+parallel with @command{grep} then it is possible for @command{grep}
+to succeed in opening, but for a re-open in @env{GREP_OPEN} to fail.
+Such use also creates another place where the the variable might not
+be correctly quoted.  Sometimes there is no convenient alternative,
+though, as with @command{pdftotext} from the Poppler package.
+
address@hidden can be as many space-tab-delimited words as needed.
+There is no rule to quote said delimiters, but such complex invocations
+should be very rare.  A wrapper script that can be run more simply is
+also always possible.
+
 @item GREP_COLOR
 @vindex GREP_COLOR @r{environment variable}
 @cindex highlight markers
diff --git a/src/grep.c b/src/grep.c
index bc47243..acbd37a 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -21,6 +21,8 @@
 #include <config.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/wait.h>
+#include <string.h>
 #include <wchar.h>
 #include <fcntl.h>
 #include <inttypes.h>
@@ -535,6 +537,8 @@ static enum
     SKIP_DEVICES
   } devices = READ_COMMAND_LINE_DEVICES;

+static char **preproc;    /* if non-NULL pipe each file input to this
command */
+
 static bool grepfile (int, char const *, bool, bool);
 static bool grepdesc (int, bool);

@@ -1761,6 +1765,7 @@ grepdesc (int desc, bool command_line)
   bool status = true;
   bool ineof = false;
   struct stat st;
+  pid_t pid; /* child preprocessor PID to wait for. */

   /* Get the file status, possibly for the second time.  This catches
      a race condition if the directory entry changes after the
@@ -1848,6 +1853,33 @@ grepdesc (int desc, bool command_line)
       goto closeout;
     }

+  if (preproc)              /* desc <- preprocessing output */
+    {
+      int fds[2];
+      if (pipe (fds) == -1)
+        die (EXIT_TROUBLE, 0, _("cannot create a pipe"));
+      switch (pid = vfork ()) /* vfork much faster but MUST exec|_exit */
+        {
+        case 0:             /* child */
+          dup2 (desc, 0);   /* stdin = desc on file itself */
+          dup2 (fds[1], 1); /* stdout = write end of pipe [1] */
+          close (fds[1]);   /* do not need extra handle on pipe */
+          close (fds[0]);   /* close read end of pipe [0] */
+          if (filename)
+            setenv ("GREP_INPUT", filename, 1);
+          execvp (preproc[0], preproc); /* become a preprocessor */
+          fprintf (stderr, "%s: %s\n", preproc[0], strerror (errno));
+          _exit (EXIT_TROUBLE);
+        default:            /* parent */
+          close (fds[1]);   /* close write end of pipe[1] */
+          close (desc);     /* close desc on file itself */
+          desc = fds[0];    /* replace desc value with read end of pipe
[0] */
+          break;
+        case -1:            /* vfork failed => resource exhaustion => die
*/
+          die (EXIT_TROUBLE, 0, _("cannot vfork"));
+        }
+    }
+
   count = grep (desc, &st, &ineof);
   if (count_matches)
     {
@@ -1876,6 +1908,30 @@ grepdesc (int desc, bool command_line)
         fflush_errno ();
     }

+  if (preproc)
+    {
+      /* Done with this input by this point.  So, do a non-blocking wait
and
+         then a forceful kill & wait loop and report any unusual
termination. */
+      int wstatus = 0;
+      if (waitpid (pid, &wstatus, WNOHANG) != pid)
+        {
+          kill (pid, SIGKILL);
+          while (waitpid (pid, &wstatus, 0) == -1)
+            if (errno != EINTR)
+              break;
+        }
+      if (!suppress_errors)
+        {
+          if (WIFEXITED (wstatus) && WEXITSTATUS (wstatus) != EXIT_SUCCESS)
+            error (0, 0, _("warning: GREP_OPEN(\"%s\") failed for: %s"),
+                   getenv ("GREP_OPEN"), filename ? filename : "(stdin)");
+          else if (WIFSIGNALED (wstatus) && WTERMSIG (wstatus) != SIGKILL)
+            error (0, 0, _("warning: GREP_OPEN(\"%s\"): %s: signalled:
%s"),
+                   getenv ("GREP_OPEN"), filename ? filename : "(stdin)",
+                   strsignal (WTERMSIG (wstatus)));
+        }
+    }
+
  closeout:
   if (desc != STDIN_FILENO && close (desc) != 0)
     suppressible_error (errno);
@@ -2814,6 +2870,22 @@ main (int argc, char **argv)
         }
     }

+  /* Split GREP_OPEN if available into argument vector for every execvp */
+  char *grep_open = getenv ("GREP_OPEN");
+  if (grep_open)
+    {
+      const char *delim = " \t";  /* could maybe also use GREP_OPEN_DELIM
*/
+      int n = 0, nA = 4;
+      preproc = (char **) malloc (nA * sizeof preproc[0]);
+      preproc[n++] = strtok (grep_open, delim);
+      while ((preproc[n++] = strtok (NULL, delim)))
+        if (n + 1 > nA)
+          {
+            nA *= 2;
+            preproc = (char **) realloc (preproc, nA * sizeof preproc[0]);
+          }
+    }
+
   /* POSIX says -c, -l and -q are mutually exclusive.  In this
      implementation, -q overrides -l and -L, which in turn override -c.  */
   if (exit_on_match | dev_null_output)
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 0ebb51b..b0b9bd2 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -106,6 +106,7 @@ TESTS = \
   grep-dev-null \
   grep-dev-null-out \
   grep-dir \
+  grep-open \
   help-version \
   high-bit-range \
   in-eq-out-infloop \
diff --git a/tests/grep-open b/tests/grep-open
new file mode 100755
index 0000000..cb062cf
--- /dev/null
+++ b/tests/grep-open
@@ -0,0 +1,30 @@
+#! /bin/sh
+# Test GREP_OPEN functionality; gzip & cat must be installed
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+cat > grep-open.sh <<- EOF
+#!/bin/sh
+case "\$GREP_INPUT" in
+  *.gz) exec gzip -dc ;;
+  *) exec cat ;;
+esac
+EOF
+chmod u+x grep-open.sh || framework_failure_
+
+echo abababababababab > 1 || framework_failure_
+echo abababababababab | gzip > 2.gz || framework_failure_
+
+cat > grep-open.expected <<- EOF
+1
+2.gz
+EOF
+
+GREP_OPEN=`pwd`/grep-open.sh grep -l ababab 1 2.gz > grep-open.out 2>&1
+# Could also export GREP_OPEN globally above
+test $? -eq 0 || fail=1
+
+compare grep-open.out grep-open.expected || fail=1
+
+Exit $fail
--
2.17.0


reply via email to

[Prev in Thread] Current Thread [Next in Thread]