[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: grep --color enhancement

From: Charles Levert
Subject: Re: grep --color enhancement
Date: Mon, 12 Dec 2005 08:59:56 -0500
User-agent: Mutt/1.4.1i

[ Here is a message sent via private email,
  for the benefit of everyone who subscribes
  to the bug-grep mailing list.  ]

* On Monday 2005-12-12 at 10:12:01 +0000, Pádraig Brady wrote:
> Hi Charles,

Thanks for your interest in GNU grep.

> I notice you've been working on the grep --color code lately.
> I've a small patch (against cvs) attached, which minimises
> the number of "match colour start" and "match colour end"
> sequences outputted.
> For e.g. the following will output 8 colour sequences:
> echo "test" | grep --color=always '.'
> With my patch there will only be 2.

This does output less octets in the end.

At issue is whether anyone currently relies on
every individual non-empty match being "marked
up" by an SGR_START/SGR_END pair.

We already elected a while back to not "mark
up" empty matches, which are a possibility with
patterns such as 'a*'.

> Where this really is beneficial is for multi byte chars.
> For e.g the following e.g. which highlights non ascii
> characters now works correctly on my UTF8 terminal:
> echo "utf∞" | LANG=C grep --color=always -E '[^ -~]'

Yes, but note that this is a somewhat perverse
combination of locales (an UTF-8 one for echo
and C for grep).

The octet sequence produced by echo is

   0x75, 0x74, 0x66, 0xE2, 0x88, 0x9E, 0x0A

with this interpretation as a sequence of Unicode
code points:

   U+0075, U+0074, U+0066, U+221E, U+000A

However, because of the C locale, grep see each
individual octet as a character to be matched.
It isn't even obvious what programs running under
the C/POSIX locale should do when they encounter
octets with the eight bit set.  From POSIX:

   The tables in Locale Definition describe
   the characteristics and behavior of the
   POSIX locale for data consisting entirely of
   characters from the portable character set
   and the control character set.  For other
   characters, the behavior is unspecified.

The terminal, assuming it is itself under an
UTF-8 locale, cannot recognize the now split
UTF-8 sequence in

   ..., 0xE2, SGR_END, SGR_START, 0x88, SGR_END, SGR_START, 0x9E, ...

This was to be expected from the terminal.

> thanks,
> Pádraig.
> p.s. I notice the end colour sequence is now ^[m rather than ^[0m
> Is that correct?

Yes (^[[m with no 0), in addition to being
followed by ^[[K unless the ne boolean
GREP_COLORS capability is specified.

For now, the default SGR substring remains
'01;31' for matched text, although it as well
might be changed to just '1;31'.

[ The originally attached patch follows.  ]

--- grep.c.orig 2005-12-09 17:37:35.000000000 +0000
+++ grep.c      2005-12-12 08:45:33.000000000 +0000
@@ -788,6 +788,19 @@
   const char *mid = NULL;
   char *buf;           /* XXX */
   const char *ibeg;    /* XXX */
+  int match_color_end_pending = 0;
+#define MATCH_COLOR_START() \
+  do { if (!match_color_end_pending) { \
+    PR_SGR_START_IF(match_color); \
+    match_color_end_pending = 1; }\
+  } while (0)
+#define MATCH_COLOR_END() \
+  do { if (match_color_end_pending) { \
+    PR_SGR_END_IF(match_color); \
+    match_color_end_pending = 0; }\
+  } while (0)
   if (match_icase)     /* XXX - None of the -i stuff should be here.  */
@@ -833,23 +846,29 @@
                                               : SEP_CHAR_SELECTED);
-             PR_SGR_START(line_color);
+             if (b - cur) {
+               MATCH_COLOR_END();
+               PR_SGR_START(line_color);
+               fwrite (cur, sizeof (char), b - cur, stdout);
+              }
              if (mid)
                  cur = mid;
                  mid = NULL;
-             fwrite (cur, sizeof (char), b - cur, stdout);
-         PR_SGR_START_IF(match_color);
+         MATCH_COLOR_START();
          fwrite (b, sizeof (char), match_size, stdout);
-         PR_SGR_END_IF(match_color);
-         if (only_matching)
+         if (only_matching) {
+           MATCH_COLOR_END();
            fputs("\n", stdout);
+         }
       cur = b + match_size;
   if (buf)
     free(buf); /* XXX */

The "else" code path at the top is broken by
this change, but this would be easy to fix.

The "if (only_matching)" code path at the bottom
doesn't need the test within MATCH_COLOR_END(),
but it doesn't break anything.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]