bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk regex stuff you may want


From: Paul Eggert
Subject: Re: gawk regex stuff you may want
Date: Thu, 21 Jan 2016 09:08:37 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0

On 01/20/2016 11:59 AM, Aharon Robbins wrote:

I don't see that. In regex.h:

| struct re_pattern_buffer
| {
|   /* Space that holds the compiled pattern.  It is declared as
|      `unsigned char *' because its elements are sometimes used as
|      array indexes.  */
|   unsigned char *__REPB_PREFIX(buffer);

Ah, that's probably the problem, and is why your version needs extra casts between unsigned char * and re_dfa_t *. In gnulib, that last line reads 'struct re_dfa_t *__REPB_PREFIX(buffer);', which is what the buffer really is, and avoids the need for casts later.

Since malloc (0) is well-defined to return either NULL or a valid pointer
to a zero-byte object that can be freed, the code is working as-is.
Not on older systems. Although yes, I'm not sure I still support such
systems.  I don't know which checker complained. I think it was valgrind
at some point in the past.

Those older systems are long dead, as this requirement was standardized in C89 and even the oldest systems we formerly supported (SunOS 4, if I recall) conformed to the standard long ago. valgrind 3.11 does not complain about simple uses like free (malloc (0)) on my platform (Fedora 23 x86-64). Perhaps older versions of valgrind complain on some platforms, but I'd rather not worry about that; it's really a valgrind bug.


I'd rather look for a solution that involves silencing the checking
without making the code bulkier and typically slower.
This is in compilation, not execution; not sure the difference
is really noticable.

There is some runtime slowdown and code bloat. You're right, it's not significant, but it's the principle of the thing: I'd rather not bloat the code just to pacify a busted checker.

If a byte value given inside [...] is greater than 127, it should
be left alone to be matched as-is (no arbitrary limits).

Hah! We recently ran into a similar problem with 'grep'. Previously, part of the grep code assumed that unibyte locales cannot have encoding errors, so in a unibyte locale one can just use the traditional byte-oriented algorithm without worrying about mbrlen and the like. Other parts of the grep code checked for encoding errors in the usual way, e.g., with mbrlen. Unfortunately, the assumption about unibyte locales is incorrect, and the inconsistencies between the different parts of 'grep' caused confusing and arguably incorrect behavior which took a bit of hackery to resolve; see the thread starting at <http://bugs.gnu.org/20526#86>.

As far as 'grep' is concerned, it'll trust what regcomp does here, so we do have some freedom to change the code in this area. However, it looks to me like your patch would do the wrong thing for unibyte locales where btowc (b) returns a value that neither b nor WEOF. Also, the rest the code assumes that if btowc returns WEOF in a multibyte locale then there won't be a match (see the setup code in init_dfa, and I have the nagging feeling that this assumption is embedded elsewhere). So, how about the attached more-conservative patch instead?


        /* Build single byte matching table for this equivalence class.  */
+      char_buf[1] = (unsigned char) '\0';
This should be unnecessary, as the rest of the code shouldn't be looking
at that byte. Is this something to pacify a lint checker?
Or valgrind at some point. I think a static checker that someone submitted
a report from.

It might be helpful to get at the bottom of this. If it's some buggy static checker then I would rather not worry about it.


-       wc = wc2;
+       wc = (wint_t) wc2;
wc and wc2 are both integers so the cast is unnecessary. Similarly elsewhere.
There was a compiler (maybe VMS) for which this was necessary.

Again, it'd be helpful to know what the problem actually was. I expect it was just a warning, which is fine to ignore. It's valid standard C code, for what it's worth.

Attachment: 0001-regex-treat-x-as-x-if-x-is-a-unibyte-encoding-error.patch
Description: Source code patch


reply via email to

[Prev in Thread] Current Thread [Next in Thread]