emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questionable code in handling of wordend in the regexp engine in reg


From: Eli Zaretskii
Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Date: Fri, 01 Mar 2019 15:46:14 +0200

> Date: Fri, 1 Mar 2019 11:10:18 +0000
> From: Alan Mackenzie <address@hidden>
> Cc: address@hidden
> 
> SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
> marker.c).  This latter function doesn't handle well the case of `d'
> being in the middle of a multibyte character; sometimes it "rounds it
> down", other times it "rounds it up" to a character position.  I think
> it should be defined as rounding it down.  It would be a relatively
> simple correction (at least, technically ;-).

buf_bytepos_to_charpos is not supposed to be called when the byte
position is in the middle of a multibyte sequence.  We have the
CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.

> For that matter, how many charpos <-> bytepos functions are there in
> Emacs?

Only one pair of such function exists for buffer text, and another
pair for strings.

> >             ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> >             ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> >             UPDATE_SYNTAX_TABLE (charpos);
> 
> > which seems even more broken because `d` might point to the first byte
> > after the gap, so `d - 1` will point in the middle of the gap, so it's
> > simply an invalid argument to PTR_TO_OFFSET.
> 
> I don't think this is right.  Both `d' and `offset' are byte
> measurements, not character measurements, so it shouldn't matter whether
> the "- 1" is inside or outside the parens.  However, it would be less
> confusing if they were both (?all) the same.

That's orthogonal.  Stefan is right in that you cannot in general do
pointer arithmetics on pointers into buffer text without considering
the gap.  You need to convert 'd' into a byte position (which is
actually an offset from the beginning of buffer text), then decrement
it, then convert back into a 'char *' pointer to the previous byte.
The macros used for these conversions take care of skipping the gap.

However, since the caller already took care to split the text into two
parts, one before the gap and the other after the gap, it sounds like
we don't need to bother about the gap in this case, unless 'd - 1'
happens to point before the beginning of string2 argument to
re_match_2_internal.

> > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> > the result may be the same as if we did the decrement after the fact,
> > but it still looks fishy.  WDYT?
> 
> I think it is suboptimal to have both PTR_TO_OFFSET and
> POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

Those macros hide the fact that the argument could be a Lisp string or
a buffer, so I don't think I agree with you here.

> > diff --git a/src/regex-emacs.c b/src/regex-emacs.c
> > index b667a43a37..b21cba0e46 100644
> > --- a/src/regex-emacs.c
> > +++ b/src/regex-emacs.c
> > @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
> > re_char *string1,
> >           int c1, c2;
> >           int s1, s2;
> >           int dummy;
> > -         ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
> > +         ptrdiff_t offset = PTR_TO_OFFSET (d);
> >           ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> > -         UPDATE_SYNTAX_TABLE (charpos);
> > +         UPDATE_SYNTAX_TABLE (charpos - 1);
> >           GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
> >           s1 = SYNTAX (c1);
>  
> There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
> regex-emacs.c.  I think I will check them all, amending them as in your
> patch.
> 
> What do you say?

I'm not Stefan, but what I say is that we should only make sure 'd'
never points to the very first byte of 'string2'.  If it does, then
decrementing it will produce invalid results.  If we cannot decide
whether that situation could happen, we should add an assertion there
to that effect.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]