bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regex-quote.c syntax support


From: Bruno Haible
Subject: Re: regex-quote.c syntax support
Date: Sat, 5 Mar 2011 15:51:33 +0100
User-agent: KMail/1.9.9

Hello Reuben,

> regex-quote seems only to support two syntaxes at the moment

Yes. POSIX specifies two syntaxes.

> and it does it in a rather odd way: by a single boolean flag.

Rather it's an 'int' with the same meaning as the cflags argument that you
pass to regcomp().

> I wonder if there's scope to change this.
> 
> The obvious representation of syntaxes from a GNU point of view would
> be a bitmask of type reg_syntax_t as passed to re_set_syntax. This
> would however add a dependency on the regex module.

Here comes again the ambivalence between the POSIX regex API and the GNU regex
API. The GNU regex API offers more flexibility but relies on a global variable
and is therefore not multithread-safe. If only the APIs could be better
designed...

> An alternative would be to allow the caller to pass a string argument
> containing the characters to be escaped, (plus an escape character?).
> This would make the routine more flexible.

True, but on the other hand if the caller is supposed to determine the
characters to be escaped ad-hoc, the risk of mistake is pretty high.

> In the latter case one could also add a routine that translates a
> reg_syntax_t into a string.

That would make sense, to mitigate said risk of mistake.

> My particular desire is to be able to escape Emacs-syntax regexs, but
> I can imagine other users wishing to escape Awk, grep, or other
> syntaxes, in GNU or POSIX flavours. GNU's regex.h reveals these all to
> be distinct.
>
> By passing a string one could also escape PCRE syntax and others not
> supported by regex.h, e.g. Lua regexs.

If we are going to support Emacs regex syntax in this module, then it
makes sense to go all the way and support all GNU syntaxes, like you
suggest.

On the other hand, 'grep' supports basic, extended, and PCRE syntaxes,
but not the Emacs syntax.

I personally am not a friend of PCRE, because PCRE was the most frequent
cause of crashes in Safari on MacOS X 10.4.

Also, there are the wish that Eric Blake expressed: a flag for the anchor
<http://lists.gnu.org/archive/html/bug-gnulib/2010-09/msg00362.html>.

Various solutions come to mind:
  - A separate module that depends on 'regex' and that contains a function
    that takes a reg_syntax_t argument.
  - A variant in regex-quote.h that takes a string of characters to be
    escaped as argument.
  - A variant in regex-quote.h that takes an enum value designating the syntax
    as argument.
  - A more general quoting facility to which you pass a set of unibyte
    characters to be quoted.
  - An even more general text substitution facility to which you pass a set
    of fixed strings to be replaced by another set of fixed strings.

Before we can decide on this, IMO some analysis is needed:

  - What are the possible effects of reg_syntax_t on the string of
    characters to be escaped? I can see
      RE_BK_PLUS_QM                   ->    +?
      RE_INTERVALS, RE_NO_BK_BRACES   ->    {}
    What other relations are there?

  - What characters need to be escaped in Emacs syntax?

  - What characters need to be escaped in PCRE syntax?

  - Do Emacs and PCRE view a regex as a sequence of bytes or as a sequence
    of multibyte characters in the locale encoding (given by LC_CTYPE)?

Bruno
-- 
In memoriam Hasso von Boehmer <http://en.wikipedia.org/wiki/Hasso_von_Boehmer>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]