[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: regex-quote.c syntax support
From: |
Reuben Thomas |
Subject: |
Re: regex-quote.c syntax support |
Date: |
Sat, 5 Mar 2011 16:31:02 +0000 |
On 5 March 2011 14:51, Bruno Haible <address@hidden> wrote:
> Hello Reuben,
>
>> regex-quote seems only to support two syntaxes at the moment
>
> Yes. POSIX specifies two syntaxes.
regex.h suggests that in practice there are a couple more:
RE_SYNTAX_POSIX_EGREP
RE_SYNTAX_POSIX_AWK
each of which is different from the other and from POSIX basic and extended.
> Rather it's an 'int' with the same meaning as the cflags argument that you
> pass to regcomp().
Any non-zero value counts as selecting extended syntax in
regex_quote*, whereas in regcomp only one bit does that. (I point this
out only as a potential source of ABI breakage.)
> True, but on the other hand if the caller is supposed to determine the
> characters to be escaped ad-hoc, the risk of mistake is pretty high.
> On the other hand, 'grep' supports basic, extended, and PCRE syntaxes,
> but not the Emacs syntax.
Presumably it supports not RE_SYNTAX_POSIX_EXTENDED but rather
RE_SYNTAX_POSIX_EGREP? Or both?
> Before we can decide on this, IMO some analysis is needed:
>
> - What are the possible effects of reg_syntax_t on the string of
> characters to be escaped? I can see
> RE_BK_PLUS_QM -> +?
> RE_INTERVALS, RE_NO_BK_BRACES -> {}
> What other relations are there?
RE_NO_BK_PARENS -> ()
RE_NO_BK_VBAR -> |
RE_NO_BK_REFS -> [:digit:]
> - What characters need to be escaped in Emacs syntax?
Emacs syntax is simply the syntax with all the bits switched off, so:
$^.*[]\+?
> - What characters need to be escaped in PCRE syntax?
According to pcrepattern(3):
^$.[|()?*+{
(Which makes me wonder why we treat ] as special in regex-quote.c.)
> - Do Emacs and PCRE view a regex as a sequence of bytes or as a sequence
> of multibyte characters in the locale encoding (given by LC_CTYPE)?
PCRE doesn't do locales; it treats strings as either bytes or, given a
specific flag, UTF-8.
I don't really understand the question about Emacs: someone using
regex-quote in their own programs is worried about Emacs syntax, not
Emacs encodings, because Emacs doesn't have a C API. My understanding
of Emacs is that it has its own universal internal encoding, which
differs from the encoding of a particular buffer being edited; the
latter can be bytes, 7-bit or 8-bit characters, or multibyte
characters, according to the file being editor and the user's selected
encoding.
HTH!
--
http://rrt.sc3d.org