[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] ed with perl-compatible regular expressions

From: Shawn Wagner
Subject: Re: [PATCH] ed with perl-compatible regular expressions
Date: Thu, 22 Jul 2021 03:34:39 -0500

ed itself knows nothing about UTF-8, or character sets in general, as
far as I've seen poking around in the source. Your system regcomp()
and regexec() implementation presumably works with multibyte encodings
with an appropriate locale. PCRE2, on the other hand, only understands
one multi-codeunit (which can be an 8-bit byte, or a 16-bit word or a
32-bit word depending on what version of the library is being used)
encoding, the appropriate UTF-XX for the code unit size. And it has to
be explicitly told to use it.

While I do think there's a lot to be said for being explicit about
character encodings, I suppose I could look at the relevant
locale-related environment variables and try to work out if they
describe a utf-8 charset... GNU grep does something like this for its
-P matcher, but it does so using non-standard functions from gnulib,
which ed doesn't depend on. Hmm. Might be able to pull out just the
minimum relevant bits of code. I'll have to look at that more

On Thu, Jul 22, 2021 at 2:58 AM Martin Guy <martinwguy@gmail.com> wrote:
> On 22/07/2021, Shawn Wagner <shawnw.mobile@gmail.com> wrote:
> > The attached patch (Made against 1.18-pre2) adds a -P option to use
> > PCRE2 regular expressions. Passing --disable-pcre2 to the configure
> > script will leave this feature out. There's also a --utf8 option that
> > turns on PCRE2's advanced Unicode matching.
> Hmm. Interesting.
> Currently ed already seems UTF8-aware. For example:
> a
> àèìòù
> .
> s/[à]/a/
> replaces the two-character sequence with a single 'a' if $LANG  ends
> in .UTF-8, while if LANG=C or is unset, it only replaces the first
> byte of the pair. Maybe you could find where it detects this and use
> the same logic instead of --utf8
> Cheers
>     M

reply via email to

[Prev in Thread] Current Thread [Next in Thread]