diffutils-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Bruno Haible
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Sat, 05 Aug 2023 01:05:52 +0200

Paul Eggert wrote in
<https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>:

> Here are the results:
> 
>         noop  mbiterf mbuiterf    mbcel   mbucel
>   a    1.546    2.038    1.931    1.639    1.968
>   b    1.545    2.016    1.936    1.705    1.969
>   c    2.153    8.160    8.255    6.160    5.550
>   d    2,153    6,692    6,686    6,322    5,520
>   e    2.234    6.866    6.817    6.214    5.781
>   f    1.843   95.276  104.270   58.667   60.902
>   g    1,844   76,126   83,899   65,541   66,983
>   h    3.297   75.727   83.735   64.261   65.686
>   i    1.941   31.064   34.887   26.262   27.072
>   j    1.310   29.620   33.698   24.751   25.550

To me, the columns timings of mbiterf and mbuiterf are good enough.
The rows c and f will improve when glibc's C locale bug is fixed.
For the rest, I see the timings as close enough to the optimum.

> But in another sense these benchmarks are biased against mbcel, as this 
> benchmark compiles in a special environment that defines NDEBUG, which 
> most Gnulib-using code doesn't do

Yes, it's so tempting to take out all assert()s from production code :)
But I won't do it nevertheless, because I fear platform bugs in the mb* code.

> For applications that need MEE, it'd be easy for mbcel to supply it. 
> Something like the following patch would do it, though it should be a 
> different function (e.g., mbcel_scanmee) instead of being a change to 
> mbcel_scan itself. However, diffutils doesn't need to bother with such a 
> change as it can use SEE like most other programs do.
> 
>   --- a/lib/mbcel.h
>   +++ b/lib/mbcel.h
>   @@ -191,3 +191,3 @@ mbcel_scan (char const *p, char const *lim)
>      if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
>   -    return (mbcel_t) { .err = *p, .len = 1 };
>   +    return (mbcel_t) { .err = *p, .len = len == (size_t) -2 ? lim - p : 1 
> };

With a patch like this, i.e. with MEE, I would welcome mbcel and mbucel in 
Gnulib.

(Although maybe you may want to align the module name to be similar
to mbiterf and mbuiterf : maybe mbitervf and mbuitervf for "very fast"?)

> SEE is simpler and is common practice elsewhere, notably Emacs.

Emacs is a complex beast. I can understand if the Emacs developers want
an implementationally-simple behaviour, rather than a simple-from-the-
user-perspective behaviour. But other GNU programs can and should do
it better, since modules 'mb*iter*f' already exist that hide the
implementational complexity.

Paul Eggert wrote in
<https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00149.html>:

> It would be better to change mbu?iterf? to use 
> single-byte-per-encoding-error ("SEE") behavior

No. No way. This would be a regression.

> This is because mbu?iterf? doesn't implement MEE either.
>
> For MEE, mbiterf would need something like the attached untested patch, 
> and mbiter, mbcel, etc. would all need similar patches.

Good point. Actually, with iconv() based converters one needs such a
loop, because iconv's calling conventions don't allow to return detailed
information. (See iconv_carefully and iconv_carefully_1 in lib/striconveh.c.)

mbrtowc / mbrtoc32 is a bit different, but since upon return it does not
give detailed information about the structure of the invalid byte sequence,
it may indeed need such a loop.

For comparison, <unistr.h> u8_mbtouc returns the number of bytes, but OTOH
does not allow to distinguish U+FFFD vs. incomplete/invalid byte sequence.

I'll think more about this one...

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]