bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `stri


From: Eli Zaretskii
Subject: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
Date: Mon, 06 Jun 2022 14:29:19 +0300

> Date: Sun, 5 Jun 2022 22:00:35 -0400
> Cc: 55777@debbugs.gnu.org
> From: Richard Hansen <rhansen@rhansen.org>
> 
> On 6/5/22 01:37, Eli Zaretskii wrote:
> > Could you please state what is confusing in the current wording?
> 
>    * "Raw 8-bit bytes" isn't really defined. It's mentioned earlier in
>      the chapter -- the term is even in a @dfn{} -- but there's no
>      definition there.

It is defined as best we could without confusing the readers:

     Occasionally, Emacs needs to hold and manipulate encoded text or
  binary non-text data in its buffers or strings.  For example, when Emacs
  visits a file, it first reads the file’s text verbatim into a buffer,
  and only then converts it to the internal representation.  Before the
  conversion, the buffer holds encoded text.

     Encoded text is not really text, as far as Emacs is concerned, but
  rather a sequence of raw 8-bit bytes.  We call buffers and strings that
  hold encoded text “unibyte” buffers and strings, because Emacs treats
  them as a sequence of individual bytes. [...]

(The @dfn part is markup used whenever new terminology is first used,
it doesn't imply "definition".)

You are welcome to propose a better explanation, but one thing is a
non-starter: mentioning the numerical codes of those bytes, certainly
as part of their "definition".  This is because their numerical codes
overlap Latin characters, and people were very confused about that
when we mentioned them in the documentation in the past.  So now we
deliberately don't mention the values.  The definition is effectively
"bytes that have no meaning as human-readable text".

>    * The term "raw 8-bit bytes" is misleading. It suggests binary data
>      (bytes with values 0-255) but it's actually meant to only cover
>      128-255.

It indeed could potentially mislead.  But not necessarily: it is
customary to use "eight-bit" to mean "with the 8th bit set".

Once again, you don't have to convince me that this area is confusing
and notoriously hard to document.  The challenge is to come up with
something that is better than what we have and yet doesn't trigger
confusion which we already had in the past.

>    * The term "raw 8-bit bytes" is not used consistently. Sometimes "8"
>      is spelled out as "eight", sometimes "raw" comes after "8-bit",
>      and sometimes it refers to all byte values 0-255 (see the first
>      sentence under `@cindex unibyte text`).

I see no problem here, none at all.  This is a manual, not a
mathematical treatise.

>    * It's not clear whether "raw 8-bit bytes" is meant to refer to
>      bytes with values 128-255, or to the *characters* that map to
>      those byte values.

We specifically say they are NOT characters.  From the above-cited
description:

     Encoded text is not really text, as far as Emacs is concerned, but
  rather a sequence of raw 8-bit bytes.

>    * The following phrasing is weird: "The function assumes that
>      @var{string} includes ASCII characters and raw 8-bit bytes". The
>      purpose of "raw 8-bit bytes" is to cover non-ASCII byte values, so
>      by definition that assumption is always true.

No, it isn't true "by definition".  We are trying to make it very
clear that we distinguish between "characters" and "raw bytes".
"Characters" are units of human-readable text, and each character has
a set of attributes that Emacs uses when processing text.  Characters
have letter-case, general category, directionality, numerical value,
etc.  By contrast, "raw bytes" don't have any such attributes: it is
meaningless to ask whether a given raw byte is upper- or lower-case,
or if its directionality is right-to-left, etc.

I hope you now better understand what the sentence above attempts to
say; it doesn't say things that are trivially true.

>                                                       By saying "the
>      function assumes", the reader is left wondering about the cases
>      where that assumption is not true,

Those other cases are multibyte strings, of course.  We could add that
in parentheses, e.g.:

  The function assumes that @var{string} includes ASCII characters and
  raw 8-bit bytes (as opposed to multibyte text).

>      Maybe something like this:
> 
>          By definition, unibyte strings contain only @acronym{ASCII}
>          characters (bytes with values 0-127) and raw 8-bit bytes
>          (bytes with values 128-255); the latter are converted to their
>          corresponding multibyte representations in the
>          @code{eight-bit} character set (@pxref{Text Representations,
>          codepoints}).

As I tried to explain above, using the numerical codes of the bytes is
a step backward: we've been there and done that, and found that people
get confused by that, because the byte codes overlap the Unicode
codepoints of Latin characters.  Explaining the difference rigorously
is IME impossible without delving into the internal representation of
each one of them, since that is how Emacs _really_ distinguishes
between them.  But having all that in the ELisp Reference manual is
completely unjustified (let alone not future-proof, since the internal
representation can change).

Another problem with the above text is that it implies ASCII
characters are bytes: we don't want to call them that, to maintain the
fundamental difference between characters and bytes.

Yet another problem there is that you can have a multibyte string that
is pure-ASCII, so "by definition" is also problematic.

Bottom line: I think the manual describes this reasonably well, and,
given the past experience, any change will have to be tangibly better
before we make it.

Thanks.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]