bug#25987: 25.2; support gcc fixit notes

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#25987: 25.2; support gcc fixit notes

From:	Eli Zaretskii
Subject:	bug#25987: 25.2; support gcc fixit notes
Date:	Sat, 14 Nov 2020 16:21:25 +0200

> From: David Malcolm <dmalcolm@redhat.com>
> Cc: 25987@debbugs.gnu.org
> Date: Fri, 13 Nov 2020 11:47:18 -0500
> 
> The names are identifiers from the user's program (names of variables,
> types, macros, etc), where an error has been issued, typically due to a
> misspelling of an identifier.  For example, somewhere there's a
> declaration of a constant named "two_π", and later the code erroneously
> references it as "two_pi"; we want to emit a diagnostic saying:
>   did you mean "two_π"?
> and provide a machine-readable fix-it hint suggesting the replacement
> of the pertinent source range with "two_π".
> 
> GCC converts the source code from any encoding specified by -finput-
> charset= to use UTF-8 internally...
> 
> https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html

And then GCC outputs these identifiers in UTF-8?  Or does it convert
back to the original input-charset?

> ...however there's a bug in GCC in how we print the source code itself,
> where we blithely emit the undecoded bytes directly to stderr when
> quoting the lines of source.  This GCC bug is 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR
> other/93067).  We ought to encode the source code into UTF-8 when
> printing it (which may be a no-op for the common case).

I'm not sure you are right here: I think it is better for GCC to use
the original bytestream, because the user's locale might not support
UTF-8 well; it is better to show the source to the user in the
encoding in which it was written.

However, I'm not familiar with GCC internals, so it is not clear to me
whether the bug report will indeed affect the way source fragments
will be output: the bug report only talks about converting the input,
and I don't know enough to understand how will that affect output.

> The annotation lines we print under the source lines for fix-it
> hints and labels are already printed in UTF-8, however.

The annotations are in US English, though, right?  If not, when will
they include non-ASCII characters?

> That said, the above bug is orthogonal to the fix-it hint issue, which
> prints the names in a different way (using UTF-8 encoded strings in
> GCC's symbol table, rather than scraping them from the filesystem,
> which is how the buggy source-quoting routines work).
> [...]
> As far as I can tell GCC handles filenames as raw bytes, and doesn't
> make any attempt to decode them, and emits them as bytes again in
> diagnostic messages.

This is okay, but since the other parts are in UTF-8, this will
complicate things, as I mentioned in my previous message.

> > > I tried creating file with the name "byte 0xff" .txt, and with
> > > valid
> > > UTF-8 non- ascii names and emacs reported them as \377.txt and with
> > > the UTF-8 names respectively, so perhaps I should simply emit the
> > > bytes and pretend they are UTF-8?
> > 
> > What do you mean by "pretend" in this context?
> 
> By "pretend" I mean simply re-emitting the bytes of the filename to
> stderr and ignoring encoding issues in them, despite the fact that the
> rest of the stream is supposed to be UTF-8-encoded.

As explained, it will be easier for Emacs to process GCC output if its
encoding is consistent.

> Currently the parseable-fixits option uses IS_PRINT on each "char"
> (i.e. byte) so that any non-printable bytes get octal-escaped.  Is that
> acceptable for filenames?  The other approach, to "pretend they're UTF-
> 8", would mean to not escape such bytes, so that if they are UTF-8 they
> are faithfully re-emitted.
> 
> I think I like the approach where the filename part of the fixit line
> is octal-escaped, and the replacement text is UTF-8, but I don't know
> what's going to be best for you.

Given your description, it sounds like it will not be simple whatever
you do.

I guess we should first try getting the plain-ASCII case to work, as
that is the most frequent use case anyway.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#25987: 25.2; support gcc fixit notes, David Malcolm, 2020/11/11
- bug#25987: 25.2; support gcc fixit notes, Eli Zaretskii, 2020/11/12
  - bug#25987: 25.2; support gcc fixit notes, David Malcolm, 2020/11/13
    - bug#25987: 25.2; support gcc fixit notes, Eli Zaretskii <=
    - bug#25987: 25.2; support gcc fixit notes, David Malcolm, 2020/11/14

Prev by Date: bug#44486: 27.1; C-@ chars corrupt elisp buffer
Next by Date: bug#44598: [PATCH] Do not show obsolete options in customize
Previous by thread: bug#25987: 25.2; support gcc fixit notes
Next by thread: bug#25987: 25.2; support gcc fixit notes
Index(es):
- Date
- Thread