bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#25987: 25.2; support gcc fixit notes


From: David Malcolm
Subject: bug#25987: 25.2; support gcc fixit notes
Date: Sat, 14 Nov 2020 14:46:29 -0500
User-agent: Evolution 3.36.5 (3.36.5-1.fc32)

On Sat, 2020-11-14 at 16:21 +0200, Eli Zaretskii wrote:
> > From: David Malcolm <dmalcolm@redhat.com>
> > Cc: 25987@debbugs.gnu.org
> > Date: Fri, 13 Nov 2020 11:47:18 -0500
> > 
> > The names are identifiers from the user's program (names of
> > variables,
> > types, macros, etc), where an error has been issued, typically due
> > to a
> > misspelling of an identifier.  For example, somewhere there's a
> > declaration of a constant named "two_π", and later the code
> > erroneously
> > references it as "two_pi"; we want to emit a diagnostic saying:
> >   did you mean "two_π"?
> > and provide a machine-readable fix-it hint suggesting the
> > replacement
> > of the pertinent source range with "two_π".
> > 
> > GCC converts the source code from any encoding specified by
> > -finput-
> > charset= to use UTF-8 internally...
> > 
> > https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html
> 
> And then GCC outputs these identifiers in UTF-8?  Or does it convert
> back to the original input-charset?

It emits them as UTF-8 when emitting diagnostics.

> > ...however there's a bug in GCC in how we print the source code
> > itself,
> > where we blithely emit the undecoded bytes directly to stderr when
> > quoting the lines of source.  This GCC bug is 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR
> > other/93067).  We ought to encode the source code into UTF-8 when
> > printing it (which may be a no-op for the common case).
> 
> I'm not sure you are right here: I think it is better for GCC to use
> the original bytestream, because the user's locale might not support
> UTF-8 well; it is better to show the source to the user in the
> encoding in which it was written.

This seems to me to lead to a bigger question: what should the encoding
of GCC's stderr be?  Right now I believe we emit a mix of UTF-8 and
other encodings, as noted in my earlier post.

> However, I'm not familiar with GCC internals, so it is not clear to
> me
> whether the bug report will indeed affect the way source fragments
> will be output: the bug report only talks about converting the input,
> and I don't know enough to understand how will that affect output.
> 
> > The annotation lines we print under the source lines for fix-it
> > hints and labels are already printed in UTF-8, however.
> 
> The annotations are in US English, though, right?  If not, when will
> they include non-ASCII characters?

Annotation lines can contain labels as of GCC 9, and these can contain
identifiers; for example in this C++ type mismatch error, where the
types of the pertinent expressions are labeled:
$ g++ t.cc
t.cc: In function 'int test(const shape&, const shape&)':
t.cc:15:4: error: no match for 'operator+' (operand types are
'boxed_value<double>' and 'boxed_value<double>')
   14 |   return (width(s1) * height(s1)
      |           ~~~~~~~~~~~~~~~~~~~~~~
      |                     |
      |                     boxed_value<[...]>
   15 |    + width(s2) * height(s2));
      |    ^ ~~~~~~~~~~~~~~~~~~~~~~
      |                |
      |                boxed_value<[...]>

where "boxed_value" is an identifier and in theory could have non-ASCII 
characters in it.

> > That said, the above bug is orthogonal to the fix-it hint issue,
> > which
> > prints the names in a different way (using UTF-8 encoded strings in
> > GCC's symbol table, rather than scraping them from the filesystem,
> > which is how the buggy source-quoting routines work).
> > [...]
> > As far as I can tell GCC handles filenames as raw bytes, and
> > doesn't
> > make any attempt to decode them, and emits them as bytes again in
> > diagnostic messages.
> 
> This is okay, but since the other parts are in UTF-8, this will
> complicate things, as I mentioned in my previous message.
> 
> > > > I tried creating file with the name "byte 0xff" .txt, and with
> > > > valid
> > > > UTF-8 non- ascii names and emacs reported them as \377.txt and
> > > > with
> > > > the UTF-8 names respectively, so perhaps I should simply emit
> > > > the
> > > > bytes and pretend they are UTF-8?
> > > 
> > > What do you mean by "pretend" in this context?
> > 
> > By "pretend" I mean simply re-emitting the bytes of the filename to
> > stderr and ignoring encoding issues in them, despite the fact that
> > the
> > rest of the stream is supposed to be UTF-8-encoded.
> 
> As explained, it will be easier for Emacs to process GCC output if
> its
> encoding is consistent.

Indeed.  I'll raise this issue on the GCC mailing list.

> > Currently the parseable-fixits option uses IS_PRINT on each "char"
> > (i.e. byte) so that any non-printable bytes get octal-escaped.  Is
> > that
> > acceptable for filenames?  The other approach, to "pretend they're
> > UTF-
> > 8", would mean to not escape such bytes, so that if they are UTF-8
> > they
> > are faithfully re-emitted.
> > 
> > I think I like the approach where the filename part of the fixit
> > line
> > is octal-escaped, and the replacement text is UTF-8, but I don't
> > know
> > what's going to be best for you.
> 
> Given your description, it sounds like it will not be simple whatever
> you do.
> 
> I guess we should first try getting the plain-ASCII case to work, as
> that is the most frequent use case anyway.

I added some test cases and posted the patch to the gcc-patches mailing
list here:
  "[PATCH/RFC] Add GCC_EXTRA_DIAGNOSTIC_OUTPUT environment variable for
fix-it hints"
  https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559105.html

Thanks
Dave






reply via email to

[Prev in Thread] Current Thread [Next in Thread]