bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#25987: 25.2; support gcc fixit notes


From: David Malcolm
Subject: bug#25987: 25.2; support gcc fixit notes
Date: Fri, 13 Nov 2020 11:47:18 -0500
User-agent: Evolution 3.36.5 (3.36.5-1.fc32)

On Thu, 2020-11-12 at 15:54 +0200, Eli Zaretskii wrote:
> > From: David Malcolm <dmalcolm@redhat.com>
> > Cc: 25987@debbugs.gnu.org
> > Date: Wed, 11 Nov 2020 14:36:49 -0500
> > 
> > On Tue, 2020-10-20 at 18:54 +0300, Eli Zaretskii wrote:
> > > > From: David Malcolm <dmalcolm@redhat.com>
> > > > Cc: 25987@debbugs.gnu.org
> > > > Date: Tue, 20 Oct 2020 10:52:05 -0400
> > > > 
> > > > One possible issue: in the final diagnostic, there's a fix-it
> > > > hint
> > > > with
> > > > non-ASCII replacement text, replacing "two_pi" with "two_π"
> > > > (where
> > > > the
> > > > final char in the latter is GREEK SMALL LETTER PI, U+03C0)
> > > > 
> > > > This replacement currently expressed as encoded bytes i.e:
> > > > 
> > > > fix-it:"demo.c":{51:10-51:16}:"two_\317\200"
> > > > 
> > > > where \317\200 is the octal-escaped representation of the two
> > > > bytes
> > > > of
> > > > the UTF-8 encoding of the character.
> > > > 
> > > > Is this going to work for Emacs?
> > > 
> > > You mean, GCC doesn't actually emit the UTF-8 encoding of π, it
> > > emits
> > > its ASCII-fied representation?  We'd need to decode that, but is
> > > that
> > > really justified?  Why not emit UTF-8?
> > 
> > I have an implementation that simply emits UTF-8 in quotes,
> > escaping
> > backslash, tab, newline, and doublequotes as before.  (we have to
> > escape at least newline, given that fix-it hint replacement text
> > can
> > contain them, and we're using newline to terminate the parseable
> > hint).
> 
> Sorry, I've lost the context: where did those non-ASCII names come
> from? are they names of variables in the user's program?  

The names are identifiers from the user's program (names of variables,
types, macros, etc), where an error has been issued, typically due to a
misspelling of an identifier.  For example, somewhere there's a
declaration of a constant named "two_π", and later the code erroneously
references it as "two_pi"; we want to emit a diagnostic saying:
  did you mean "two_π"?
and provide a machine-readable fix-it hint suggesting the replacement
of the pertinent source range with "two_π".

GCC converts the source code from any encoding specified by -finput-
charset= to use UTF-8 internally...

https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html

> If so, in
> what encoding does GCC quote portions of the source code in its
> warning/error messages?
>   Does it use the exact byte stream it found in
> the source, or does it perform any conversions of the encoding?

...however there's a bug in GCC in how we print the source code itself,
where we blithely emit the undecoded bytes directly to stderr when
quoting the lines of source.  This GCC bug is 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR
other/93067).  We ought to encode the source code into UTF-8 when
printing it (which may be a no-op for the common case).  The annotation
lines we print under the source lines for fix-it hints and labels are
already printed in UTF-8, however.

That said, the above bug is orthogonal to the fix-it hint issue, which
prints the names in a different way (using UTF-8 encoded strings in
GCC's symbol table, rather than scraping them from the filesystem,
which is how the buggy source-quoting routines work).

> > However, the filename also needs to be escaped.  Currently I'm
> > applying
> > the same escaping rules to both filename and replacement text.
> > What is the encoding of the filename?  What if the bytes in a
> > filename
> > aren't UTF-8 encoded?  How does emacs handle this case?
> 
> Emacs has a separate variable for the encoding of file names, which
> gets set from the locale settings.  But this is not necessarily
> relevant to the issue at hand, because we are talking about
> processing
> output from a sub-process (GCC) which includes both file names and
> other stuff, such as fragments of the source code.  When Emacs
> processes sub-process output, it generally assumes all of it is
> encoded in the same encoding.  So if, for example, you encode
> non-ASCII variables in UTF-8 while the file names are emitted in some
> other encoding (perhaps because the locale's codeset is not UTF-8),
> then there will be complications: we will have to read the output
> from
> GCC in its raw form, and then decode "by hand" (in Lisp) each part of
> it as appropriate (which means we will need to be able to identifye
> each such part).
> 
> So it's important to understand the situation and its limitations for
> proposing the best solution.

As far as I can tell GCC handles filenames as raw bytes, and doesn't
make any attempt to decode them, and emits them as bytes again in
diagnostic messages.

> > I tried creating file with the name "byte 0xff" .txt, and with
> > valid
> > UTF-8 non- ascii names and emacs reported them as \377.txt and with
> > the UTF-8 names respectively, so perhaps I should simply emit the
> > bytes and pretend they are UTF-8?
> 
> What do you mean by "pretend" in this context?

By "pretend" I mean simply re-emitting the bytes of the filename to
stderr and ignoring encoding issues in them, despite the fact that the
rest of the stream is supposed to be UTF-8-encoded.

Currently the parseable-fixits option uses IS_PRINT on each "char"
(i.e. byte) so that any non-printable bytes get octal-escaped.  Is that
acceptable for filenames?  The other approach, to "pretend they're UTF-
8", would mean to not escape such bytes, so that if they are UTF-8 they
are faithfully re-emitted.

I think I like the approach where the filename part of the fixit line
is octal-escaped, and the replacement text is UTF-8, but I don't know
what's going to be best for you.

Hope the above clarifies things.

Dave






reply via email to

[Prev in Thread] Current Thread [Next in Thread]