octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #53842] Handle m-files with arbitrary characte


From: Andrew Janke
Subject: [Octave-bug-tracker] [bug #53842] Handle m-files with arbitrary character encoding
Date: Mon, 25 Jun 2018 08:11:35 -0400 (EDT)
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36

Follow-up Comment #13, bug #53842 (project octave):

>  UTF-8 is an encoding that covers all Unicode characters. What do you mean
by changing the default encoding from "utf-8" to "unicode"? 

I mean, make the default _external_ file encoding not strictly UTF-8, but a
"meta-encoding" which means "Unicode, but choose/detect the actual
UTF-whatever encoding on a per-file basis". So that you might have a project
that has some .m files in UTF-8, and some files in UTF-16, and these encodings
are autodetected when the individual files are read, based on their contents.
And for output, previously-existing files are saved in their original
encodings (to avoid introducing spurious changes into source control or
whatever), and newly-created files are saved in a default Unicode encoding,
probably UTF-8.

> Wrt your comment #10: Unfortunately(?) the C++ standard doesn't define any
character encodings. To make matters worse wchar_t and wstring have different
sizes on different platforms...

That's my understanding as well. And, yeah, "Unfortunately(?)" is right. It's
a bummer that there's not standard support in base C/C++, but this Unicode
stuff is so complex that whatever made it out of a C++ standards committee
would probably be woefully outdated by the time it hit the streets.

What I really mean is:
* Define an EncodedFileReader class that:
** detects file encodings using some heuristics, or takes an encoding
explicitly supplied by the caller, and then
** does line-oriented encoding-aware input reading (based on the initial
encoding detection/specification) and encoding conversion, returning the
results as UTF-8 strings inside `std::string` (or whatever Octave is using for
its internal Unicode string representation). So the external encoding could be
UTF-16, UTF-8, UTF-32, or whatever, as long as the Unicode library supports
translation between that and the native Octave representation, and line feeds
could be safely detected
*** (which maybe means slurping the entire input file in to memory,
transcoding the entire thing at once, and _then_ detecting line feeds. But
that's probably not an issue because source code files are relatively small.)

My understanding is that Octave is currently using gnulib for Unicode support,
and that gnulib provides encoding translations, but not encoding/character-set
detection. So we'd have to roll our own encoding detection, or take a
dependency on a library that supports it.

> While C++ 11 added u16string and u32string, I don't know of any iostreams
that support these types out of the box.

I don't know of any standard C/C++ stuff that supports this either. AFAIK,
base C/C++ Unicode support is not great.

So that means rolling our own encoding-detection and Unicode-aware iostream
support (which might not be so bad if all you care about is UTF-8 + UTF-16 and
BMP-only support), or taking a dependency on ICU4C or similar, or deferring
this to later.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?53842>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]