help-source-highlight
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-source-highlight] Unicode files ?


From: Martin Gebert
Subject: Re: [Help-source-highlight] Unicode files ?
Date: Tue, 30 Mar 2010 20:30:48 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; de-DE; rv:1.9.1.8) Gecko/20100306 Thunderbird/3.0.3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> I would say that Unicode is an essential feature.  In fact, I
> thought Source-highlight was already Unicode-compliant, since this
> is 2010 and is hard to imagine an application that isn't.

The problem isn't the number of the year, but that C/C++ and its
standard libs neglected advanced string processing aside char (even
wchar is kind of a step child even in todays C++ programming) for a
long time, so you are always reliant on some advanced lib that
supports this (non-trivial, if correctly done) encoding stuff (QString
is an excellent example, and in my eyes still a reference for how
string classes should be done), or you had to roll your own, using
what little support C is able to give. Should get better with C++0x,
but for source-highlight I wouldn't count on it, as it will take a
while until it's available on most platforms and installations.


> I think you are exaggerating the difficulty of dealing with
> variable-length encodings such as UTF-8.  In fact, almost every
> library I know that deals with Unicode does so using the UTF-8
> encoding.

That may be the case, but you still need some non-standard
infrastructure around it to make UTF-8 string processing work
properly, and usually that's nothing that you do in one evening for
your home-brew projects (not meant to slag you, Lorenzo ;-)).
One problem, aside from strlen() (without which it's IMHO hard to
write any string processing at all), is how to determine which type
the string literal in your code is, or which encoding the file you're
processing has.

> Almost everyone uses either single-byte (non-Unicode, thus) or
> Unicode in the form of UTF-8.

That's not for very long now (I made the switch on my workstations
about 5 or 6 years ago), and you can't nor should count on that.

> There's also some UTF-16 out there and even UTF-32 (aka UCS-4), but
> these are less common.

Never heard of any environment using UTF-32 seriously. And UTF-16 I
know mostly from VFAT and NTFS... However, my experience in this field
is limited.

> It's really not as difficult as you make it to be...

Well, and on the other side I don't think that without some support
from an appropriate library it's as easy as you think it should be...

>
> Using UTF-8, you will need a special length() function, but you
> can use the regular strcpy() and strcat().

Right, you can always store a UTF-8 string (or any other Unicode) in a
char array, but you have to be aware which standard functions don't
work with that...

Just my 5 ยข.

Martin
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuyQ1gACgkQ0qFE1uHKpVcJrQCfXbtkL7DSYPAchqHuLD3B/lCZ
lEQAoKAkunBS1MRXXb937nIF5f9AZHZY
=Rjds
-----END PGP SIGNATURE-----





reply via email to

[Prev in Thread] Current Thread [Next in Thread]