|
From: | Lionel Fumery |
Subject: | Re: [Help-source-highlight] Unicode files ? |
Date: | Tue, 30 Mar 2010 11:59:01 +0200 |
User-agent: | Thunderbird 2.0.0.24 (Windows/20100228) |
Hi Lorenzo,
Martin (and others maybe), Thanks for you answers. Again, as I discovered Source-highlight very recently, I don't know if Unicode is an important feature for you or not... I read sometimes source code from Japanese or Chinese developers, and am French myself, so that's not unusual to store code or text files in Unicode (I mostly work with Visual Studio). Unicode files (UTF-8 for example, which is widely used on the Internet) can store characters on 1 to 6 bytes. So of course it's very difficult to use (length() and so are difficult) 1) First you have to know if the file is Unicode or not. They should have a header, described here: http://en.wikipedia.org/wiki/Byte_order_mark (Note that "bad" unicode text files are quite common (unicode text files without any header), but no need to address this here.) 2) The second thing is to convert the whole file to a "fixed bytes per character" format, so you can work with it. A wide char format (16 bits wchar) is a good choice most of the time. Here is a FAQ explaining how to read Unicode files : http://www.cl.cam.ac.uk/~mgk25/unicode.html. I can provide some C code source snippets to match this. 3) And then you can work with wchar functions. Don't know too much on the Linux side, but it's simply a matter of wcslen, wcscpy, wcscat instead of length(), strcpy(), strcat() with Visual Studio. I'm going to take a look on the Source-Highlight code to see if this could be easy to add... Best, Lionel Lorenzo Bettini wrote: Lionel Fumery wrote: |
[Prev in Thread] | Current Thread | [Next in Thread] |