[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file
From: |
Eli Zaretskii |
Subject: |
bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file |
Date: |
Sun, 14 Apr 2013 10:08:42 +0300 |
> Date: Sun, 14 Apr 2013 10:33:39 +0400
> From: Николай Сущенко
> <sckol@yandex.ru>
> CC: 7781@debbugs.gnu.org
>
> Please send me this patch, I'll ask the hunspell developers to include it.
Attached. This is a small part of a much larger patch, most of it for
Windows-specific problems. If you have problems compiling the patched
hunspell, let me know: it could be that I omitted some hunk that is
needed for this part.
> Could you also recall which concrete problems produces this workaround?
> For me it works fine, but I haven't tested it in different languages and
> encodings.
One problem is that you assume the encoding of the communications with
hunspell is UTF-8, and thus matches the internal representation of
text in Emacs buffers and strings (only then will byte-to-position
give correct results). But that assumption is false: hunspell
supports any encoding that it can convert to/from UTF-8 (it uses
libiconv internally). The "usual" choice of the encoding is the one
used by the dictionary. Not every dictionary out there is in UTF-8.
> If it is some problems, I could try to fix it
I don't think you can fix this on the Emacs side, because Emacs cannot
easily and/or quickly convert between bytes and characters in an
arbitrary multibyte encoding.
When I discovered this problem, I also tried fixing it on the Emacs
side first, but then I realized that this kind of solution has too
many problems, and instead fixed it in hunspell.
--- src/tools/hunspell.cxx~0 2011-01-21 19:01:29.000000000 +0200
+++ src/tools/hunspell.cxx 2013-02-07 10:11:54.443610900 +0200
@@ -710,13 +748,22 @@ if (pos >= 0) {
fflush(stdout);
} else {
char ** wlst = NULL;
- int ns = pMS[d]->suggest(&wlst, token);
+ int byte_offset = parser->get_tokenpos() + pos;
+ int char_offset = 0;
+ if (strcmp(io_enc, "UTF-8") == 0) {
+ for (int i = 0; i < byte_offset; i++) {
+ if ((buf[i] & 0xc0) != 0x80)
+ char_offset++;
+ }
+ } else {
+ char_offset = byte_offset;
+ }
+ int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc,
dic_enc[d]));
if (ns == 0) {
- fprintf(stdout,"# %s %d", token,
- parser->get_tokenpos() + pos);
+ fprintf(stdout,"# %s %d", token, char_offset);
} else {
fprintf(stdout,"& %s %d %d: ", token, ns,
- parser->get_tokenpos() + pos);
+ char_offset);
fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d],
io_enc));
}
for (int j = 1; j < ns; j++) {
@@ -745,13 +792,23 @@ if (pos >= 0) {
if (root) free(root);
} else {
char ** wlst = NULL;
+ int byte_offset = parser->get_tokenpos() + pos;
+ int char_offset = 0;
+ if (strcmp(io_enc, "UTF-8") == 0) {
+ for (int i = 0; i < byte_offset; i++) {
+ if ((buf[i] & 0xc0) != 0x80)
+ char_offset++;
+ }
+ } else {
+ char_offset = byte_offset;
+ }
int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc,
dic_enc[d]));
if (ns == 0) {
fprintf(stdout,"# %s %d", chenc(token, io_enc,
ui_enc),
- parser->get_tokenpos() + pos);
+ char_offset);
} else {
fprintf(stdout,"& %s %d %d: ", chenc(token,
io_enc, ui_enc), ns,
- parser->get_tokenpos() + pos);
+ char_offset);
fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d],
ui_enc));
}
for (int j = 1; j < ns; j++) {