bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file

From:	Eli Zaretskii
Subject:	bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file
Date:	Sun, 14 Apr 2013 10:08:42 +0300

> Date: Sun, 14 Apr 2013 10:33:39 +0400
> From: Николай Сущенко
>  <sckol@yandex.ru>
> CC: 7781@debbugs.gnu.org
> 
> Please send me this patch, I'll ask the hunspell developers to include it.

Attached.  This is a small part of a much larger patch, most of it for
Windows-specific problems.  If you have problems compiling the patched
hunspell, let me know: it could be that I omitted some hunk that is
needed for this part.

> Could you also recall which concrete problems produces this workaround? 
> For me it works fine, but I haven't tested it in different languages and 
> encodings.

One problem is that you assume the encoding of the communications with
hunspell is UTF-8, and thus matches the internal representation of
text in Emacs buffers and strings (only then will byte-to-position
give correct results).  But that assumption is false: hunspell
supports any encoding that it can convert to/from UTF-8 (it uses
libiconv internally).  The "usual" choice of the encoding is the one
used by the dictionary.  Not every dictionary out there is in UTF-8.

> If it is some problems, I could try to fix it

I don't think you can fix this on the Emacs side, because Emacs cannot
easily and/or quickly convert between bytes and characters in an
arbitrary multibyte encoding.

When I discovered this problem, I also tried fixing it on the Emacs
side first, but then I realized that this kind of solution has too
many problems, and instead fixed it in hunspell.

--- src/tools/hunspell.cxx~0    2011-01-21 19:01:29.000000000 +0200
+++ src/tools/hunspell.cxx      2013-02-07 10:11:54.443610900 +0200
@@ -710,13 +748,22 @@ if (pos >= 0) {
                        fflush(stdout);
                } else {
                        char ** wlst = NULL;
-                       int ns = pMS[d]->suggest(&wlst, token);
+                       int byte_offset = parser->get_tokenpos() + pos;
+                       int char_offset = 0;
+                       if (strcmp(io_enc, "UTF-8") == 0) {
+                               for (int i = 0; i < byte_offset; i++) {
+                                       if ((buf[i] & 0xc0) != 0x80)
+                                               char_offset++;
+                               }
+                       } else {
+                               char_offset = byte_offset;
+                       }
+                       int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, 
dic_enc[d]));
                        if (ns == 0) {
-                               fprintf(stdout,"# %s %d", token,
-                                   parser->get_tokenpos() + pos);
+                               fprintf(stdout,"# %s %d", token, char_offset);
                        } else {
                                fprintf(stdout,"& %s %d %d: ", token, ns,
-                                   parser->get_tokenpos() + pos);
+                                       char_offset);
                                fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], 
io_enc));
                        }
                        for (int j = 1; j < ns; j++) {
@@ -745,13 +792,23 @@ if (pos >= 0) {
                        if (root) free(root);
                } else {
                        char ** wlst = NULL;
+                       int byte_offset = parser->get_tokenpos() + pos;
+                       int char_offset = 0;
+                       if (strcmp(io_enc, "UTF-8") == 0) {
+                               for (int i = 0; i < byte_offset; i++) {
+                                       if ((buf[i] & 0xc0) != 0x80)
+                                               char_offset++;
+                               }
+                       } else {
+                               char_offset = byte_offset;
+                       }
                        int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, 
dic_enc[d]));
                        if (ns == 0) {
                                fprintf(stdout,"# %s %d", chenc(token, io_enc, 
ui_enc),
-                                   parser->get_tokenpos() + pos);
+                                   char_offset);
                        } else {
                                fprintf(stdout,"& %s %d %d: ", chenc(token, 
io_enc, ui_enc), ns,
-                                   parser->get_tokenpos() + pos);
+                                   char_offset);
                                fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], 
ui_enc));
                        }
                        for (int j = 1; j < ns; j++) {

[Prev in Thread]

Current Thread

[Next in Thread]

bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file, Николай Сущенко, 2013/04/13
- bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file, Eli Zaretskii, 2013/04/14
  - bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file, Николай Сущенко, 2013/04/14
    - bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file, Eli Zaretskii <=
    - bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file, Николай Сущенко, 2013/04/20

Prev by Date: bug#14180: PATCH Better fullscreen frame support on Windows
Next by Date: bug#13567: 24.1; New remember back-end for storing data in multiple files
Previous by thread: bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file
Next by thread: bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file
Index(es):
- Date
- Thread