[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] iconv fails on large Greek files
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] iconv fails on large Greek files |
Date: |
Sun, 02 Oct 2022 00:04:34 +0200 |
Hello Wesley,
> The failure also occurs when the file does have known decomposed characters.
>
> WGroleau@MBP ~ % iconv --version
> iconv (GNU libiconv 1.16)
> WGroleau@MBP ~ % uname -a
> Darwin MBP.local 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT
> 2022; root:xnu-8020.140.49~2/RELEASE_X86_64 x86_64
> WGroleau@MBP el % wc el.txt
> 179 975 8621 el.txt
> WGroleau@MBP el % iconv -f UTF8-MAC -t UTF-8 el.txt > /tmp/tmp
>
> iconv: el.txt:90:16: cannot convert
> WGroleau@MBP el % wc /tmp/tmp
> 89 457 4093 /tmp/tmp
> WGroleau@MBP el % iconv -f UTF-8 -t UTF8-MAC el.txt > /tmp/tmp
> WGroleau@MBP el % wc /tmp/tmp
> 179 1029 9537 /tmp/tmp
> WGroleau@MBP el % iconv -f UTF8-MAC -t UTF-8 el.txt > /tmp/tmp
>
> iconv: el.txt:90:16: cannot convert
> WGroleau@MBP el % wc el.txt
> 179 975 8621 el.txt WGroleau@MBP
> el % tail -$((179-90+2)) el.txt > el+.txt
> WGroleau@MBP el % wc el+.txt
> 90 522 4558 el+.txt
> WGroleau@MBP el % iconv -f UTF8-MAC -t UTF-8 el+.txt > /tmp/tmp
>
> iconv: el+.txt:84:36: cannot convert
> WGroleau@MBP el % wc /tmp/tmp
> 83 469 4093 /tmp/tmp
> WGroleau@MBP el % iconv -f UTF-8 -t UTF8-MAC el.txt > /tmp/tmp
> WGroleau@MBP el % iconv -f UTF8-MAC -t UTF-8 /tmp/tmp > temp.txt
>
> iconv: /tmp/tmp:161:7: cannot convert
> WGroleau@MBP el % wc temp.txt
> 160 835 7390 temp.txt
> WGroleau@MBP el % wc /tmp/tmp
> 179 1029 9537 /tmp/tmp
The failures occur only when you use the 'UTF8-MAC', apparently.
Then you need to complain to Apple. Because GNU libiconv does not
have this encoding name; it was added by Apple in the macOS version
of GNU libiconv.
> The failure usually occurs after processing APPROX. 4000 bytes,
> but occasionally approx. 8000.
When I decided to not integrate Apple's code upstream, it was because
* UTF8-MAC is a workaround to Apple's misdesign decisions: Although
the W3C says that decomposed Unicode should not be user-visible,
Apple made it user-visible in HFS+. They better ought to have hidden
it in their file system routines.
* The code that Apple added to GNU libiconv looked buggy to me. I am
not surprised at all that you have succeeded in finding a reproducer
for these bugs. Probably you are the first one because most people
use iconv in this way only to convert file names, and file names are
smaller than 4000 bytes.
As a workaround, you can use 'uconv -x NFC' where uconv is a program
part of ICU.
Best regards,
Bruno