--- Begin Message ---
Subject: |
grep doesn't match diacritical chars in ISO-8859 files |
Date: |
Fri, 2 Oct 2015 11:43:58 +0200 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
Hi,
Moreover http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19230 , several
debian users report that grep doesn't match characters with diacritical
marks in ISO-8859 files, inside a Unicode enviroment:
% file /tmp/q.h
/tmp/q.h: ISO-8859 text
% grep c /tmp/q.h
Coincidencia en el fichero binario /tmp/q.h
% grep -a c /tmp/q.h
struct cara* lcaras; //array de caras, habr� que usar reserva dinamica de
memoria.
% grep á /tmp/q.h
% grep -a á /tmp/q.h
grep matches the "á" pattern if it's is input from an ISO-8859 file:
% grep -f a q.h
Coincidencia en el fichero binario q.h
Test files attached
Full report:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800670
Regards,
Santiago
-- System Information:
Debian Release: stretch/sid
APT prefers squeeze-lts
APT policy: (500, 'squeeze-lts'), (500, 'oldoldstable'), (500, 'unstable'),
(500, 'testing'), (500, 'oldstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores)
Locale: LANG=es_CO.utf8, LC_CTYPE=es_CO.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)
Versions of packages grep depends on:
ii dpkg 1.18.1
ii install-info 6.0.0.dfsg.1-3
ii libc6 2.19-19
ii libpcre3 2:8.35-7
q.h
Description: Text Data
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#21604: grep doesn't match diacritical chars in ISO-8859 files |
Date: |
Fri, 2 Oct 2015 13:01:04 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 |
On 10/02/2015 02:43 AM, Santiago Ruano Rincón wrote:
grep doesn't match characters with diacritical
marks in ISO-8859 files, inside a Unicode enviroment
That is normal and expected behavior. In a UTF-8 locale, "á" is
represented by the two bytes 0xC3 and 0xA1. In an ISO-8859 file, the
same character is represented by the single byte 0xE1. The UTF-8
pattern won't match the ISO-8859 representation.
To avoid this problem, switch to an ISO-8859 locale before using grep to
read ISO-8859 text files. This is true for pretty much any standard
utility, not just grep. Alternatively, you can translate the text files
from ISO-8859 to UTF-8, before giving the resulting text to grep or to
other utilities.
--- End Message ---