bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#28255: grep erroneously skips Microsoft UTF-8 text files as being bi


From: Paul Eggert
Subject: bug#28255: grep erroneously skips Microsoft UTF-8 text files as being binary
Date: Sun, 27 Aug 2017 17:18:47 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1

Simon wrote:
Sorry my description was slightly ambiguous.  I should not have said
skip so much as treats the file as binary and does not find a match
because each character takes 2 octets as per utf-8.

$ mkdir tmp
$ cd tmp
$
$ printf
'\377\376\164\000\145\000\163\000\164\000\061\000\015\000\012\000' >1.txt
$ printf 'test2\r\n' >2.txt
$
$ hexdump -C 1.txt
00000000  ff fe 74 00 65 00 73 00  74 00 31 00 0d 00 0a 00
|..t.e.s.t.1.....|
00000010
$ hexdump -C 2.txt
00000000  74 65 73 74 32 0d 0a                              |test2..|
00000007
$
$ grep --include=*.txt test *
2.txt:test2
$

I've made the two files as they appear on a Windows system (since lots
of us move lots of files between operating systems).  As you can see,
the "1.txt" is skipped because the characters are encoded two octets per
byte.

As an example that "1.txt" is a valid Windows text file, if you edit
"1.txt" with Notepad on a Windows system, Notepad will detect BOM at the
beginning and switch to UTF-8 encoding, and preserve it upon saving.

That is, UTF-8 (BOM + 2 octet characters) is an acceptable text file
format for Windows text files.  (I can only confirm Win 7 or higher.)

I guess this should really be considered a feature, not a bug.

Similar happens for Cygwin grep running under windows.

You're right. grep and most other GNU tools do not support UTF-16. You can use the 'recode' command to convert to UTF-8, which grep does support.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]