[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#28255: grep erroneously skips Microsoft UTF-8 text files as being bi
bug#28255: grep erroneously skips Microsoft UTF-8 text files as being binary
Sun, 27 Aug 2017 17:18:47 -0700
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
Sorry my description was slightly ambiguous. I should not have said
skip so much as treats the file as binary and does not find a match
because each character takes 2 octets as per utf-8.
$ mkdir tmp
$ cd tmp
$ printf 'test2\r\n' >2.txt
$ hexdump -C 1.txt
00000000 ff fe 74 00 65 00 73 00 74 00 31 00 0d 00 0a 00
$ hexdump -C 2.txt
00000000 74 65 73 74 32 0d 0a |test2..|
$ grep --include=*.txt test *
I've made the two files as they appear on a Windows system (since lots
of us move lots of files between operating systems). As you can see,
the "1.txt" is skipped because the characters are encoded two octets per
As an example that "1.txt" is a valid Windows text file, if you edit
"1.txt" with Notepad on a Windows system, Notepad will detect BOM at the
beginning and switch to UTF-8 encoding, and preserve it upon saving.
That is, UTF-8 (BOM + 2 octet characters) is an acceptable text file
format for Windows text files. (I can only confirm Win 7 or higher.)
I guess this should really be considered a feature, not a bug.
Similar happens for Cygwin grep running under windows.
You're right. grep and most other GNU tools do not support UTF-16. You can use
the 'recode' command to convert to UTF-8, which grep does support.