Hello,
I have trouble with locale. I run gawk at Linux, locale is set to "utf8" but I have to process text files in CP1250 encoding. I have no way to disable "locale" from AWK script and I miss such option.
This is sentence from GAWK documentation:
> Gawk is multibyte aware. This means that index(), length(), substr() and match() all work in terms of characters, not bytes.
I already learned that I can start gawk with switch "-b" or with LC_ALL, like 'LC_ALL=C gawk -f script.awk data'
but there is no way to verify from AWK script that switch -b was used (I can get value of LC_ALL, ENVIRON["LC_ALL"]). Problem is that when user forgets to activate "binary mode" with -b switch, result of parsing is wrong because AWK removes "extended" ASCII characters from results returned by substr() and lenght() returns wrong value, etc.
I assume LANG=en_US.UTF-8, LC_ALL is empty.
It is confusing, when I use 'printf "%s", $0;', I see all extended characters in the output but when I run 'for (I=1; I<=length(); I++) printf "%c", substr($0,i,1);' I see that characters are missing (for ASCII code > 127, I guess these are mapped to invalid utf8 codepoints).
I already tried to use BINMODE="r" but it doesn't affect substr().
So, I miss a way to force gawk to use strings in terms of bytes, not characters. To activate such option from the script.
I have simple demo, to show difference when ASCII with code 0x80 (could be EUR symbol) is in data file. I am interested in the first case, I think there is no way to configure AWK (from the script) to process file correctly, to get EUR symbol at position 15 or to detect that something is wrong...
$ awk -f demo1.awk test.txt
Price is 35.12� (EUR).
Price is 35.12 (EUR).
U
$ awk -b -f demo1.awk test.txt
Price is 35.12� (EUR).
Price is 35.12� (EUR).
E
$ hexdump -C test.txt
00000000 50 72 69 63 65 20 69 73 20 33 35 2e 31 32 80 20 |Price is 35.12. |
00000010 28 45 55 52 29 2e 0a |(EUR)..|
00000017
$ cat demo.awk
{
printf "%s\n", $0;
for (I=1; I<=length(); I++) printf "%c", substr($0,I,1); printf "\n";
printf "%18s\n", substr($0,18,1);
}