bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gawk] Regex treatment of NUL characters within fields


From: Matt Wenham
Subject: [bug-gawk] Regex treatment of NUL characters within fields
Date: Sun, 29 Mar 2015 20:51:30 +0100

I have found a use case which has made me unsure as to how gawk 4.1.1
treats NUL characters within fields and how they are parsed by the
regex engine.

I have a series of files which I am trying to process and validate
using gawk. A small number of the files are corrupt and contain runs
of NUL characters which I would like to reject as invalid.

I tried the following code:

BEGIN {
    FS="[#/]"   #Split at hash or slash
    OFS = ":"
}

$10 ~ "^7$" {
    print NR, $10
}

This successfully matches the digit '7' followed by a run of NULs in
the tenth field. However, using

$10 ~ "^7\0+$"

fails to match the same tenth field despite the explicitly specified
NUL character. From everything I've read, this is unexpected
behaviour.

I am using GnuWin32 in this case. I asked about the issue on
Stackoverflow, and another user has found that this behaviour does not
occur with gawk 3.1.5 on CentOS 5, but does occur with gawk 4.1.1 on
debian unstable.

Is this expected behaviour? If so how? Is it possible to successfully
parse NUL characters in 4.1.1?

Many thanks,

Dr. Matt Wenham.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]