[bug-gawk] Regex treatment of NUL characters within fields

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gawk] Regex treatment of NUL characters within fields

From:	Matt Wenham
Subject:	[bug-gawk] Regex treatment of NUL characters within fields
Date:	Sun, 29 Mar 2015 20:51:30 +0100

I have found a use case which has made me unsure as to how gawk 4.1.1
treats NUL characters within fields and how they are parsed by the
regex engine.

I have a series of files which I am trying to process and validate
using gawk. A small number of the files are corrupt and contain runs
of NUL characters which I would like to reject as invalid.

I tried the following code:

BEGIN {
    FS="[#/]"   #Split at hash or slash
    OFS = ":"
}

$10 ~ "^7$" {
    print NR, $10
}

This successfully matches the digit '7' followed by a run of NULs in
the tenth field. However, using

$10 ~ "^7\0+$"

fails to match the same tenth field despite the explicitly specified
NUL character. From everything I've read, this is unexpected
behaviour.

I am using GnuWin32 in this case. I asked about the issue on
Stackoverflow, and another user has found that this behaviour does not
occur with gawk 3.1.5 on CentOS 5, but does occur with gawk 4.1.1 on
debian unstable.

Is this expected behaviour? If so how? Is it possible to successfully
parse NUL characters in 4.1.1?

Many thanks,

Dr. Matt Wenham.

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-gawk] Regex treatment of NUL characters within fields, Matt Wenham <=
- Re: [bug-gawk] Regex treatment of NUL characters within fields, Andrew J. Schorr, 2015/03/30
  - Re: [bug-gawk] Regex treatment of NUL characters within fields, arnold, 2015/03/30
- Re: [bug-gawk] Regex treatment of NUL characters within fields, Manuel Collado, 2015/03/30
  - Re: [bug-gawk] Regex treatment of NUL characters within fields, Matt Wenham, 2015/03/30
    - Re: [bug-gawk] Regex treatment of NUL characters within fields, Matt Wenham, 2015/03/30
    - Re: [bug-gawk] Regex treatment of NUL characters within fields, Aharon Robbins, 2015/03/30
- Re: [bug-gawk] Regex treatment of NUL characters within fields, arnold, 2015/03/30

Prev by Date: [bug-gawk] patsplit in 4.1.1 seems to invert the usage of ARRAY and SEPS
Next by Date: Re: [bug-gawk] Regex treatment of NUL characters within fields
Previous by thread: [bug-gawk] patsplit in 4.1.1 seems to invert the usage of ARRAY and SEPS
Next by thread: Re: [bug-gawk] Regex treatment of NUL characters within fields
Index(es):
- Date
- Thread