bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Clang-built Gawk 5.2.1 regex oddity


From: Sam James
Subject: Re: Clang-built Gawk 5.2.1 regex oddity
Date: Sat, 31 Dec 2022 21:38:32 +0000


> On 30 Dec 2022, at 09:13, arnold@skeeve.com wrote:
> 
> Hi.
> 
> Thanks for the report.
> 
> Although the dfa and regex code changed some between releases,
> this smells strongly like a compiler issue and not a gawk issue.
> 
> I suggest first that you try compiling with clang but without
> optimization. After running configure, edit the top level Makefile *and*
> support/Makefile and remove any -O flags.  Then build.

Kenton mentioned to me that with no optimisation, it works okay.

> 
> If the bug goes away, it's definitely a clang issue.
> 

It _probably_ is, but it's also possible it's UB. I tried building with UBSAN
(as did Kenton) and we both got this when running the command he posted
when built with Clang:
```
$ ./configure CC=clang CFLAGS="-O2 -fsanitize=undefined -ggdb3" 
LDFLAGS="-fsanitize=undefined -ggdb3"
$ make
$ export UBSAN_OPTIONS=print_stacktrace=1
$ ./gawk 'BEGIN { RS="[[][:blank:]]" }'
dfa.c:1141:6: runtime error: execution reached an unreachable program point
    #0 0x5db652 in parse_bracket_exp /tmp/gawk/support/dfa.c:1141:6
    #1 0x5c241a in lex /tmp/gawk/support/dfa.c:1543:37
    #2 0x5dc8f1 in atom /tmp/gawk/support/dfa.c:1888:24
    #3 0x5dc8f1 in closure /tmp/gawk/support/dfa.c:1961:3
    #4 0x5dc022 in branch /tmp/gawk/support/dfa.c:2002:3
    #5 0x5c7082 in regexp /tmp/gawk/support/dfa.c:2014:3
    #6 0x5c0e32 in dfaparse /tmp/gawk/support/dfa.c:2042:3
    #7 0x5c76c2 in dfacomp /tmp/gawk/support/dfa.c:3812:5
    #8 0x5abb33 in make_regexp /tmp/gawk/re.c:272:3
    #9 0x56dffd in set_RS /tmp/gawk/io.c:4092:14
    #10 0x50510b in r_interpret /tmp/gawk/./interpret.h
    #11 0x5754d7 in main /tmp/gawk/main.c:538:3
    #12 0x7f7bb5df464f in __libc_start_call_main 
/var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #13 0x7f7bb5df4708 in __libc_start_main@GLIBC_2.2.5 
/var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../csu/libc-start.c:381:3
    #14 0x4092a4 in _start 
/var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/x86_64/start.S:115

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior dfa.c:1141:6 in # (yes, 
this is cut off, I don't know why!)
```

If I build with ASAN instead with Clang:
```
$ ./configure CC=clang CFLAGS="-O2 -fsanitize=address -ggdb3" 
LDFLAGS="-fsanitize=address -ggdb3"
$ make
$ ./gawk 'BEGIN { RS="[[][:blank:]]" }'
=================================================================
==1517313==ERROR: AddressSanitizer: unknown-crash on address 0x7fa647137000 at 
pc 0x000000658214 bp 0x7ffe59482ad0 sp 0x7ffe59482ac8
READ of size 8 at 0x7fa647137000 thread T0
    #0 0x658213 in setbit /tmp/gawk/support/dfa.c:746:33
    #1 0x658213 in setbit_case_fold_c /tmp/gawk/support/dfa.c:868:7
    #2 0x658213 in parse_bracket_exp /tmp/gawk/support/dfa.c:1095:27
    #3 0x64b6d0 in lex /tmp/gawk/support/dfa.c:1543:37
    #4 0x6588dd in atom /tmp/gawk/support/dfa.c:1888:24
    #5 0x6588dd in closure /tmp/gawk/support/dfa.c:1961:3
    #6 0x64d84c in branch /tmp/gawk/support/dfa.c:2002:3
    #7 0x64d84c in regexp /tmp/gawk/support/dfa.c:2014:3
    #8 0x64aad6 in dfaparse /tmp/gawk/support/dfa.c:2042:3
    #9 0x64dbb7 in dfacomp /tmp/gawk/support/dfa.c:3812:5
    #10 0x6404df in make_regexp /tmp/gawk/re.c:272:3
    #11 0x611b66 in set_RS /tmp/gawk/io.c:4092:14
    #12 0x5c693b in r_interpret /tmp/gawk/./interpret.h
    #13 0x616e6b in main /tmp/gawk/main.c:538:3
    #14 0x7fa646ccc64f in __libc_start_call_main 
/var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #15 0x7fa646ccc708 in __libc_start_main@GLIBC_2.2.5 
/var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../csu/libc-start.c:381:3
    #16 0x420df4 in _start 
/var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/x86_64/start.S:115

Address 0x7fa647137000 is a wild pointer inside of access range of size 
0x000000000008.
SUMMARY: AddressSanitizer: unknown-crash /tmp/gawk/support/dfa.c:746:33 in 
setbit
Shadow bytes around the buggy address:
  0x7fa647136d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647136e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647136e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647136f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647136f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x7fa647137000:[00]00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647137080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647137100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647137180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647137200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7fa647137280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1517313==ABORTING
``

I'm testing with Clang from git (LLVM 16, 
dfc20708bcdf7b4c4bea8595fc4ac8674634d5e6)
but when I tried Clang 15, I got the same. I'm pretty sure Kenton is using 
Clang 15 as well.

Of course, this might still be a Clang bug though. I don't see this with
GCC but that's not proof either way. So if this all looks impossible, one
of us can forward it up to Clang and see what they say.

> In any case, in the gawk repo in helpers/testdfa.c is a program that
> may be useful for further isolating the problem, since it extracts
> the regex building and matching from the rest of gawk's code. If
> the problem persists with that program, it will be of more use
> in making a bug report to the clang team.
> 

Unfortunately, no matter what input I give to testdfa,
it seems to say "malloc failed", e.g.
```
$ ./testdfa 'a'
Ignorecase: false
Syntax: 
RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD
Pattern: /a/, len = 1
setup_pattern: malloc failed
```

This happens even if testdfa is built with GCC (12.2.1_20221224).

Best,
sam

Attachment: signature.asc
Description: Message signed with OpenPGP


reply via email to

[Prev in Thread] Current Thread [Next in Thread]