help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: What is wrong with a regex?


From: Koichi Murase
Subject: Re: What is wrong with a regex?
Date: Sat, 4 Feb 2023 14:20:20 +0900

2023年2月4日(土) 14:06 Leonid Isaev <leonid.isaev@ifax.com>:
> On Fri, Feb 03, 2023 at 10:08:29PM -0600, Dennis Williamson wrote:
> > Bash (and grep) don't allow an empty subexpression.
> >
> >  f=foo; [[ $f =~ (|o) ]]; echo $?; echo "${BASH_REMATCH}"
> > 2
> >
> > $ echo foo | grep -E '(|o)'
> > grep: empty (sub)expression
>
> [...]
>
> I-orca--05:02-~-> f=foo; [[ $f =~ (|o) ]]; echo $?
> 0
>
> I-orca--05:03-~-> grep -E "(|o)" <<< foo
> foo
>
> [...]
> WTF?

Bash relies on the system library <regex.h> for the regular
expressions, so I guess the Bash version is not so much related to
this behavior difference.

I've checked the standard:

>From POSIX XCU 9.5.3
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_05_03
> /* --------------------------------------------
>    Extended Regular Expression
>    --------------------------------------------
> */
> extended_reg_exp   :                      ERE_branch
>                    | extended_reg_exp '|' ERE_branch
>                    ;
> ERE_branch         :            ERE_expression
>                    | ERE_branch ERE_expression
>                    ;
> ERE_expression     : one_char_or_coll_elem_ERE
>                    | '^'
>                    | '$'
>                    | '(' extended_reg_exp ')'
>                    | ERE_expression ERE_dupl_symbol
>                    ;
> one_char_or_coll_elem_ERE  : ORD_CHAR
>                    | QUOTED_CHAR
>                    | '.'
>                    | bracket_expression
>                    ;
> ERE_dupl_symbol    : '*'
>                    | '+'
>                    | '?'
>                    | '{' DUP_COUNT               '}'
>                    | '{' DUP_COUNT ','           '}'
>                    | '{' DUP_COUNT ',' DUP_COUNT '}'
>                    ;

According to the standard, `|' connects one or more <ERE_branch>es,
<ERE_branch> is a sequence of one or more <ERE_expression>s, and
<ERE_expression> seems to require at least one element. This means
that (|_x) is not supported by the POSIX ERE, and what we see with GNU
grep and Bash regular expressions with Glibc <regex.h> is an
extension.

I also checked the behavior of Cygwin, where <regex.h> seems to be
implemented as a part of Newlib. Newlib <regex.h> also seems to
support an empty <ERE_branch> and thus (|_x).



reply via email to

[Prev in Thread] Current Thread [Next in Thread]