[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"

From: Koichi Murase
Subject: Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
Date: Mon, 28 Nov 2022 19:51:20 +0900

2022年11月23日(水) 5:24 Chet Ramey <chet.ramey@case.edu>:
> I attached the latest patch against bash-5.2.9.

> commit 3c9dd4565792bc53de3a94ec38a65a1989f3fe2f (upstream/devel)
>     associative array elements; last set of changes to globbing
>     bracket expressions; fix for timing subshell commands

Thank you for the discussion and for applying the changes.  Besides, I
am sorry that I still have a discussion on the behavior of BRACKMATCH,
so it was not the last set of changes.  After the above fix, I moved
to check the behavior related to PATSCAN, where I found inconsistent
results related to the difference between BRACKMATCH and PATSCAN in
parsing the bracket expressions.  I checked also other parts of the
codebase and found additional inconsistencies.


First, let me introduce the symbols (A)..(D) to later reference the
implementations of the bracket expression in the codebase.  There are
four independent codes that implement rules for extracting the bracket
expression in the current codebase:

- (A) The main loop of BRACKMATCH: This handles sub-expressions of a
  bracket expression when a matching sub-expression is not found.

- (B) The section of the `matched' label in BRACKMATCH: This handles
  sub-expressions of the bracket expression after a matching
  sub-expression is found.

- (C) PATSCAN: This skips bracket expressions to determine the end of
  the extglob constructs of the form @(...), ?(...), +(...), etc.

- (D) MATCHLEN (lib/glob/gm_loop.c): This function handles bracket
  expressions to count the number of characters that a fixed-length
  pattern can match.

Actually, each of the four implements a distinct rule, which does not
match any of the other three.  These implementations need to be
adjusted to support an identical and consistent rule.


The differences between (A)..(D) cause various strange behaviors.

1. Strange behavior caused by an inconsistency between (A/B) and (C)

  This is what I was first faced with.  The following shows the result
  of [example4.sh] with the current devel, where column 4
  `{yes,no}/{yes,no}' shows `(result)/(what I expect)':

  #1: pat=@([[.].])A])         str=]                no/yes
  #2: pat=@([[.].])A])         str===]A])           no/no
  #3: pat=@([[.].])A])         str=AA])             yes/no
  #4: pat=@([[=]=])A])         str=]                no/no
  #5: pat=@([[=]=])A])         str===]A])           no/yes
  #6: pat=@([[=]=])A])         str=AA])             yes/no

  Obvious strange behaviors can be found in cases #3 and #6, where `A'
  matches twice even if there is only one `A' and no repetition such
  as `*()' or `+()' in the pattern.  This is because PATSCAN (C)
  considers the bracket expression to be `[[.].]' while BRACKMATCH
  (A/B) considers the bracket expression to be `[[.].])A]'.  First,
  PATSCAN extracts `@([[.].])', but BRACKMATCH next matches the first
  `A' in the input string using a pattern character `A' outside the
  construct `@()'.  Finally, the remaining part `A])' in the pattern
  is matched literally.

2. Inconsistency between (A) and (B):

  To fix the above item for (A/B) vs (C), I checked the detailed
  behaviors of both and found this.  The parsing of [.ch.], [=a=], and
  [:var:] are not totally consistent before and after a matching
  sub-expression is found.  The following is the second section of the
  result of [example4.sh]:

  --- BRACKMATCH: after match vs before match ---
  #7: pat=[[=]=]ab]            str=a                yes/no
  #8: pat=[[.[=.]ab]           str=a                yes/yes
  #9: pat=[[.[==].]ab]         str=a                yes/yes

  #10: pat=[a[=]=]b]            str=a                no/no
  #11: pat=[a[.[=.]b]           str=a                no/yes
  #12: pat=[a[.[==].]b]         str=a                no/yes

  #13: pat=[a[=]=]b]            str=b                yes/no
  #14: pat=[a[=]=]b]            str=a=]b]            yes/yes
  #15: pat=[a[.[=.]b]           str=b                yes/yes
  #16: pat=[a[.[=.]b]           str=ab]              yes/no
  #17: pat=[a[.[==].]b]         str=b                yes/yes
  #18: pat=[a[.[==].]b]         str=ab]              yes/no

  Cases #7..#9 succeeds, which means that `[=]=]', `[.[=.]', and
  `[.[==].]' form an equivalence class and collating symbols in
  BRACKMATCH (A).  However, cases #10..#12 (which are the bracket
  expressions of the same sub-expression with different ordering)
  fail, which means that `[=]=]', `[.[=.]', and `[.[==].]'  do not
  form an equivalence class or a collating symbol in BRACKMATCH (B).

  Also, cases #13 vs #14, #15 vs #16, and #17 vs #18 demonstrate that
  the same pattern consisting of bracket expressions and normal
  characters can match different numbers of characters.  This means
  that the boundary of the bracket expression can change depending on
  the input string.

  Actually, these patterns are undefined by the standard because an
  equivalence class shall not contain `]' for cases #7, #10, #13, and
  #14, and the opening `[.' and `[=' shall be followed by the
  corresponding `.]` and `=]`, respectively, for the other cases.
  Nevertheless, even if the behavior is undefined, I expect at least
  the same results for pairs (#7, #10), (#8, #11), and (#9, #12),
  respectively.  I also expect that only one from each pair (#13,
  #14), (#15, #16), or (#17, #18) succeeds.  Otherwise, we cannot
  determine the range of the bracket expression before seeing the
  input string.

3. Differences for incomplete [:ccname:], [=a=], and [:var:] within (A)

  In trying to implement a common implementation for (A)..(D), I also
  noticed that the behavior of incomplete [:cclass:], [=a=], and
  [.ch.] are different from one another within BRACKMATCH (A):

  --- incomplete POSIX brackets ---
  #19: pat=x[a[:y]              str=x[               no/???
  #20: pat=x[a[:y]              str=x:               yes/???
  #21: pat=x[a[:y]              str=xy               yes/???
  #22: pat=x[a[:y]              str=x[ay             no/???

  #23: pat=x[a[.y]              str=x[               no/???
  #24: pat=x[a[.y]              str=x.               no/???
  #25: pat=x[a[.y]              str=xy               no/???
  #26: pat=x[a[.y]              str=x[ay             yes/???

  #27: pat=x[a[=y]              str=x[               yes/???
  #28: pat=x[a[=y]              str=x=               yes/???
  #29: pat=x[a[=y]              str=xy               yes/???
  #30: pat=x[a[=y]              str=x[ay             no/???

  These special POSIX bracket expressions ([:cclass:], [=a=], and
  [.ch.]) are implemented separately in BRACKMATCH (A).  On the other
  hand, these special POSIX bracket expressions ([:cclass:], [=a=],
  and [.ch.]) are handled together in the other parts (B)..(D).

  The variations in the behaviors of [:cclass:], [=a=], and [.ch.]  in
  (A) does not actually violate the standard because the behavior is
  undefined by the standard when there is no corresponding ending
  brackets `:]', `.]', or `=]'.  However, if we would keep these
  variations of the behavior in BRACKMATCH (A), we then need to
  implement these seemingly random variations also in (B)..(D) to
  match the behaviors of (A)..(D).  Instead, I think the opposite
  would be more reasonable, i.e., to change (A) to handle incomplete
  [:cclass:], [=a=], and [.ch.] in a consistent manner and match its
  behavior to (B)..(D).

  However, it is still unclear what would be the preferred behavior
  because each of (B)..(D) implements its distinct rule.  I instead I
  checked the behaviors of other shells/implementations.  Here is the
  summary of the comparison:

  No. pattern input | bash  fnmatch/osh  zsh  ksh yash/busybox
  --- ------- ----- | ----- ------------ ---- --- ------------
  #19 x[a[:y] x[    | no        yes      yes  no  no
  #20 x[a[:y] x:    | yes       yes      yes  no  no
  #21 x[a[:y] xy    | yes       yes      yes  no  no
  #22 x[a[:y] x[ay  | yes       no       no   no  yes
  #23 x[a[.y] x[    | no        no       yes  no  no
  #24 x[a[.y] x.    | no        no       yes  no  no
  #25 x[a[.y] xy    | no        no       yes  no  no
  #26 x[a[.y] x[ay  | yes       no       no   no  yes
  #27 x[a[=y] x[    | yes       yes      yes  no  no
  #28 x[a[=y] x=    | yes       yes      yes  no  no
  #29 x[a[=y] xy    | yes       yes      yes  no  no
  #30 x[a[=y] x[ay  | no        no       no   no  yes

  The behavior of fnmatch(3) of GNU/Linux and osh (oilshell) is
  slightly different from that of Bash but are essentially similar.  I
  guess this is because the Bash implementation (seems to be) derived
  from a fnmatch(3) implementation.  I guess osh calls fnmatch(3)
  internally.  The other shells zsh, ksh, yash, and busybox sh produce
  consistent results for all of `[:cclass:]', `[=a=]', and `[.ch.]',
  though they differ from one another: zsh treats unclosed `[:', `[.',
  and `[=' as normal characters consisting of the bracket expression,
  ksh considers the entire pattern to be invalid and does not let it
  match anything, and yash and busybox sh consider the bracket
  expression to be invalid and let the first bracket `[' match

  I decided to choose the behavior of `zsh' because it is consistent
  for all [:cclass:], [.ch.], and [=a=] and closest to the current
  behavior of Bash.

Also, the following points are (partly) addressed in this report.

4. PATSCAN currently does not handle / in parsing bracket expressions

5. PATSCAN currently does not handle FNM_NOESCAPE.


I attach the first patch
[r0037.brackmatch8.incomplete-subbracket.patch.txt] against the devel
to solve the above "Repeat-By 2 and 3" of BRACKMATCH (A/B), where I
introduced a new helper function PARSE_SUBBRACKET to extract
[:cclass:], [.ch.], and [=a=] in the same way (except for the special
rule for [.].] in POSIX XCU  In the patch, I upgraded the
existing function PARSE_COLLSYM to PARSE_SUBBRACKET.  I initially
tried to add an independent function PARSE_SUBBRACKET and modify
PARSE_COLLSYM to use PARSE_SUBBRACKET internally, but the resulting
PARSE_COLLSYM became trivial, so I decided to remove the function.
Eventually, the patch became to have the current shape.

The fix for PATSCAN (C) is included in the second patch
[r0037.patscan1.parse_subbracket.patch.txt], which solves "Repeat-By 1
and 4" by using PARSE_SUBBRACKET of the first patch.  This patch
applies to devel after applying the first patch.  Actually, this is a
partial fix for "Repeat-By 4"; there are still failing cases after
applying this patch.  For the complete fix, I would like to use a
helper function introduced for the new DFA engine, so I will submit a
patch after it is determined whether the new DFA engine is accepted or

The third patch [r0037.patscan2.fnm_noescape.patch.txt] addresses
"Repeat-By 5". This is a single-line fix.  This patch applies to devel
after applying the second patch.

The adjustments of MATCHLEN (D) are not included here because I
intended to remove (or re-implement) MATCHLEN in the new DFA engine,
and the current code would be discarded when the new DFA engine would
be accepted.  I would submit another patch in case the new DFA engine
would be rejected.  (To make it clear, PATSCAN and BRACKMATCH are
still used in the new DFA engine, which is the reason I recently
submit changes to these functions).


Attachment: r0037.brackmatch8.incomplete-subbracket.patch.txt
Description: Text document

Attachment: example4.sh
Description: Bourne shell script

Attachment: r0037.patscan1.parse_subbracket.patch.txt
Description: Text document

Attachment: r0037.patscan2.fnm_noescape.patch.txt
Description: Text document

reply via email to

[Prev in Thread] Current Thread [Next in Thread]