|
From: | Koichi Murase |
Subject: | Re: bash "extglob" needs to upgrade at least like zsh "kshglob" |
Date: | Mon, 28 Nov 2022 19:51:20 +0900 |
2022年11月23日(水) 5:24 Chet Ramey <chet.ramey@case.edu>: > I attached the latest patch against bash-5.2.9. ---- > commit 3c9dd4565792bc53de3a94ec38a65a1989f3fe2f (upstream/devel) > > associative array elements; last set of changes to globbing > bracket expressions; fix for timing subshell commands Thank you for the discussion and for applying the changes. Besides, I am sorry that I still have a discussion on the behavior of BRACKMATCH, so it was not the last set of changes. After the above fix, I moved to check the behavior related to PATSCAN, where I found inconsistent results related to the difference between BRACKMATCH and PATSCAN in parsing the bracket expressions. I checked also other parts of the codebase and found additional inconsistencies. Description ----------- First, let me introduce the symbols (A)..(D) to later reference the implementations of the bracket expression in the codebase. There are four independent codes that implement rules for extracting the bracket expression in the current codebase: - (A) The main loop of BRACKMATCH: This handles sub-expressions of a bracket expression when a matching sub-expression is not found. - (B) The section of the `matched' label in BRACKMATCH: This handles sub-expressions of the bracket expression after a matching sub-expression is found. - (C) PATSCAN: This skips bracket expressions to determine the end of the extglob constructs of the form @(...), ?(...), +(...), etc. - (D) MATCHLEN (lib/glob/gm_loop.c): This function handles bracket expressions to count the number of characters that a fixed-length pattern can match. Actually, each of the four implements a distinct rule, which does not match any of the other three. These implementations need to be adjusted to support an identical and consistent rule. Repeat-By --------- The differences between (A)..(D) cause various strange behaviors. 1. Strange behavior caused by an inconsistency between (A/B) and (C) This is what I was first faced with. The following shows the result of [example4.sh] with the current devel, where column 4 `{yes,no}/{yes,no}' shows `(result)/(what I expect)': --- PATSCAN vs BRACKMATCH --- #1: pat=@([[.].])A]) str=] no/yes #2: pat=@([[.].])A]) str===]A]) no/no #3: pat=@([[.].])A]) str=AA]) yes/no #4: pat=@([[=]=])A]) str=] no/no #5: pat=@([[=]=])A]) str===]A]) no/yes #6: pat=@([[=]=])A]) str=AA]) yes/no Obvious strange behaviors can be found in cases #3 and #6, where `A' matches twice even if there is only one `A' and no repetition such as `*()' or `+()' in the pattern. This is because PATSCAN (C) considers the bracket expression to be `[[.].]' while BRACKMATCH (A/B) considers the bracket expression to be `[[.].])A]'. First, PATSCAN extracts `@([[.].])', but BRACKMATCH next matches the first `A' in the input string using a pattern character `A' outside the construct `@()'. Finally, the remaining part `A])' in the pattern is matched literally. 2. Inconsistency between (A) and (B): To fix the above item for (A/B) vs (C), I checked the detailed behaviors of both and found this. The parsing of [.ch.], [=a=], and [:var:] are not totally consistent before and after a matching sub-expression is found. The following is the second section of the result of [example4.sh]: --- BRACKMATCH: after match vs before match --- #7: pat=[[=]=]ab] str=a yes/no #8: pat=[[.[=.]ab] str=a yes/yes #9: pat=[[.[==].]ab] str=a yes/yes #10: pat=[a[=]=]b] str=a no/no #11: pat=[a[.[=.]b] str=a no/yes #12: pat=[a[.[==].]b] str=a no/yes #13: pat=[a[=]=]b] str=b yes/no #14: pat=[a[=]=]b] str=a=]b] yes/yes #15: pat=[a[.[=.]b] str=b yes/yes #16: pat=[a[.[=.]b] str=ab] yes/no #17: pat=[a[.[==].]b] str=b yes/yes #18: pat=[a[.[==].]b] str=ab] yes/no Cases #7..#9 succeeds, which means that `[=]=]', `[.[=.]', and `[.[==].]' form an equivalence class and collating symbols in BRACKMATCH (A). However, cases #10..#12 (which are the bracket expressions of the same sub-expression with different ordering) fail, which means that `[=]=]', `[.[=.]', and `[.[==].]' do not form an equivalence class or a collating symbol in BRACKMATCH (B). Also, cases #13 vs #14, #15 vs #16, and #17 vs #18 demonstrate that the same pattern consisting of bracket expressions and normal characters can match different numbers of characters. This means that the boundary of the bracket expression can change depending on the input string. Actually, these patterns are undefined by the standard because an equivalence class shall not contain `]' for cases #7, #10, #13, and #14, and the opening `[.' and `[=' shall be followed by the corresponding `.]` and `=]`, respectively, for the other cases. Nevertheless, even if the behavior is undefined, I expect at least the same results for pairs (#7, #10), (#8, #11), and (#9, #12), respectively. I also expect that only one from each pair (#13, #14), (#15, #16), or (#17, #18) succeeds. Otherwise, we cannot determine the range of the bracket expression before seeing the input string. 3. Differences for incomplete [:ccname:], [=a=], and [:var:] within (A) In trying to implement a common implementation for (A)..(D), I also noticed that the behavior of incomplete [:cclass:], [=a=], and [.ch.] are different from one another within BRACKMATCH (A): --- incomplete POSIX brackets --- #19: pat=x[a[:y] str=x[ no/??? #20: pat=x[a[:y] str=x: yes/??? #21: pat=x[a[:y] str=xy yes/??? #22: pat=x[a[:y] str=x[ay no/??? #23: pat=x[a[.y] str=x[ no/??? #24: pat=x[a[.y] str=x. no/??? #25: pat=x[a[.y] str=xy no/??? #26: pat=x[a[.y] str=x[ay yes/??? #27: pat=x[a[=y] str=x[ yes/??? #28: pat=x[a[=y] str=x= yes/??? #29: pat=x[a[=y] str=xy yes/??? #30: pat=x[a[=y] str=x[ay no/??? These special POSIX bracket expressions ([:cclass:], [=a=], and [.ch.]) are implemented separately in BRACKMATCH (A). On the other hand, these special POSIX bracket expressions ([:cclass:], [=a=], and [.ch.]) are handled together in the other parts (B)..(D). The variations in the behaviors of [:cclass:], [=a=], and [.ch.] in (A) does not actually violate the standard because the behavior is undefined by the standard when there is no corresponding ending brackets `:]', `.]', or `=]'. However, if we would keep these variations of the behavior in BRACKMATCH (A), we then need to implement these seemingly random variations also in (B)..(D) to match the behaviors of (A)..(D). Instead, I think the opposite would be more reasonable, i.e., to change (A) to handle incomplete [:cclass:], [=a=], and [.ch.] in a consistent manner and match its behavior to (B)..(D). However, it is still unclear what would be the preferred behavior because each of (B)..(D) implements its distinct rule. I instead I checked the behaviors of other shells/implementations. Here is the summary of the comparison: No. pattern input | bash fnmatch/osh zsh ksh yash/busybox --- ------- ----- | ----- ------------ ---- --- ------------ #19 x[a[:y] x[ | no yes yes no no #20 x[a[:y] x: | yes yes yes no no #21 x[a[:y] xy | yes yes yes no no #22 x[a[:y] x[ay | yes no no no yes #23 x[a[.y] x[ | no no yes no no #24 x[a[.y] x. | no no yes no no #25 x[a[.y] xy | no no yes no no #26 x[a[.y] x[ay | yes no no no yes #27 x[a[=y] x[ | yes yes yes no no #28 x[a[=y] x= | yes yes yes no no #29 x[a[=y] xy | yes yes yes no no #30 x[a[=y] x[ay | no no no no yes The behavior of fnmatch(3) of GNU/Linux and osh (oilshell) is slightly different from that of Bash but are essentially similar. I guess this is because the Bash implementation (seems to be) derived from a fnmatch(3) implementation. I guess osh calls fnmatch(3) internally. The other shells zsh, ksh, yash, and busybox sh produce consistent results for all of `[:cclass:]', `[=a=]', and `[.ch.]', though they differ from one another: zsh treats unclosed `[:', `[.', and `[=' as normal characters consisting of the bracket expression, ksh considers the entire pattern to be invalid and does not let it match anything, and yash and busybox sh consider the bracket expression to be invalid and let the first bracket `[' match literally. I decided to choose the behavior of `zsh' because it is consistent for all [:cclass:], [.ch.], and [=a=] and closest to the current behavior of Bash. Also, the following points are (partly) addressed in this report. 4. PATSCAN currently does not handle / in parsing bracket expressions with FNM_PATHNAME. 5. PATSCAN currently does not handle FNM_NOESCAPE. Fixes ----- I attach the first patch [r0037.brackmatch8.incomplete-subbracket.patch.txt] against the devel to solve the above "Repeat-By 2 and 3" of BRACKMATCH (A/B), where I introduced a new helper function PARSE_SUBBRACKET to extract [:cclass:], [.ch.], and [=a=] in the same way (except for the special rule for [.].] in POSIX XCU 9.3.5.1). In the patch, I upgraded the existing function PARSE_COLLSYM to PARSE_SUBBRACKET. I initially tried to add an independent function PARSE_SUBBRACKET and modify PARSE_COLLSYM to use PARSE_SUBBRACKET internally, but the resulting PARSE_COLLSYM became trivial, so I decided to remove the function. Eventually, the patch became to have the current shape. The fix for PATSCAN (C) is included in the second patch [r0037.patscan1.parse_subbracket.patch.txt], which solves "Repeat-By 1 and 4" by using PARSE_SUBBRACKET of the first patch. This patch applies to devel after applying the first patch. Actually, this is a partial fix for "Repeat-By 4"; there are still failing cases after applying this patch. For the complete fix, I would like to use a helper function introduced for the new DFA engine, so I will submit a patch after it is determined whether the new DFA engine is accepted or rejected. The third patch [r0037.patscan2.fnm_noescape.patch.txt] addresses "Repeat-By 5". This is a single-line fix. This patch applies to devel after applying the second patch. The adjustments of MATCHLEN (D) are not included here because I intended to remove (or re-implement) MATCHLEN in the new DFA engine, and the current code would be discarded when the new DFA engine would be accepted. I would submit another patch in case the new DFA engine would be rejected. (To make it clear, PATSCAN and BRACKMATCH are still used in the new DFA engine, which is the reason I recently submit changes to these functions). -- Koichi
r0037.brackmatch8.incomplete-subbracket.patch.txt
Description: Text document
example4.sh
Description: Bourne shell script
r0037.patscan1.parse_subbracket.patch.txt
Description: Text document
r0037.patscan2.fnm_noescape.patch.txt
Description: Text document
[Prev in Thread] | Current Thread | [Next in Thread] |