bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bash "extglob" needs to upgrade at least like zsh "kshglob"


From: Koichi Murase
Subject: Re: bash "extglob" needs to upgrade at least like zsh "kshglob"
Date: Mon, 14 Nov 2022 20:06:00 +0900

2022年11月1日(火) 0:59 Chet Ramey <chet.ramey@case.edu>:
> If someone wanted to do this, I would take a look at incorporating the
> results, as long as it didn't add dependencies on, say, pcre or gnulib
> regex.

Instead of translating the pattern to a regular expression and
compiling it by regcomp (<regex.h>), I have experimentally implemented
a new extglob engine based on a naive DFA and was comparing the
behavior and the performance with the current implementations of
devel.  [Note: In this report hereafter, ``the current
implementation/engine'' always means the implementation of strmatch
(lib/glob/{strmatch,smatch,sm_loop.c}) in the current devel 31f4d468.]

However, I noticed two strange behaviors of the current engine.
Before adjusting the behavior of the new engine and submitting it for
review, I would like to confirm the (expected) behavior of the current
engine in the current devel.

These two behaviors finally turned out to be both related to the
matching of bracket expression by the function `BRACKMATCH'
(lib/glob/sm_loop.c).

----------------------------------------------------------------------
1. pattern [[=B=]][c] matches with c

  $ bash-devel --norc
  $ [[ Bc == [[=B=]][c] ]]; echo $?
  0      <-- OK. This is expected.
  $ [[ c == [[=B=]][c] ]]; echo $?
  0      <-- This is unexpected.

  See the attached [r0037.brackmatch1.equivalence-class.patch] for the
  fix.  The problem is caused because [[=B=]][c] is treated as a
  single bracket expression [ « [=B=]][c » ] when the equivalence
  class [=B=] does not match.  This is because `]' after `[[=B=]' is
  treated as if it is the first `]' in the bracket expression (such as
  `]' after `[' in the pattern « []aaa] »).  In the patch, when the
  next character after a failing equivalence class [=B=] is `]', the
  bracket expression has been changed to fail just the same as the
  case for a failing character class [:alpha:]
  (lib/glob/sm_loop.c:530; line number is that in the current devel
  31f4d468).

----------------------------------------------------------------------
2. bracket expression sometimes match or unmatch the slash with
  FNM_PATHNAME.

  FNM_PATHNAME is only used in two places in the Bash codebase.  1)
  One is for the glob matching for the filenames in the directory
  (lib/glob/glob.c).  However, this actually does not seem to have an
  actual effect because FNM_PATHNAME only causes differences in the
  matching when the target string contains a slash but the filenames
  do not contain any slashes.  2) The other is the filtering of the
  pathname-expansion results with GLOBIGNORE (pathexp.c).  So the only
  way to test the behavior of Bash's strmatch with FNM_PATHNAME
  (without writing a C program to directly use the function
  `strmatch') is to use GLOBIGNORE.

  To demonstrate the behavior of the current implementation, I attach
  a script [fnmpath.sh], which utilizes GLOBIGNORE for the Bash
  implementation.  It compares the result with fnmatch(3).  The result
  in my environment (x86_64-redhat-linux-gnu [Fedora Linux 36 (Server
  Edition)]) is this:

  $ bash-devel fnmpath.sh
  #1: pat=ab/cd/efg        yes/yes
  #2: pat=ab[/]cd/efg      yes/no
  #3: pat=ab[/a]cd/efg     yes/no
  #4: pat=ab[a/]cd/efg     no/no
  #5: pat=ab[!a]cd/efg     yes/no
  #6: pat=ab[.-0]cd/efg    yes/no
  #7: pat=*/*/efg          yes/yes
  #8: pat=*[/]*/efg        no/no
  #9: pat=*[/a]*/efg       no/no
  #10: pat=*[a/]*/efg       no/no
  #11: pat=*[!a]*/efg       no/no
  #12: pat=*[.-0]*/efg      no/no
  #13: pat=ab@(/)cd/efg     yes/yes
  #14: pat=*@(/)cd/efg      no/no
  #15: pat=*/cd/efg         yes/yes

  This tests whether each pattern matches the string "ab/cd/efg".  Two
  results by Bash's strmatch and fnmatch(3) are connected with
  `/'. "yes" means the pattern matches the string "ab/cd/efg" and "no"
  means it does not match.  Some observations are

  * In fnmatch(3), a bracket expression never matches a / with
    FNM_PATHNAME.

  * In Bash's strmatch, a bracket expression sometimes matches `/' and
    sometimes does not.  In the codebase, `/' is excluded only when it
    explicitly appears after another character in the bracket
    expression (lib/glob/sm_loop.c:574) even though the comment
    mentions [/].  This is the reason that only [a/] fails with Bash's
    implementation in cases #2..#6 in the above result.

  * What is happening with Bash's implementation in cases #7..#12 is
    related the assumption of the backtracking trick for `*': The
    trick for `*' backtracking explained in the code comment
    lib/glob/sm_loop.c:320---"This is a clever idea from
    glibc"---relies on the behavior that the bracket expression never
    matches a slash that `*' cannot match.  [Note: The exact
    requirements for this trick is actually slightly weaker: each
    bracket expression needs to match a fixed number of `/', 0 or 1,
    when FNM_PATHNAME is specified; it should never match a slash if
    it can match other characters, and it should never match other
    characters if it can match a slash.] Otherwise, backtracking for a
    different number of slashes would unexpectedly fail.

    It is hard to modify the current implementation so that it does
    not use the "clever idea (lib/glob/sm_loop.c:320)" which requires
    the assumption on the bracket expressions; it would be another
    re-implementation of the engine.  In addition, the time complexity
    of the current implementation is linear with respect to the string
    length O(len) as far as extglob is unused, but, if we allow
    bracket expressions to consistently match `/', the time complexity
    would become O(len^n) at the worst where n is the number of `*' as
    far as the backtracking algorithm is used.

    For this practical reason in addition to the compatibility with
    fnmatch(3), I think we should just follow fnmatch(3) to reject `/'
    for any bracket expressions with FNM_PATHNAME.

  * There is a similar inconsistency caused by the same trick with the
    extglob as observed in cases #13..#15.  For these cases, even
    fnmatch(3) behaves in a somewhat unpredictable way, so I would not
    try to fix this behavior in this report.

  The attached patch [r0037.brackmatch2.slash.patch] fixes this.  I
  move the check for the slash outside the loop of the bracket
  expression.  In particular, I moved the check outside the function
  BRACKMATCH because it is more consistent with the other similar
  checks for `?' (lib/glob/sm_loop.c:108) and `*'
  (lib/glob/sm_loop.c:179).

----------------------------------------------------------------------

By the way, the third patch [r0037.brackmatch3.unused-assign.patch] is
just a cosmetic fix that removes the assignments of unused values to
the variable `cend'.  The values are unused because they will be
overwritten by a later line (lib/glob/sm_loop.557) without being
referenced.

If you would like to keep the current assignments because it is
harmless, it also works for me, but in that case I think we should
also assign the value to `cend' in lib/glob/sm_loop.c:546 as `cstart =
cend = ...' for consistency with the other lines
lib/glob/sm_loop.c:442 and lib/glob/sm_loop.c:554.

--
Koichi

Attachment: r0037.brackmatch1.equivalence-class.patch.txt
Description: Text document

Attachment: fnmpath.sh
Description: Text Data

Attachment: r0037.brackmatch2.slash.patch.txt
Description: Text document

Attachment: r0037.brackmatch3.unused-assign.patch.txt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]