[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Manpage and infopage of wget need mention whether regex of wget is Exten
From: |
Rabvit |
Subject: |
Manpage and infopage of wget need mention whether regex of wget is Extended or Basic |
Date: |
Sat, 18 Dec 2021 05:30:03 +0100 (CET) |
The man page of wget 1.21.2 (also 1.20.3) describes the following options
concerning regular expressions.
> --accept-regex urlregex
> --reject-regex urlregex
> Specify a regular expression to accept or reject
> the complete URL.
>
>
> --regex-type regextype
> Specify the regular expression type.
> Possible types are posix or pcre.
> Note that to be able to use pcre type
> wget has to be compiled with libpcre support.
However, the above option description forgets to mention which kind of POSIX
regular expression wget uses. The info page of wget also forgets to mention
which.
There are two kinds of POSIX regular expressions:
1. POSIX Extended Regular Expression (ERE)
2. POSIX Basic Regular Expression (BRE)
The difference between BRE and ERE follows:
POSIX ERE
? + | ( ) { } have special meanings by themselves
without being preceded by a backslash (\).
To be literal, they need be escaped.
POSIX BRE
? + | are always literal and
never have special meanings,
no matter whether preceded by a backslash (\).
( ) { } are literal by themselves,
but have special meanings if and only if
they are escaped as in \( \) \{ \}
All other special symbols have no difference between POSIX ERE and POSIX BRE.
While the man page of the latest version of wget still forgets to mention
whether wget uses ERE or BRE, a very old mail in the mailing list system
suggests that wget should use ERE.
Gijs van Tulder wrote on 11 Apr 2012
(https://lists.gnu.org/archive/html/bug-wget/2012-05/msg00021.html):
> Here is a new version of the regular expressions patch.
> The new version combines POSIX (always, from gnulib)
> and PCRE (if available).
>
> The patch adds these options:
>
> --accept-regex="..."
> --reject-regex="..."
>
> --regex-type=posix for POSIX extended regexes (the default)
> --regex-type=pcre for PCRE regexes (if PCRE is available)
Please verify that wget currently uses ERE (as opposed to BRE) and that it is
the default, by looking at the source code and by running wget. If so
verified, then, please add the sentence "posix is the default, and refers to
POSIX Extended Regular Expression (ERE)." to the manpage and the infopage.
Thus, the option description should become:
--regex-type regextype
Specify the regular expression type.
Possible types are posix or pcre.
posix is the default, and refers to
POSIX Extended Regular Expression (ERE).
Note that to be able to use pcre type
wget has to be compiled with libpcre support.
To test whether the regex of wget is ERE, you need know the following.
? + | ( ) { } have the following meanings
when they have special meanings.
? zero or one of the preceding element
+ one or more of the preceding element
| alternation
( ) grouping
{n} the preceding element occurs exactly n times
{n,} the preceding element occurs at least n times
{n,m} the preceding element occurs at least n times
but at most m times
Before actually running `wget` to see whether the posix regex of wget is ERE,
let us get familiar with the behavior of ERE by running `grep`. The -E option
of GNU grep enables POSIX Extended Regular Expression (ERE). Without -E, the
regex of GNU grep is basic but slightly deviated from POSIX BRE.
Here is the difference between the three:
POSIX ERE
? + | ( ) { } have special meanings by themselves
without being preceded by a backslash (\).
To be literal, they need be escaped.
POSIX BRE
? + | are always literal and
never have special meanings,
no matter whether preceded by a backslash (\).
( ) { } are literal by themselves,
but have special meanings if and only if
they are escaped as in \( \) \{ \}
GNU-grep basic (default for GNU grep)
? + | ( ) { } are literal by themselves,
but have special meanings if and only if
escaped as in
\? \+ \| \( \) \{ \}
All other special symbols have no difference between POSIX ERE, POSIX BRE, and
GNU-grep basic. Let me mention two of such symbols.
* zero or more of the preceding element
. matches any character except newline
The dot character '.' appears in a domain name such as "ftp.gnu.org" and before
a file extension such as "report.pdf". For '.' to literally mean a dot in
regex, it has to be escaped like "ftp\.gnu\.org" and "report\.pdf".
Note that, in the context of regular expression, a special character means a
character that has a meaning special to regular expression. This is not to be
confused with a special character for bash. Many characters special to regex
are also special to bash (but the meanings to regex and the meanings to bash
may differ). Thus, when passing a regex string to `grep` or `wget` on command
line, characters that happen to be special to bash must be protected from bash.
This protection is usually done by enclosing the regex string with
single-quotes (''). The only special characters that double-quotes fail to
protect from bash are the following four:
dollar ($), backslash (\), backtick (`), exclamation (!)
Now, let us run `grep` to get familiar with the behavior of ERE. In the
following, output lines are commented out by '#' to distinguish them from
commands.
[code]
# ? question mark
quest='ac
abc
abbc
ab?c'
echo "$quest" | grep -E 'ab?c'
# ac
# abc
echo "$quest" | grep -E 'ab\?c'
# ab?c
echo "$quest" | grep 'ab?c'
# ab?c
# +
plus='ac
abc
abbc
ab+c'
echo "$plus" | grep -E 'ab+c'
# abc
# abbc
echo "$plus" | grep -E 'ab\+c'
# ab+c
echo "$plus" | grep 'ab+c'
# ab+c
# | vertical line
vert='ab
cd
ad
bc
b|c'
echo "$vert" | grep -E 'ab|cd'
# ab
# cd
echo "$vert" | grep -E 'ab\|cd'
# none matched
echo "$vert" | grep 'ab|cd'
# none matched
# () parentheses
paren='ad
abcd
abcbcd
bc
acbd
ebcf
a(bc)d
a(bcd'
echo "$paren" | grep -E 'a(bc)*d'
# ad
# abcd
# abcbcd
echo "$paren" | grep -E 'a\(bc\)*d'
# a(bc)d
# a(bcd
echo "$paren" | grep 'a(bc)*d'
# a(bc)d
# a(bcd
echo "$paren" | grep 'a\(bc\)*d'
# ad
# abcd
# abcbcd
# {} curly braces
brace='ac
abc
abbc
ab{0,1}c'
echo "$brace" | grep -E 'ab{0,1}c' # same as 'ab?c'
# ac
# abc
echo "$brace" | grep -E 'ab\{0,1\}c'
# ab{0,1}c
echo "$brace" | grep 'ab{0,1}c'
# ab{0,1}c
echo "$brace" | grep 'ab\{0,1\}c'
# ac
# abc
[/code]
If I were an admin of a web site, I would test `wget` with the same strings and
regexes as the above `grep` test. I would create files whose names are the
same as the sample strings of the above `grep` test. Then, I would run `wget`
with the regex for --accept-regex being fundamentally the same as the above
`grep` test.
The following code creates such files in the directories whose names are the
same as the five variables of the above `grep` test.
[code]
mkdir /quest /plus /vert /paren /brace
cd /quest
questAry=(ac abc abbc 'ab?c')
echo foo | tee "${questAry[@]}" > /dev/null
cd /plus
plusAry=(ac abc abbc 'ab+c')
echo foo | tee "${plusAry[@]}" > /dev/null
cd /vert
vertAry=(ab cd ad bc 'b|c')
echo foo | tee "${vertAry[@]}" > /dev/null
cd /paren
parenAry=(ad abcd abcbcd bc acbd ebcf 'a(bc)d' 'a(bcd')
echo foo | tee "${parenAry[@]}" > /dev/null
cd /brace
braceAry=(ac abc abbc 'ab{0,1}c')
echo foo | tee "${braceAry[@]}" > /dev/null
[/code]
Suppose unrealistically that the above directories and files had been created
in "https://www.gnu.org" such that "https://www.gnu.org/quest/abc",
"https://www.gnu.org/plus/abc" and so on.
The following code would test `wget` to see how `wget` would handle '+', which
ERE and BRE handle differently.
[code]
links='<a href="https://www.gnu.org/plus/ac">
<a href="https://www.gnu.org/plus/abc">
<a href="https://www.gnu.org/plus/abbc">
<a href="https://www.gnu.org/plus/ab+c">'
re=".*/ab+c"
wget -rl1 --accept-regex "$re" -Fi <(echo "$links") -w1
[/code]
I thought that `wget` would request only the files whose names match the regex
and would not request the files whose names do not match the regex. Hence,
even though www.gnu.org <http://www.gnu.org> actually lacks such directories
and files as "plus/abc", "plus/abbc" and so on, I thought that, by looking at
which files `wget` would request, I would be able to see whether `wget` works
identically to the ERE of `grep`. Unfortunately, however, `wget` requests all
the files no matter whether they match the regex. `wget` may be asking the web
server the existence of every file before the regex filter is performed, which
is an inefficient behavior.
Because I am not an admin of any web sites, and because `wget` requests all the
files no matter whether they match the regex, I gave up testing whether the
posix regex of `wget` works identically to the ERE of `grep`.
Anyway, I would like the maintainer of `wget` to verify that wget currently
uses ERE (as opposed to BRE) and that it is the default, by looking at the
source code and by running wget. If so verified, then, please add the sentence
"posix is the default, and refers to POSIX Extended Regular Expression (ERE)."
to the manpage and the infopage.
--- Rabvit
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Manpage and infopage of wget need mention whether regex of wget is Extended or Basic,
Rabvit <=