[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] Re: Thoughts on regex support
From: |
Matthew Woehlke |
Subject: |
[Bug-wget] Re: Thoughts on regex support |
Date: |
Wed, 23 Sep 2009 11:52:34 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.23) Gecko/20090825 Fedora/2.0.0.23-1.fc10 Thunderbird/2.0.0.23 Mnenhy/0.7.5.0 |
Micah Cowan wrote:
[stuff about regex matching]
How will you handle nested boolean expressions? Same as 'find'?
IOW, how do you do this?
[url matches foo] AND ( [domain matches bar] OR [query matches baz] )
(Obviously I am intentionally choosing an example where the 'or' part
can't be easily expressed in the regex.)
--no-match ':field:action=(edit|print)'
Something like 'param[eter]' or 'arg[ument]' seems more sensible to me
(though as a programmer I am not the best to ask about usability
things). Such URL's coming from a form isn't always obvious... and in
some cases is even untrue.
. Don't follow links for producing printer-ready output, or editing
pages. Equivalent to --no-match ':query:(.*&)?action=print(&.*)?',
but somewhat easier to write.
Just in case you're planning on a conversion to that regex in the code,
remember that it is really:
'^.*[?]([^&]*&)*action=print(&.*)?$'
This simplification is probably safe:
'[?&]action=print&*.*$'
(I don't believe '&' has special meaning in a regex... it does on the
RHS of a substitution, but we aren't discussing those.)
For that matter, if you support '\b', I wonder if you need "components"
at all...
Components may be combined; to match against the combination of path and
query string, you just specify :path+query:. That could be abbreviated
as :p+q:. Combinations are only allowed if all the components involved
are consecutive; :domain+query: (no path) would be illegal.
I can probably figure out technical reasons for that, but it doesn't
make much sense from a user perspective. Why shouldn't I be able to write:
-z ':d,f:foo'
...and have it match both
'http://foobar.com/'
and
'http://baz.org/index?title=foobar'
?
My expectation would be that it tries the match against the domain,
then, if that fails, tries it against the fields/params/args/whatever.
You could support both syntaxes easily enough:
-z ':d..f:expr' # match 'expr' in concatenation of domain through f/p/a.
-z ':p,q:expr' # match 'expr' in protocol or query
(And of course you can combine the above, e.g. 'p,file..args'. Another
reason to use 'args', you can use 'file' and still abbreviate to one
letter.)
BTW, what exactly are the components? Is this right?
[u]rl: http://foobar.com/site/images/thumb.php?name=baz.jpg&x=64&y=64
p[r]otocol: "http"
[d]omain: "foobar.com"
[p]ath: "site/images"
[f]ile: "thumb.php"
[q]uery: "name=baz.jpg&x=64&y=64"
[a]rgs: "name=baz.jpg", "x=64", "y=64"
(We could have also host/tld, but that seems like overkill when you can
match against '^www(\.|$)' and '\.com$', respectively. Or - did I
mention you should support '\b'? ;-) - '^www\b'.)
- Avoid adding both a --match and a --no-match option, by making
negation a flag instead (/n or something: --match 'p/ni:.*\.js'
would reject any paths ending in any case variant of ".js").
Similar ideas:
-z '(?!expr)'
-z ':opts!expr' # instead of ':opts:expr'
Personally I think there should be a way to do inverse matches with the
short option. At the same time, I don't feel strongly either way about
having a long option --no-match (i.e. to have both).
- Other anchoring options. I suspect that the many common use cases
will begin with '.*'. We could remove the implicit anchoring, but
then we'd probably usually want it at the end, forcing us to write
the final '$'. That's one character versus two, but my gut tells me
it's easier to forget anchors than it is to forget "match-any"
patterns, which is why I lean toward implicit anchors.
MHO: implicit anchoring violates traditional regex usage. There is
probably an example of implicit anchoring somewhere, but offhand I can't
think of it. (And at any rate, sed/grep sure don't use implicit anchoring.)
That's inconvenient for args, but for everything else I still lean
toward no implicit anchoring.
Of course, if you support '\b' (and require explicit anchoring), then it
is somewhat hard to justify args (as you can just use '\bexpr\b' against
query, instead of '^expr$' against args).
--
Matthew
Please do not quote my e-mail address unobfuscated in message bodies.
--
I picked up a Magic 8-Ball the other day and it said 'Outlook not so
good.' I said 'Sure, but Microsoft still ships it.'
-- Anonymous (from cluefire.net)