[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-devel] Re: I see problem with Pan’s “h ttp url detector”

From: Jeffrey Stedfast
Subject: Re: [Pan-devel] Re: I see problem with Pan’s “h ttp url detector”
Date: Tue, 08 Feb 2011 07:15:46 -0500
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20101125 SUSE/3.0.11 Thunderbird/3.0.11

Hey guys, GMime author here...

On 02/07/2011 11:28 PM, Duncan wrote:
> Duncan posted on Mon, 07 Feb 2011 23:06:24 +0000 as excerpted:
>> Meanwhile, something else URL related that /used/ to annoy me, tho I've
>> not noticed it recently so maybe it's fixed (?), is unspaced commas or
>> the like, terminating a URL.  Here's testing it:
>>, Does the URL include the comma?
>> What about the terminating dot?
>> Question mark?
> Hmm... pan got those three right, now (as of... see the headers for git 
> commit, it's been a bit since I rebuilt).
>> ""; Double-quote?
>> '' Single-quote?
>> Colon?
> ... and those three wrong.  Pan didn't include the leading quote on either 
> of those, but parsed the trailing punctuation as part of the URL on all 
> three.
>> Those of us using pan to follow this list, thru gmane or whatever,
>> should get pan's behavior with the above tested directly.  I guess I'll
>> post a followup with the results for anyone using a standard mail
>> client.

I wrote up a quick test to see if it might be a bug in GMime 2.4's
gmime-filter-html.c implementation and it appears to get all of the
above urls correct[1], plus it didn't seem to get confused by
< ... and then no >, so I'm guessing that Pan doesn't
use GMime for this feature(?) and that maybe it has some custom regex's
or something.

A number of GNOME apps (including gnome-terminal) I think use regexes
that you may be able to steal (assuming they are not the same ones
already used by Pan), or another option is to use GMime's url scanner
instead. You can see example usage in gmime/gmime-filter-html.c (you'll
need something like the 'patterns' array at the top, altho you could
probably drop the mask bit unless you want to keep a similar url vs
addrspec feature). The overall API is similar to regex and so could
almost be used as a drop-in replacement.

I mention this because it might be easier to try this out than to
debug/fix Pan's current url regexes (I say this as a non-perl programmer
who is very much intimidated by regex syntax ;-)

As an added bonus, my url-scanner trie graph approach is ~13x faster
than regex for this particular purpose (or at least *was* back when I
first wrote it ~6-7 years ago).

Hope that helps...


1. There is, however, a difference between what GMime matches as the url
string and what Thunderbird matches in the double-quotes example
(Thunderbird includes both leading and closing quotes, GMime only
matches what is between them).

reply via email to

[Prev in Thread] Current Thread [Next in Thread]