emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs needs truely useful flex matching


From: Le Wang
Subject: Re: Emacs needs truely useful flex matching
Date: Mon, 15 Apr 2013 08:14:44 +0800

On Mon, Apr 15, 2013 at 2:18 AM, Stefan Monnier <address@hidden> wrote:
>> "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d"
>> > the regexp matching should be fairly efficient and you should be able to
>> > compute the score efficiently as well (at least if
>> > you ignore the camelCase boundaries).
>> I hadn't thought of this, and I'll try it soon.
> I gave this a good try.  :-)
> Since we are favouring beginning of word anchors (often but not always), I
> actually had to insert "\\<" in the various slots in front of characters.
> That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc.

I don't understand.  The example I gave already has all the "\\<" we
need, doesn't it?

The regexp you gave works gave works the same as "a.*?b.*?c.*?d".  It doesn't really prefer word boundaries.

Consider "fafbfcfd-bcd", your regexp will match "afbfcfd" from the first word, but we want to prefer "a" from the first word, "bcd" from the second word.

I could only match the preferred parts by inserting the max number of "\\<" and then one less (but insert them in different places preferring places closer to the beginning of the string.


> The algorithm works fast.  I believe it has feature parity with Sublime
> Text's fuzzy search.
> However it uses a fair bit of memory,
> 1. it's allocating one hash table and one same-length vector for each string
> 2. it's allocating one cons cell for each character.

Sounds great.  It does seem to require a fair bit of pre-processing, tho.
When you say it's fast, does that include the time it takes to build the
hash tables from the list of potential candidates?

I missed describing the top level caching layer.  I keep two global caches -- one for filenames, one for other strings.  The key is a completion string and the value is the analysis for that string (hashtable + heatmap).

The algorithm works differently for files than other strings, scoring different path segments differently (basepath is especially preferred).

When I say fast I mean the matching and scoring is fast, once the static analysis is done.

Warming up the cache with 6300 filenames (average length 44 characters) it takes 0.7 seconds and invokes the garbage collector 8 times.  

Actually can I skip the GC somehow since I know I'm about to allocate a bunch of memory that never gets freed.

--
Le

reply via email to

[Prev in Thread] Current Thread [Next in Thread]