emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character group folding in searches


From: Eli Zaretskii
Subject: Re: Character group folding in searches
Date: Fri, 06 Feb 2015 16:32:57 +0200

> Date: Fri, 6 Feb 2015 11:04:03 -0200
> From: Artur Malabarba <address@hidden>
> 
> 1. Follow the `decomposition' char property. For instance, the
> character "a" in the search string would match any one of  "aãáâ" (and
> so on). This is easy to do, and one of the patches below already shows
> how. Note that this won't handle symbols that are actually composed of
> multiple characters.
> 
> 2. Follow an intuitive sense of similarity which is not defined in the
> unicode standard. For instance, an ascii single quote in the search
> string should match any type of single quote (there are about a dozen
> that I know of).
> 
> 3. Ignore modifier (non-spacing) characters. Another way of writing
> "á" is to write "a" followed by a special non-spacing accute. This
> kind of thing (a symbol composed of multiple characters) is not
> handled by item 1, so I'm listing as a separate point.
> 
> 4. Perform the conversion two-ways. That is, item 1 should work even
> if the search contained "á" instead of "a". Item 2 should match an
> ascii quote if the search string contains a curly quote. This is
> mostly useful when the user copies a fancy string from somewhere and
> pastes it into the search field.
> 
> 5. It should work for any searching, not just isearch.

The full set of "folding" transformations is described in the Unicode
technical report UTR #30.  It was withdrawn, but its last draft is
still enlightening.

I think we should support some subset of what's described there.

The way to do it IMO is to generate a set of char-tables where each
character is mapped to its folded variant, one char-table for each
subset of folding.  A character whose folding is not a single
character should map to a vector or a string of characters (not sure
which one is best, we should choose the one that lends itself to the
most efficient use).

I think the best approach is to modify search.c to be able to handle
folding that produces more than a single character.  I think we will
also need search.c to support several alternative foldings for the
same search operation.  Making these changes would be relatively easy,
I think, and once it's done, all the rest will "just work", because
the basic search algorithms don't need to be touched.

As a final general remark, I don't think I like the "group" part of
the terminology.  Why not use "character-folding" instead, it's what
this is called out there.

> * group-folding-with-regexp-lisp.patch
> 
> This one takes each input character and either keeps it verbatim or
> transform it into a regexp which matches the entire group that this
> character represents. It is implemented in isearch.
> 
> + It trivially handles goals 1, 2 and 3. Because regexps are quite
> versatile, it is the only solution that handles item 3 (it allows each
> character to match more than a single character).

But the downside is that we will have to construct such regexps for
all the foldings of all the characters we want to support.  That will
be quite a large database, and a lot of work to construct it.

> * group-folding-with-case-table-lisp.patch
> 
> This patch is entirely in elisp. I've put it all inside `isearch.el'
> for now, for the sake of simplicity, but it's not restricted to
> isearch.
> 
> It creates a new case-table which performs group folding by borrowing
> the case-folding machinery, so it is very fast. Then, group folding
> can be achieved by running the search inside a `with-group-folding`
> macro. There's also an example implementation which turns it on for
> isearch by default.
> 
> + It immediately satisfies items 1, 2, 4, and 5.
> + It is very fast.
> - It has no simple way of achieving item 3.

It could use a separate case-table for item 3, couldn't it?

I think we will need separate tables for different foldings anyway,
because each use case calls for some specific folding.  In isearch,
the user will have to specify which foldings she wants to be in
effect.

> - If the user decides to set `group-fold-search' to t, this can break
> existing code (a disadvantage that the lisp version above does not
> have).
> - It adds two extra fields to every buffer object (the boolean
> variable and the char table).

I'm not sure we need to add these tables to the buffer object.  The
experience with using case-tables this way is not encouraging, because
in several important cases it is not at all clear which buffer is
relevant to the folding-match operation one needs to do.

> Do any of these options seem good enough? Which would you all like to explore?
> I like the second one best, but goal 3 is quite important.

I think we must lift the limitation of single-character folding
result, which means changes on the C level are inevitable.

I also think we need to talk a bit more about which kinds of folding
we would like to support.

Thanks.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]