bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13041: 24.2; diacritic-fold-search


From: Juri Linkov
Subject: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 02:27:32 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu)

> Using these properties, every search string can be converted to a
> sequence of non-decomposable characters (this process is recursive,
> because the 'decomposition' property can use characters that
> themselves are decomposable).  If the user wants to ignore diacritics,
> then the diacritics should be dropped from the decomposition sequence
> before starting the search.  E.g., for the decomposition of è above,
> we will drop the 768 and will be left with 101, which is 'e'.  Then
> searching for that string should apply the same decomposition
> transformation to the text being searched, when comparing them.

Yes, using the `decomposition' property would be better than hard-coding
these decomposition mappings.  Though I'm surprised to see case mappings
hard-coded in lisp/international/characters.el instead of using the
properties `uppercase' and `lowercase' during creation of case tables.

But nevertheless the `decomposition' property should be used to find
all decomposable characters.  The question is how to use them in the search.
One solution is to use the case tables.  I tried to build the case table
with the decomposed characters retrieved using the `decomposition' property
recursively:

(defvar decomposition-table nil)

(defun make-decomposition-table ()
  (let ((table (standard-case-table))
        canon)
    (setq canon (copy-sequence table))
    (let ((c #x0000) d)
      (while (<= c #xFFFD)
        (make-decomposition-table-1 canon c c)
        (setq c (1+ c))))
    (set-char-table-extra-slot table 1 canon)
    (set-char-table-extra-slot table 2 nil)
    (setq decomposition-table table)))

(defun make-decomposition-table-1 (canon c0 c1)
  (let ((d (get-char-code-property c1 'decomposition)))
    (when d
      (unless (characterp (car d)) (pop d))
      (if (eq c1 (car d))
          (aset canon c0 (car d))
        (make-decomposition-table-1 canon c0 (car d))))))

(make-decomposition-table)

Then a new Isearch command (the existing `isearch-toggle-case-fold'
can't be used because it enables/disables the standard case table)
could toggle between the current case table and the decomposition
case table using

  (set-case-table decomposition-table)

After evaluating this, Isearch correctly finds all related characters
in every row of this example:

  
http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html

But it seems using the case table for decomposition has one limitation.
I see no way to ignore combining accent characters in the case table,
i.e. to map combining accent characters to nothing.  These characters
have the general-category "Mn (Mark, Nonspacing)", so they should be ignored
in the search.

An alternative would be to build a regexp from the search string
like building a regexp for word-search:

(define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)

(defun isearch-toggle-decomposition ()
  "Toggle Unicode decomposition searching on or off."
  (interactive)
  (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
                       'isearch-decomposition-regexp))
  (if isearch-word (setq isearch-regexp nil))
  (setq isearch-success t isearch-adjusted t)
  (isearch-update))

(defun isearch-decomposition-regexp (string &optional _lax)
  "Return a regexp that matches decomposed Unicode characters in STRING."
  (mapconcat
   (lambda (c0)
     (if (eq (get-char-code-property c0 'general-category) 'Mn)
         ;; Mark-Nonspacing chars like COMBINING ACUTE ACCENT are optional.
         (concat (string c0) "?")
       (let ((c1 c0) c2 chars)
         (while (and (setq c2 (aref (char-table-extra-slot
                                     decomposition-table 2) c1))
                     (not (eq c2 c0)))
           (push c2 chars)
           (setq c1 c2))
         (if chars
             ;; Character alternatives from the case equivalences table.
             (concat "[" (string c0) chars "]")
           (string c0)))))
   string ""))

(put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")

This uses the decomposition table created above but instead of activating it,
it's necessary to "shuffle" the equivalences table with the following code
that prepares the table but doesn't enable it in the current buffer:

  (with-temp-buffer (set-case-table decomposition-table))

The advantage of the regexp-based approach is making combining accents
optional in the search string.  But there is another problem: how to ignore
combining accents in the buffer when the search string doesn't contain them.
With regexps this means adding a group of all possible combining accents
after every character in the search string like turning a search string
like "abc" into "a[́̂̃̄̆]?b[́̂̃̄̆]?c[́̂̃̄̆]?".
This would make the search slow, and I have no better idea.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]