emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] add 'string-distance' to calculate Levenshtein distance


From: Eli Zaretskii
Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
Date: Sun, 15 Apr 2018 17:47:00 +0300

> From: Chen Bin <address@hidden>
> Cc: address@hidden
> Date: Sun, 15 Apr 2018 17:15:49 +1000
> 
> I attached patch for latest code.

Thanks, we are getting close.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 3, 0,
> +       doc: /* Return Levenshtein distance between STRING1 and STRING2.
> +If BYTECOMPARE is nil, we compare character of strings.
> +If BYTECOMPARE is t, we compare byte of strings.
> +Case is significant, but text properties are ignored. */)
> +  (Lisp_Object string1, Lisp_Object string2, Lisp_Object bytecompare)

I question the need for the BYTECOMPARE flag.  Emacs's editing
operations work in characters, not in bytes.  There's insert-byte, but
no delete-byte or replace-byte (although applications which should for
some reason need that could implement that, albeit not conveniently).
The byte-level operations are not exposed to Lisp for a good reason:
Emacs is primarily a text-processing environment, and text is
processed in character units.

So I think you should remove that option, unless you can explain why
you think it's needed and in what situations.

> +  unsigned short *ws1 = 0; /* 16 bit unicode character */
> +  unsigned short *ws2 = 0; /* 16 bit unicode character */
> +  if(!use_bytecompare)
> +    {
> +      /* convert utf-8 byte stream to 16 bit unicode array */
> +      string1 = code_convert_string_norecord (string1, Qutf_16le, 1);
> +      ws1 = (unsigned short *) SDATA (string1);
> +      string2 = code_convert_string_norecord (string2, Qutf_16le, 1);
> +      ws2 = (unsigned short *) SDATA (string2);
> +    }

Conversion to UTF-16 burns cycles, and the function will do the wrong
thing for characters beyond the BMP as result, because you compare two
16-bit words instead of full characters.

Instead, please use the macros defined on character.h, either
CHAR_STRING_ADVANCE of FETCH_STRING_CHAR_ADVANCE (or their non-ADVANCE
counterparts), whichever better suits your coding style and needs.
These macros produce the full Unicode codepoint of the string
characters, and you can then compare them without bumping into the
problem with UTF-8 or UTF-16 encoding.  The *-ADVANCE variants also
advance the pointer or index to the next character as side effect,
which is handy when examining successive characters in a loop.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]