emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] add 'string-distance' to calculate Levenshtein distance


From: Chen Bin
Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
Date: Sun, 15 Apr 2018 02:40:18 +1000
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux)

Correct me if I'm wrong.

I read cod eand found definion of Lisp_String:
  struct GCALIGNED Lisp_String
  {
    ptrdiff_t size;
    ptrdiff_t size_byte;
    INTERVAL intervals;         /* Text properties in this string.  */
    unsigned char *data;
  };

I understand string text is encoded in UTF8 format and is stored in
'Lisp_String::data'. There is actually no difference between unibyte
and multibyte text since UTF8 is compatible with ASCII and we only deal
with 'data' field.

I attached the latest patch.

>From 53e6358e1346afd64bad261823f92ae9619b93d2 Mon Sep 17 00:00:00 2001
From: Chen Bin <address@hidden>
Date: Sun, 15 Apr 2018 02:20:29 +1000
Subject: [PATCH] add api string-distance

---
 etc/NEWS                |  2 ++
 src/fns.c               | 35 +++++++++++++++++++++++++++++++++++
 test/lisp/subr-tests.el | 10 ++++++++++
 3 files changed, 47 insertions(+)

diff --git a/etc/NEWS b/etc/NEWS
index 12b72eb..75cc92d 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -463,6 +463,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
 +++
 ** New function assoc-delete-all.
 
+** New function string-distance to calculate Levenshtein distance between two 
strings.
+
 ** 'print-quoted' now defaults to t, so if you want to see
 (quote x) instead of 'x you will have to bind it to nil where applicable.
 
diff --git a/src/fns.c b/src/fns.c
index 94b9d98..f149001 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -153,6 +153,40 @@ If STRING is multibyte, this may be greater than the 
length of STRING.  */)
   return make_number (SBYTES (string));
 }
 
+DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
+       doc: /* Return Levenshtein distance between STRING1 and STRING2.
+Case is significant, but text properties are ignored.
+Strings are treated as byte arrays when calculating distance. */)
+  (Lisp_Object string1, Lisp_Object string2)
+{
+  CHECK_STRING (string1);
+  CHECK_STRING (string2);
+
+  char *s1 = SSDATA (string1);
+  char *s2 = SSDATA (string2);
+
+  ptrdiff_t s1len, s2len, x, y, lastdiag, olddiag;
+  s1len = SBYTES (string1);
+  s2len = SBYTES (string2);
+
+  USE_SAFE_ALLOCA;
+  ptrdiff_t *column = SAFE_ALLOCA ((s1len + 1) * sizeof (ptrdiff_t));
+  for (y = 1; y <= s1len; y++)
+    column[y] = y;
+  for (x = 1; x <= s2len; x++)
+    {
+      column[0] = x;
+      for (y = 1, lastdiag = x - 1; y <= s1len; y++)
+        {
+          olddiag = column[y];
+          column[y] = min (min (column[y] + 1, column[y-1] + 1), lastdiag + 
(s1[y-1] == s2[x-1] ? 0 : 1));
+          lastdiag = olddiag;
+        }
+    }
+  SAFE_FREE ();
+  return make_number (column[s1len]);
+}
+
 DEFUN ("string-equal", Fstring_equal, Sstring_equal, 2, 2, 0,
        doc: /* Return t if two strings have identical contents.
 Case is significant, but text properties are ignored.
@@ -5226,6 +5260,7 @@ this variable.  */);
   defsubr (&Slength);
   defsubr (&Ssafe_length);
   defsubr (&Sstring_bytes);
+  defsubr (&Sstring_distance);
   defsubr (&Sstring_equal);
   defsubr (&Scompare_strings);
   defsubr (&Sstring_lessp);
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index 52b61d9..40a7727 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -281,6 +281,16 @@ subr-test--frames-1
   (should (equal (string-match-p "\\`[[:blank:]]\\'" "\u3000") 0))
   (should-not (string-match-p "\\`[[:blank:]]\\'" "\N{LINE SEPARATOR}")))
 
+(ert-deftest subr-tests--string-distance ()
+  "Test `string-distance' behavior."
+  (should (equal 1 (string-distance "heelo" "hello")))
+  (should (equal 2 (string-distance "aeelo" "hello")))
+  (should (equal 0 (string-distance "ab" "ab")))
+  (should (equal 1 (string-distance "ab" "abc")))
+  ;; string containing unicode character (Hanzi)
+  (should (equal 6 (string-distance "ab" "ab我她")))
+  (should (equal 3 (string-distance "我" "她"))))
+
 (ert-deftest subr-tests--dolist--wrong-number-of-args ()
   "Test that `dolist' doesn't accept wrong types or length of SPEC,
 cf. Bug#25477."
-- 
2.16.3

>>>>> "Eli" == Eli Zaretskii <address@hidden> writes:

    Eli> [Please CC the mailing list when you respond, so others could
    Eli> see your messages.]

    >> From: Chen Bin <address@hidden> Date: Sat, 14 Apr 2018
    >> 21:01:46 +1000
    >> 
    >> Hi, Eli, Thanks for the review.
    >> 
    >> I fixed most issues except two things.
    >> 
    >> 1. In Asia, it's possible to cacluate distance between one
    >> unibyte and one multibyte string. As a Chinese, I might create a
    >> document containing Hanzi characters whose file name is obviously
    >> multibyte string. I may need get the distance of this document to
    >> a file named "README.txt".

    Eli> If you mean unibyte pure-ASCII strings, then I agree.  But that
    Eli> doesn't mean we should avoid the text altogether, because we
    Eli> might compute non-zero distance between a string and its
    Eli> encoded unibyte variant, which will confuse users.  At the very
    Eli> least the doc string should say something about this.

    >> 2. Algorithm is based on
    >> 
https://en.wikibooks.org/w/index.php?title=Algorithm_Implementation/Strings/Levenshtein_distance&stable=0#C
    >> It's optimized to use O(min(m,n)) space instead of O(mn).  Say
    >> you compare two string whose string length is 512 bytes.  You
    >> only need allocate 512 bytes instead of 262K (512*512) in memory.
    >> 
    >> Please check attached patch for latest code.
    >> 
    >> --- a/etc/NEWS +++ b/etc/NEWS @@ -463,6 +463,8 @@
    >> x-lost-selection-hooks, x-sent-selection-hooks +++ ** New
    >> function assoc-delete-all.
    >> 
    >> +** New function string-distance.

    Eli> This should mention Levenshtein distance.

    >> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2,
    >> 2, 0, + doc: /* Return Levenshtein distance of STRING1 and
    >> STRING2.
    Eli>                                               ^^^^^^^^^^^^^^^^^^^^^^
    Eli> "between STRING1 and STRING2"

    >> + unsigned int s1len, s2len, x, y, lastdiag, olddiag;

    Eli> These variables should be declared EMACS_INT, not unsigned int,
    Eli> because the size of Emacs strings can be larger than UINT_MAX,
    Eli> especially on 64-bit systems.

    >> + unsigned int *column = SAFE_ALLOCA ((s1len + 1) * sizeof
    >> (unsigned int));

    Eli> Likewise here.

    >> + char *s1 = SSDATA (string1); + char *s2 = SSDATA (string2); + +
    >> unsigned int s1len, s2len, x, y, lastdiag, olddiag; + s1len =
    >> strlen(s1); + s2len = strlen(s2);

    Eli> You could optimize the code by using SCHARS and SBYTES, instead
    Eli> of calling strlen.

    >> +(ert-deftest subr-tests--string-distance () + "Test
    >> `string-distance' behavior."  + (should (equal 1 (string-distance
    >> "heelo" "hello"))) + (should (equal 2 (string-distance "aeelo"
    >> "hello"))) + (should (equal 0 (string-distance "ab" "ab"))) +
    >> (should (equal 1 (string-distance "ab" "abc"))))

    Eli> Could you please add a test or two with non-ASCII characters?

    Eli> Thanks.

-- 
Best Regards,
Chen Bin

--
Help me, help you

reply via email to

[Prev in Thread] Current Thread [Next in Thread]