bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24206: 25.1; Curly quotes generate invalid strings, leading to a seg


From: Eli Zaretskii
Subject: bug#24206: 25.1; Curly quotes generate invalid strings, leading to a segfault
Date: Mon, 15 Aug 2016 19:09:40 +0300

> Cc: p.stephani2@gmail.com, 24206@debbugs.gnu.org, johnw@gnu.org,
>  nicolas@petton.fr
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Sun, 14 Aug 2016 19:04:42 -0700
> 
> Eli Zaretskii wrote:
> > Its multibyteness is entirely in Emacs's imagination.
> 
> Sure, but Emacs should not substitute "\342\200\230" for "`". The point of 
> text-quoting-style is to substitute quotes, not byte string encodings of 
> quotes.

I'm not sure.  We never discussed what should Emacs do when
substitute-command-keys is called on a unibyte non-ASCII string which
requires quote substitution.  Other substitutions, including those
that produce ASCII quote characters, previously would leave the
unibyte string unibyte.  But with your changes, any substitution
converts the string into multibyte:

  (multibyte-string-p (substitute-command-keys "\200\\[goto-char]"))
    => t

I think this is might be a subtle regression, because some code might
just find itself mixing multibyte and unibyte strings where previously
there were only unibyte strings.

> >> > More generally, Fsubstitute_command_keys is quite confused about unibyte
> >> > versus multibyte issues. It merges together a number of strings, and
> >> > assumes that they are all multibyte iff the original string is
> >> > multibyte, which is obviously not true in general.
> > Could you please point out the specific places where this is done?
> 
> OK, here's a contrived example. Run this code in emacs-25:
> 
> (progn
>    (setq km (make-keymap))
>    (define-key km "≠" 'global-set-key)
>    (substitute-command-keys "\200\\<km>\\[global-set-key]"))
> 
> This should return a 2-character string equal to "\200≠".

I'm not sure your expectations are correct: as the original string is
unibyte, the output of "\200≠", which is multibyte, might not be what
the users expect.  They might expect "\200\342\211\240" instead.

> But in Emacs 25 it dumps core, at least on my platform (Fedora 23
> x86-64). And in Emacs 24 on my platform it returns a malformed
> string that prints as "\242\1340" but has length 2. I suppose we
> could make Emacs 24 dump core too, though I haven't tried hard to do
> that.

The errors are easily fixed, though.  Below I show 2 patches.  The
first one should go to master (after reverting yours), and IMO is also
safe enough for emacs-25.  But if it is deemed not safe enough for the
release, the second patch is safer.  The second patch doesn't produce
"\200≠" in your test case, but neither did Emacs 24, so this is not a
regression.

Comments?  Let's decide on what to do with emacs-25 first, since that
blocks the release, and then discuss master if needed.

Thanks.

--- src/doc.c~0 2016-06-20 08:49:44.000000000 +0300
+++ src/doc.c   2016-08-15 11:24:07.894579900 +0300
@@ -738,8 +738,9 @@ Otherwise, return a new string.  */)
   unsigned char const *start;
   ptrdiff_t length, length_byte;
   Lisp_Object name;
-  bool multibyte;
+  bool multibyte, pure_ascii;
   ptrdiff_t nchars;
+  Lisp_Object orig_string = Qnil;
 
   if (NILP (string))
     return Qnil;
@@ -752,6 +753,20 @@ Otherwise, return a new string.  */)
   enum text_quoting_style quoting_style = text_quoting_style ();
 
   multibyte = STRING_MULTIBYTE (string);
+  /* Pure-ASCII unibyte input strings should produce unibyte strings
+     if substitution doesn't yield non-ASCII bytes, otherwise they
+     should produce multibyte strings.  */
+  pure_ascii = SBYTES (string) == count_size_as_multibyte (SDATA (string),
+                                                          SCHARS (string));
+  /* If the input string is unibyte and includes non-ASCII characters,
+     make a multibyte copy, so as to be able to return the original
+     unibyte string if no substitution eventually happens.  */
+  if (!multibyte && !pure_ascii)
+    {
+      orig_string = string;
+      string = Fstring_make_multibyte (Fcopy_sequence (string));
+      multibyte = true;
+    }
   nchars = 0;
 
   /* KEYMAP is either nil (which means search all the active keymaps)
@@ -933,8 +948,8 @@ Otherwise, return a new string.  */)
 
        subst_string:
          start = SDATA (tem);
-         length = SCHARS (tem);
          length_byte = SBYTES (tem);
+         length = SCHARS (tem);
        subst:
          nonquotes_changed = true;
        subst_quote:
@@ -956,8 +971,8 @@ Otherwise, return a new string.  */)
               && quoting_style == CURVE_QUOTING_STYLE)
        {
          start = (unsigned char const *) (strp[0] == '`' ? uLSQM : uRSQM);
-         length = 1;
          length_byte = sizeof uLSQM - 1;
+         length = 1;
          idx = strp - SDATA (string) + 1;
          goto subst_quote;
        }
@@ -995,6 +1010,8 @@ Otherwise, return a new string.  */)
            }
        }
     }
+  else if (!NILP (orig_string))
+    tem = orig_string;
   else
     tem = string;
   xfree (buf);


--- src/doc.c~0 2016-06-20 08:49:44.000000000 +0300
+++ src/doc.c   2016-08-15 11:13:15.132137200 +0300
@@ -738,7 +738,7 @@ Otherwise, return a new string.  */)
   unsigned char const *start;
   ptrdiff_t length, length_byte;
   Lisp_Object name;
-  bool multibyte;
+  bool multibyte, pure_ascii;
   ptrdiff_t nchars;
 
   if (NILP (string))
@@ -752,6 +752,11 @@ Otherwise, return a new string.  */)
   enum text_quoting_style quoting_style = text_quoting_style ();
 
   multibyte = STRING_MULTIBYTE (string);
+  /* Pure-ASCII unibyte input strings should produce unibyte strings
+     if substitution doesn't yield non-ASCII bytes, otherwise they
+     should produce multibyte strings.  */
+  pure_ascii = SBYTES (string) == count_size_as_multibyte (SDATA (string),
+                                                          SCHARS (string));
   nchars = 0;
 
   /* KEYMAP is either nil (which means search all the active keymaps)
@@ -933,8 +938,11 @@ Otherwise, return a new string.  */)
 
        subst_string:
          start = SDATA (tem);
-         length = SCHARS (tem);
          length_byte = SBYTES (tem);
+         if (multibyte || pure_ascii)
+           length = SCHARS (tem);
+         else
+           length = length_byte;
        subst:
          nonquotes_changed = true;
        subst_quote:
@@ -956,8 +964,11 @@ Otherwise, return a new string.  */)
               && quoting_style == CURVE_QUOTING_STYLE)
        {
          start = (unsigned char const *) (strp[0] == '`' ? uLSQM : uRSQM);
-         length = 1;
          length_byte = sizeof uLSQM - 1;
+         if (multibyte || pure_ascii)
+           length = 1;
+         else
+           length = length_byte;
          idx = strp - SDATA (string) + 1;
          goto subst_quote;
        }





reply via email to

[Prev in Thread] Current Thread [Next in Thread]