[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16286: 24.3.50; insert-file-contents may bring invisible garbage

From: Eli Zaretskii
Subject: bug#16286: 24.3.50; insert-file-contents may bring invisible garbage
Date: Thu, 02 Jan 2014 18:30:30 +0200

> From: Andrey Kotlarski <address@hidden>
> Date: Sun, 29 Dec 2013 16:05:22 +0200
> In trunk inserting few bytes from file may sometimes result in nothing
> visible in the buffer while invisible artifacts are present and may
> affect subsequent operations.  Moreover, there doesn't seem to be way to
> recover from this.  Here's example session with emacs -Q:
> (let ((file "test.txt"))
>   (unless (file-exists-p file)
>     (find-file file)
>     (insert "абв")                      ;Cyrillic letters
>     (save-buffer)
>     (kill-buffer))
>   (let ((buf (generate-new-buffer "test")))
>     (switch-to-buffer buf)
>     (insert-file-contents file nil 0 2) ;inserts а
>     (goto-char (point-max))
>     (insert-file-contents file nil 2 3) ;returns 0 bytes inserted, nothing 
> visible in the buffer
>                                         ;but actually there is
>     (erase-buffer)                      ;and still is
>     (insert-file-contents file nil 2 4) ;should insert б, instead let: Wrong 
> type argument: inserted-chars, 1
>     (message "%S" (buffer-string)) ;"бЀ" while buffer is visibly empty
>     ))
> Trying to insert multibyte characters now brings content length issues,
> garbage inserted and at some point Emacs crashes.

Your Emacs is built without --enable-checking; if that configure-time
switch is used, Emacs hits an assertion violation as soon as this sexp
is evaluated:

     (insert-file-contents file nil 2 3)

Also, you are wrong about there being some invisible stuff in the
buffer.  The problem is elsewhere: Emacs gets confused about the
number of characters and the number of bytes in the buffer.  These two
counts should be in sync at all times; once they become
unsynchronized, Emacs will generally crash very soon.

I'm CC'ing Handa-san in the hope that he will be able to suggest a

The problem happens in decode_coding_gap (called from
insert-file-contents), in this code fragment (note the call to

    detect_coding (coding);
  attrs = CODING_ID_ATTRS (coding->id);
  if (! disable_ascii_optimization
      && ! coding->src_multibyte
      && ! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
      && NILP (CODING_ATTR_POST_READ (attrs))
      && NILP (get_translation_table (attrs, 0, NULL)))
      chars = coding->head_ascii;
      if (chars < 0)
        chars = check_ascii (coding);
      if (chars != bytes)
          /* There exists a non-ASCII byte.  */
          if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8))
              if (coding->detected_utf8_chars >= 0)
                chars = coding->detected_utf8_chars;  <<<<<<<<<<<<<<
                chars = check_utf_8 (coding);

This reuses the number of characters that are valid UTF-8 sequences in
the byte stream to be decoded, stored in coding->detected_utf8_chars,
which were found by detect_coding_utf_8, which was called by
detect_coding.  In the case in point, detect_coding_utf_8 finds zero
valid UTF-8 sequences, and so 'chars' becomes zero.  But the number of
decoded bytes is not adjusted to fit that, so it stays at its original
value of 1.  Then, decode_coding_gap does this:

          coding->produced = bytes;
          coding->produced_char = chars;
          insert_from_gap (chars, bytes, 1);

Since 'chars' is zero, but 'bytes' is 1, this causes a mismatch
between buffer's Z and Z_BYTE values, and from there it's a slippery
slope all the way to an assertion violation during redisplay.

Similar problems happen when insert-file-contents is called to read
some number of bytes that doesn't end at a UTF-8 sequence boundary.

I think I see a potential reason for this in detect_coding_utf_8, near
its end:

      if (nchars < src_end - coding->source)
        /* The found characters are less than source bytes, which
           means that we found a valid non-ASCII characters.  */
        detect_info->found |= CATEGORY_MASK_UTF_8_AUTO | 

This misses the use case such as this one, where the detection loop
consumed one byte, found it not to be the head byte of a UTF-8
sequence, and then hit the end of the source bytes.  It looks like the
function incorrectly returns a success indication in this case, which
might be part of the problem.

> In release 24.3 and earlier insert-file-contents seems to always insert
> something, be it wrongly decoded or raw eight-bit characters.  But it is
> visible and easy to deal with.  The above example works fine there.
> This is useful for the vlf package (https://github.com/m00natic/vlfi) as
> a way to detect insufficient amount of bytes requested and allows
> further adjustment.

What vlf does is strange and IMO not the best possible solution to
this issue:

        (cond ((vlf-partial-decode-shown-p) ;remove raw bytes from end
               (goto-char (point-max))
               (while (eq (char-charset (preceding-char)) 'eight-bit)
                 (setq shift-end (1- shift-end))
                 (delete-char -1)))
              ((< end vlf-file-size) ;add bytes until new character is displayed
               (let ((position (or position (point-min)))
                     (expected-size (buffer-size)))
                 (while (and (progn
                               (setq shift-end (1+ shift-end)
                                     end (1+ end))
                               (delete-region position (point-max))
                               (goto-char position)
                               (insert-file-contents buffer-file-name
                                                     nil start end)
                               (< end vlf-file-size))
                             (= expected-size (buffer-size))))))))

This seems to have a subtle misfeature of not supporting files with
inconsistent encoding, or files with binary data, because there _all_
characters will belong to the eight-bit charset.  Also, I don't
understand why the removal of raw bytes is conditioned on Emacs
version: why not just remove them unconditionally: if there are none,
nothing will be removed.

More to the point, I'm not sure whether inserting raw bytes in
insert-file-contents when a portion of a multibyte sequence was read
(i.e. go back to what Emacs 24.3 did) will be good for vlf.  It sounds
to me much better if Emacs would only return complete characters read
from the file, so that applications will not need to remove those
stray bytes.

Finally, it would seem a better design for vlf to always read a few
more bytes than was requested into some scratch buffer, and then
decode them manually to determine just how many to copy to the main

reply via email to

[Prev in Thread] Current Thread [Next in Thread]