bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13505: Bug#696026: emacs24: file corruption on saving


From: Eli Zaretskii
Subject: bug#13505: Bug#696026: emacs24: file corruption on saving
Date: Sun, 20 Jan 2013 18:49:38 +0200

> From: Rob Browning <rlb@defaultvalue.org>
> Date: Sat, 19 Jan 2013 22:09:28 -0600
> Cc: 696026-forwarded@bugs.debian.org, Vincent Lefevre <vincent@vinc17.net>,
>       696026@bugs.debian.org
> 
> Vincent Lefevre <vincent@vinc17.net> writes:
> 
> > Package: emacs24
> > Version: 24.2+1-1
> > Severity: grave
> > Justification: causes non-serious data loss
> >
> > The file "file1" (attached) has the following contents:
> >
> > 00000000  6c e2 80 99 c3 a9 0a 74  65 73 74 e9 0a           |l......test..|
> >
> > 1. Open "file1" with "emacs -Q". It is regarded as
> >    an in-is13194-devanagari-unix file.
> >
> > 2. Type M-: (set-buffer-modified-p t) to mark the buffer as modified
> >    (so that one can save it).
> >
> > 3. Save the file with C-x C-s. It is proposed:
> >
> > [...]
> > Select one of the safe coding systems listed below,
> > or cancel the writing with C-g and edit the buffer
> >    to remove or modify the problematic characters,
> > or specify any other coding system (and risk losing
> >    the problematic characters).
> >
> >   raw-text emacs-mule no-conversion
> >
> > 4. Choose raw-text (the default) or no-conversion. One can assume
> >    that the file will not be modified. But it gets corrupted: one
> >    obtains a file "file2" (attached) with the following contents:
> >
> > 00000000  6c e0 a5 88 80 99 e0 a4  a5 e0 a4 8a 0a 74 65 73  
> > |l............tes|
> > 00000010  74 e0 a4 bc 0a                                    |t....|
> >
> > Note: Actually "file1" has mixed UTF-8 and ISO-8859-1 contents due to
> > a user error. But due to this bug, an attempt to fix the problem with
> > Emacs makes things even worse! BTW, I had the same problem in the past
> > when attempting to edit an mbox file with Emacs (in this case, having
> > mixed UTF-8 and ISO-8859-1 contents is normal). How Emacs interprets
> > such contents doesn't matter, but by default, it mustn't corrupt the
> > file on saving.
> >
> > There is no such problem with GNU Emacs 23.4.1 (Debian package
> > emacs23 23.4+1-4).

First, this isn't really a regression: Emacs 23 has the same
"problem".  It's just that Emacs 23 doesn't autodetect
in-is13194-devanagari in this file, while Emacs 24 does.  If you say
"C-x RET c raw-text RET C-x C-f" to visit this file in Emacs 24, the
problem will be gone, which is exactly what happens in Emacs 23,
because it visits the file in raw-text to begin with.  Conversely, if
you use "C-x RET c in-is13194-devanagari RET C-x C-f" to visit the
file in Emacs 23, you will get the same "problem" saving it.

I didn't research the reason why Emacs 24 autodetects this encoding,
and whether this is on purpose.  Perhaps Handa-san could tell.

More to the point: there seems to be a fundamental misunderstanding
here regarding the effect of selecting an encoding at save time.  It
sounds like the OP thought that selecting a "literal" encoding, such
as raw-text, which is supposed to leave the binary stream unaltered
(apart of the EOL format), will ensure that a buffer will be saved
exactly as it was originally found on disk.  But this is false.  What
raw-text and no-conversion do is to write out the _internal_
representation of each character without any conversions.  The
original encoded form of the characters as found on disk at visit time
_cannot_ be recovered by saving with raw-text, because that encoded
form is lost without a trace when the file is _visited_ and decoded
into the internal representation.  The only information that's left is
the coding-system used to decode the characters.  But since the file's
encoding in this case is inconsistent, that coding-system cannot be
used to save it back (Emacs will not let you do so, as demonstrated in
the report), and therefore the original form cannot be recovered this
way.

What the user should do to avoid this data loss is prevent the
incorrect decoding of the file's contents when the file is visited.
To this end, the file should be visited with no-conversion or
raw-text, using "C-x RET c raw-text RET C-x C-f".  Then it will be
possible to repair the file and write it back using the same raw-text
encoding.

If the fact that the file's encoding is inconsistent is not
realized until some time after the file is visited, the user should
use "C-x RET r raw-text RET" to re-visit the file using raw-text.

IOW, only selecting the appropriate encoding _at_visit_time_ can
prevent data loss in these cases.  The expectation that "Emacs mustn't
corrupt the file on saving" when the file has inconsistent encoding
and was decoded with anything but raw-text or no-conversion is
unjustified.

Personally, I don't think there's a bug here.  It's a cockpit error.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]