[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [h-e-w] Processing chars above \200

From: Eli Zaretskii
Subject: Re: [h-e-w] Processing chars above \200
Date: Fri, 21 Sep 2018 21:22:40 +0300

> Date: Fri, 21 Sep 2018 13:32:43 -0400
> From: John J. Xenakis <address@hidden>
> Cc: address@hidden
> (defun 8bit ()
> "Test 8-bit characters"
> (let* (
>      (pos (point))  (NL "\n")
>      (char1 "\235")  (char2 "\220")
>      (pat1  "\235")  (pat2 "[\230-\237]")
>   )
>     (insert "This is a char: " char1 NL)
>     (insert "This is another char: " char2 NL)
>     (goto-char pos)
>     (query-replace-regexp pat1 "x")  ; replaces
>     (goto-char pos)
>     (query-replace-regexp pat2 "y")  ; does not work
> ))
> Now, open a brand new empty file, and execute this macro.  The first
> replace works, but the second replace does not.  I don't know whether
> this is what's supposed to happen, but at least it doesn't work as I
> would expect.

After you execute this macro, if you go to the \235 or \220 characters
and type "C-x =", what do you see?  Does what Emacs says about these
raw bytes give you a hint regarding what is going on?

> OK, so here's the overall problem.  In the process of writing books
> and articles, I create text files with text from a variety of sources.
> The sources can include copy and paste from web sites, doc files, pdf
> files, and application windows, and can also include text generated by
> my scripts, usually in Perl or Java.

On what OS are you doing all that?  I assume Windows, but what
versions?  And what applications do you copy text from?

> I should mention that when I open a file, I use the coding system
> "windows-1252-dos."

That is probably wrong nowadays.  Since you seem to say your files are
full of raw bytes, you should use raw-text, not cp1252.  (That is, if
you cannot resolve your problem in a better way, so that what you get
in the buffer before saving it is not raw bytes, but actual non-ASCII
characters.  Given your answers to some of my questions, maybe we
could make that happen, unless you are working with very old

> Sometimes emacs opens one of these text files, and magically decides
> that it's a "(Unix)" file.  This is a nightmare because then I have "^M"
> at the end of each line, and I can't get rid of them.  I've written a macro
> that replaces all ^M's with "", and that gets rid of them for a while,
> but they come back.  I've tried using utility programs to convert files
> to windows or unix or mac formats, and back again, but the problem is never
> fixed.

These are all signs of working with files with inconsistent encoding.
Emacs employs some guesswork to decide what is the encoding, but it
only examines a small portion of the file before it makes the guess,
so inconsistent encoding can dupe it into making the wrong decisions.

> OK, you may be sorry you asked, but that's what I'm trying to do.

I'm not sorry, I actually guessed you have something like that on your

> What's the solution?

I'd start at "emacs -Q", and upgrade to Emacs 26 if you haven't
already.  I think you may have accumulated quite a bit of semi-correct
hacks trying to solve these problems, and those hacks are now biting

In "emacs -Q", try copy/pasting text from the applications you care
about, and see what apps give you which problems, if any.  Then we
will try to solve those problems one at a time.

Your first problem with the kind of solution you are used to is that
you assume \220 etc. are raw 8-bit bytes everywhere you see them in
Emacs.  That assumption is false, as "C-x =" above shows you.  I
actually hope that you won't need any such replacements at all, but if
you do, we will get to how one should go about doing this safely.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]