[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How to convert .doc to plain text ascii in emacs.
From: |
Thomas Persson |
Subject: |
Re: How to convert .doc to plain text ascii in emacs. |
Date: |
Sun, 02 May 2004 21:26:45 +0200 |
User-agent: |
Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux) |
gebser@speakeasy.net writes:
> Thanks very much. Your elisp works great. There's one glitch (which I
> realize is from antiword):
>
> The three characters "\342\200\231" should be replaced by the single
> apostrophe character (').
The fact that antiword and my code leaves you with a buffer containing
numerical codes instead of the characters themselves is your first
problem. This doesn't happen for me at all. It's either a problem with
antiword or a problem with how emacs displays characters. Try running
antiword from the command line to figure out which.
> To do this by hand, I did M-x replace-regexp Return C-q 342 Return
> C-q 200 Return C-q 231 Return Return ' Return
>
> but this does not find the intended string. The problem seems to be
> that C-q 342 is immediately (in the minibuffer) converted into an 'a'
> with a grave symbol over it. Putting the point on the backslash (\)
> preceding the 342 in the antiword-converted buffer and doing "C-u C-x ="
> indeed shows this a-with-grave character to be (0342, 226, 0xe2).
>
> To create a simple test case, do the following:
>
> Open an empty *scratch* buffer. Enter into it: C-q 342 Return C-q 200
> Return C-q 231 Return. The first character that appears is the
> a-with-grave; the second and third characters appear properly as
> \200\231.
>
> It is, I think, the failure of C-q 342 to be represented as \342 which
> is the problem. What is the solution?
The fact that you have a problem with replacing the numerical
character codes with the characters themselves is however definitely a
emacs related problem. As far as I can tell it would work to add the
replace-regexp business to the end of the antiword-buffer function
like this:
(defun antiword-buffer ()
"Takes the current buffer as input to the external program antiword.
If the current buffer is a ms-word document it's contents are replaced
with the output from antiword and the extension `.doc' is replaced
with `.txt' in the buffer-file-name."
(let ((txt-buffer-file-name (concat (substring (buffer-file-name) 0 -4)
".txt")))
(shell-command-on-region (point-min) (point-max)
"cat | antiword -" nil t nil)
(undo-start)
(if (equal (buffer-string) "- is not a Word Document.\n")
(or (undo-more 1)
(message "%s - is not a Word Document."(current-buffer)))
(set-visited-file-name txt-buffer-file-name)
(not-modified)
(replace-regexp "\342\200\231" "'"))))
;; The following expression makes sure that antiword-buffer is run when a
;; file with the .doc extension is opened.
(setq auto-mode-alist
(append '(("\\.doc\\'" . antiword-buffer))
auto-mode-alist))
If that doesn't work then perhaps "wvWare" or "undoc.el" ,as previous
posters have suggested, might be better solutions for you.