[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forw
From: |
Eduardo Ochs |
Subject: |
bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends) |
Date: |
Tue, 21 Oct 2008 12:00:58 -0400 |
Hello,
this may not be exactly a bug, I'm just struggling with an obscure
part of Emacs... anyway, I did my best to make this look like a nice
bug report, and to make the tests clear enough to help other people
who also find unibyte<->multibyte conversions obscure...
The short story
===============
Let me refer to strings like "<<tag>>" - where the "<<" and ">>" stand
for guillemets, i.e., the characters that we type with `C-x 8 <' and
`C-x 8 >' - as "anchors". So: if I produce an anchor string in a
unibyte buffer and then I search for an occurrence of that string in
multibyte buffer, the search fails.
The two small blocks below illustrate this. Instructions: save the
first one to "/tmp/1.txt", the second one to "/tmp/2.txt", and then
run:
(load-file "/tmp/1.txt")
It will show "uni" in the "*Messages*" buffer, and the search will
fail. The detailed message about the failure of the search will be
like this:
progn: Search failed: "\302\253foo\302\273"
meaning the anchor string has been incorrectly converted.
;;--------snip,snip--------
;; -*- coding: raw-text-unix -*-
;; (save-this-block-as "/tmp/1.txt")
(progn
(find-file "/tmp/2.txt")
(goto-char (point-min))
(setq anchorstr "«foo»")
(message (if (multibyte-string-p anchorstr) "multi" "uni"))
(search-forward anchorstr))
;;--------snip,snip--------
;;--------snip,snip--------
;; -*- coding: latin-1 -*-
;; (save-this-block-as "/tmp/2.txt")
(search-forward "«foo»")
;; «foo»
;;--------snip,snip--------
The long story
==============
Save the block below as "/tmp/3.txt" and follow the instructions in
it. Note that it doesn't have any non-ascii characters - the anchors
are produced by running the "(insert ...)" sexps.
;;--------snip,snip--------
;; -*- coding: latin-1 -*-
;; (save-this-block-as "/tmp/3.txt")
;; Run the "progn" below with C-x C-e.
;; It will create a line like this:
;; <<anchor>>\253anchor\273\253anchor\273\253anchor\273
;; (but the "<<", ">>", "\253", "\273" are single characters).
;; Don't delete that line, it will be used later.
;;
(progn
(defun mmb (str) (string-make-multibyte str))
(defun mub (str) (string-make-unibyte str))
(insert 171 "anchor" 187)
(insert "\253anchor\273")
(insert (mub "\253anchor\273"))
(insert (mmb (mub "\253anchor\273")))
)
;; Now try to save this file.
;; Emacs will complain about the "\253"s and "\273"s - it will
;; say that iso-latin-1-unix and utf-8-unix cannot encode them.
;; The "<<" and ">>" are ok, though...
;;
;; So: leave the "<<anchor>>" above, delete the "\253anchor\273"s,
;; save this file, and reload it. DON'T SKIP THIS STEP - the
;; charset properties mentioned below behave differently before
;; and after reloads, and I don't know exactly the mechanics of
;; this... 8-\
;;
;; If we inspect the "<<", ">>" "\253", "\273" with `C-x ='
;; we see this:
;; Char: << (171, #o253, #xab, file #xAB)
;; Char: >> (187, #o273, #xbb, file #xBB)
;; Char: \253 (4194219, #o17777653, #x3fffab, raw-byte)
;; Char: \253 (4194235, #o17777673, #x3fffbb, raw-byte)
;;
;; Now mark the "<<anchor>>" above and copy it to the top of
;; the kill ring with `M-w'. Let's examine the results of
;; several obvious ways to (re)create the "<<anchor>>"
;; above as a string...
;; Here are some of the results:
;;
;; "\253anchor\273" ==> "<<anchor>>"
;; (mub "\253anchor\273") ==> "<<anchor>>"
;; (mmb (mub "\253anchor\273")) ==> "\253anchor\273"
;; (car kill-ring) ==>
;; #("<<anchor>>" 0 8 (charset iso-8859-1))
;; (mub (car kill-ring)) ==> "<<anchor>>"
;; (mmb (mub (car kill-ring))) ==> "\253anchor\273"
"\253anchor\273"
(mub "\253anchor\273")
(mmb (mub "\253anchor\273"))
(mub (mmb (mub "\253anchor\273")))
(mapcar 'identity "\253anchor\273")
(mapcar 'identity (mub "\253anchor\273"))
(mapcar 'identity (mmb (mub "\253anchor\273")))
(car kill-ring)
(mub (car kill-ring))
(mmb (mub (car kill-ring)))
(mapcar 'identity (car kill-ring))
(mapcar 'identity (mub (car kill-ring)))
(mapcar 'identity (mmb (mub (car kill-ring))))
;; This is the weird part.
;; Let's insert another "<<anchor>>"/"\253anchor\273" pair, and
;; let's try to jump to its "anchors" with `search-backward'.
(insert 171 "anchor" 187 "\n\253anchor\273")
(search-backward "\253anchor\273")
(search-backward (mub "\253anchor\273"))
(search-backward (mmb (mub "\253anchor\273")))
(search-backward (car kill-ring))
(search-backward (mub (car kill-ring)))
(search-backward (mmb (mub (car kill-ring))))
;; Only "(search-backward (car kill-ring))" jumps to
;; "<<anchor>>" - all the others jump to "\253anchor\273".
;; The trick - aha! - is that "(car kill-ring)" holds this
;; string,
;;
;; (car kill-ring) ==>
;; #("<<anchor>>" 0 8 (charset iso-8859-1))
;;
;; and the "(charset iso-8859-1)" property is essential...
;;--------snip,snip--------
What is the standard way to convert unibyte strings (for example
anchor strings, generated from code in raw-text-unix ".el" files) to
strings with the right charset property (if needed) and the right
encoding? I couldn't find the functions for that...
Cheers, thanks in advance,
Eduardo Ochs
eduardoochs at gmail.com
http://angg.twu.net/
P.S.: (emacs-version) ==>
"GNU Emacs 23.0.60.1 (i686-pc-linux-gnu, GTK+ Version 2.8.20)
of 2008-10-11 on dekooning"
- bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends),
Eduardo Ochs <=