bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#5797: 23.1; search-forward in unibyte buffer for \377


From: rasmith
Subject: bug#5797: 23.1; search-forward in unibyte buffer for \377
Date: Mon, 29 Mar 2010 10:09:19 -0500 (CDT)

Please write in English if possible, because the Emacs maintainers
usually do not have translators to read other languages for them.

Your bug report will be posted to the bug-gnu-emacs@gnu.org mailing list,
and to the gnu.emacs.bug news group.

Please describe exactly what actions triggered the bug
and the precise symptoms of the bug:

search-forward fails to find a unibyte \377 in a raw unibyte buffer.
I use "cgreek", a package written by Naoto Takahashi for handling
polytonic (ancient, fully accented) Greek.  It includes a file,
cgreek-tlg.el, for processing the files in the Thesaurus Linguae
Graecae, which have their own unique formats.  In these files, the
byte \377 is used as a string terminator.  Prior to emacs23, these
files could be processed by reading the file in with
insert-file-contents-literally, making the buffer unibyte with
(set-buffer-multibyte nil), and searching for the string terminator
with (search-forward (char-to-string ?\xff)).  However, that search
now fails to find a single byte \377 and instead matches on the
two-byte sequence \231\277.  

Changing the search function to (search-forward (unibyte-string ?\377))
has the same result.  

On investigation, I see the following:

After further investigation, I'm not certain it's a bug: it may be an
intentional part of the modifications to accommodate utf-8.  Here are
the details;

In a multibyte-buffer (set-buffer-multibyte t), 
   
(search-forward (char-to-string ?\xff)) matches utf-8 "ÿ" (i.e. \303\277)
(search-forward (char-to-string ?\377)) matches utf-8 "ÿ"
(search-forward (unibyte-string ?\377)) matches byte \377

In a unibyte buffer (set-buffer-multibyte nil)

(search-forward (char-to-string ?\xff)) matches \231\277
(search-forward (char-to-string ?\377)) matches \231\277
(search-forward (unibyte-string ?\377)) matches \231\277

In other words, search-forward cannot find byte \377 when searching in
a *unibyte* buffer, but it can find that same byte if the buffer is
changed to multibyte.  The reason is that in a unibyte buffer,
search-forward apparently changes byte \377 to a two-byte
representation (but not to utf-8, which would be \303\277).  

This may be exactly the intended behavior of search-forward, but it
breaks scripts expecting search-forward to be able to find a single
high 8-bit byte in a unibyte buffer.  In context, changing the buffer
to multibyte is not a solution.

The code in which I found this error can be fixed by replacing
    (search-forward (char-to-string ?\xff))
with
    (skip-chars-forward "^\377")
    (forward-char 1)
(fix provided by Naoto Takahashi)

However, that means that scripts counting on the old behavior of
search-forward will have to be modified. 

If Emacs crashed, and you have the Emacs process in the gdb debugger,
please include the output from the following gdb commands:
    `bt full' and `xbacktrace'.
If you would like to further debug the crash, please read the file
/usr/local/share/emacs/23.1/etc/DEBUG for instructions.


In GNU Emacs 23.1.1 (amd64-portbld-freebsd8.0, GTK+ Version 2.18.7)
 of 2010-03-25 on aristotle.tamu.edu
Windowing system distributor `The X.Org Foundation', version 11.0.10605000
configured using `configure  '--with-x-toolkit=gtk' 
'--x-libraries=/usr/local/lib' '--x-includes=/usr/local/include' 
'--prefix=/usr/local' '--mandir=/usr/local/man' '--infodir=/usr/local/info/' 
'--build=amd64-portbld-freebsd8.0' 'build_alias=amd64-portbld-freebsd8.0' 
'CC=cc' 'CFLAGS=-O2 -pipe -fno-strict-aliasing' 'LDFLAGS=-L/usr/local/lib 
-lintl' 'CPPFLAGS=-I/usr/local/include''

Important settings:
  value of $LC_ALL: en_US.UTF-8
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: nil
  locale-coding-system: utf-8-unix
  default-enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  tool-bar-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
o <down> <down> <down> <return> C-q 0 0 0 <return> 
C-q 3 7 7 <return> <up> <up> <up> <left> <up> C-x C-e 
C-x o <down> <down> <down> <down> <backspace> <backspace> 
C-q 2 3 1 <return> ] <backspace> C-q 2 7 7 <return> 
<up> <up> <up> <up> C-e C-x C-e <up> <up> <left> C-x 
C-e <up> <up> <switch-frame> <down-mouse-1> <mouse-movement> 
<switch-frame> <mouse-1> <help-echo> <switch-frame> 
<switch-frame> <switch-frame> <switch-frame> <switch-frame> 
<switch-frame> <switch-frame> <switch-frame> <help-echo> 
<up> <up> <left> <up> <right> C-k C-y <return> C-y 
<left> <backspace> <backspace> <backspace> t <right> 
C-x C-e <down> <right> <right> <right> <right> <right> 
<right> <right> <right> <right> <right> <right> <right> 
<right> <right> <right> C-x C-e C-x o <down> C-x C-e 
<up> <up> <up> <left> <left> <left> <left> <return> 
<up> ( s e a r c h - f o r w a r d SPC ( c h a r - 
t o - s t r i o n g <backspace> <backspace> <backspace> 
g <backspace> g SPC <backspace> <backspace> n g SPC 
? \ x f f ) ) C-x C-e C-x o <up> <up> <down> <up> C-x 
C-e <down> <down> C-e C-x C-e <up> <up> <up> <up> C-e 
C-x C-e <up> <up> <left> C-x C-e <up> <up> <up> <up> 
<up> <up> C-e C-x C-e <down> C-e C-x C-e C-x o <down> 
<down> <down> <down> <down> <down> <return> C-q 3 7 
7 <return> <up> <up> <up> <up> <up> <up> <left> <left> 
C-x C-e <up> <up> <up> <up> <up> <up> <down> <left> 
<left> C-x C-e <up> <up> <up> <up> <left> C-x C-e <up> 
<up> <up> <up> <up> <left> <left> <left> <left> <left> 
C-x C-e <down> <down> C-e C-x C-e <up> <up> <up> <up> 
C-e C-x C-e <up> <up> <up> C-e C-x C-e <down> <switch-frame> 
<switch-frame> <help-echo> <help-echo> <help-echo> 
M-x r e p o r t <tab> b <tab> <return>

Recent messages:
Entering debugger...
326
Entering debugger...
nil
369 [3 times]
t
Entering debugger...
374 [2 times]
366
nil
369 [3 times]







reply via email to

[Prev in Thread] Current Thread [Next in Thread]