[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] Unicode Lisp reader escapes
From: |
Aidan Kehoe |
Subject: |
Re: [PATCH] Unicode Lisp reader escapes |
Date: |
Sun, 30 Apr 2006 10:14:20 +0200 |
Ar an naoú lá is fiche de mí Aibréan, scríobh Richard Stallman:
> [Comments on the text taken into account in the revised patch below.]
>
> [...]
>
> What is the reason for needing both \u and \U, and the difference? Why
> not use a syntax like that of \x?
They are both fixed-length expressions, which is good, because people get
into the habit of typing "\u0123As I walked out one evening" instead of the
more disastrous "\u123As I walked out one evening". We could provide the
same functionality with just the \U00ABCDEF syntax, but since the code
points above #xFFFF are very rarely used, the need to provide the initial
four zeroes would be very annoying for the majority of the time.
The reason the approach is not to have variable length constants as is used
with \x is exactly the "\u0123As I" versus "\u123As I walked out" issue
above.
lispref/ChangeLog addition:
2006-04-30 Aidan Kehoe <address@hidden>
* objects.texi (Character Type):
Describe the Unicode character escape syntax; \uABCD or \U00ABCDEF
specifies Unicode characters U+ABCD and U+ABCDEF respectively.
src/ChangeLog addition:
2006-04-30 Aidan Kehoe <address@hidden>
* lread.c (read_escape):
Provide a Unicode character escape syntax; \u followed by exactly
four or \U followed by exactly eight hex digits in a comment or
string is read as a Unicode character with that code point.
GNU Emacs Trunk source patch:
Diff command: cvs -q diff -u
Files affected: src/lread.c lispref/objects.texi
Index: lispref/objects.texi
===================================================================
RCS file: /sources/emacs/emacs/lispref/objects.texi,v
retrieving revision 1.51
diff -u -u -r1.51 objects.texi
--- lispref/objects.texi 6 Feb 2006 11:55:10 -0000 1.51
+++ lispref/objects.texi 30 Apr 2006 08:08:05 -0000
@@ -431,6 +431,20 @@
bit values are 2**22 for alt, 2**23 for super and 2**24 for hyper.
@end ifnottex
address@hidden unicode character escape
+ Emacs provides a syntax for specifying characters by their Unicode code
+points. @code{?\uABCD} represents a character that maps to the code
+point @samp{U+ABCD} in Unicode-based representations (UTF-8 text files,
+Unicode-oriented fonts, etc.). There is a slightly different syntax for
+specifying characters with code points above @code{#xFFFF};
address@hidden represents an Emacs character that maps to the code
+point @samp{U+ABCDEF} in Unicode-based representations, if such an Emacs
+character exists.
+
+ Unlike in some other languages, while this syntax is available for
+character literals, and (see later) in strings, it is not available
+elsewhere in your Lisp source code.
+
@cindex @samp{\} in character constant
@cindex backslash in character constant
@cindex octal character code
Index: src/lread.c
===================================================================
RCS file: /sources/emacs/emacs/src/lread.c,v
retrieving revision 1.350
diff -u -u -r1.350 lread.c
--- src/lread.c 27 Feb 2006 02:04:35 -0000 1.350
+++ src/lread.c 30 Apr 2006 08:08:07 -0000
@@ -1743,6 +1743,9 @@
int *byterep;
{
register int c = READCHAR;
+ /* \u allows up to four hex digits, \U up to eight. Default to the
+ behaviour for \u, and change this value in the case that \U is seen. */
+ int unicode_hex_count = 4;
*byterep = 0;
@@ -1907,6 +1910,48 @@
return i;
}
+ case 'U':
+ /* Post-Unicode-2.0: Up to eight hex chars */
+ unicode_hex_count = 8;
+ case 'u':
+
+ /* A Unicode escape. We only permit them in strings and characters,
+ not arbitrarily in the source code as in some other languages. */
+ {
+ int i = 0;
+ int count = 0;
+ Lisp_Object lisp_char;
+ while (++count <= unicode_hex_count)
+ {
+ c = READCHAR;
+ /* isdigit(), isalpha() may be locale-specific, which we don't
+ want. */
+ if (c >= '0' && c <= '9') i = (i << 4) + (c - '0');
+ else if (c >= 'a' && c <= 'f') i = (i << 4) + (c - 'a') + 10;
+ else if (c >= 'A' && c <= 'F') i = (i << 4) + (c - 'A') + 10;
+ else
+ {
+ error ("Non-hex digit used for Unicode escape");
+ break;
+ }
+ }
+
+ lisp_char = call2(intern("decode-char"), intern("ucs"),
+ make_number(i));
+
+ if (EQ(Qnil, lisp_char))
+ {
+ /* This is ugly and horrible and trashes the user's data. */
+ XSETFASTINT (i, MAKE_CHAR (charset_katakana_jisx0201,
+ 34 + 128, 46 + 128));
+ return i;
+ }
+ else
+ {
+ return XFASTINT (lisp_char);
+ }
+ }
+
default:
if (BASE_LEADING_CODE_P (c))
c = read_multibyte (c, readcharfun);
--
In the beginning God created the heavens and the earth. And God was a
bug-eyed, hexagonal smurf with a head of electrified hair; and God said:
“Si, mi chiamano Mimi...”
- [PATCH] Unicode Lisp reader escapes, Aidan Kehoe, 2006/04/29
- Re: [PATCH] Unicode Lisp reader escapes, Stefan Monnier, 2006/04/29
- Re: [PATCH] Unicode Lisp reader escapes, Richard Stallman, 2006/04/29
- Re: [PATCH] Unicode Lisp reader escapes,
Aidan Kehoe <=
- Re: [PATCH] Unicode Lisp reader escapes, Aidan Kehoe, 2006/04/30
- Re: [PATCH] Unicode Lisp reader escapes, Miles Bader, 2006/04/30
- Re: [PATCH] Unicode Lisp reader escapes, Stefan Monnier, 2006/04/30
- Re: [PATCH] Unicode Lisp reader escapes, Miles Bader, 2006/04/30