--- Begin Message ---
Subject: |
24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter |
Date: |
Sun, 15 Feb 2015 19:14:57 +0330 (Iran Standard Time) |
This is to report that the Syntax class [:alpha:] wrongly matches the
Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter.
In GNU Emacs 24.4.1 (i686-pc-mingw32)
of 2014-10-24 on LEG570
Windowing system distributor `Microsoft Corp.', version 6.1.7601
Configured using:
`configure --prefix=/c/usr'
Important settings:
value of $LANG: ENU
locale-coding-system: cp1256
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter |
Date: |
Sat, 28 Feb 2015 14:29:52 +0200 |
> Date: Tue, 17 Feb 2015 18:13:05 +0200
> From: Eli Zaretskii <address@hidden>
> Cc: address@hidden, address@hidden
>
> > From: Andreas Politz <address@hidden>
> > Date: Sun, 15 Feb 2015 21:16:13 +0100
> > Cc: address@hidden
> >
> >
> > I think this is supposed to be:
> >
> > ,----[ (info "(elisp) Char Classes") ]
> > | `[:alpha:]'
> > | This matches any letter. (At present, for multibyte characters, it
> > | matches anything that has word syntax.)
> > `----
>
> Indeed, which doesn't sound very nice.
>
> Does someone object to the changes below (to be installed on master)?
> They make [:alpha:] and [:alnum:] closer to the Unicode
> recommendations in UTS #18, although we are still very far from
> supporting even Level 1 of conformance. But these two seem like
> low-hanging fruit to me.
>
> The modified definitions of these two sets are not 100% compatible
> with the old ones for the multibyte characters. However, if it turns
> out that some code used these to get word-constituent characters,
> those places should simply be changed to use \sw instead.
No further comments, so I pushed the changes as commit 1a50945 on the
master branch, and I'm marking this bug closed.
> Also, does someone see any potential problem to make [:digit:] be a
> superset of the current ASCII-only set, to match UTS #18 as well? The
> comment in regex.c says it is "only used for single-byte characters",
> but it isn't clear to me whether this is a requirement, i.e. there's
> some code in Emacs that relies on that, or just a statement of facts.
I'd still like to hear an answer and/or opinions about this. If I
hear no comments, I will look into making a similar change to
[:digit:] soon.
--- End Message ---