bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#1877: Request: Regular expressions that can match Unicode general ca


From: Eli Zaretskii
Subject: bug#1877: Request: Regular expressions that can match Unicode general categories
Date: Mon, 30 Sep 2019 11:45:14 +0300

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Date: Mon, 30 Sep 2019 09:45:15 +0200
> Cc: 1877@debbugs.gnu.org
> 
> Derick Eddington <derick.eddington@gmail.com> writes:
> 
> > A new Scheme major mode I've made [1] requires regular expressions that
> > can match characters by their Unicode general categories.  It seems
> > Emacs regular expressions do not provide a way to do that directly (I'm
> > using GNU Emacs 23.0.60.1)
> 
> (I'm going through old bug reports that unfortunately didn't get any
> response at the time.)
> 
> I'm not quite sure what Unicode general categories you're referring to,
> but the Emacs regexp matcher has gained a bunch of categories in the ten
> years since you made the request.
> 
> Are the categories below what you were thinking of?
> 
> ‘[:print:]’
>      This matches any printing character—either whitespace, or a graphic
>      character matched by ‘[:graph:]’.
> ‘[:punct:]’
>      This matches any punctuation character.  (At present, for multibyte
>      characters, it matches anything that has non-word syntax.)
> ‘[:space:]’
>      This matches any character that has whitespace syntax (*note Syntax
>      Class Table::).
> ‘[:upper:]’
>      This matches any upper-case letter, as determined by the current
>      case table (*note Case Tables::).  If ‘case-fold-search’ is
>      non-‘nil’, this also matches any lower-case letter.
> ‘[:word:]’
>      This matches any character that has word syntax (*note Syntax Class
>      Table::).

No, he means the categories described in the node "Character
Properties" of the ELisp manual.

We don't yet have full support for the Unicode Regular Expressions, as
specified in UTS#18.  In particular, see

  http://unicode.org/reports/tr18/#General_Category_Property

for General Category regexp specs.

It is not clear to me which categories are of interest here.  Some of
them are nowadays definitely available indirectly via the classes
mentioned above (they weren't available in Emacs 23 when the bug was
filed).  Maybe the OP could provide an explicit list of categories
needed for this Scheme mode, together with their required usage in
this mode.  Looking at R6RS sec 4.2.1, all I see is "whitespace"
(which we provide via [:blank:]), "letter" (provided by [:alpha:]),
"digit" (provided by [:alnum:]), and "intraline whitespace" (provided
by [:blank:]).  If this is all, then we have all the required support
now.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]