help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [External] : Re: Regexp for matching control character, say, FORM FE


From: tomas
Subject: Re: [External] : Re: Regexp for matching control character, say, FORM FEED. (Was: Re: The `^L' appeared in built-in help.)
Date: Thu, 22 Jul 2021 10:06:43 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Thu, Jul 22, 2021 at 09:13:31AM +0800, Hongyi Zhao wrote:

[...]

> I want to know whether there are some similar regexp patterns in Emacs
> as the ones used by grep, say, $'\014' or $'\f'.

To offer some other perspective on the (correct) answers by Emanuel and
Drew, remember that a regular expression is, basically, a string
where each character is interpreted as "itself", unless it is a "regexp
special" character [1]. So, for example searching for the regular expression
"a" will find all "a"s in your text, because the character a isn't a
"regexp special".

Now ASCII control characters are all *not* "regexp special" so you only
have to find a way to express them whithin a string. How, that is stated
in the Emacs Lisp manual when it talks about "string type" [2] (especially
the subnode "Non-ASCII Characters in Strings", which leads you to "character
type" [3]. The special forms "\f", "\^L" or "\C-L" (all of them equivalent),
which all were talked about here are treated in a subnode of the above [4].
This notation carries some historical baggage, so don't expect too much
logic from it.

For example, why ^L? Because form feed is at point 12 (in decimal) in the
ascii table, and L at point 76, the difference being 64. What happens is
that the "^" "subtracts 64 from the character code", or more precisely
masks out bit 6 of its binary representation. So ^M would be "carriage
return" and so on. Just have a look at the ASCII table.

Then "\f" comes from the C string literal representation. It's meant to
be mnemonic ("f" for "form feed" -- similarly "\n" for "line feed", aka
"new line", "\b" for "bell" and so on).

The references below lead you to more alternative representations, like
short hex "\x0C", short Unicode hex "\u000C", long Unicode hex "\U0000000C";
there are also (mostly historical) octals, etc.

You can even put the unicode /names/ in there, using the "\N{...}"
notation, so your ^L can be named "\N{FORM FEED (FF)}" (yes the (FF)
in parentheses is part of it: the Unicode Consortium put it in there.
Life is like that).

If you want to explore those unicode names, type in C-x 8 <RET>, you
can autocomplete your way among them.

Hope this gives some rough map for that landscape :-)

Cheers

[1] Emacs Lisp reference manual "Syntax of Regular Expressions"
    or 
https://www.gnu.org/software/emacs/manual/html_node/elisp/Syntax-of-Regexps.html


[2] Emacs Lisp reference manual "String Type" and its subnodes
    or 
https://www.gnu.org/software/emacs/manual/html_node/elisp/String-Type.html
    
[3] Emacs Lisp reference manual "Character Type"
    
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Type.html

[4] Emacs Lisp reference manual "Control-Character Syntax"
    
https://www.gnu.org/software/emacs/manual/html_node/elisp/Ctl_002dChar-Syntax.html

 - tomás

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]