[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: regexp filter to match non-english characters
From: |
Robert D. Crawford |
Subject: |
Re: regexp filter to match non-english characters |
Date: |
Thu, 06 Nov 2008 10:43:38 -0600 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux) |
Ted Zlatanov <tzz@lifelogs.com> writes:
> On Thu, 06 Nov 2008 10:34:25 +0100 Michal Nazarewicz <mina86@tlen.pl> wrote:
>
> MN> "<<" and ">>" have codes U+00AB and U+00BB so that's why they match but
> MN> there are plenty of other characters which may show up in an English
> MN> text, like (I'll use a (sequence of) ASCII characters which resembles
> MN> the proper unicode character) "`" (U+2018), "'" (U+2019), "``" (U+201C)
> MN> , "''" (U+201D) or "..." (U+2026) which will cause the entry to be
> MN> filtered out.
>
> Agreed. It's not an easy problem without Unicode properties, but for
> the *subject* of the message it's a passable heuristic.
>
> MN> Besides, I think what you really meant was:
>
> MN> (string-match "[^\\0-\\177]" "string")
>
> MN> since "1ff" is not a valid octal number.
>
> Yes. Sorry.
>
> MN> I think that taking the title of the entry and checking if at least 90%
> MN> are ASCII characters would be sufficient to filter out Asian texts. You
> MN> can also try taking first 100 (or so) characters of the body. I think
> MN> you could use replace-regexp-in-string for that purpose:
>
> MN> (defun mn-non-english-p (string)
> MN> (>
> MN> (* (length (replace-regexp-in-string "[^\\0-\\77]" "" string)) 10)
> MN> (* (length string) 9)))
>
> That might work, but for a score file a simple regular expression is
> better, and I understood the OP to need a score file.
Score files are great. Truth be told, I'm just looking for what works.
I like your solution but it will exclude posts with unicode characters,
which is something I would like to avoid if possible.
Thanks,
rdc
--
Robert D. Crawford rdc1x@comcast.net
semper en excretus
- regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Message not available
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
- Re: regexp filter to match non-english characters, Michal Nazarewicz, 2008/11/06
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
- Re: regexp filter to match non-english characters,
Robert D. Crawford <=
- Message not available
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06