bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#27681: grep: Combining Mark-Nonspacing are classified as [:punct:]


From: Santiago
Subject: bug#27681: grep: Combining Mark-Nonspacing are classified as [:punct:]
Date: Thu, 13 Jul 2017 15:21:40 +0200

Hi,

I would like to forward the issue below, reported by Panu Kalliokoskii
in 2012 (better late than never!). I think the correct category is
Mark-nonspacing, but I am not very familiar with Unicode though.

It still occurs in grep 3.1. In this case, using the U+0301 acute accent:

 $ echo árbol | grep -o '[[:alpha:]]*'
 a
 rbol

Cheers,

 -- Santiago

On Mon, 05 Mar 2012 13:08:43 +0200 "Panu A. Kalliokoski" <address@hidden> wrote:
> Package: grep
> Version: 2.6.3-3
> Severity: normal
> 
> 
> It seems that grep misclassifies combining letters (unicode class Lm) as
> punctuation, when they should be letters.  For instance:
> 
> $ echo d̪ʌ̀lì | grep -o '[[:alpha:]]*'
> d
> ʌ
> li
> 
> As a consequence, combining accents are not seen as "word-constituent":
> 
> $ echo d̪ʌ̀lì | grep -o '\w*'
> d
> ʌ
> li
> 
> This causes also false positives on word-boundary conditions, such as
> the below:
> 
> $ echo d̪ʌ̀lì | grep -w ʌ
> d̪ʌ̀lì
> 
> I suggest that combining letters should be part of [:alpha:] instead of
> [:punct:].





reply via email to

[Prev in Thread] Current Thread [Next in Thread]