bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17196: UTF-8 printf string formating problem


From: Eric Blake
Subject: bug#17196: UTF-8 printf string formating problem
Date: Mon, 07 Apr 2014 15:57:03 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

[adding the Austin Group]

On 04/07/2014 07:08 AM, Pádraig Brady wrote:
> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>> Pádraig Brady wrote:
>>> Yes printf follows the C standard which only considers bytes.
>>> ...
>>> I don't think we'd be able to change the current operation of printf
>>> due to backwards compat reasons? Though we might be able to somehow leverage
>>> the existing multibyte character aware alignment/truncation code in:
>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>
>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>> that ksh uses the L modifier.
>>
>>   http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>
>>   Dan Douglas wrote:
>>   > ksh93 already has this feature using the "L" modifier:
>>   > 
>>   > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>   > ★★★
>>
>> At least there is prior art for it.
> 
> So we can count bytes, chars or cells (graphemes).
> 
> Thinking a bit more about it, I think shell level printf
> should be dealing in text of the current encoding and counting cells.
> In the edge case where you want to deal in bytes one can do:
>   LC_ALL=C printf ...
> 
> I see that ksh behaves as I would expect and counts cells,
> though requires the explicit %L enabler:
>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>   á★★
>   $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>   A★
>   $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>
> 
> zsh seems to just count characters:
>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>   á★
>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>   á★
>   $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>   A★★
> 
> I see that dash gives invalid directive for any of %ls %Ls %S.
> 
> Pity there is no consensus here.
> Personally I would go for:
>   printf '%3s' 'blah'  # count cells
>   printf '%3Ls' 'blah' # count chars
>   LANG=C '%3Ls' 'blah' # count bytes
>   LANG=C '%3s' 'blah'  # count bytes

Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
and currently states that %Ls is undefined.  But I would LOVE to have a
standardized spelling for counting characters instead of bytes.  The
extension %Ls looks like a good candidate for standardization, precisely
because counting characters when printing a multibyte string is more
useful than counting bytes (you do NOT want to end in the middle of a
multibyte character), and because ksh offers it as existing practice.

Your idea for counting "cells" (by which I'm assuming you mean one or
more characters that all display within the same cell of the terminal,
as if the end user saw only one grapheme), on the other hand, does not
seem to have any precedence, and I would strongly object to having %s
count by cells because %s already has a standardized (if unfortunate)
meaning of counting by bytes.  Maybe yet another extension is warranted
(perhaps %LLs?) as a new notion for counting by cells instead of
characters, but it's harder to justify that without existing practice.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]