lilypond-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character encoding / poor man's letterspacing


From: Aaron Hill
Subject: Re: Character encoding / poor man's letterspacing
Date: Tue, 12 Mar 2019 08:32:22 -0700
User-agent: Roundcube Webmail/1.3.8

On 2019-03-12 7:53 am, Urs Liska wrote:
Hi Alexander,

thank you for that pointer. This made my day!

Am 12.03.19 um 14:54 schrieb Alexander Kobel:
Hi,

On 12.03.19 10:43, Urs Liska wrote:
Am 12.03.19 um 01:14 schrieb Aaron Hill:
On 2019-03-11 3:40 pm, David Kastrup wrote:
Urs Liska <address@hidden> writes:
[...]

Also, I should have been clear before. David's code should work for most cases.  I was just being pedantic that /./ would not work if the input has combining characters.  For instance, if you type U+0308 (Combining Diaeresis) after an 'a', you'll get an ä.  But the simple regex would not treat that as a single grapheme.  The result would be "T a ̈ s t".

I did understand it that way, and it would not be an issue in the project I'm working on. There it's just some umlauts.

given that Aaron, my undisputed hero of Lily-UTFxy-workarounds, is active in this thread: I'm surprised to see no mention of his wonderful example of such a workaround from

https://lists.gnu.org/archive/html/lilypond-user/2018-10/msg00473.html

IIRC, the essentials of the approach is to encode stuff as UTF-32 (more or less brute force), and handle individual characters as chunks of 4 consecutive bytes / 0..255-integers in a list. It's not the ultimate solution to all imaginable troubles with encodings, but should be good enough for almost every *practical* use case.

In his modified center-lyrics-ignoring-punctuation.ily from that thread, you'll find the two main utility functions as string->utf32 and utf32->string. I presume you could call string->utf32, slice in a '(0 0 0 32) after each 4 entries, convert back via utf32->string, et voilà.


For future reference here is my solution:

[ . . . ]

Sorry for the long delay, Urs. As Alexander reminded me, I had already written a function for handing UTF-8 decoding/encoding. (Thanks, Alex!)

Here is my proposed solution:

%%%%
\version "2.19.82"

#(define (utf8->utf32 lst)
  "Converts a list of UTF8-encoded characters into UTF32."
  (if (null? lst) '()
    (let ((ch (char->integer (car lst))))
      (cond
        ;; Characters 0x00-0x7F
        ((< ch #b10000000) (cons ch (utf8->utf32 (cdr lst))))
        ;; Characters 0x80-0x7FF
        ((eqv? (logand ch #b11100000) #b11000000)
          (cons (let ((ch2 (char->integer (cadr lst))))
              (logior (ash (logand ch #b11111) 6)
                      (logand ch2 #b111111)))
            (utf8->utf32 (cddr lst))))
        ;; Characters 0x800-0xFFFF
        ((eqv? (logand ch #b11110000) #b11100000)
          (cons (let ((ch2 (char->integer (cadr lst)))
                      (ch3 (char->integer (caddr lst))))
              (logior (ash (logand ch #b1111) 12)
                      (ash (logand ch2 #b111111) 6)
                      (logand ch3 #b111111)))
            (utf8->utf32 (cdddr lst))))
        ;; Characters 0x10000-0x10FFFF
        ((eqv? (logand ch #b111110000) #b11110000)
          (cons (let ((ch2 (char->integer (cadr lst)))
                      (ch3 (char->integer (caddr lst)))
                      (ch4 (char->integer (cadddr lst))))
              (logior (ash (logand ch #b111) 18)
                      (ash (logand ch2 #b111111) 12)
                      (ash (logand ch3 #b111111) 6)
                      (logand ch4 #b111111)))
            (utf8->utf32 (cddddr lst))))
        ;; Ignore orphaned continuation characters
((eqv? (logand ch #b11000000) #b10000000) (utf8->utf32 (cdr lst)))
        ;; Error on all else
        (else (error "Unexpected character:" ch))))))

#(define (utf32->utf8 lst)
  "Converts a list of UTF32-encoded characters into UTF8."
  (if (null? lst) '()
    (let ((ch (car lst)))
      (append (cond
          ;; Characters 0x00-0x7F
          ((< ch #x80) (list (integer->char ch)))
          ;; Characters 0x80-0x7FF
          ((< ch #x800) (list
(integer->char (logior #b11000000 (logand (ash ch -6) #b11111)))
            (integer->char (logior #b10000000 (logand ch #b111111)))))
          ;; Characters 0x800-0xFFFF
          ((< ch #x10000) (list
(integer->char (logior #b11100000 (logand (ash ch -12) #b1111))) (integer->char (logior #b10000000 (logand (ash ch -6) #b111111)))
            (integer->char (logior #b10000000 (logand ch #b111111)))))
          ;; Characters 0x10000-0x10FFFF
          (else (list
(integer->char (logior #b11110000 (logand (ash ch -18) #b111))) (integer->char (logior #b10000000 (logand (ash ch -12) #b111111))) (integer->char (logior #b10000000 (logand (ash ch -6) #b111111)))
            (integer->char (logior #b10000000 (logand ch #b111111))))))
        (utf32->utf8 (cdr lst))))))

#(define (string->utf32 s) (utf8->utf32 (string->list s)))
#(define (utf32->string l) (list->string (utf32->utf8 l)))

#(define (is-combining-mark? ucp)
  "Returns whether a code-point is a Unicode Combining Character."
  (or (<= #x0300 ucp #x03ff)
      (<= #x1ab0 ucp #x1aff)
      (<= #x1dc0 ucp #x1dff)
      (<= #x20d0 ucp #x20ff)
      (<= #xfe20 ucp #xfe2f)))
#(define (utf32->graphemes lst)
  "Splits the UTF32-encoded characters into graphemes."
  (if (null? lst) '()
    (let* ((marks (take-while is-combining-mark? (cdr lst)))
           (rest (drop (cdr lst) (length marks))))
      (cons (cons (car lst) marks) (utf32->graphemes rest)))))

#(define-markup-command (letter-spaced layout props arg) (string?)
  (interpret-markup layout props #{ \markup {
    $@(map utf32->string (utf32->graphemes (string->utf32 arg)))
  } #}))

\markup { \column {
  \letter-spaced "address@hidden"
  \letter-spaced "àb᪰c᷀d⃐e︠"
  \override #'(word-space . 3) {
    \letter-spaced "address@hidden"
    \letter-spaced "àb᪰c᷀d⃐e︠"
  }
} }
%%%%

I attached the .ly in UTF-8 in the event the email text doesn't make it. I am using an example from a rarely-supported range of combining marks (U+1AB0-U+1AFF), so do not expect to be able to see U+1AB0 that is attached to the 'b'.

My solution does not actually add spaces to the string, but rather it breaks up a string into individual graphemes* and emits them as individual markup texts. This allows LilyPond's default handling of adding word-space take care of the spacing. You can then easily \override word-space to taste.

* I should note that this is only an approximation of Unicode graphemes. All I have done is support the officially-defined ranges for combining marks. There are a many more characters in Unicode that exhibit similar behavior to combining marks and should not be spaced apart. These include the myriad adorning or modifier symbols found in languages like Arabic or Thai. Such so-called "Complex Scripts" are very much non-trivial to support.


-- Aaron Hill

Attachment: letterspaced.cropped.png
Description: PNG image

Attachment: letterspaced.ly
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]