lilypond-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multi-byte characters in Lyrics


From: David Kastrup
Subject: Re: Multi-byte characters in Lyrics
Date: Fri, 27 Oct 2017 09:18:04 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)

Maurits Lamers <address@hidden> writes:

>> Op 26 okt. 2017, om 17:27 heeft David Kastrup <address@hidden> het
>> volgende geschreven:
>> 
>> Maurits Lamers <address@hidden <mailto:address@hidden>> writes:
>> 
>>> Hi,
>>> 
>>> I am writing an extension to lilypond to support generating some basic
>>> braille inside an includable .ly file.
>>> I am trying to map the characters of lyric events into a set of braille 
>>> dots.
>>> One of the issues I have is that I have trouble finding a way to do
>>> this with characters which seem to be multi-byte.
>>> In this case, these character are defined in text mode as 
>>> 
>>> "’s He"
>>> 
>>> I have tried quite a few ways of simply getting 5 characters, but the
>>> first one (which I found out through other means) has charcode 8217.
>>> None of the functions I could find works to get this character as one
>>> character, as it seems that even integer->char only allows values
>>> between 0 and 255.
>> 
>> Characters in Guile 1.8 are bytes.  Where it the problem?
>
> I cannot convert a multi-byte character to a symbol, unless I do some
> very inelegant hacks.

Huh?  string->symbol works just fine.  So what do you mean when you say
"symbol"?

>>> I have fiddled with ly:wide-char->utf-8 and ly:encode-string-for-pdf
>>> but that doesn't bring much either.
>> 
>> \markup \char #5000
>> 
>> And it's not like
>> 
>> \markup #(ly:wide-char->utf-8 5000)
>> 
>> wouldn't work.  You just have to work with strings instead of characters.
>
> It is not a problem on the input side, it is a problem on the processing 
> side. 
> I set up an engraver to listen to lyric events. As the lyrics have to
> be mapped to braille, I map every character to a specific braille dot
> pattern.
> I have to do this in order to support braille embossers, which mostly
> still are ascii based and are not in agreement on which dot pattern
> maps to which ascii character. I also want to be able to support
> unicode.
> So, every lyric event, I retrieve the text with (ly:event-property
> event 'text), which then needs to be processed into braille dots,
> which I achieve by doing (string->list) or could do through
> (string-ref str pos).

There is your problem.  string->list will deliver bytes.  Try something
like

(define (b->c input)
  (cdr
    (string-fold-right
      (lambda (new tail)
        (cond ((char<? new #\200)
               (cons* '() (string new) (cdr tail)))
              ((char<? new #\300)
               (cons (cons new (car tail)) (cdr tail)))
              (else
               (cons* '() (list->string (cons new (car tail))) (cdr tail)))))
       '(())
       input)))

which will deliver one-utf-8-character strings when applied to a string.

> This works for almost all situations, except this one. I get a lyric
> which contains an inverted comma instead of a apostrophe, and
> literally defined as:
>
> "’s He"
>
> This inverted comma is a multi-byte character, but I cannot read it as
> a character, I can only read it as the separate bytes.
> This is problematic, because as far as I know these characters could
> have a different meaning by themselves, as they could each can
> represent a different character.

No.  utf-8 multibyte character constituents cannot be confused with
single characters, but you still need to be able to distinguish
different utf-8 characters.

>>> Because of other limitations, it has to be compatible with Lilypond
>>> 2.14.
>> 
>> A really bad idea.
> Couldn't agree more, but at the moment I don't have much choice, and
> there doesn't seem much benefit in using 2.18 as it seems to suffer
> from the same problem.

It suffers from a host of other problems less.  2.14 is not supported on
any current platform.  The source will not compile given current
compilers.  Nobody will able to help you with it, and you won't be able
to hot-patch any bugs critical to your project.  Your output will suffer
numerous problems, the PDF metadata will likely break when using utf-8
characters in it and multibyte output might not work properly in PDF
since Ghostscript went through a number of changes.

You won't evade upgrading anyway eventually, so you have nothing to gain
by postponing: this is a cost you'll have to pay anyway.

> This was a very good lead. With great help from the scheme IRC
> channel, I figured out that having strings as keys works great, and
> because they suggested (and provided) an UTF8 byte count counter, I
> was able to implement a simple function which takes as many characters
> from the string as required to make a proper match to the assoc list.
>
> So, problem solved :)

I should really read mails to the end before coming up with code.

-- 
David Kastrup



reply via email to

[Prev in Thread] Current Thread [Next in Thread]