lilypond-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8-aware backwards string searching in Guile, or: fixing centered


From: Aaron Hill
Subject: Re: UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation
Date: Tue, 30 Oct 2018 16:56:53 -0700
User-agent: Roundcube Webmail/1.3.6

On 2018-10-30 10:01 am, Alexander Kobel wrote:
This makes me wonder whether the problem is in the backwards
string-search in string-skip-right or in the substring routine used in
the make-center-on-word-callback; the only reason I can imagine why
this pops up is that some blind-to-Unicode slicing cuts some string in
the middle of a multi-byte Unicode character.

It's a little of both.

Near as I can tell, proper Unicode support was only added in Guile 2.0. So 1.8 only thinks of characters as 8-bit values. While a UTF8-encoded string can be represented, none of the built-in character- or string-handling routines in 1.8 understand that encoding directly.

For instance, the code you posted contains a mistake in defining the character set for punctuation symbols. string->list will convert a string into individual characters, but remember that 1.8 doesn't understand anything beyond ASCII. As such, the following string:

    .?-;,:„“‚‘«»‹›『』「」“”‘’–— */()[]{}|<>!`~&…†‡

gets converted into the following list:

    (#\. #\? #\- #\; #\, #\: #\342 #\200 #\236 #\342 #\200 #\234
     #\342 #\200 #\232 #\342 #\200 #\230 #\302 #\253 #\302 #\273
     #\342 #\200 #\271 #\342 #\200 #\272 #\343 #\200 #\216 #\343
     #\200 #\217 #\343 #\200 #\214 #\343 #\200 #\215 #\342 #\200
     #\234 #\342 #\200 #\235 #\342 #\200 #\230 #\342 #\200 #\231
     #\342 #\200 #\223 #\342 #\200 #\224 #\space #\* #\/ #\( #\)
     #\[ #\] #\{ #\} #\| #\< #\> #\! #\` #\~ #\& #\342 #\200
     #\246 #\342 #\200 #\240 #\342 #\200 #\241)

The resulting character set is then just the unique individual bytes, not the original characters which may have been composed of two or more surrogates:

    #<charset {#\space #\! #\& #\( #\) #\* #\, #\- #\. #\/ #\:
               #\; #\< #\> #\? #\[ #\] #\` #\{ #\| #\} #\~ #\200
               #\214 #\215 #\216 #\217 #\223 #\224 #\230 #\231
               #\232 #\234 #\235 #\236 #\240 #\241 #\246 #\253
               #\271 #\272 #\273 #\302 #\342 #\343}>

The result is something that may at first glance appear to handle things, since what is happening is that the logic is stripping away individual bytes from the left and right ends of the string. When you have a leading or trailing symbol that was in the list, then its individual bytes are stripped properly. However, if you include a character that just so happens to begin with or end with one of these bytes, then it will be split improperly.

In your example, "à" is encoded as #\303 #\240. But take note of #\240 which is in the character set. It was included in the set because of "†" which is encoded as #\342 #\200 #\240. If you were to remove "†" from the list of symbols, you'd find that the warning will go away, because #\240 is no longer being stripped.

Does anyone have a hint how to approach this one? (Or is the answer
just: be patient and hope for Guile v2?)

The only hint here is to replace the built-in functions with ones which understand UTF8 encoding and can perform the work needed. There very well might be someone online who has already done this work, which would save on having to do it yourself.

Otherwise, the basic strategy is to replace string->list with a version that decodes UTF8 and returns a list of integers (essentially UTF32). Then, all of the string work is being done with these lists of integers instead. (The character set would also just be a set of integers representing the unique Unicode code points.) After you find the subsets of the list that are interesting to measure, you'll then need to convert the list back into a string. This means encoding back into UTF8 and emitting a string.


-- Aaron Hill



reply via email to

[Prev in Thread] Current Thread [Next in Thread]