guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: guile can't find a chinese named file


From: Mike Gran
Subject: Re: guile can't find a chinese named file
Date: Tue, 14 Feb 2017 20:54:07 +0000 (UTC)

On Tuesday, February 14, 2017 12:11 PM, Linas Vepstas <address@hidden> wrote:

> Which seems to be a bad decision. I've got strings, 10MBytes long, holding
> chinese in UTF8, and guile converts these internally, to UCS-32 which is a
> complete and total waste of CPU time. WTF.  It then has to  convert them
> back to UTF8 before passing them to my C++ code that actually does stuff
> with them.

> All I get for this design decision is poor performance, and endless

> complaints from boehm-gc:

I almost hate to wade in here, because no matter what I say, the
response is likely to be withering.

But, for what it is worth, the Latin-1/UCS-32 design decision came from
a couple of conflicting requirements.  The switch happened in the 1.9.x
series.


There was several examples of legacy C code using Guile for an extension
language that accessed the bytes of a string directly, using 

SCM_STRING_CHARS or scm_i_string_chars.  To keep from breaking legacy code,
we needed to retain the capability to use this (then already deprecated)
capability to have C programs access 8-bit-locale string internals directly.

Also, in R6RS, there was the requirement that functions like "string-ref"
act in "constant time". This suggested either a codepoint-array
representation for strings, or a UTF-8 array representation with some
indexing to allow for constant-time access.

Note that the constant time access requirement was dropped in R7RS, if
I understand it correctly.

Guile wasn't the only language to make this decision.  Python strings
are similar, as you can see in PEP 393, though Guile's usage of such
an encoding scheme came first.

I still maintain that this design decision was a good one based on
the simplicity of implementation.  When I helped out with the coding of the
Unicode support, I had three different prototypes: a UTF-32-only
Guile, and UTF-8 Guile, and the current scheme.

The great difficulty with the UTF-8 Guile prototype was the need to
interrogate every string access or index to decide if it was a codepoint
index or a byte index. I abandoned that effort because it was doing my 

head in.  Had we chosen that route, the result would likely have been
a long, long process of squashing difficult bugs related to byte vs
codepoint index confusion.

But, for what it is worth, we've had a few years of the internal
representation of strings being private, so any modification of
internal representation of strings would be easier in 2017 than they
were in 2007, when the guts of strings were exposed to the C
API.


Thanks,
Mike

(N.B. dak at gnu is on my block list, so I won't see any such response.)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]