koha-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-devel] Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::Fi


From: Mike Rylander
Subject: [Koha-devel] Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML
Date: Mon, 20 Mar 2006 10:54:08 -0500

On 3/20/06, Pierrick LE GALL <address@hidden> wrote:
> Hello Mike,
>
> I'll answer to the second question, since I worked with Paul on
> Perl/MySQL and UTF-8...
>
> On Mon, 20 Mar 2006 09:59:32 -0500
> "Mike Rylander" <address@hidden> wrote:
>
> > Are you using decode_utf8($mysql_string) to let Perl know that the
> > database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
> > about that, and the DBD::MySQL maintainer haven't added that
> > functionality to the module yet.
>
> We don't use decode_utf8. Just after the database handler creation, we
> force communication to be UTF-8 with "set names 'UTF8'" SQL query. As
> we know our data are UTF-8 stored and we want UTF-8, all works fine.
>

Except that Perl doesn't know that the data is already UTF8 ... which
is the problem.  Perl /does/ know that the MARC data is UTF8, and it
has to convert one string or the other on output.  If you explicitly
use binmode() to set the PerlIO state to utf8, then the MARC::Record
strings, which are known good UTF8, are not transformed, but the MySQL
data, of which Perl has no encoding notions, gets "transformed", and
thus broken.

The only consistent and correct way to deal with UTF8 data in perl is
to let PerlIO handle it by marking all sources as either providing
UTF8 data or not.  You can do that with binmode(), open() and several
other ways, including this in modern Perls (
http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm ).  Because
DBD::mysql doesn't give you a way to mark its socket as UTF8, you need
to be a little underhanded and tell Perl as soon as possible using
decode(), or by making utf8 the default mode for all PerlIO channels. 
There really isn't any way around this if you want to claim real UTF8
support and be able to use components that really do support UTF8
natively, like MARC::File::XML and MARC::Record.

It's unfortunate that the DBD::mysql people won't fix their module,
but there really is a right way to do this, even without their help. 
Is there a performance penalty with decode()?  Yep.  Would that go
away with a fix to the DBD::mysql module?  Mostly, so you really need
to bug them.

> Bye
>
> --
> Pierrick LE GALL
> INEO media system
>


--
Mike Rylander
address@hidden
GPLS -- PINES Development
Database Developer
http://open-ils.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]