RE: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again

nuxeo-localizer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again

From:	Sean Treadway
Subject:	RE: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again
Date:	Tue, 15 Oct 2002 12:05:57 +0200

Juan,

Thank you for the work around, it works like a charm with one addition.
I needed to also encode the output of the Localizer.changeLanguageForm()
method to utf-8.

What should we do to fix this?  I think it is a bug that Zope assumes
custom content be encoded in Latin-1, but that has been the default
encoding for most of the strings up until now.  I see three solutions:

Convert all my UTF-8 strings to Unicode.  This seems like the 'correct'
solution because it removes any ambiguity, and I contain the knowledge
of which strings are UTF-8 (my custom objects) and which are not (Zope's
stock objects).

Update Zope to upgrade all non-unicode strings with user defined
encoding instead of Latin-1.

Update the Localizer to return strings in a user defined encoding
instead of Unicode.  Given that Localizer is one of the first products
to make heavy use of Unicode this would be a good place to add a fix for
my application.  In the management pages, there is an option to select
the character set of the PO files.  It would be a logical place to
specify the characters set of returned strings.  

I will file a bug against Zope within a couple of days.  In the meantime
the workaround for my site is operational so I would not update
Localizer.  I think Localizer is doing the 'right thing' returning
Unicode strings and any change would be counter-productive in the long
run.  This behavior and work-around was excellently described by you and
should be in Localizer's documentation.

Thanks again for a quick and accurate response!

-Sean

> -----Original Message-----
> From: address@hidden [mailto:nuxeo-
> address@hidden On Behalf Of Juan David
Ibáñez
> Palomar
> Sent: Tuesday, October 15, 2002 10:49 AM
> To: address@hidden
> Subject: Re: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get
> encoded again
> 
> 
> I forgot to give you a short term solution.
> 
> Create the message catalog with the id "mc". Create a Python
> Script with the id "gettext", it would be like:
> 
> def gettext(message, lang=None):
>     translation = container.mc(message, lang)
>     return translation.encode('utf-8')
> 
> Then use the script instead of calling the message catalog
> directly. When this problem is fixed you just will remove
> the Python script and rename the message catalog.
> 
> This way you can continue building your web site and don't
> have to wait for anything else.
> 
> 
> Regards,
> 
> 
> Juan David Ibáñez Palomar wrote:
> 
> > Sean Treadway wrote:
> >
> >> First off, Localizer promises to make a world of difference for our
> >> site.  We have customers all over the world.  Thanks for coming
this
> >> far!
> >>
> >> Our site dates back to Zope 2.0 where I stored all the content in
UTF-8
> >> encoded python strings.  Displaying the content worked great
because I
> >> set the Content-Type header to "text/html;charset=utf-8" for all of
the
> >> pages that submit and display the content.
> >>
> >> I've upgraded to Zope 2.6beta.  I installed a MessageCatalog
(0.9.1) in
> >> the folder of a virtual site and have some translations in place.
> >>
> >> When I view pages that include the utf-8 encoded content from the
new
> >> site, the non ASCII characters look like they get an extra encoding
> >> (from utf-8 to utf-8).  When I remove the Content-Type header and
> >> display the page in Latin-1, it looks the same as if I have the
> >> Content-Type header and display the page in UTF-8.  If I switch a
page
> >> without the Content-Type header from Latin-1 to UTF-8 it looks
fine.
> >> However, I need to tell the browser to view the page in UTF-8 and
get
> >> the content there without an extra encoding.
> >>
> >> My suspicion is that the MessageCatalog is doing something with the
> >> encoding of the response before the request is finished.  The same
> >> content, with the "Content-Type: text/html;charset=utf-8" header
from
> >> the original site (without a message catalog) looks fine.  The
content
> >> from the message catalog is fine for both pages that have and do
not
> >> have the utf-8 charset header.  The content displays fine if I
delete
> >> the message catalog and include the charset=utf-8 in the
Content-Type
> >> header.
> >>
> >> Any insight?  What can I do?  I would really like to use the
Localizer,
> >> but updating my content is a daunting task with many objects that
have
> >> many properties.  Is there a place in the code I can look for
answers
> or
> >> is this a fundamental behavior of the product?  If anyone can
describe
> >> the logic that applies per request or has some sane advice for this
> i18n
> >> site, I am listening.
> >>
> >> Thanks,
> >> -Sean
> >>
> >>
> >>
> >
> > Hi Sean,
> >
> > The good news is that I know what is happening. It's not a
> > Localizer issue, it's Zope. To verify it, remove the Message
> > Catalog and remove Localizer if you like; then add a unicode
> > string in your template, for example, add:
> >
> > <dtml-var "u'I am a unicode string'">
> >
> > Now try again the experiment, you will see the same result.
> >
> > Explanation comes now. The problem is, Python has two types
> > of strings, normal and unicode. In your current web site you
> > use normal strings encoded in UTF-8.
> >
> > What happens when a normal string and a Unicode string are
> > concatenated? The normal string is promoted to a Unicode
> > string, to do that it must be encoded. It isn't posible to
> > detect the encoding, so a default one is used.
> >
> > In Python the default is ASCII, start the Python interpreter
> > and type:
> >
> > >>> 'á' + u'a'
> > Traceback (most recent call last):
> > File "<stdin>", line 1, in ?
> > UnicodeError: ASCII decoding error: ordinal not in range(128)
> >
> > Python interprets the string 'á' as ASCII, but accented characters
> > can't be represented with ASCII, so it raises an exception. It's
> > posible to change the default encoding in Python.
> >
> > However, things are different with Zope. The Unicode support in
> > Zope 2.6 was implemented by Toby Dickenson, who decided to give
> > Latin-1 a prominent role. Look at the line:
> >
> > lib/python/DocumentTemplate/pDocumentTemplate.py:248
> >
> > within the method "join_unicode", it is:
> >
> > rendered[i] = unicode(rendered[i],'latin-1')
> >
> > In Zope each time a normal string is concatenated with a Unicode
> > string it's interpreted as Latin-1, and it's hardcoded. Bad luck.
> >
> > When I implemented Unicode support I repected and followed this
> > policy. I didn't bothered to address the problem.
> >
> >
> > So now, the solution is..
> >
> > The good one is to fix Zope. Zope 2.6 is still in the works, if
> > this problem is seen as bug, and I think it is, then it has a
> > chance to be fixed. I will have to modify Localizer too, but this
> > won't be an issue.
> >
> > Florent is who can help here. Actually, maybe this has already been
> > fixed, I don't follow the Zope CVS activity. Florent, could you give
> > some insight?
> >
> >
> > In the worst case, if it is not fixed in Zope, I will have to
> > implement a workaround in Localizer, with your help I hope :-)
> >
> >
> > Regards,
> >
> 
> 
> --
> J. David Ibáñez, http://www.j-david.net
> Software Engineer / Ingénieur Logiciel / Ingeniero de Software
> 
> 
> 
> 
> _______________________________________________
> Nuxeo-localizer mailing list
> address@hidden
> http://mail.freesoftware.fsf.org/mailman/listinfo/nuxeo-localizer

[Prev in Thread]

Current Thread

[Next in Thread]

[Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again, Sean Treadway, 2002/10/14
- Re: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again, Juan David Ibáñez Palomar, 2002/10/14
  - Re: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again, Juan David Ibáñez Palomar, 2002/10/15
    - RE: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again, Sean Treadway <=
    - Re: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again, Juan David Ibáñez Palomar, 2002/10/15

Prev by Date: Re: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again
Next by Date: Re: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again
Previous by thread: Re: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again
Next by thread: Re: [Nuxeo-localizer] Non-MessageCatalog UTF-8 strings get encoded again
Index(es):
- Date
- Thread