gnulib-tool-py
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnulib-tool-py] Python 2 vs Python 3: operating with strings


From: Bruno Haible
Subject: Re: [gnulib-tool-py] Python 2 vs Python 3: operating with strings
Date: Sun, 29 Apr 2012 23:33:24 +0200
User-agent: KMail/4.7.4 (Linux/3.1.10-1.9-desktop; KDE/4.7.4; x86_64; ; )

Hi Dmitry,

> I have one problem:  there is some functions inside the code which get stdout
> after executing some shell commands.

I see from the code that you mean 'git' and 'date', for example. These
programs, like most Unix programs, do respect the Unix locale encoding
(nl_langinfo(CODESET)).

> All the content from shell has 'str'
> type in Python 2 and 'bytes' type in Python 3. All works great in Python 2,
> because all English strings can be converted to 'unicode' type, but
> Python 3 pays a great attention to the type of string. You can not
> contatenate 'str' and 'bytes' types without telling what encoding you use.
> So we have a problem of portability. What decisions we can make?
> 
> 1. We can check what encodings are used on different systems. All we need
> is to run some commands:
>     a) sys.getdefaultencoding()
>     b) sys.stdout.encoding
>     c) sys.getfilesystemencoding()
> I know that on Linux we have 'UTF-8' everywhere, but on pure Windows we have 
> 'cp1251', 'cp866', 'mbcs' and it depends on locale too. I think that cygwin 
> uses 'UTF-8' too. However we need to check everything and then must write 
> conditions how to convert bytes to string if we use Python 3.
> 2. We could use my package streaming and fileutils, which could solve
>    this problem absolutely. There are two problems:
>    1) I haven't yet converted this package to Python 3;
>    2) package must be recompiled for each system.

Option 2 means a dependency to an external package, which we try to avoid
if we can. The fact that it needs to be recompiled (not 100% Python) is also
something to avoid.

So, about option 1. I made a test with two locales on a glibc system:
  - de_DE, which has encoding ISO-8859-1,
  - de_DE.UTF-8, which is an UTF-8 locale.

$ export LC_ALL=de_DE
$ python3
>>> import sys
>>> import locale
>>> sys.getdefaultencoding()
'utf-8'
>>> sys.stdout.encoding
'ISO-8859-1'
>>> locale.getpreferredencoding()
'ISO-8859-1'
>>> sys.getfilesystemencoding()
'iso8859-1'

$ export LC_ALL=de_DE.UTF-8
$ python3
>>> import sys
>>> import locale
>>> sys.getdefaultencoding()
'utf-8'
>>> sys.stdout.encoding
'UTF-8'
>>> locale.getpreferredencoding()
'UTF-8'
>>> sys.getfilesystemencoding()
'utf-8'

So, sys.getdefaultencoding() is not the right thing to use.

Whereas sys.stdout.encoding, locale.getpreferredencoding(),
sys.getfilesystemencoding() all are the right thing.

> 2. I can take from my streaming module only what will work cross-platform
> and create the same bstream and ustream classes. That will allow not to
> use conditions when converting bytes to strings. This way really combines
> 1 and 2 ways. The another plus is that this can be used as a separated
> module. Of course this module will use pure Python, not Cython.

At this point - we are just at the beginning of the project and have little
experience with Python 3 - I would, by default, trust that the Python 3
designers have put everything we need into the built-in libraries. And
that therefore, when we think we need to depend on an external module,
we should look at it once again.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]