[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [gnulib-tool-py] Python 2 vs Python 3: operating with strings
From: |
Bruno Haible |
Subject: |
Re: [gnulib-tool-py] Python 2 vs Python 3: operating with strings |
Date: |
Sun, 29 Apr 2012 23:33:24 +0200 |
User-agent: |
KMail/4.7.4 (Linux/3.1.10-1.9-desktop; KDE/4.7.4; x86_64; ; ) |
Hi Dmitry,
> I have one problem: there is some functions inside the code which get stdout
> after executing some shell commands.
I see from the code that you mean 'git' and 'date', for example. These
programs, like most Unix programs, do respect the Unix locale encoding
(nl_langinfo(CODESET)).
> All the content from shell has 'str'
> type in Python 2 and 'bytes' type in Python 3. All works great in Python 2,
> because all English strings can be converted to 'unicode' type, but
> Python 3 pays a great attention to the type of string. You can not
> contatenate 'str' and 'bytes' types without telling what encoding you use.
> So we have a problem of portability. What decisions we can make?
>
> 1. We can check what encodings are used on different systems. All we need
> is to run some commands:
> a) sys.getdefaultencoding()
> b) sys.stdout.encoding
> c) sys.getfilesystemencoding()
> I know that on Linux we have 'UTF-8' everywhere, but on pure Windows we have
> 'cp1251', 'cp866', 'mbcs' and it depends on locale too. I think that cygwin
> uses 'UTF-8' too. However we need to check everything and then must write
> conditions how to convert bytes to string if we use Python 3.
> 2. We could use my package streaming and fileutils, which could solve
> this problem absolutely. There are two problems:
> 1) I haven't yet converted this package to Python 3;
> 2) package must be recompiled for each system.
Option 2 means a dependency to an external package, which we try to avoid
if we can. The fact that it needs to be recompiled (not 100% Python) is also
something to avoid.
So, about option 1. I made a test with two locales on a glibc system:
- de_DE, which has encoding ISO-8859-1,
- de_DE.UTF-8, which is an UTF-8 locale.
$ export LC_ALL=de_DE
$ python3
>>> import sys
>>> import locale
>>> sys.getdefaultencoding()
'utf-8'
>>> sys.stdout.encoding
'ISO-8859-1'
>>> locale.getpreferredencoding()
'ISO-8859-1'
>>> sys.getfilesystemencoding()
'iso8859-1'
$ export LC_ALL=de_DE.UTF-8
$ python3
>>> import sys
>>> import locale
>>> sys.getdefaultencoding()
'utf-8'
>>> sys.stdout.encoding
'UTF-8'
>>> locale.getpreferredencoding()
'UTF-8'
>>> sys.getfilesystemencoding()
'utf-8'
So, sys.getdefaultencoding() is not the right thing to use.
Whereas sys.stdout.encoding, locale.getpreferredencoding(),
sys.getfilesystemencoding() all are the right thing.
> 2. I can take from my streaming module only what will work cross-platform
> and create the same bstream and ustream classes. That will allow not to
> use conditions when converting bytes to strings. This way really combines
> 1 and 2 ways. The another plus is that this can be used as a separated
> module. Of course this module will use pure Python, not Cython.
At this point - we are just at the beginning of the project and have little
experience with Python 3 - I would, by default, trust that the Python 3
designers have put everything we need into the built-in libraries. And
that therefore, when we think we need to depend on an external module,
we should look at it once again.
Bruno