This is in response to why this solves the issue OP is having, and somebackground on the issue OP is trying describe
from __future__ import unicode_literals
from builtins import str
In the default iPython 2.7 kernel :
(iPython session)
In [1]: type("é") # By default, quotes in py2 create py2 strings, which is the same thing as a sequence of bytes that given some encoding, can be decoded to a character in that encoding.
Out[1]: str
In [2]: type("é".decode("utf-8")) # We can get to the actual text data by decoding it if we know what encoding it was initially encoded in, utf-8 is a safe guess in almost every country but Myanmar.
Out[2]: unicode
In [3]: len("é") # Note that the py2 `str` representation has a length of 2. There's one byte for the "e" and one byte for the accent.
Out[3]: 2
In [4]: len("é".decode("utf-8")) # the py2 `unicode` representation has length 1, since an accented e is a single character
Out[4]: 1
Some other things of note in python 2.7:
"é" is the same thing as str("é")
u"é" is the same thing as "é".decode('utf-8') or unicode("é", 'utf-8')
u"é".encode('utf-8') is the same thing as str("é")
- You typically call decode with a py2
str, and encode with py2 unicode.
- Due to early design issues, you can call both on either even though that doesn't really make any sense.
- In python3,
str, which is the same as python2 unicode, can no longer be decoded since a string by definition is a decoded sequence of bytes. By default, it uses the utf-8 encoding.
- Byte sequences that were encoded with in the ascii codec behave exactly the same as their decoded counterparts.
- In python 2.7 with no future imports :
type("a".decode('ascii')) gives a unicode object, but this behaves nearly identically with str("a"). This is not the case in python3.
With that said, here's what the snippets above do :
__future__ is a module maintained by the core python team that backports python3 functionality to python2 to allow you to use python3 idioms within python2.
from __future__ import unicode_literals has the following effect :
- Without the future import
"é" is the same thing as str("é")
- With the future import
"é" is functionally the same thing as unicode("é")
builtins is a module that is approved by the core python team, and contains safe aliases for using python3 idioms in python2 with the python3 api.
- Due to reasons beyond me, the package itself is named "future", so to install the
builtins module you run : pip install future
from builtins import str has the following effect :
- the
str constructor now gives what you think it does, i.e. text data in the form of python2 unicode objects. So it's functionally the same thing as str = unicode
- Note : Python3
str is functionally the same as Python2 unicode
- Note : To get bytes, you can use the "bytes" prefix, e.g.
b'é'
The takeaway is this :
- Decode on read/Decode early on and encode on write/encode at the end
- Use
str objects for bytes and unicode objects for text
from __future__ import unicode_literalsto the top of the file as well, does that solve your issue?