Understanding Python Unicode and Linux terminal

Question

I have a Python script that writes some strings with UTF-8 encoding. In my script I am using mainly the str() function to cast to string. It looks like that:

mystring="this is unicode string:"+japanesevalues[1] 
#japanesevalues is a list of unicode values, I am sure it is unicode
print mystring

I don't use the Python terminal, just the standard Linux Red Hat x86_64 terminal. I set the terminal to output utf8 chars.

If I execute this:

#python myscript.py
this is unicode string: カラダーズ ソフィー

But if I do that:

#python myscript.py > output

I got the typical error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 253-254: ordinal not in range(128)

Why is that?

In your question,you said that "some strings with UTF-8 encoding" how can you make sure that the strings were encoded with UTF-8, what have you done? — venus.w
– venus.w, Commented Nov 7, 2013 at 14:18
@venus.w I'm sorry I can't help you much. I'm reading the strings from both a DB and CSV that are encoded in UTF-8, but I just assume that the encoding is indeed UTF-8 (since if I print out I can properly read japanese characters), but they might be actually encoded in some other character set that also allows japanese characters. I believe there are python functions that can tell you the encoding of a string and even change it. — Cesc
– Cesc, Commented Nov 8, 2013 at 4:34

Lennart Regebro · Accepted Answer · 2013-07-02 09:58:56Z

15

The terminal has a character set, and Python knows what that character set is, so it will automatically decode your Unicode strings to the byte-encoding that the terminal uses, in your case UTF-8.

But when you redirect, you are no longer using the terminal. You are now just using a Unix pipe. That Unix pipe doesn't have a charset, and Python has no way of knowing which encoding you now want, so it will fall back to a default character set. You have marked your question with "Python-3.x" but your print syntax is Python 2, so I suspect you are actually using Python 2. And then your sys.getdefaultencoding() is generally 'ascii', and in your case it's definitely so. And of course, you can not encode Japanese characters as ASCII, so you get an error.

Your best bet when using Python 2 is to encode the string with UTF-8 before printing it. Then redirection will work, and the resulting file with be UTF-8. That means it will not work if your terminal is something else, though, but you can get the terminal encoding from sys.stdout.encoding and use that (it will be None when redirecting under Python 2).

In Python 3, your code should work as is, except that you need to change print mystring to print(mystring).

edited Jul 2, 2013 at 9:58

answered Jul 2, 2013 at 7:06

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Cesc Over a year ago

yes it is python 2, I just check, sorry for that. And you are right, if I do: print mystring.encode('utf-8') then I have no problem to redirect. Thanks for the accurate explanation.

Martijn Pieters Over a year ago

You can force Python to use a different encoding when using pipes with the PYTHONIOENCODING environment variable as well.

AlbertFerras Over a year ago

you said 'decode the string with utf-8 before printing it' but it should be 'encode the unicode to a utf8 encoded str before printing it' (mystring.encode('utf-8')).

jfs Over a year ago

Python has a way of knowing the encoding on Linux e.g., locale.getpreferredencoding(True)

Albert Hendriks Over a year ago

Lennart's answer set me on the right track to find the best solution for me: import sys sys.stdout.write(mystring.encode('utf-8')) instead of print. It works for both cases and I found it here: stackoverflow.com/a/492711/838494

Ignacio Vazquez-Abrams · Accepted Answer · 2013-07-02 06:49:46Z

2

If it outputs to the terminal then Python can examine the value of $LANG to pick a charset. All bets are off if you redirect.

answered Jul 2, 2013 at 6:49

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Collectives™ on Stack Overflow

Understanding Python Unicode and Linux terminal

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related