0

Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.

So encoding here means converting to a particular format.

In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function. Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'

I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")

What am I missing here.

9
  • You convert from UTF-8 to a Unicode object. Commented Jan 8, 2018 at 13:12
  • Your console or terminal is set to UTF-8, so typing in "奥多比" sends UTF-8 bytes to the Python interactive interpreter process. Decoding then creates a Unicode object from the UTF-8 bytes. Commented Jan 8, 2018 at 13:12
  • @MartijnPieters: SO when this is part of a script and I write : str = "奥多比." and then str.decode("utf-8") then that means that str is essentially the utf-8 already? However when I append it to the URL of an API call, it is sent as "奥多比." only and not in the encoded format. Commented Jan 8, 2018 at 13:16
  • So are you really asking how to send UTF-8 bytes in a URL? Commented Jan 8, 2018 at 13:19
  • URLs are not UTF-8 encoded. They are percent encoded, often using UTF-8 as a starting point. In Python 2, use import urllib, then urllib.quote() to create URL percent-encoded data. Start with UTF-8 bytes. Commented Jan 8, 2018 at 13:21

2 Answers 2

2

You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.

You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.

The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.

I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:

Sign up to request clarification or add additional context in comments.

3 Comments

You mention that - "奥多比"' is a byte literal. I think they are unicode characters. There is no purpose to decode them. Where as if we encode them then we get a byte representation. The byte representation can be transferred over network or file and the receiver can decode the byte to get the original "奥多比" value. Have I got this concept right?
@variable: are you using Python 2? If not, then this is not something you need to worry about nearly as much. In any case, read the links I included, especially Ned Batchelder's. Try out the concepts in your interactive interpreter. Bytes are the lingua franca of data exchange, everything is bytes. Decoding to a text type (unicode in Python 2, str in Python 3) is turning bytes into a more useful object type, like using` datetime.strftime()` or int() or json.load().
I'm using python 3 and have read that str in python 3+ is unicode
1

In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.