Why can't I convert unicode string to plain python string?

Question

url = u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A'

The decoded string is (through https://www.urldecoder.org/):

decoded_url = u'/wiki/Category:打磚塊'

In python, I have the following code to do this conversion:

decoded_url = url.decode('utf-8')

This code doesn't change it at all. I also tried:

decoded_url = url.encode('utf-8')

The string remains the same. How to convert it to the decoded string I want?

urllib.parse.unquote(u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A') — furas
– furas, Commented Feb 26, 2021 at 19:28

CryptoFool · Accepted Answer · 2021-02-26 20:11:55Z

1

Here's Python 2.7 code that gives you the result you want from the original string in your question:

import urlparse

utfStr = u"/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A"
asciiStr = utfStr.encode()
str = urlparse.unquote(asciiStr)
print(str)

Result:

/wiki/Category:打磚塊

It appears that unquote does the wrong thing when given a unicode string. You have to first convert it to single-byte string before unquote will do the right thing.

edited Feb 26, 2021 at 20:11

answered Feb 26, 2021 at 19:52

CryptoFool

23.4k5 gold badges31 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

furas Over a year ago

my original anwer was for Python 3 because I didn't expect that someone still use Python 2 :)

marlon Over a year ago

@furas You need to put a 'u' before the string. Mine is a unicode string. In that case, your code give broken characgters in Python 2

CryptoFool Over a year ago

Sorry. I got the 'u' issue mixed up in my first answer. Should be what you want now.

furas · Accepted Answer · 2021-02-26 20:12:25Z

0

it is not UTF-8 encoding but url escaping or url quoting

import urllib.parse

print( urllib.parse.unquote( u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A') )

Result

/wiki/Category:打磚塊

Python 3.x doc: urllib.parse

EDIT:

Python 2.7 has it in module urlparse

 import urlparse

 print( urlparse.unquote(u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A') )

Python 2.7 doc: urlparse

EDIT:

After testing with Python 2.7 it needs encode() before unquote() to work with str (plain text) instead of unicode

#-*- coding:  utf-8 -*-
import urlparse
 
url = u'/wiki/Category:%e6%89%93%E7%A3%9A%E5%A1%8A'
url = url.encode('utf-8')    # convert `unicode` to `str`
url = urlparse.unquote(url)  # convert `%e6%89%93%E7%A3%9A%E5%A1%8A` to `打磚塊`

print url
print type(url)
print '打磚塊' in url

Result

/wiki/Category:打磚塊
<type 'str'>
True

BTW: The same for Python 3 - it doesn't need encode()

import urllib.parse
 
url = u'/wiki/Category:%e6%89%93%E7%A3%9A%E5%A1%8A'
url = urllib.parse.unquote(url)  # convert `%e6%89%93%E7%A3%9A%E5%A1%8A` to `打磚塊`

print(url)
print(type(url))
print('打磚塊' in url)

Result:

/wiki/Category:打磚塊
<class 'str'>
True

edited Feb 26, 2021 at 20:12

answered Feb 26, 2021 at 19:30

furas

149k12 gold badges121 silver badges171 bronze badges

12 Comments

marlon Over a year ago

I am using python 2.7. It doesn't have urllib.parse?

furas Over a year ago

python 2.7 has also module urllib and urlparse and it has quoting functions somewhere in urlparse

furas Over a year ago

I added example for python 2.7

marlon Over a year ago

But it gave this "u'/wiki/Category:\xe6\x89\x93\xe7\xa3\x9a\xe5\xa1\x8a'", not the readable characters

furas Over a year ago

you get correct string but you have to use print() to display it correctly. If you don't use print() then it uses repr() to display UTF-8 codes instead of chars (for debuging)

|

Collectives™ on Stack Overflow

Why can't I convert unicode string to plain python string?

2 Answers 2

3 Comments

12 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

12 Comments

Your Answer

Sign up or log in

Post as a guest

Related