0
url = u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A'

The decoded string is (through https://www.urldecoder.org/):

decoded_url = u'/wiki/Category:打磚塊'

In python, I have the following code to do this conversion:

decoded_url = url.decode('utf-8')

This code doesn't change it at all. I also tried:

decoded_url = url.encode('utf-8')

The string remains the same. How to convert it to the decoded string I want?

2
  • 2
    That's not UTF-8, it's a special URL encoding method. Commented Feb 26, 2021 at 19:24
  • urllib.parse.unquote(u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A') Commented Feb 26, 2021 at 19:28

2 Answers 2

1

Here's Python 2.7 code that gives you the result you want from the original string in your question:

import urlparse

utfStr = u"/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A"
asciiStr = utfStr.encode()
str = urlparse.unquote(asciiStr)
print(str)

Result:

/wiki/Category:打磚塊

It appears that unquote does the wrong thing when given a unicode string. You have to first convert it to single-byte string before unquote will do the right thing.

Sign up to request clarification or add additional context in comments.

3 Comments

my original anwer was for Python 3 because I didn't expect that someone still use Python 2 :)
@furas You need to put a 'u' before the string. Mine is a unicode string. In that case, your code give broken characgters in Python 2
Sorry. I got the 'u' issue mixed up in my first answer. Should be what you want now.
0

it is not UTF-8 encoding but url escaping or url quoting

import urllib.parse

print( urllib.parse.unquote( u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A') )

Result

/wiki/Category:打磚塊

Python 3.x doc: urllib.parse


EDIT:

Python 2.7 has it in module urlparse

 import urlparse

 print( urlparse.unquote(u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A') )

Python 2.7 doc: urlparse


EDIT:

After testing with Python 2.7 it needs encode() before unquote() to work with str (plain text) instead of unicode

#-*- coding:  utf-8 -*-
import urlparse
 
url = u'/wiki/Category:%e6%89%93%E7%A3%9A%E5%A1%8A'
url = url.encode('utf-8')    # convert `unicode` to `str`
url = urlparse.unquote(url)  # convert `%e6%89%93%E7%A3%9A%E5%A1%8A` to `打磚塊`

print url
print type(url)
print '打磚塊' in url

Result

/wiki/Category:打磚塊
<type 'str'>
True

BTW: The same for Python 3 - it doesn't need encode()

import urllib.parse
 
url = u'/wiki/Category:%e6%89%93%E7%A3%9A%E5%A1%8A'
url = urllib.parse.unquote(url)  # convert `%e6%89%93%E7%A3%9A%E5%A1%8A` to `打磚塊`

print(url)
print(type(url))
print('打磚塊' in url)

Result:

/wiki/Category:打磚塊
<class 'str'>
True

12 Comments

I am using python 2.7. It doesn't have urllib.parse?
python 2.7 has also module urllib and urlparse and it has quoting functions somewhere in urlparse
I added example for python 2.7
But it gave this "u'/wiki/Category:\xe6\x89\x93\xe7\xa3\x9a\xe5\xa1\x8a'", not the readable characters
you get correct string but you have to use print() to display it correctly. If you don't use print() then it uses repr() to display UTF-8 codes instead of chars (for debuging)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.