string.translate() with unicode data in python

Question

I have 3 API's that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list valuelist. One of the steps is to remove the punctuation from them. I normally use string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:

    wordlist = [s.translate(None, string.punctuation)for s in valuelist]
TypeError: translate() takes exactly one argument (2 given)

Is there a way around this? Either by encoding the unicode or a replacement for string.translate?

s.encode('utf-8').translate(None, string.punctuation) worked for me. — Suzana
– Suzana, Commented Mar 18, 2014 at 20:22

joce · Accepted Answer · 2013-09-10 15:03:16Z

33

The translate method work differently on Unicode objects than on byte-string objects:

>>> help(unicode.translate)

S.translate(table) -> unicode

Return a copy of the string S, where all characters have been mapped
through the given translation table, which must be a mapping of
Unicode ordinals to Unicode ordinals, Unicode strings or None.
Unmapped characters are left untouched. Characters mapped to None
are deleted.

So your example would become:

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
word_list = [s.translate(remove_punctuation_map) for s in value_list]

Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.

edited Sep 10, 2013 at 15:03

joce

9,94019 gold badges58 silver badges75 bronze badges

answered Jul 27, 2012 at 18:50

Simon Sapin

10.2k3 gold badges39 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jfs Over a year ago

dict.fromkeys(map(ord, string.punctuation))

patrick Over a year ago

this is by far the best of all the answers on here. thanks.

Mike Honey Over a year ago

thanks! just to note, I think this implies import string

ncultra · Accepted Answer · 2012-07-27 17:14:31Z

6

I noticed that string.translate is deprecated. Since you are removing punctuation, not actually translating characters, you can use the re.sub function.

    >>> import re

    >>> s1="this.is a.string, with; (punctuation)."
    >>> s1
    'this.is a.string, with; (punctuation).'
    >>> re.sub("[\.\t\,\:;\(\)\.]", "", s1, 0, 0)
    'thisis astring with punctuation'
    >>>

answered Jul 27, 2012 at 17:14

ncultra

3061 silver badge5 bronze badges

2 Comments

adohertyd Over a year ago

the translate function works great in python 2.7 and is computationally faster than REGEX. I may have no other option though. Thanks

bheklilr Over a year ago

The module function string.translate is deprecated in favor of the method str.translate, the translate method (which OP is using) is still usable.

madjardi · Accepted Answer · 2013-10-01 11:12:57Z

3

In this version you can relatively make one's letters to other

def trans(to_translate):
    tabin = u'привет'
    tabout = u'тевирп'
    tabin = [ord(char) for char in tabin]
    translate_table = dict(zip(tabin, tabout))
    return to_translate.translate(translate_table)

answered Oct 1, 2013 at 11:12

madjardi

6,0192 gold badges39 silver badges40 bronze badges

Comments

sastanin · Accepted Answer · 2015-01-27 17:35:57Z

Python re module allows to use a function as a replacement argument, which should take a Match object and return a suitable replacement. We may use this function to build a custom character translation function:

import re

def mk_replacer(oldchars, newchars):
    """A function to build a replacement function"""
    mapping = dict(zip(oldchars, newchars))
    def replacer(match):
        """A replacement function to pass to re.sub()"""
        return mapping.get(match.group(0), "")
    return replacer

An example. Match all lower-case letters ([a-z]), translate 'h' and 'i' to 'H' and 'I' respectively, delete other matches:

>>> re.sub("[a-z]", mk_replacer("hi", "HI"), "hail")
'HI'

As you can see, it may be used with short (incomplete) replacement sets, and it may be used to delete some characters.

A Unicode example:

>>> re.sub("[\W]", mk_replacer(u'\u0435\u0438\u043f\u0440\u0442\u0432', u"EIPRTV"), u'\u043f\u0440\u0438\u0432\u0435\u0442')
u'PRIVET'

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

As I stumbled upon the same problem and Simon's answer was the one that helped me to solve my case, I thought of showing an easier example just for clarification:

from collections import defaultdict

And then for the translation, say you'd like to remove '@' and '\r' characters:

remove_chars_map = defaultdict()
remove_chars_map['@'] = None
remove_chars_map['\r'] = None

new_string = old_string.translate(remove_chars_map)

And an example:

old_string = "word1@\r word2@\r word3@\r"

new_string = "word1 word2 word3"

'@' and '\r' removed

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Dec 16, 2015 at 15:27

Ioannis Koumarelas

3624 silver badges6 bronze badges

Collectives™ on Stack Overflow

string.translate() with unicode data in python

5 Answers 5

3 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related