24

I have 3 API's that return json data to 3 dictionary variables. I am taking some of the values from the dictionary to process them. I read the specific values that I want to the list valuelist. One of the steps is to remove the punctuation from them. I normally use string.translate(None, string.punctuation) for this process but because the dictionary data is unicode I get the error:

    wordlist = [s.translate(None, string.punctuation)for s in valuelist]
TypeError: translate() takes exactly one argument (2 given)

Is there a way around this? Either by encoding the unicode or a replacement for string.translate?

3
  • 14
    s.encode('utf-8').translate(None, string.punctuation) worked for me. Commented Mar 18, 2014 at 20:22
  • 1
    @Suzana_K Thank you! This was the simplest solution for me. Commented Aug 6, 2015 at 14:05
  • related: Remove punctuation from Unicode formatted strings Commented Dec 17, 2015 at 1:59

5 Answers 5

33

The translate method work differently on Unicode objects than on byte-string objects:

>>> help(unicode.translate)

S.translate(table) -> unicode

Return a copy of the string S, where all characters have been mapped
through the given translation table, which must be a mapping of
Unicode ordinals to Unicode ordinals, Unicode strings or None.
Unmapped characters are left untouched. Characters mapped to None
are deleted.

So your example would become:

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
word_list = [s.translate(remove_punctuation_map) for s in value_list]

Note however that string.punctuation only contains ASCII punctuation. Full Unicode has many more punctuation characters, but it all depends on your use case.

Sign up to request clarification or add additional context in comments.

3 Comments

this is by far the best of all the answers on here. thanks.
thanks! just to note, I think this implies import string
6

I noticed that string.translate is deprecated. Since you are removing punctuation, not actually translating characters, you can use the re.sub function.

    >>> import re

    >>> s1="this.is a.string, with; (punctuation)."
    >>> s1
    'this.is a.string, with; (punctuation).'
    >>> re.sub("[\.\t\,\:;\(\)\.]", "", s1, 0, 0)
    'thisis astring with punctuation'
    >>>

2 Comments

the translate function works great in python 2.7 and is computationally faster than REGEX. I may have no other option though. Thanks
The module function string.translate is deprecated in favor of the method str.translate, the translate method (which OP is using) is still usable.
3

In this version you can relatively make one's letters to other

def trans(to_translate):
    tabin = u'привет'
    tabout = u'тевирп'
    tabin = [ord(char) for char in tabin]
    translate_table = dict(zip(tabin, tabout))
    return to_translate.translate(translate_table)

Comments

1

Python re module allows to use a function as a replacement argument, which should take a Match object and return a suitable replacement. We may use this function to build a custom character translation function:

import re

def mk_replacer(oldchars, newchars):
    """A function to build a replacement function"""
    mapping = dict(zip(oldchars, newchars))
    def replacer(match):
        """A replacement function to pass to re.sub()"""
        return mapping.get(match.group(0), "")
    return replacer

An example. Match all lower-case letters ([a-z]), translate 'h' and 'i' to 'H' and 'I' respectively, delete other matches:

>>> re.sub("[a-z]", mk_replacer("hi", "HI"), "hail")
'HI'

As you can see, it may be used with short (incomplete) replacement sets, and it may be used to delete some characters.

A Unicode example:

>>> re.sub("[\W]", mk_replacer(u'\u0435\u0438\u043f\u0440\u0442\u0432', u"EIPRTV"), u'\u043f\u0440\u0438\u0432\u0435\u0442')
u'PRIVET'

Comments

1

As I stumbled upon the same problem and Simon's answer was the one that helped me to solve my case, I thought of showing an easier example just for clarification:

from collections import defaultdict

And then for the translation, say you'd like to remove '@' and '\r' characters:

remove_chars_map = defaultdict()
remove_chars_map['@'] = None
remove_chars_map['\r'] = None

new_string = old_string.translate(remove_chars_map)

And an example:

old_string = "word1@\r word2@\r word3@\r"

new_string = "word1 word2 word3"

'@' and '\r' removed

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.