0

I have two lists:

wrong_chars = [
    ['أ','إ','ٱ','ٲ','ٳ','ٵ'],
    ['ٮ','ݕ','ݖ','ﭒ','ﭓ','ﭔ'],
    ['ڀ','ݐ','ݔ','ﭖ','ﭗ','ﭘ'],
    ['ٹ','ٺ','ٻ','ټ','ݓ','ﭞ'],
]

true_chars = [
    ['ا'],
    ['ب'],
    ['پ'],
    ['ت'],
]

For a given string I want to replace the entries in wrong_chars with those in true_chars. Is there a clean way to do that in python?

4 Answers 4

8

string module to the rescue!

There's a really handy function as a part of the string module called translate that does exactly what you're looking for, though you'll have to pass in your translation mapping as a dictionary.

The documentation is here

An example based on a tutorial from tutoriapoint is shown below:

>>> from string import maketrans

>>> trantab = maketrans("aeiou", "12345")
>>> "this is string example....wow!!!".translate(trantab)

th3s 3s str3ng 2x1mpl2....w4w!!!

It looks like you're using unicode here though, which works slightly differently. You can look at this question to get a sense, but here's an example that should work for you more specifically:

translation_dict = {}
for i, char_list in enumerate(wrong_chars):
    for char in char_list:
        translation_dict[ord(char)] = true_chars[i]

example.translate(translation_dict)
Sign up to request clarification or add additional context in comments.

3 Comments

thanks for good answer. but i have question again. I change your code to translation_dict[ord(char.decode('utf-8'))] = true_chars[i]. This is true? and i get error: expected a character buffer object in this line
@chalist you shouldn't have to decode the character to get the ord. Have you tried on the raw unicode object?
Note that string module does not contain maketrans function in python 3, rather it is available in python2. If anyone is interested in using maketrans, they need call this function on str: str.maketrans(...)
2

I merged your two wrong and true chars in a list of dictionaries of wrongs and what should be replaced with them. so here you are:
link to a working sample http://ideone.com/mz7E0R
and code itself

given_string = "ayznobcyn"
correction_list = [
                    {"wrongs":['x','y','z'],"true":'x'},
                    {"wrongs":['m','n','o'],"true":'m'},
                    {"wrongs":['q','r','s','t'],"true":'q'}
                  ]

processed_string = ""
true_char = ""

for s in given_string:
    for correction in correction_list:
        true_char=s
        if s in correction['wrongs']:
            true_char=correction['true']
            break
    processed_string+=true_char

print given_string
print processed_string

this code can be more optimized and of course i do not care about unicode problems if there was any, because i see you are using Farsi. you should take care about that.

Comments

1
#!/usr/bin/env python
from __future__ import unicode_literals

wrong_chars = [
    ['1', '2', '3'],
    ['4', '5', '6'],
    ['7'],
]
true_chars = 'abc'

table = {}
for keys, value in zip(wrong_chars, true_chars):
    table.update(dict.fromkeys(map(ord, keys), value))
print("123456789".translate(table))

Output

aaabbbc89

3 Comments

@chalist: the code works as is on Python 2 and 3. Do you have from __future__ import unicode_literals at the top in your code?
@chalist: here's live example that demonstrates that it works. Update your quesiton, to include the complete (but minimal) code example with the full traceback if any.
@chalist: a single user-perceived character may span several Unicode codepoints. (I've used 'abc' as a shortcut for ['a', 'b', 'c']). Use a list, to see the character boundaries: ideone.com/cweBU9 If a "wrong character" contains more than one Unicode codepoint then you could use text.replace(multiple_codepoints, true_char) or re.sub("|".join(map(re.escape, ['1', '2', '3'])), 'a', text)
0

In my idea you can make just one list that contain true characters too like this:

NewChars = {["ا"،"أ"،"إ"،"آ"], ["ب"،"بِ"،"بِ"،]} 
# add all true characters to the first of lists and add all lists to a dict, then:
Ch="إ"
For L in NewChars:
    If Ch in L: return L[0]

1 Comment

thanks but list is very very big. each of rows has over 100 char somtimes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.