0

I have a dataset that contains a dirty data in terms of wrong encoding:

example:

column_header,    other_column
Kol^u00edn,       ...
^u00d8lstykke,    ...
Aalborg S^u00d8,  ...

I used pandas to import the data (read_csv) and replaced the "^" with "" so it is the pythonic way of writing the unicode:

df["column_header"].apply(lambda x: str(x).replace("^", "\\"))

which returns when printed:

0          Kol\u00edn
1       \u00d8lstykke
2     Aalborg S\u00d8

But what I need is not the python \u00ed but the unicode character í...

If I manually print("Kol\u00edn") I get Kolín, but it does not work in my dataframe.

How can I transform the strings in the dataframe to contain the actual character and not the \u... representation.

Any help is much appreciated!

Edit: Might be helpful:

print("Kol\u00edn".encode()) # returns b'Kol\xc3\xadn'
print(df["column_header"][0].encode()) # returns b'Kol\\u00edn'
2
  • what encoding you used while loading file, did your iso-8859-1 ? Commented Aug 11, 2021 at 17:32
  • I did not specify any encoding when loading the data. The raw dataset is actually a dirty mixture of multiple combined datasets which had multiple encodings. E.g. I also have some strings that contain html characters like "&lt"...but i wanted to clean up the \u... stuff first. I tried iso and utf-8 encoding when loading the data...same problem Commented Aug 11, 2021 at 17:37

1 Answer 1

0

I actually found the answer by myself with the help of [this answer][1]:

import codecs
df["column_header"].apply(lambda x: codecs.unicode_escape_decode(x.replace("^", "\\"))[0])

# 1. replace all ^ by \
# 2. use codecs library which transforms raw string to unicode string
# "Kol\u00edn" --> u"Kol\u00edn" (the u"" is the key for python to recognize it)

maybe someone else has a similar problem and this helps
[1]: https://stackoverflow.com/a/57660758/9145756

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.