encoding in strings in dataframe not recognized in Python

Question

I have a dataset that contains a dirty data in terms of wrong encoding:

example:

column_header,    other_column
Kol^u00edn,       ...
^u00d8lstykke,    ...
Aalborg S^u00d8,  ...

I used pandas to import the data (read_csv) and replaced the "^" with "" so it is the pythonic way of writing the unicode:

df["column_header"].apply(lambda x: str(x).replace("^", "\\"))

which returns when printed:

0          Kol\u00edn
1       \u00d8lstykke
2     Aalborg S\u00d8

But what I need is not the python \u00ed but the unicode character í...

If I manually print("Kol\u00edn") I get Kolín, but it does not work in my dataframe.

How can I transform the strings in the dataframe to contain the actual character and not the \u... representation.

Any help is much appreciated!

Edit: Might be helpful:

print("Kol\u00edn".encode()) # returns b'Kol\xc3\xadn'
print(df["column_header"][0].encode()) # returns b'Kol\\u00edn'

what encoding you used while loading file, did your iso-8859-1 ? — Naga kiran
– Naga kiran, Commented Aug 11, 2021 at 17:32
I did not specify any encoding when loading the data. The raw dataset is actually a dirty mixture of multiple combined datasets which had multiple encodings. E.g. I also have some strings that contain html characters like "&lt"...but i wanted to clean up the \u... stuff first. I tried iso and utf-8 encoding when loading the data...same problem — mayool
– mayool, Commented Aug 11, 2021 at 17:37

mayool · Accepted Answer · 2021-08-11 18:06:16Z

0

I actually found the answer by myself with the help of [this answer][1]:

import codecs
df["column_header"].apply(lambda x: codecs.unicode_escape_decode(x.replace("^", "\\"))[0])

# 1. replace all ^ by \
# 2. use codecs library which transforms raw string to unicode string
# "Kol\u00edn" --> u"Kol\u00edn" (the u"" is the key for python to recognize it)

maybe someone else has a similar problem and this helps
[1]: https://stackoverflow.com/a/57660758/9145756

answered Aug 11, 2021 at 18:06

mayool

1482 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

encoding in strings in dataframe not recognized in Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related