I have a dataset that contains a dirty data in terms of wrong encoding:
example:
column_header, other_column
Kol^u00edn, ...
^u00d8lstykke, ...
Aalborg S^u00d8, ...
I used pandas to import the data (read_csv) and replaced the "^" with "" so it is the pythonic way of writing the unicode:
df["column_header"].apply(lambda x: str(x).replace("^", "\\"))
which returns when printed:
0 Kol\u00edn
1 \u00d8lstykke
2 Aalborg S\u00d8
But what I need is not the python \u00ed but the unicode character í...
If I manually print("Kol\u00edn") I get Kolín, but it does not work in my dataframe.
How can I transform the strings in the dataframe to contain the actual character and not the \u... representation.
Any help is much appreciated!
Edit: Might be helpful:
print("Kol\u00edn".encode()) # returns b'Kol\xc3\xadn'
print(df["column_header"][0].encode()) # returns b'Kol\\u00edn'
iso-8859-1?