Python issues on character encoding

Question

I'm working on a program that need to take two files and merge them and write the union file as a new one. The problem is that the output file contains chars like this \xf0 or if i change some of the encodings the result is something like that \u0028. The input file are codificated in utf8. How can i print on the output file chars like "è" or "ò" and "-"

I have done this code:

import codecs
import pandas as pd
import numpy as np


goldstandard = "..\\files\file1.csv"
tweets = "..\\files\\file2.csv"

with codecs.open(tweets, "r", encoding="utf8") as t:
    tFile = pd.read_csv(t, delimiter="\t",
                        names=['ID', 'Tweet'],
                        quoting=3)

IDs = tFile['ID']
tweets = tFile['Tweet']

dict = {}
for i in range(len(IDs)):
    dict[np.int64(IDs[i])] = [str(tweets[i])]


with codecs.open(goldstandard, "r", encoding="utf8") as gs:
    for line in gs:
        columns = line.split("\t")
        index = np.int64(columns[0])
        rowValue = dict[index]
        rowValue.append([columns[1], columns[2], columns[3], columns[5]])
        dict[index] = rowValue

import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
f = codecs.open("out.csv", "w", "utf8")
f.write(ndic)
f.close()

and this is example of the outputs

   desired: Beyoncè
   obtained: Beyonc\xe9

Why are you pretty-printing? That produces representations, and string representations produce \xhh escape sequences (literally 4 characters, two of which are hex) for any non-printable or non-ASCII codepoint. — Martijn Pieters
– Martijn Pieters, Commented Apr 28, 2016 at 19:31

Martijn Pieters · Accepted Answer · 2016-04-29 07:05:29Z

3

You are producing Python string literals, here:

import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)

Pretty-printing is useful for producing debugging output; objects are passed through repr() to make non-printable and non-ASCII characters easily distinguishable and reproducible:

>>> import pprint
>>> value = u'Beyonc\xe9'
>>> value
u'Beyonc\xe9'
>>> print value
Beyoncé
>>> pprint.pprint(value)
u'Beyonc\xe9'

The é character is in the Latin-1 range, outside of the ASCII range, so it is represented with syntax that produces the same value again when used in Python code.

Don't use pprint if you want to write out actual string values to the output file. You'll have to do your own formatting in that case.

Moreover, the pandas dataframe will hold bytestrings, not unicode objects, so you still have undecoded UTF-8 data at that point.

Personally, I'd not even bother using pandas here; you appear to want to write CSV data, so I've simplified your code to use the csv module instead, and I'm not actually bothering to decode the UTF-8 here (this is safe for this case as both input and output is entirely in UTF-8):

import csv

tweets = {}
with open(tweets, "rb") as t:
    reader = csv.reader(t, delimiter='\t')
    for id_, tweet in reader:
        tweets[id_] = tweet

with open(goldstandard, "rb") as gs, open("out.csv", 'wb') as outf:
    reader = csv.reader(gs, delimiter='\t')
    writer = csv.reader(outf, delimiter='\t')
    for columns in reader:
        index = columns[0]
        writer.writerow([tweets[index]] + columns[1:4] + [columns[5])

Note that you really want to avoid using dict as a variable name; it masks the built-in type, I used tweets instead.

edited Apr 29, 2016 at 7:05

answered Apr 28, 2016 at 19:36

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Cezar Sas Over a year ago

Hi, thanks for the help, i noticed that now with the json.dump the output is \\u201c For printing i changed he code to: json.dump(unicode(dict), json_file, ensure_ascii=False, indent=4) without unicode it wont print (error)

Martijn Pieters Over a year ago

@ForceITA: unicode(dict) will convert your whole dictionary to a single unicode() object by calling repr() on the object first. You really don't want that. I see the problem now, you are using index = np.int64(columns[0]) as the dictionary key, and JSON requires that you use strings for keys instead.

Martijn Pieters Over a year ago

@ForceITA: updated the code to convert all keys to strings first.

Martijn Pieters Over a year ago

@ForceITA: also, you are converting all your tweets to str() earlier on which can produce similar issues; what format is that column in?

Cezar Sas Over a year ago

Now i'm watching the dict as i populate it, and in the memoy i had \u201c So i think it's a problem somewhere before printing. I will try to change keys

|

Collectives™ on Stack Overflow

Python issues on character encoding

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related