2

I'm working on a program that need to take two files and merge them and write the union file as a new one. The problem is that the output file contains chars like this \xf0 or if i change some of the encodings the result is something like that \u0028. The input file are codificated in utf8. How can i print on the output file chars like "è" or "ò" and "-"

I have done this code:

import codecs
import pandas as pd
import numpy as np


goldstandard = "..\\files\file1.csv"
tweets = "..\\files\\file2.csv"

with codecs.open(tweets, "r", encoding="utf8") as t:
    tFile = pd.read_csv(t, delimiter="\t",
                        names=['ID', 'Tweet'],
                        quoting=3)

IDs = tFile['ID']
tweets = tFile['Tweet']

dict = {}
for i in range(len(IDs)):
    dict[np.int64(IDs[i])] = [str(tweets[i])]


with codecs.open(goldstandard, "r", encoding="utf8") as gs:
    for line in gs:
        columns = line.split("\t")
        index = np.int64(columns[0])
        rowValue = dict[index]
        rowValue.append([columns[1], columns[2], columns[3], columns[5]])
        dict[index] = rowValue

import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
f = codecs.open("out.csv", "w", "utf8")
f.write(ndic)
f.close()

and this is example of the outputs

   desired: Beyoncè
   obtained: Beyonc\xe9
1
  • Why are you pretty-printing? That produces representations, and string representations produce \xhh escape sequences (literally 4 characters, two of which are hex) for any non-printable or non-ASCII codepoint. Commented Apr 28, 2016 at 19:31

1 Answer 1

3

You are producing Python string literals, here:

import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)

Pretty-printing is useful for producing debugging output; objects are passed through repr() to make non-printable and non-ASCII characters easily distinguishable and reproducible:

>>> import pprint
>>> value = u'Beyonc\xe9'
>>> value
u'Beyonc\xe9'
>>> print value
Beyoncé
>>> pprint.pprint(value)
u'Beyonc\xe9'

The é character is in the Latin-1 range, outside of the ASCII range, so it is represented with syntax that produces the same value again when used in Python code.

Don't use pprint if you want to write out actual string values to the output file. You'll have to do your own formatting in that case.

Moreover, the pandas dataframe will hold bytestrings, not unicode objects, so you still have undecoded UTF-8 data at that point.

Personally, I'd not even bother using pandas here; you appear to want to write CSV data, so I've simplified your code to use the csv module instead, and I'm not actually bothering to decode the UTF-8 here (this is safe for this case as both input and output is entirely in UTF-8):

import csv

tweets = {}
with open(tweets, "rb") as t:
    reader = csv.reader(t, delimiter='\t')
    for id_, tweet in reader:
        tweets[id_] = tweet

with open(goldstandard, "rb") as gs, open("out.csv", 'wb') as outf:
    reader = csv.reader(gs, delimiter='\t')
    writer = csv.reader(outf, delimiter='\t')
    for columns in reader:
        index = columns[0]
        writer.writerow([tweets[index]] + columns[1:4] + [columns[5])

Note that you really want to avoid using dict as a variable name; it masks the built-in type, I used tweets instead.

Sign up to request clarification or add additional context in comments.

10 Comments

Hi, thanks for the help, i noticed that now with the json.dump the output is \\u201c For printing i changed he code to: json.dump(unicode(dict), json_file, ensure_ascii=False, indent=4) without unicode it wont print (error)
@ForceITA: unicode(dict) will convert your whole dictionary to a single unicode() object by calling repr() on the object first. You really don't want that. I see the problem now, you are using index = np.int64(columns[0]) as the dictionary key, and JSON requires that you use strings for keys instead.
@ForceITA: updated the code to convert all keys to strings first.
@ForceITA: also, you are converting all your tweets to str() earlier on which can produce similar issues; what format is that column in?
Now i'm watching the dict as i populate it, and in the memoy i had \u201c So i think it's a problem somewhere before printing. I will try to change keys
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.