0

I'm having a curious issue with the csv package in Python 3.7.

I'm importing a csv file and able to access all the file as expected, with one exception - the header row, as stored in the "fieldnames" object, appears have the first column header (first item in fieldnames) malformed.

This first field always has the format: 'xxx"header"'

where:

  1. xxx are garbage characters that always seem to be the same
  2. header is the correct header text

See the following screenshot of my table <csv.DictReader> object from my debug window: enter image description here

My code to open the file, follows. I added the headers[0] = table.fieldnames[0].split('"')[1] in order to extract the correct header and place it back into fieldnames`.

import csv

  with self.inputfile.open() as self.inputfid:
    table = csv.DictReader(self.inputfid, delimiter=',')
    headers = table.fieldnames
    headers[0] = table.fieldnames[0].split('"')[1]

(Note: self.inputfile is a pathlib.Path object)

I didn't notice this for a long time because I wasn't using the first column (with the # header) - I've been happily parsing with the rest of the columns for a while on multiple files.

If I look directly at the csv, there doesn't appear to be any issue:

```csv


Questions:

Does anyone know what the issue is? Is there anything I can try to correct the import issue?

If there isn't a fix, is there a better way to parse the garbage? I realize this could clear up in the future, but I think the split will still work even with just bare double quotes (the header should still be the 2nd item in the split, right?). Is there a better solution?

1 Answer 1

2

It looks like your csv file is encoded as utf-8-sig - a version of utf-8 used by some Windows applications, but it's being decoded as cp1252 - another encoding in common use on Windows.

>>> print('"#"'.encode('utf-8-sig').decode('cp1252'))
"#"

The "garbage" characters preceding the header are the byte-order-mark that utf-8-sig uses to tell Windows applications that a file is encoded as utf-8 rather than one of the historically more common 8-bit encodings.

To avoid the "garbage", specify utf-8-sig as the encoding when opening your file.

The code in the question could be modified to work like this:

import csv

encoding = 'utf-8-sig'
with self.inputfile.open(encoding=encoding, newline='') as self.inputfid:
    table = csv.DictReader(self.inputfid, delimiter=',')
    headers = table.fieldnames
    ...

If - as seems likely - the encoding of input files may vary, the value of encoding (or a best guess) must be determined by using a tool like chardet, as used in the comments.

Sign up to request clarification or add additional context in comments.

3 Comments

Great - that makes sense. Have the info on how to apply the utf-i-sig encoding when opening a file with the csv package?
Can you tell us what kind of object is self.inputfile?
self.inputfile is a pathlib.Path object. P.S. I used the chardet package and confirmed the encoding was utf-8-sig.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.