Python csv package - issue with DictReader module

Question

I'm having a curious issue with the csv package in Python 3.7.

I'm importing a csv file and able to access all the file as expected, with one exception - the header row, as stored in the "fieldnames" object, appears have the first column header (first item in fieldnames) malformed.

This first field always has the format: 'xxx"header"'

where:

xxx are garbage characters that always seem to be the same
header is the correct header text

See the following screenshot of my table <csv.DictReader> object from my debug window:

My code to open the file, follows. I added the headers[0] = table.fieldnames[0].split('"')[1] in order to extract the correct header and place it back into fieldnames`.

import csv

  with self.inputfile.open() as self.inputfid:
    table = csv.DictReader(self.inputfid, delimiter=',')
    headers = table.fieldnames
    headers[0] = table.fieldnames[0].split('"')[1]

(Note: self.inputfile is a pathlib.Path object)

I didn't notice this for a long time because I wasn't using the first column (with the # header) - I've been happily parsing with the rest of the columns for a while on multiple files.

If I look directly at the csv, there doesn't appear to be any issue:

Questions:

Does anyone know what the issue is? Is there anything I can try to correct the import issue?

If there isn't a fix, is there a better way to parse the garbage? I realize this could clear up in the future, but I think the split will still work even with just bare double quotes (the header should still be the 2nd item in the split, right?). Is there a better solution?

snakecharmerb · Accepted Answer · 2019-09-29 08:52:36Z

2

It looks like your csv file is encoded as utf-8-sig - a version of utf-8 used by some Windows applications, but it's being decoded as cp1252 - another encoding in common use on Windows.

>>> print('"#"'.encode('utf-8-sig').decode('cp1252'))
ï»¿"#"

The "garbage" characters preceding the header are the byte-order-mark that utf-8-sig uses to tell Windows applications that a file is encoded as utf-8 rather than one of the historically more common 8-bit encodings.

To avoid the "garbage", specify utf-8-sig as the encoding when opening your file.

The code in the question could be modified to work like this:

import csv

encoding = 'utf-8-sig'
with self.inputfile.open(encoding=encoding, newline='') as self.inputfid:
    table = csv.DictReader(self.inputfid, delimiter=',')
    headers = table.fieldnames
    ...

If - as seems likely - the encoding of input files may vary, the value of encoding (or a best guess) must be determined by using a tool like chardet, as used in the comments.

edited Sep 29, 2019 at 8:52

answered Sep 27, 2019 at 16:10

snakecharmerb

57.2k13 gold badges137 silver badges200 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

LightCC Over a year ago

Great - that makes sense. Have the info on how to apply the utf-i-sig encoding when opening a file with the csv package?

snakecharmerb Over a year ago

Can you tell us what kind of object is self.inputfile?

LightCC Over a year ago

self.inputfile is a pathlib.Path object. P.S. I used the chardet package and confirmed the encoding was utf-8-sig.

Collectives™ on Stack Overflow

Python csv package - issue with DictReader module

Questions:

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Questions:

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related