Read data from CSV file and transform from string to correct data-type, including a list-of-integer column

Question

When I read data back in from a CSV file, every cell is interpreted as a string.

How can I automatically convert the data I read in into the correct type?
Or better: How can I tell the csv reader the correct data-type of each column?

(I wrote a 2-dimensional list, where each column is of a different type (bool, str, int, list of integer), out to a CSV file.)

Sample data (in CSV file):

IsActive,Type,Price,States
True,Cellphone,34,"[1, 2]"
,FlatTv,3.5,[2]
False,Screen,100.23,"[5, 1]"
True,Notebook, 50,[1]

Do you want to map the "States" column to a list of integers? — user647772
– user647772, Commented Jul 26, 2012 at 8:58
A bad idea: To convert the data, except string-data, to the correct format, I could use eval. But I'd prefer to avoid this method. — wewa
– wewa, Commented Jul 26, 2012 at 9:09
@JonClements: Good idea, but this method doesn't exist in Python version 2.5.1 (see: docs.python.org/library/ast.html#ast-helpers) — wewa
– wewa, Commented Jul 26, 2012 at 9:41
Oh - I thought it was in 2.5 - my bad - thanks for correction though — Jon Clements
– Jon Clements, Commented Jul 26, 2012 at 9:51

martineau · Accepted Answer · 2022-03-29 18:32:53Z

I know this is a fairly old question, tagged python-2.5, but here's answer that works with Python 3.6+ which might be of interest to folks using more up-to-date versions of the language.

It leverages the built-in typing.NamedTuple class which was added in Python 3.5. What may not be evident from the documentation is that the "type" of each field can be a function.

The example usage code also uses so-called f-string literals which weren't added until Python 3.6, but their use isn't required to do the core data-type transformations.

#!/usr/bin/env python3.6
import ast
import csv
from typing import NamedTuple


class Record(NamedTuple):
    """ Define the fields and their types in a record. """
    IsActive: bool
    Type: str
    Price: float
    States: ast.literal_eval  # Handles string represenation of literals.

    @classmethod
    def _transform(cls: 'Record', dict_: dict) -> dict:
        """ Convert string values in given dictionary to corresponding Record
            field type.
        """
        return {name: cls.__annotations__[name](value)
                    for name, value in dict_.items()}


filename = 'test_transform.csv'

with open(filename, newline='') as file:
    for i, row in enumerate(csv.DictReader(file)):
        row = Record._transform(row)
        print(f'row {i}: {row}')

Output:

row 0: {'IsActive': True, 'Type': 'Cellphone', 'Price': 34.0, 'States': [1, 2]}
row 1: {'IsActive': False, 'Type': 'FlatTv', 'Price': 3.5, 'States': [2]}
row 2: {'IsActive': True, 'Type': 'Screen', 'Price': 100.23, 'States': [5, 1]}
row 3: {'IsActive': True, 'Type': 'Notebook', 'Price': 50.0, 'States': [1]}

Generalizing this by creating a base class with just the generic classmethod in it is not simple because of the way typing.NamedTuple is implemented.

To avoid that issue, in Python 3.7+, a dataclasses.dataclass could be used instead because they do not have the inheritance issue — so creating a generic base class that can be reused is simple:

#!/usr/bin/env python3.7
import ast
import csv
from dataclasses import dataclass, fields
from typing import Type, TypeVar

T = TypeVar('T', bound='GenericRecord')

class GenericRecord:
    """ Generic base class for transforming dataclasses. """
    @classmethod
    def _transform(cls: Type[T], dict_: dict) -> dict:
        """ Convert string values in given dictionary to corresponding type. """
        return {field.name: field.type(dict_[field.name])
                    for field in fields(cls)}


@dataclass
class CSV_Record(GenericRecord):
    """ Define the fields and their types in a record.
        Field names must match column names in CSV file header.
    """
    IsActive: bool
    Type: str
    Price: float
    States: ast.literal_eval  # Handles string represenation of literals.


filename = 'test_transform.csv'

with open(filename, newline='') as file:
    for i, row in enumerate(csv.DictReader(file)):
        row = CSV_Record._transform(row)
        print(f'row {i}: {row}')

In one sense it's not really very important which one you use because an instance of the class in never created — using one is just a clean way of specifying and holding a definition of the field names and their type in a record data-structure.

A TypedDict was added to the typing module in Python 3.8 that can also be used to provide the typing information, but must be used in a slightly different manner since it doesn't actually define a new type like NamedTuple and dataclasses do — so it requires having a standalone transforming function:

#!/usr/bin/env python3.8
import ast
import csv
from dataclasses import dataclass, fields
from typing import TypedDict


def transform(dict_, typed_dict) -> dict:
    """ Convert values in given dictionary to corresponding types in TypedDict . """
    fields = typed_dict.__annotations__
    return {name: fields[name](value) for name, value in dict_.items()}


class CSV_Record_Types(TypedDict):
    """ Define the fields and their types in a record.
        Field names must match column names in CSV file header.
    """
    IsActive: bool
    Type: str
    Price: float
    States: ast.literal_eval


filename = 'test_transform.csv'

with open(filename, newline='') as file:
    for i, row in enumerate(csv.DictReader(file), 1):
        row = transform(row, CSV_Record_Types)
        print(f'row {i}: {row}')

@Bryan: In that case, you may be interested in the update I just made.
Ah, moving the transformation logic into the Record type is a good design choice

cortopy · Accepted Answer · 2015-09-04 11:54:37Z

17

As the docs explain, the CSV reader doesn't perform automatic data conversion. You have the QUOTE_NONNUMERIC format option, but that would only convert all non-quoted fields into floats. This is a very similar behaviour to other csv readers.

I don't believe Python's csv module would be of any help for this case at all. As others have already pointed out, literal_eval() is a far better choice.

The following does work and converts:

strings
int
floats
lists
dictionaries

You may also use it for booleans and NoneType, although these have to be formatted accordingly for literal_eval() to pass. LibreOffice Calc displays booleans in capital letters, when in Python booleans are Capitalized. Also, you would have to replace empty strings with None (without quotes)

I'm writing an importer for mongodb that does all this. The following is part of the code I've written so far.

[NOTE: My csv uses tab as field delimiter. You may want to add some exception handling too]

def getFieldnames(csvFile):
    """
    Read the first row and store values in a tuple
    """
    with open(csvFile) as csvfile:
        firstRow = csvfile.readlines(1)
        fieldnames = tuple(firstRow[0].strip('\n').split("\t"))
    return fieldnames

def writeCursor(csvFile, fieldnames):
    """
    Convert csv rows into an array of dictionaries
    All data types are automatically checked and converted
    """
    cursor = []  # Placeholder for the dictionaries/documents
    with open(csvFile) as csvFile:
        for row in islice(csvFile, 1, None):
            values = list(row.strip('\n').split("\t"))
            for i, value in enumerate(values):
                nValue = ast.literal_eval(value)
                values[i] = nValue
            cursor.append(dict(zip(fieldnames, values)))
    return cursor

answered Sep 4, 2015 at 11:54

cortopy

2,9072 gold badges28 silver badges31 bronze badges

3 Comments

crow16384 Over a year ago

Good solution. Required modules: csv, ast and itertools.

jarmod Over a year ago

Related Python bug report on incompatibility between csv reader and writer with QUOTE_NONNUMERIC: bugs.python.org/issue30046

Shrout1 Over a year ago

Can't currently edit this post to update. I added a try: except: clause to the for i, value in enumerate(values): loop and it now works extremely well. This provides a lot of flexibility for converting files from CSV but won't choke on data it doesn't recognize (dates, ip addresses... you know... just about everything...) Also remember from itertools import islice. Thanks for this answer!

martineau · Accepted Answer · 2022-03-29 17:03:32Z

8

You have to map your rows:

import csv
import io

data = u"""\
True,foo,1,2.3,baz
False,bar,7,9.8,qux
"""

reader = csv.reader(io.StringIO(data, newline=""), delimiter=",")
parsed = (({'True': True}.get(row[0],False), row[1], int(row[2]), float(row[3]), row[4])
              for row in reader)
for row in parsed:
    print(row)

Results in

(True, 'foo', 1, 2.3, 'baz')
(False, 'bar', 7, 9.8, 'qux')

edited Mar 29, 2022 at 17:03

martineau

124k29 gold badges181 silver badges319 bronze badges

answered Jul 26, 2012 at 8:56

user647772

4 Comments

Jon Clements Over a year ago

As the OP has a bool column in their example. For row[0] assuming that only "True" is True you could use {'True': True}.get(row[0], False)

wewa Over a year ago

@Tichodroma: But how do i convert cells, which contain lists? (Like [1] or "[2, 3, 4]")

user647772 Over a year ago

Good question :) I don't recommend eval but have no solution ATM.

Wojciech Ptak Over a year ago

List format is json-like, so maybe json.loads will help. It should handle nested lists, integers and strings...

doctaphred · Accepted Answer · 2015-09-08 17:19:37Z

Props to Jon Clements and cortopy for teaching me about ast.literal_eval! Here's what I ended up going with (Python 2; changes for 3 should be trivial):

from ast import literal_eval
from csv import DictReader
import csv


def csv_data(filepath, **col_conversions):
    """Yield rows from the CSV file as dicts, with column headers as the keys.

    Values in the CSV rows are converted to Python values when possible,
    and are kept as strings otherwise.

    Specific conversion functions for columns may be specified via
    `col_conversions`: if a column's header is a key in this dict, its
    value will be applied as a function to the CSV data. Specify
    `ColumnHeader=str` if all values in the column should be interpreted
    as unquoted strings, but might be valid Python literals (`True`,
    `None`, `1`, etc.).

    Example usage:

    >>> csv_data(filepath,
    ...          VariousWordsIncludingTrueAndFalse=str,
    ...          NumbersOfVaryingPrecision=float,
    ...          FloatsThatShouldBeRounded=round,
    ...          **{'Column Header With Spaces': arbitrary_function})
    """

    def parse_value(key, value):
        if key in col_conversions:
            return col_conversions[key](value)
        try:
            # Interpret the string as a Python literal
            return literal_eval(value)
        except Exception:
            # If that doesn't work, assume it's an unquoted string
            return value

    with open(filepath) as f:
        # QUOTE_NONE: don't process quote characters, to avoid the value
        # `"2"` becoming the int `2`, rather than the string `'2'`.
        for row in DictReader(f, quoting=csv.QUOTE_NONE):
            yield {k: parse_value(k, v) for k, v in row.iteritems()}

(I'm a little wary that I might have missed some corner cases involving quoting. Please comment if you see any issues!)

martineau · Accepted Answer · 2022-03-29 18:26:57Z

Here's a modified version of @user647772's answer that makes use of the ast.literal_eval() function so it can handle a list-of-integer column (as well as any other valid Python literal expression) in a field in a row of a CSV formatted file.

It works in both Python 2.17 and 3.x.

from ast import literal_eval
import csv
import io

data = u"""\
True,foo,1,2.3,baz,"[1, 2]"
False,bar,7,9.8,qux,"[5, 1]"
"""

def evaluate(expression):
    try:
        return literal_eval(expression)
    except ValueError:
        return str(expression)

reader = csv.reader(io.StringIO(data, newline=""), delimiter=",")
parsed = (tuple(evaluate(field) for field in row) for row in reader)
for row in parsed:
    print(row)

Results:

(True, 'foo', 1, 2.3, 'baz', [1, 2])
(False, 'bar', 7, 9.8, 'qux', [5, 1])

Adrien C. · Accepted Answer · 2019-07-04 13:19:30Z

1

I love @martineau's answer. It's very clean.

One thing I needed was to convert only a couple of values and leave all the other fields as strings, like having strings as default and just updating the type for specific keys.

To do that, just replace this line:

row = CSV_Record._transform(row)

by this one:

row.update(CSV_Record._transform(row))

The 'update' function updates the variable row directly, merging the raw data from the csv extract with the values converted to the correct type by the '_transform' method.

Note there is no 'row = ' in the updated version.

Hope this will help in case anyone has a similar requirement.

(PS: I'm quite new to posting on stackoverflow, so please let me know if the above is not clear)

answered Jul 4, 2019 at 13:19

Adrien C.

212 bronze badges

1 Comment

VoteCoffee Over a year ago

It's a neat trick for certain use cases. I think this will lose some capability that a dataclass might otherwise provide, like functions, as the underlying type of row would remain a dataframe and not a dataclass.

fredmb · Accepted Answer · 2019-10-26 02:55:43Z

1

I too really liked @martineau's approach and was especially intrigued by his comment that the essence of his code was a clean mapping between fields and types. That suggested to me that a dictionary would work also. Hence the variation on his theme shown below. It's worked nicely for me.

Clearly the value field in the dictionary is really just a callable and thus could be used to provide a hook for data massaging as well as typecasting if one so chose.

import ast
import csv

fix_type = {'IsActive': bool, 'Type': str, 'Price': float, 'States': ast.literal_eval}

filename = 'test_transform.csv'

with open(filename, newline='') as file:
    for i, row in enumerate(csv.DictReader(file)):
        row = {k: fix_type[k](v) for k, v in row.items()}
        print(f'row {i}: {row}')

Output

row 0: {'IsActive': True, 'Type': 'Cellphone', 'Price': 34.0, 'States': [1, 2]}
row 1: {'IsActive': False, 'Type': 'FlatTv', 'Price': 3.5, 'States': [2]}
row 2: {'IsActive': True, 'Type': 'Screen', 'Price': 100.23, 'States': [5, 1]}
row 3: {'IsActive': True, 'Type': 'Notebook', 'Price': 50.0, 'States': [1]}

answered Oct 26, 2019 at 2:55

fredmb

111 silver badge2 bronze badges

1 Comment

martineau Over a year ago

The dictionary could be defined in an arguably more readable way with fix_type = dict(IsActive=bool, Type=str, Price=float, States=ast.literal_eval). Might also want to put the {k: fix_type[k](v) for k, v in row.items()} into a reusable data-driven function.

Jon Clements · Accepted Answer · 2012-07-26 16:11:23Z

0

An alternative (although it seems a bit extreme) in lieu of using ast.literal_eval is the pyparsing module available on PyPi - and see if the http://pyparsing.wikispaces.com/file/view/parsePythonValue.py code sample is either appropriate for what you require, or can be easily adapted.

edited Jul 26, 2012 at 16:11

answered Jul 26, 2012 at 14:29

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

1 Comment

PaulMcG Over a year ago

Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing

Neil McGuigan · Accepted Answer · 2022-10-12 20:52:12Z

0

If you're using JSON Schema, you can use singer.transform():

import json
import singer
import csv

with open("my.schema.json") as f:
  schema = json.load(f)

fieldnames = list(schema["properties"].keys())

with open("my.csv", newline="") as f:
  reader = csv.DictReader(f, fieldnames=fieldnames)
  for row in reader:
    print(singer.transform(row, schema))

answered Oct 12, 2022 at 20:52

Neil McGuigan

48.5k12 gold badges131 silver badges156 bronze badges

Comments

Lionel Hamayon · Accepted Answer · 2021-08-25 08:13:48Z

this is my take on the question in case you have to deal with multiple csv formats, some additional custom data wrangling to perform on some columns, and to output as list of lists or tuple of tuples.

The types are presented as a string, as the columns types are stored in a database outside of the Python code. It also enable to add some custom types if need be.

I haven’t tested this against super large files, though, as in my production code I use pandas and this code is here for some tests setup. But I guess this consumes more memory than some other answers, as all the data from the csv is loaded at once.

dict_headers_type = {
    "IsActive": "bool",
    "Type": "str",
    "Price": "float",
    "State": "list",
}

dict_converters = {
    "bool": x: bool(x),
    "float": x: float(x),
    "list": x: ast.literal_eval(x),
}

dict_header_converter = {
    header: dict_converters[my_type]
    for header, my_type in dict_headers_type.items()
    if my_type in dict_converters.keys()
}

That in place, we can perform the conversion:

with open(csv_path) as f:
    data = [line for line in csv.reader(f)]

# list of the converters to apply
ls_f = [
    dict_header_converter[header]
    if header in dict_header_converter.keys() else None
    for header in data[0]
]


ls_records = [f(datapoint) if f else datapoint
     for f, datapoint in zip(ls_f, row)]
    for row in data[1:]]

# to add headers, if needed:
ls_records.insert(0, data[0])

outputs:

[
  ['IsActive','Type','Price','State']
  [True, 'Cellphone', 34.0, [1, 2]],
  [False, 'FlatTv', 3.5, [2]],
  [True, 'Screen', 100.23, [5, 1]],
  [True, 'Notebook', 50.0, [1]],
]

Collectives™ on Stack Overflow

Read data from CSV file and transform from string to correct data-type, including a list-of-integer column

10 Answers 10

3 Comments

3 Comments

4 Comments

Comments

Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

3 Comments

3 Comments

4 Comments

Comments

Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related