Problems with loading csv with pandas

Question

My code:

raw_data = pd.read_csv("C:/my.csv")

After I ran it to file is loaded but I am getting:

C:\Users\user\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (0,79,237,239,241,243,245,247,248,249,250,251,252,253,254,255,256,258,260,262,264) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

Questions:

What exactly it means?
How to fix it?

Sorry, I cannot share the data.

Does this help? have a read on how to ask good pandas questions as well. — hongsy
– hongsy, Commented Jan 21, 2020 at 15:00
The warning is telling you that those columns have mixed data types. Meaning, for example, column 79 you might expect to be a date format. However, in your file, you might have '01/01/2020' but you also have 43831 in another row. Pandas is trying to determine the type for you, but it's warning you that a consistent type can't be assigned because the data is inconsistent. — gbeaven
– gbeaven, Commented Jan 21, 2020 at 15:00
@gbeaven You mean "'01/01/2020' but you also have 43831 in another column"? — vasili111
– vasili111, Commented Jan 21, 2020 at 15:01
no, row is correct. pandas has to read the entire file into memory (thus resulting in OOM). Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file. — hongsy
– hongsy, Commented Jan 21, 2020 at 15:03
@vasili111 No, I mean row. One is expected to be a date type while the other an int in the same column. I'm suggesting you have differing types (inconsistent) of data in the same column. — gbeaven
– gbeaven, Commented Jan 21, 2020 at 15:04

venkatadileep · Accepted Answer · 2020-01-21 15:02:10Z

2

Try these

raw_data = pd.read_csv("C:/my.csv",low_memory=False)

answered Jan 21, 2020 at 15:02

venkatadileep

1817 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

vasili111 Over a year ago

From here: stackoverflow.com/a/27232309/1601703 it looks like low_memory=False is depreciated. What it actually does?

venkatadileep Over a year ago

its not depreciated in your error it self says that it saying that it work previously for me ( Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

anky Over a year ago

@vasili111 it isnt depreciated: if you read the docs it says:

low_memory : bool, default True Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser).

FredrikHedman · Accepted Answer · 2020-01-21 15:19:57Z

1

pd.read_csv has a number of parameters that will give you control over how to treat the different columns.

Without the data it is hard to be specific, so read up on what the the options dtype or converters can do.

See the pandas manual for more details.

A first try could be

raw_data = pd.read_csv("C:/my.csv", dtype=str)

This should allow you to read the data and figure out how to set the data type on the columns that really matter.

edited Jan 21, 2020 at 15:19

answered Jan 21, 2020 at 15:13

FredrikHedman

1,2637 silver badges14 bronze badges

4 Comments

vasili111 Over a year ago

Can I fix this by casting on problematic columns the data type that I expect? For example, for example, string type on the column with strings and numerics?

FredrikHedman Over a year ago

Yes you can do this. However, first best to understand why the data has mixed types in the relevant columns.

vasili111 Over a year ago

Sure I meant that. First I will load data with raw_data = pd.read_csv("C:/my.csv"). After reciving that warning I will look at problematic columns and find why panda may think that there are different data types (for example, strings and numericals in one column). After I will fix that by changing data if possible (recode, make NaN, etc). After I will cast the data type on that column that I think is correct one.

FredrikHedman Over a year ago

Sounds like a good plan. Good luck with your data wrangling :)

Gess123 · Accepted Answer · 2020-01-21 17:31:22Z

1

Pandas will read all data to memory. If your CSV is large, this may be a tough task.

chunks = []
 for chunk in pd.read_csv('desired_file...', chunksize = 1000):
     chunks.append(chunk)
 df = pd.concat(chunks, ignore_index = True)

This will read the CSV to memory in chunks instead of as bulk.

answered Jan 21, 2020 at 17:31

Gess123

611 silver badge4 bronze badges

Comments

João Victor Fernandes · Accepted Answer · 2020-01-21 15:46:03Z

0

Try to use the parameter dtype for pandas.read_csv

You can find here: Pandas.read_csv

In my CSV, I just transform all the columns in a string, and after the loading of the Dataset, i transform the columns I need in numbers using

DataFrame[Column] = pandas.to_numeric(DataFrame[Column], errors='coerce')

answered Jan 21, 2020 at 15:46

João Victor Fernandes

601 silver badge9 bronze badges

Collectives™ on Stack Overflow

Problems with loading csv with pandas

4 Answers 4

3 Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related