1

My code:

raw_data = pd.read_csv("C:/my.csv")

After I ran it to file is loaded but I am getting:

C:\Users\user\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (0,79,237,239,241,243,245,247,248,249,250,251,252,253,254,255,256,258,260,262,264) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

Questions:

  1. What exactly it means?
  2. How to fix it?

Sorry, I cannot share the data.

10
  • 1
    Does this help? have a read on how to ask good pandas questions as well. Commented Jan 21, 2020 at 15:00
  • 2
    The warning is telling you that those columns have mixed data types. Meaning, for example, column 79 you might expect to be a date format. However, in your file, you might have '01/01/2020' but you also have 43831 in another row. Pandas is trying to determine the type for you, but it's warning you that a consistent type can't be assigned because the data is inconsistent. Commented Jan 21, 2020 at 15:00
  • @gbeaven You mean "'01/01/2020' but you also have 43831 in another column"? Commented Jan 21, 2020 at 15:01
  • no, row is correct. pandas has to read the entire file into memory (thus resulting in OOM). Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file. Commented Jan 21, 2020 at 15:03
  • 2
    @vasili111 No, I mean row. One is expected to be a date type while the other an int in the same column. I'm suggesting you have differing types (inconsistent) of data in the same column. Commented Jan 21, 2020 at 15:04

4 Answers 4

2

Try these

raw_data = pd.read_csv("C:/my.csv",low_memory=False)
Sign up to request clarification or add additional context in comments.

3 Comments

From here: stackoverflow.com/a/27232309/1601703 it looks like low_memory=False is depreciated. What it actually does?
its not depreciated in your error it self says that it saying that it work previously for me ( Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
@vasili111 it isnt depreciated: if you read the docs it says: low_memory : bool, default True Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser).
1

pd.read_csv has a number of parameters that will give you control over how to treat the different columns.

Without the data it is hard to be specific, so read up on what the the options dtype or converters can do.

See the pandas manual for more details.

A first try could be

raw_data = pd.read_csv("C:/my.csv", dtype=str)

This should allow you to read the data and figure out how to set the data type on the columns that really matter.

4 Comments

Can I fix this by casting on problematic columns the data type that I expect? For example, for example, string type on the column with strings and numerics?
Yes you can do this. However, first best to understand why the data has mixed types in the relevant columns.
Sure I meant that. First I will load data with raw_data = pd.read_csv("C:/my.csv"). After reciving that warning I will look at problematic columns and find why panda may think that there are different data types (for example, strings and numericals in one column). After I will fix that by changing data if possible (recode, make NaN, etc). After I will cast the data type on that column that I think is correct one.
Sounds like a good plan. Good luck with your data wrangling :)
1

Pandas will read all data to memory. If your CSV is large, this may be a tough task.

chunks = []
 for chunk in pd.read_csv('desired_file...', chunksize = 1000):
     chunks.append(chunk)
 df = pd.concat(chunks, ignore_index = True)

This will read the CSV to memory in chunks instead of as bulk.

Comments

0

Try to use the parameter dtype for pandas.read_csv

You can find here: Pandas.read_csv

In my CSV, I just transform all the columns in a string, and after the loading of the Dataset, i transform the columns I need in numbers using

DataFrame[Column] = pandas.to_numeric(DataFrame[Column], errors='coerce')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.