3

Right now I messing with some JSON data and I am trying to push it into the MySQL database on the fly. The JSON file is enormous so I have to carefully go through it line by line using yield function in Python, convert each JSON line into small pandas DF and write it into MySQL. The problem is that when I create DF from JSON it adds the index column. And it seems that when I write stuff to MySQL it ignores index=False option. Code below

import gzip
import pandas as pd
from sqlalchemy import create_engine

#stuff to parse json file
def parseJSON(path):
  g = open(path, 'r')
  for l in g:
      yield eval(l)
#MySQL engine
engine = create_engine('mysql://login:password@localhost:1234/MyDB', echo=False)
#empty df just to have it
df = {}

for l in parseJSON("MyFile.json"):
    df = pd.DataFrame.from_dict(l, orient='index')
    df.to_sql(name='MyTable', con=engine, if_exists = 'append', index=False)

And I get a error:

OperationalError: (_mysql_exceptions.OperationalError) (1054, "Unknown column '0' in 'field list'")

Any ideas what I am missing? Or is there a way to get around this stuff?

UPD. I see that dataframe has an unnamed column with value 0 each time I create the dataframe in inner loop.

Here is some info about DF:

df
Out[155]: 
                                                                0
reviewerID                                         A1C2VKKDCP5H97
asin                                                   0007327064
reviewerName                                        Donna Polston
helpful                                                    [0, 0]
unixReviewTime                                         1392768000
reviewText      love Oddie ,One of my favorite books are the O...
overall                                                         5
reviewTime                                            02 19, 2014
summary                                                       Wow

print(df.columns)
RangeIndex(start=0, stop=1, step=1)
6
  • Sounds like the column names differ from your dataframe to your table. Commented Apr 18, 2017 at 2:56
  • @BobHaffner, hi, I double-checked that, columns are precisely same. If the column do not exist it would let me know, I believe. I updated the question a bit. Commented Apr 18, 2017 at 2:58
  • Ok, so they all match except you have an extra column with a value of 0? Can you do a print (df.columns) right before your df.to_sql()? Commented Apr 18, 2017 at 3:01
  • @BobHaffner done as well Commented Apr 18, 2017 at 3:08
  • Ok, now its little more clear. You currently have a frame with one column named 0 with your intended column names as the index of your frame. Perhaps you can try df = pd.DataFrame.from_dict(l) OR you could try df.T.to_sql(name='MyTable', con=engine, if_exists = 'append', index=False) where you tranpose the frame before pushing it to mysql. NOTE: I think you would have much better performance if you could build up a dict (or some other structure), convert all rows to a df then push to mysql. This one row at a time might be too slow? Commented Apr 18, 2017 at 3:20

1 Answer 1

2

You currently have a frame with one column named 0 with your intended column names as the index of your frame. Perhaps you can try

df = pd.DataFrame.from_dict(l)

NOTE: I think you would have much better performance if you could build up a dict (or some other structure), convert all rows to a df then push to mysql. This one row at a time might be too slow

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.