12

I've been working to develop a product which centers in the daily execution of a data analysis Python 3.7.0 script. Everyday at midnight it will proccess a huge amount of data, and then export the result to two MySQL tables. The first one will only contain the data relative to the current day, while the other table will contain the concatenated data of all executions.

To exemplify what I current have, see the code below, supposing df would be the final DataFrame generated from the data analysis:

import pandas as pd
import sqlalchemy

engine = sqlalchemy.create_engine(r"mysql+pymysql://user:psswd@localhost/pathToMyDB")

df = pd.DataFrame({'Something':['a','b','c']})

df.to_sql('DReg', engine, index = True, if_exists='replace') #daily database
df.to_sql('AReg', engine, index = False, if_exists='append') #anual database

As you can see in the parameters of my second to_sql function, I ain't setting an index to the anual database. However, my manager asked me to do so, creating an index that would center around a simple rule: it would be an auto increasing numeric index, that would automatically attribute a number to every row saved on the database corresponding to its position.

So basically, the first time I saved df, the database should look like:

index   Something
0       a
1       b
2       c

And in my second execution:

index   Something
0       a
1       b
2       c
3       a
4       b
5       c

However, when I set my index to True in the second df.to_sql command (turning it into df.to_sql('AReg', engine, index = True, if_exists='append')), after two executions my database ends up looking like:

index   Something
0       a
1       b
2       c
0       a
1       b
2       c

I did some research, but could not find a way to allow this auto increase on the index. I considered reading the anual database at every execution and then adapting my dataframe's index to it, but my database can easily get REALLY huge, which would make it's execution absurdly slow (and also forbid me to simultaneously execute the same data analysis in two computers without compromising my index).

So what is the best solution to make this index work? What am I missing here?

10
  • 1
    if you have two executions of this code then it is doing what you told it to do: it is writing the same dataframe twice. The to_sql instruction is writing a column named 'index' with the contents of the dataframe's index and the dataframe is the same in each call, that's why it repeats Commented Feb 21, 2019 at 14:10
  • Yeah, that makes sense... Do you have any idea how I could formulate this code properly then? Commented Feb 21, 2019 at 14:12
  • one way to go about it (although it's ugly) is to first read the table(AReg,DReg), find the largest index and offset the index of the dataframe you're about to write to the db. Let me think is there's a better way Commented Feb 21, 2019 at 14:14
  • Yeah, I thought about it too, but it is not much of an optimized solution... Let's see if some other approach pops up in someone's mind Commented Feb 21, 2019 at 14:14
  • 1
    I second on this, it would be nice to have a way to delegate auto-indexing to the database without workarounds like reading the entire table first (which has its drawbacks e.g. in a concurrent scenario) Commented May 4, 2019 at 10:21

2 Answers 2

19
+25

Even though Pandas has a lot of export options, its main purpose is not intented to use as database management api. Managing indexes is typically something a database should take care of.

I would suggest to set index=False, if_exists='append' and create the table with an auto-increment index:

CREATE TABLE AReg (
     id INT NOT NULL AUTO_INCREMENT,
     # your fields here
     PRIMARY KEY (id)
);
Sign up to request clarification or add additional context in comments.

2 Comments

This answer seems pretty cool, however, how do you create the tables from pandas?
I guess when table doesnt exist pandas runs a create table, we just need to pass increment index
1

Here is my solution. SQL + Python。

Use SQL to get max index id instead of read whole table. it is fast and light load on DB and python.

update the id need read from database sequency to ensure unique id if multi user/session cases.

It is best to design the table with auto incremental id. if not, then follow cases need be followed. all new id should get from the sequence instance of database. The sequency instance can make sure the id will be unique even multi user/sesseion readging.

In mysql, we get max id manualy. in Oracle Or postgresql, we can get the max sequence id by advanced sql command.

import pandas as pd
from pprint import pprint
from sqlalchemy import create_engine


db_name = 'temp'
table_name = 'tmp_table'
index_name = 'id'
mysql_url = f'mysql+mysqlconnector://root:[email protected]:13306/{db_name}'
engine=create_engine(mysql_url)

def to_sql_seq(df,table_name=table_name, engine=engine):

    get_seq_id_sql = f"""
                       select your_sequence.nextval as id
                        from dual 
                         connect by level < {df.shape[0]}
                     """

    # sql_get_max_id = f'select max({index_name}) as id from {table_name}'

    s_id = pd.read_sql(get_seq_id_sql , engine)

    df.index =s_id['id'].values
    df.index.name=index_name
    df.to_sql(table_name,engine,if_exists='append')
    return
#Check the current database record
current_table = pd.read_sql(f"select * from {table_name}",engine)
pprint(current_table)

#Simu the new data
new_data = [1,2,3,4]
new_table = pd.DataFrame(new_data,columns=['value'])
to_sql_seq(new_table)

#show the index auto_increment result
inserted_table = pd.read_sql(f'select * from {table_name}',engine)
pprint(inserted_table)

And output

   id  value
0   1    123
1   2    234
2   3      1
3   4      2
4   5      3
5   6      4
   id  value
0   1    123
1   2    234
2   3      1
3   4      2
4   5      3
5   6      4
6   7      1
7   8      2
8   9      3
9  10      4

5 Comments

Is the incrementation a database transaction here?
No, the above code get max id from databased in sql query and adjusted dataframe index in python and insert into databased with adjusted index. Note: id is the the database index name. Before insert, they 1,2,3,4,5, the dataframe index was 0 based. by the index was renamed and added 5 original max(id) of database. the new database index named as id, and inserted was id 6,7,8,9,10.
Could run into raice condition
thank for comment. I do noticed that there is issue. especailly, when multi user cases. it may lead to following case. Only first user success, the following user will get wrong id. I has revise the code by use oracle sequence. By using oracle sequence, the id will be unique.in multi user cases.
In postgresql, the get_seq_id_sql can be replace by select nextval('your_sequence') as id from generate_series (1,df.shape[0])

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.