Pandas to_sql - Increase table's index when appending DataFrame

Question

I've been working to develop a product which centers in the daily execution of a data analysis Python 3.7.0 script. Everyday at midnight it will proccess a huge amount of data, and then export the result to two MySQL tables. The first one will only contain the data relative to the current day, while the other table will contain the concatenated data of all executions.

To exemplify what I current have, see the code below, supposing df would be the final DataFrame generated from the data analysis:

import pandas as pd
import sqlalchemy

engine = sqlalchemy.create_engine(r"mysql+pymysql://user:psswd@localhost/pathToMyDB")

df = pd.DataFrame({'Something':['a','b','c']})

df.to_sql('DReg', engine, index = True, if_exists='replace') #daily database
df.to_sql('AReg', engine, index = False, if_exists='append') #anual database

As you can see in the parameters of my second to_sql function, I ain't setting an index to the anual database. However, my manager asked me to do so, creating an index that would center around a simple rule: it would be an auto increasing numeric index, that would automatically attribute a number to every row saved on the database corresponding to its position.

So basically, the first time I saved df, the database should look like:

index   Something
0       a
1       b
2       c

And in my second execution:

index   Something
0       a
1       b
2       c
3       a
4       b
5       c

However, when I set my index to True in the second df.to_sql command (turning it into df.to_sql('AReg', engine, index = True, if_exists='append')), after two executions my database ends up looking like:

index   Something
0       a
1       b
2       c
0       a
1       b
2       c

I did some research, but could not find a way to allow this auto increase on the index. I considered reading the anual database at every execution and then adapting my dataframe's index to it, but my database can easily get REALLY huge, which would make it's execution absurdly slow (and also forbid me to simultaneously execute the same data analysis in two computers without compromising my index).

So what is the best solution to make this index work? What am I missing here?

if you have two executions of this code then it is doing what you told it to do: it is writing the same dataframe twice. The to_sql instruction is writing a column named 'index' with the contents of the dataframe's index and the dataframe is the same in each call, that's why it repeats — Yuca
– Yuca, Commented Feb 21, 2019 at 14:10
Yeah, that makes sense... Do you have any idea how I could formulate this code properly then? — souza
– souza, Commented Feb 21, 2019 at 14:12
one way to go about it (although it's ugly) is to first read the table(AReg,DReg), find the largest index and offset the index of the dataframe you're about to write to the db. Let me think is there's a better way — Yuca
– Yuca, Commented Feb 21, 2019 at 14:14
Yeah, I thought about it too, but it is not much of an optimized solution... Let's see if some other approach pops up in someone's mind — souza
– souza, Commented Feb 21, 2019 at 14:14
I second on this, it would be nice to have a way to delegate auto-indexing to the database without workarounds like reading the entire table first (which has its drawbacks e.g. in a concurrent scenario) — matanox
– matanox, Commented May 4, 2019 at 10:21

tvgriek · Accepted Answer · 2019-11-11 07:19:36Z

19

+25

Even though Pandas has a lot of export options, its main purpose is not intented to use as database management api. Managing indexes is typically something a database should take care of.

I would suggest to set index=False, if_exists='append' and create the table with an auto-increment index:

CREATE TABLE AReg (
     id INT NOT NULL AUTO_INCREMENT,
     # your fields here
     PRIMARY KEY (id)
);

edited Nov 11, 2019 at 7:19

answered May 6, 2019 at 20:31

tvgriek

1,2659 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jeroaranda Over a year ago

This answer seems pretty cool, however, how do you create the tables from pandas?

jeroaranda Over a year ago

I guess when table doesnt exist pandas runs a create table, we just need to pass increment index

Yong Wang · Accepted Answer · 2020-06-18 05:28:49Z

1

Here is my solution. SQL + Python。

Use SQL to get max index id instead of read whole table. it is fast and light load on DB and python.

update the id need read from database sequency to ensure unique id if multi user/session cases.

It is best to design the table with auto incremental id. if not, then follow cases need be followed. all new id should get from the sequence instance of database. The sequency instance can make sure the id will be unique even multi user/sesseion readging.

In mysql, we get max id manualy. in Oracle Or postgresql, we can get the max sequence id by advanced sql command.

import pandas as pd
from pprint import pprint
from sqlalchemy import create_engine


db_name = 'temp'
table_name = 'tmp_table'
index_name = 'id'
mysql_url = f'mysql+mysqlconnector://root:[email protected]:13306/{db_name}'
engine=create_engine(mysql_url)

def to_sql_seq(df,table_name=table_name, engine=engine):

    get_seq_id_sql = f"""
                       select your_sequence.nextval as id
                        from dual 
                         connect by level < {df.shape[0]}
                     """

    # sql_get_max_id = f'select max({index_name}) as id from {table_name}'

    s_id = pd.read_sql(get_seq_id_sql , engine)

    df.index =s_id['id'].values
    df.index.name=index_name
    df.to_sql(table_name,engine,if_exists='append')
    return
#Check the current database record
current_table = pd.read_sql(f"select * from {table_name}",engine)
pprint(current_table)

#Simu the new data
new_data = [1,2,3,4]
new_table = pd.DataFrame(new_data,columns=['value'])
to_sql_seq(new_table)

#show the index auto_increment result
inserted_table = pd.read_sql(f'select * from {table_name}',engine)
pprint(inserted_table)

And output

   id  value
0   1    123
1   2    234
2   3      1
3   4      2
4   5      3
5   6      4
   id  value
0   1    123
1   2    234
2   3      1
3   4      2
4   5      3
5   6      4
6   7      1
7   8      2
8   9      3
9  10      4

edited Jun 18, 2020 at 5:28

answered May 5, 2019 at 14:48

Yong Wang

1,31311 silver badges15 bronze badges

5 Comments

matanox Over a year ago

Is the incrementation a database transaction here?

Yong Wang Over a year ago

No, the above code get max id from databased in sql query and adjusted dataframe index in python and insert into databased with adjusted index. Note: id is the the database index name. Before insert, they 1,2,3,4,5, the dataframe index was 0 based. by the index was renamed and added 5 original max(id) of database. the new database index named as id, and inserted was id 6,7,8,9,10.

gies0r Over a year ago

Could run into raice condition

Yong Wang Over a year ago

thank for comment. I do noticed that there is issue. especailly, when multi user cases. it may lead to following case. Only first user success, the following user will get wrong id. I has revise the code by use oracle sequence. By using oracle sequence, the id will be unique.in multi user cases.

jeb Over a year ago

In postgresql, the get_seq_id_sql can be replace by select nextval('your_sequence') as id from generate_series (1,df.shape[0])

Collectives™ on Stack Overflow

Pandas to_sql - Increase table's index when appending DataFrame

2 Answers 2

2 Comments

Use SQL to get max index id instead of read whole table. it is fast and light load on DB and python.

It is best to design the table with auto incremental id. if not, then follow cases need be followed. all new id should get from the sequence instance of database. The sequency instance can make sure the id will be unique even multi user/sesseion readging.

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Use SQL to get max index id instead of read whole table. it is fast and light load on DB and python.

It is best to design the table with auto incremental id. if not, then follow cases need be followed. all new id should get from the sequence instance of database. The sequency instance can make sure the id will be unique even multi user/sesseion readging.

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related