Pandas to_sql avoid duplicate rows

Question

I am using pandas' to_sql method to insert data into a mysql table. The mysql table already exists and I'd like to avoid inserting duplicate rows.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

Is there a way to do this in python?

# mysql connection
import pandas as pd
import pymysql
from sqlalchemy import create_engine
user = 'user1'
pwd = 'xxxx'
host =  'aa1.us-west-1.rds.amazonaws.com'
port = 3306
database = 'main'

engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user,pwd,host,database))


con = engine.connect()
df.to_sql(name="dfx", con=con, if_exists = 'append')
con.close()

Are there any work-arounds, if there isn't a straight forward way to do this?

If your table isn't exceptionally large you could simply read the whole DB into a df and do a concat with a drop duplicates into your existing code from you post — ArchAngelPwn
– ArchAngelPwn, Commented Jul 13, 2022 at 2:55
I'd say skip the Pandas layer and go for raw SQL so you can pass the duplicate handling to SQL. — Quang Hoang
– Quang Hoang, Commented Jul 13, 2022 at 3:03

OD1995 · Accepted Answer · 2022-07-13 09:42:32Z

1

It sounds like you want to do an "upsert" (insert or update). Pangres is a useful package that will allow you to do an upsert using a pandas df. If you don't want to update the row if it exists, that is also an option by setting if_row_exists to 'ignore'

answered Jul 13, 2022 at 9:42

OD1995

1,7954 gold badges26 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ASH · Accepted Answer · 2022-11-11 20:31:02Z

0

I have never heard of 'upsert' before today, but it sounds interesting. You could certainly delete dupes after the data is loaded into your table.

WITH a as
(
SELECT Firstname,ROW_NUMBER() OVER(PARTITION by Firstname, empID ORDER BY Firstname) 
AS duplicateRecCount
FROM dbo.tblEmployee
)
--Now Delete Duplicate Records
DELETE FROM a
WHERE duplicateRecCount > 1

That will work fine, unless you have billions of rows.

answered Nov 11, 2022 at 20:31

ASH

20.5k28 gold badges117 silver badges247 bronze badges

Collectives™ on Stack Overflow

Pandas to_sql avoid duplicate rows

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related