Merging Pandas dataframe and SQL table values

Question

I have dataframe and want to update it or create a new dataframe based on some input from an SQL table. The dataframe A has two columns (ID and Added_Date).

On the other hand, the SQL table has a few more columns including ID, Transaction_Date, Year, Month and Day. My idea is to merge contents of dataframe A to the SQL table and after the merge, pick all records transacted 30 days after the Transaction_Date in SQL table. In summary, I'm keen on having a dataframe with all transactions that happened 30 days (in SQL table) after the Added_Date in the df A. The SQL table is quite huge and is partitioned by Year, Month and Day. How can I optimize this process?

I understand the join can happen when the dataframe is converted to a tuple or may be dictionary but nothing past that. Sample code is below :

import sqlite3
import pandas as pd 

# create df 

data = {'ID': [1, 2, 3], 'Added_Date': ['2023-02-01', '2023-04-15', '2023-03-17']}
df_A = pd.DataFrame(data)

Below is code to create sample transactions in memory table in SQL

# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')
c = conn.cursor()

# Create the transactions table
c.execute('''CREATE TABLE transactions
             (ID INTEGER, transaction_date DATE)''')

# Insert sample data into the transactions table
c.execute('''INSERT INTO transactions VALUES
             (1, '2023-01-15'), (1, '2023-02-10'), (1, '2023-03-01'),
             (2, '2023-04-01'), (2, '2023-04-20'), (2, '2023-05-05'),
             (3, '2023-03-10'), (3, '2023-03-25'), (3, '2023-04-02')''')

Expected outcome should be something like this:

ID  transaction_date
1        2023-02-10
1        2023-03-01
2        2023-04-20
2        2023-05-05
3        2023-03-10
3        2023-03-25
3        2023-04-02

I hope that's more clear.

SQLite (the sqlite tag) and Microsoft SQL Server (the sql-server tag) are not even remotely the same thing. Which database system are you actually using? (Please correct your tags.) — AlwaysLearning
– AlwaysLearning, Commented Apr 27, 2024 at 6:29
You're expected output includes '2023-03-10' for ID == 3. That seems incorrect, given '2023-03-17' in df_A. No? — ouroboros1
– ouroboros1, Commented Apr 27, 2024 at 11:15

ouroboros1 · Accepted Answer · 2024-04-27 10:52:14Z

Here's one approach:

First, convert df_A['Added_Date'] to datetime with pd.to_datetime.
Add df_A as a temporary table (see this answer) and append its data with df.to_sql.
Now, execute a query with an INNER JOIN on both 'ID' and 'transaction_date BETWEEN ...' (cf. DATE) and use cursor.fetchall.
Pass the result to data inside pd.DataFrame and add column names from cursor.description.

df_A['Added_Date'] = pd.to_datetime(df_A['Added_Date'])

create_tmp = pd.io.sql.get_schema(df_A, 'temporary_table')
create_tmp = re.sub(
    "^(CREATE TABLE)?",
    "CREATE TEMPORARY TABLE",
    create_tmp
)
c.execute(create_tmp)

df_A.to_sql(name='temporary_table', con=conn, if_exists='append', index=False)

query = """
SELECT 
  tr.ID, 
  tr.transaction_date 
FROM 
  transactions AS tr 
  INNER JOIN temporary_table AS tmp ON tr.ID = tmp.ID 
  AND tr.transaction_date BETWEEN tmp.Added_Date 
  AND DATE(tmp.Added_Date, '+30 day')
"""

out = pd.DataFrame(data=c.execute(query).fetchall(), 
                   columns=[desc[0] for desc in c.description])

Output

   ID transaction_date
0   1       2023-02-10
1   1       2023-03-01
2   2       2023-04-20
3   2       2023-05-05
4   3       2023-03-25
5   3       2023-04-02

Also possible, of course, to add the end date already to df_A, using pd.offsets.Day:

df_A['End_Date'] = pd.to_datetime(df_A['Added_Date']) + pd.offsets.Day(30)

Collectives™ on Stack Overflow

Merging Pandas dataframe and SQL table values

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related