0

I have dataframe and want to update it or create a new dataframe based on some input from an SQL table. The dataframe A has two columns (ID and Added_Date).

On the other hand, the SQL table has a few more columns including ID, Transaction_Date, Year, Month and Day. My idea is to merge contents of dataframe A to the SQL table and after the merge, pick all records transacted 30 days after the Transaction_Date in SQL table. In summary, I'm keen on having a dataframe with all transactions that happened 30 days (in SQL table) after the Added_Date in the df A. The SQL table is quite huge and is partitioned by Year, Month and Day. How can I optimize this process?

I understand the join can happen when the dataframe is converted to a tuple or may be dictionary but nothing past that. Sample code is below :

import sqlite3
import pandas as pd 

# create df 

data = {'ID': [1, 2, 3], 'Added_Date': ['2023-02-01', '2023-04-15', '2023-03-17']}
df_A = pd.DataFrame(data)

Below is code to create sample transactions in memory table in SQL

# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')
c = conn.cursor()

# Create the transactions table
c.execute('''CREATE TABLE transactions
             (ID INTEGER, transaction_date DATE)''')

# Insert sample data into the transactions table
c.execute('''INSERT INTO transactions VALUES
             (1, '2023-01-15'), (1, '2023-02-10'), (1, '2023-03-01'),
             (2, '2023-04-01'), (2, '2023-04-20'), (2, '2023-05-05'),
             (3, '2023-03-10'), (3, '2023-03-25'), (3, '2023-04-02')''') 

Expected outcome should be something like this:

ID  transaction_date
1        2023-02-10
1        2023-03-01
2        2023-04-20
2        2023-05-05
3        2023-03-10
3        2023-03-25
3        2023-04-02

I hope that's more clear.

4
  • Pls add data sample of your dataframe and SQL db Commented Apr 27, 2024 at 4:03
  • Added more details. Commented Apr 27, 2024 at 4:24
  • SQLite (the sqlite tag) and Microsoft SQL Server (the sql-server tag) are not even remotely the same thing. Which database system are you actually using? (Please correct your tags.) Commented Apr 27, 2024 at 6:29
  • You're expected output includes '2023-03-10' for ID == 3. That seems incorrect, given '2023-03-17' in df_A. No? Commented Apr 27, 2024 at 11:15

1 Answer 1

0

Here's one approach:

df_A['Added_Date'] = pd.to_datetime(df_A['Added_Date'])

create_tmp = pd.io.sql.get_schema(df_A, 'temporary_table')
create_tmp = re.sub(
    "^(CREATE TABLE)?",
    "CREATE TEMPORARY TABLE",
    create_tmp
)
c.execute(create_tmp)

df_A.to_sql(name='temporary_table', con=conn, if_exists='append', index=False)

query = """
SELECT 
  tr.ID, 
  tr.transaction_date 
FROM 
  transactions AS tr 
  INNER JOIN temporary_table AS tmp ON tr.ID = tmp.ID 
  AND tr.transaction_date BETWEEN tmp.Added_Date 
  AND DATE(tmp.Added_Date, '+30 day')
"""

out = pd.DataFrame(data=c.execute(query).fetchall(), 
                   columns=[desc[0] for desc in c.description])

Output

   ID transaction_date
0   1       2023-02-10
1   1       2023-03-01
2   2       2023-04-20
3   2       2023-05-05
4   3       2023-03-25
5   3       2023-04-02

Also possible, of course, to add the end date already to df_A, using pd.offsets.Day:

df_A['End_Date'] = pd.to_datetime(df_A['Added_Date']) + pd.offsets.Day(30)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.