Summary: I want to be able to recreate my SQL code via Python such that I dont have to manually type out each join for situations when the combinations get too large to handle
I have one table
import pandas
data_table_one = {'store': ['A','B', 'C', 'C', 'C'],
'new_item': ['red car', 'red boat', 'red plane', 'green plane', 'red bike'],
'previous_item':['green car', 'green boat', 'green plane', 'yellow plane' , 'green bike'],
'change_date': ['2025-01','2025-01','2025-01','2024-01','2025-01']}
df_table_one = pandas.DataFrame(data_table_one)
df_table_one below
| store | new_item | previous_item | change_date |
|-------|-------------|---------------|-------------|
| A | red car | green car | 2025-01 |
| B | red boat | green boat | 2025-01 |
| C | red plane | green plane | 2025-01 |
| C | green plane | yellow plane | 2024-01 |
| C | red bike | green bike | 2025-01 |
Assume all items are unique per store so store A will only have one red car but store B can also have a red car. I want to be able to get the latest new_item based on max change_date and the first previous_item based on min change_date until all items are traced back.
The desired output is red car joins to green car, red boat to green boat, red bike to green bike, and red plane to yellow plane since yellow plane first joins to green plane and then green plane joins to red plane
Desired Output
| store | latest_item | latest_change_date | first_item | first_change_date |
|-------|-------------|--------------------|--------------|-------------------|
| A | red car | 2025-01 | green car | 2025-01 |
| B | red boat | 2025-01 | green boat | 2025-01 |
| C | red plane | 2025-01 | yellow plane | 2024-01 |
| C | red bike | 2025-01 | green bike | 2025-01 |
I can currently do this via SQL (Redshift) but the issue is that this quickly becomes unscalable if there is more than one join or if the amount of joins needed is not known so code will have to be manually updated each time i.e in December 2025 its one join but in January 2026 it can be two joins
select
a.store,
a.new_item as latest_item,
a.change_date as latest_change_date,
b.previous_item as first_item,
b.change_date as first_change_date
from
df_table_one a
join
df_table_one b
on b.new_item = a.previous_item
and b.store = a.store
;