2

Summary: I want to be able to recreate my SQL code via Python such that I dont have to manually type out each join for situations when the combinations get too large to handle

I have one table

import pandas
data_table_one = {'store': ['A','B', 'C', 'C', 'C'],
        'new_item': ['red car', 'red boat', 'red plane', 'green plane', 'red bike'],
        'previous_item':['green car', 'green boat', 'green plane', 'yellow plane' , 'green bike'],
        'change_date': ['2025-01','2025-01','2025-01','2024-01','2025-01']}
df_table_one = pandas.DataFrame(data_table_one)

df_table_one below

| store | new_item    | previous_item | change_date |
|-------|-------------|---------------|-------------|
| A     | red car     | green car     | 2025-01     |
| B     | red boat    | green boat    | 2025-01     |
| C     | red plane   | green plane   | 2025-01     |
| C     | green plane | yellow plane  | 2024-01     |
| C     | red bike    | green bike    | 2025-01     |

Assume all items are unique per store so store A will only have one red car but store B can also have a red car. I want to be able to get the latest new_item based on max change_date and the first previous_item based on min change_date until all items are traced back.

The desired output is red car joins to green car, red boat to green boat, red bike to green bike, and red plane to yellow plane since yellow plane first joins to green plane and then green plane joins to red plane

Desired Output

| store | latest_item | latest_change_date | first_item   | first_change_date |
|-------|-------------|--------------------|--------------|-------------------|
| A     | red car     | 2025-01            | green car    | 2025-01           |
| B     | red boat    | 2025-01            | green boat   | 2025-01           |
| C     | red plane   | 2025-01            | yellow plane | 2024-01           |
| C     | red bike    | 2025-01            | green bike   | 2025-01           |

I can currently do this via SQL (Redshift) but the issue is that this quickly becomes unscalable if there is more than one join or if the amount of joins needed is not known so code will have to be manually updated each time i.e in December 2025 its one join but in January 2026 it can be two joins

select
    a.store,
    a.new_item as latest_item,
    a.change_date as latest_change_date,
    b.previous_item as first_item,
    b.change_date as first_change_date
from
    df_table_one a
join
    df_table_one b
    on b.new_item = a.previous_item
    and b.store = a.store
;
0

1 Answer 1

0

This is not at all easy. In SQL, people solve this using e.g. recursive common table expressions.

In a Python application, this cannot be represented directly with simple joins. One way to represent this is as a disjoint connected-component graph problem:

import networkx
import pandas as pd

df_table_one = pd.DataFrame({
    'store': ['A', 'B', 'C', 'C', 'C'],
    'new_item':     [  'red car',   'red boat',   'red plane',  'green plane',   'red bike'],
    'previous_item':['green car', 'green boat', 'green plane', 'yellow plane', 'green bike'],
    'change_date': ['2025-01','2025-01','2025-01','2024-01','2025-01'],
})
df_table_one['change_date'] = pd.to_datetime(df_table_one['change_date'])
print(df_table_one)

# Complete a disjoint connected-component problem
component_id = 0
for store, items in df_table_one.groupby('store')[['previous_item', 'new_item']]:
    g = networkx.Graph()
    g.add_edges_from(items.values)

    for component in networkx.connected_components(g):
        predicate = items.isin(component).any(axis=1)
        df_table_one.loc[predicate.index[predicate], 'component'] = component_id
        component_id += 1

# Join between new and previous nodes; this becomes an anti-join after the isna() filter
joined = pd.merge(
    left=df_table_one, right=df_table_one,
    left_on=['store', 'component', 'new_item'], right_on=['store', 'component', 'previous_item'],
    suffixes=['_prev', '_next'], how='outer',
)

# Merge path endpoints
output = pd.merge(
    left=joined.loc[
        joined['previous_item_next'].isna(),
        ['store', 'component', 'new_item_prev', 'change_date_prev'],
    ],
    right=joined.loc[
        joined['new_item_prev'].isna(),
        ['store', 'component', 'previous_item_next', 'change_date_next'],
    ],
    left_on=['store', 'component'],
    right_on=['store', 'component'], how='inner',
).drop(columns='component').rename(columns={
    'new_item_prev': 'latest_item',
    'change_date_prev': 'latest_change_date',
    'previous_item_next': 'first_item',
    'change_date_next': 'first_change_date',
})

print(output)
  store latest_item latest_change_date    first_item first_change_date
0     A     red car         2025-01-01     green car        2025-01-01
1     B    red boat         2025-01-01    green boat        2025-01-01
2     C   red plane         2025-01-01  yellow plane        2024-01-01
3     C    red bike         2025-01-01    green bike        2025-01-01
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.