1

I have a dataframe in the following general format:

id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8 

What I am trying to accomplish is stack and enumerate the 'transaction_dt' based on the value of 'units' field in same record and unroll them into new records to produce something like this:

id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8 

I have been working on trying to create a vectorized performant version of the answer to my prior question that someone was kind enough to answer here: Python PANDAS: Stack and Enumerate Date to Create New Records

df.set_index('transaction_dt', inplace=True)

df.apply(lambda x: pd.Series(pd.date_range(x.name, periods=x.units)), axis=1). \
    stack(). \
    reset_index(level=1). \
    join(df['measure']). \
    drop('level_1', axis=1). \
    reset_index(). \
    rename(columns={0:'enumerated_dt'}) 

This does work but I have a very large dataset to run this on, so I need to invest in optimizing it a bit more. He suggests creating an array of all dates which I can do with something like this:

date_range = pd.date_range('2004-01-01', '2017-12-31', freq='1D')

And he suggests then reindexing the array and forward filling the values somehow. If anyone could help me, I would sincerely appreciate it!

1 Answer 1

3

You can use numpy.repeat for duplicate indices by column units with loc for duplicates rows. Last per each indices get count by cumcount, convert to_timedelta and add to column transaction_dt. Last reset_index for default unique indeices:

df = df.loc[np.repeat(df.index, df['units'])]
df['transaction_dt'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df = df.reset_index(drop=True)
print (df)
    id transaction_dt  units  measures
0    1     2018-01-01      4      30.5
1    1     2018-01-02      4      30.5
2    1     2018-01-03      4      30.5
3    1     2018-01-04      4      30.5
4    1     2018-01-03      4      26.3
5    1     2018-01-04      4      26.3
6    1     2018-01-05      4      26.3
7    1     2018-01-06      4      26.3
8    2     2018-01-01      3      12.7
9    2     2018-01-02      3      12.7
10   2     2018-01-03      3      12.7
11   2     2018-01-03      3       8.8
12   2     2018-01-04      3       8.8
13   2     2018-01-05      3       8.8
Sign up to request clarification or add additional context in comments.

1 Comment

@jezrael Top notch! Currently running but it seems much faster, better memory utilization, etc. . In addition, unbelievably succinct/dense code that is still very readable. Thanks as always for sharing your expertise.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.