Python PANDAS: Stack by Enumerated Date to Create Records Vectorized

Question

I have a dataframe in the following general format:

id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8

What I am trying to accomplish is stack and enumerate the 'transaction_dt' based on the value of 'units' field in same record and unroll them into new records to produce something like this:

id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8

I have been working on trying to create a vectorized performant version of the answer to my prior question that someone was kind enough to answer here: Python PANDAS: Stack and Enumerate Date to Create New Records

df.set_index('transaction_dt', inplace=True)

df.apply(lambda x: pd.Series(pd.date_range(x.name, periods=x.units)), axis=1). \
    stack(). \
    reset_index(level=1). \
    join(df['measure']). \
    drop('level_1', axis=1). \
    reset_index(). \
    rename(columns={0:'enumerated_dt'})

This does work but I have a very large dataset to run this on, so I need to invest in optimizing it a bit more. He suggests creating an array of all dates which I can do with something like this:

date_range = pd.date_range('2004-01-01', '2017-12-31', freq='1D')

And he suggests then reindexing the array and forward filling the values somehow. If anyone could help me, I would sincerely appreciate it!

jezrael · Accepted Answer · 2018-02-09 17:47:45Z

3

You can use numpy.repeat for duplicate indices by column units with loc for duplicates rows. Last per each indices get count by cumcount, convert to_timedelta and add to column transaction_dt. Last reset_index for default unique indeices:

df = df.loc[np.repeat(df.index, df['units'])]
df['transaction_dt'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df = df.reset_index(drop=True)
print (df)
    id transaction_dt  units  measures
0    1     2018-01-01      4      30.5
1    1     2018-01-02      4      30.5
2    1     2018-01-03      4      30.5
3    1     2018-01-04      4      30.5
4    1     2018-01-03      4      26.3
5    1     2018-01-04      4      26.3
6    1     2018-01-05      4      26.3
7    1     2018-01-06      4      26.3
8    2     2018-01-01      3      12.7
9    2     2018-01-02      3      12.7
10   2     2018-01-03      3      12.7
11   2     2018-01-03      3       8.8
12   2     2018-01-04      3       8.8
13   2     2018-01-05      3       8.8

answered Feb 9, 2018 at 17:47

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pylander Over a year ago

@jezrael Top notch! Currently running but it seems much faster, better memory utilization, etc. . In addition, unbelievably succinct/dense code that is still very readable. Thanks as always for sharing your expertise.

Collectives™ on Stack Overflow

Python PANDAS: Stack by Enumerated Date to Create Records Vectorized

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related