I have a dataframe, df, with two columns: IDs and Dates. It records events for IDs at different dates. Neither field is unique, but rows are unique by combination (no ID has more than one record on the same date).
I have the following function to add a new column to determine, at a given record/date, whether or not (TRUE/FALSE) the ID has another record at any future date:
def f(df):
count = pd.Series(np.arange(1, len(df)+1), index=df["date"])
day = count.index.shift(0, freq="D")
next18month = count.index.shift(3000, freq="D")
result = count.asof(next18month).fillna(0).values - count.asof(day).fillna(0).values
if result[0] > 0:
return pd.Series(1, df.index)
else:
return pd.Series(0, df.index)
Then I can apply the function to my dataframe, grouped by ID:
df["everagain"] = df.groupby("id").apply(f)
It doesn't work. I believe the result[0] is wrong. It works for the first time an ID appears (it counts the second time, tripping a true return), but if there is a second record for a given id, and no third record, it still returns a '1' (True) at the second record. Can someone help with the correct notation?
(Note: 3000 days is enough to count as forever given my dataset).
For example, if df looked like:
| ID | Date
0 | A | 2010-01-01
1 | A | 2010-02-01
2 | A | 2010-02-15
3 | B | 2010-01-01
4 | C | 2010-02-01
5 | C | 2010-02-15
Then output would hopefully look like:
| ID | Date | everagain
0 | A | 2010-01-01 | 1
1 | A | 2010-02-01 | 1
2 | A | 2010-02-15 | 0
3 | B | 2010-01-01 | 0
4 | C | 2010-02-01 | 1
5 | C | 2010-02-15 | 0