Pandas Dataframe, Apply Function, Return Index

Question

I have a dataframe, df, with two columns: IDs and Dates. It records events for IDs at different dates. Neither field is unique, but rows are unique by combination (no ID has more than one record on the same date).

I have the following function to add a new column to determine, at a given record/date, whether or not (TRUE/FALSE) the ID has another record at any future date:

def f(df):
    count = pd.Series(np.arange(1, len(df)+1), index=df["date"])
    day = count.index.shift(0, freq="D")
    next18month = count.index.shift(3000, freq="D")
    result =  count.asof(next18month).fillna(0).values - count.asof(day).fillna(0).values
    if result[0] > 0:
        return pd.Series(1, df.index)
    else:
        return pd.Series(0, df.index)

Then I can apply the function to my dataframe, grouped by ID:

df["everagain"] = df.groupby("id").apply(f)

It doesn't work. I believe the result[0] is wrong. It works for the first time an ID appears (it counts the second time, tripping a true return), but if there is a second record for a given id, and no third record, it still returns a '1' (True) at the second record. Can someone help with the correct notation?

(Note: 3000 days is enough to count as forever given my dataset).

For example, if df looked like:

   |  ID  |  Date
0  |  A   |  2010-01-01
1  |  A   |  2010-02-01
2  |  A   |  2010-02-15
3  |  B   |  2010-01-01
4  |  C   |  2010-02-01
5  |  C   |  2010-02-15

Then output would hopefully look like:

   |  ID  |  Date        | everagain
0  |  A   |  2010-01-01  | 1
1  |  A   |  2010-02-01  | 1
2  |  A   |  2010-02-15  | 0
3  |  B   |  2010-01-01  | 0
4  |  C   |  2010-02-01  | 1
5  |  C   |  2010-02-15  | 0

can you postt a sample of your frame and expected output?

Jeff
– Jeff

2013-09-24 00:45:03 +00:00
Commented Sep 24, 2013 at 0:45 — Jeff
– Jeff, Commented Sep 24, 2013 at 0:45

DSM · Accepted Answer · 2013-09-24 01:35:51Z

1

I originally thought I could use .groupby("ID").last() but couldn't quite get it to work. (We could do it with transform, of course, but that feels like too much firepower.)

If your data is ordered by date and has contiguous IDs, however, you can simply compare whether ID is equal to the next ID. For example:

>>> df = df.sort(["ID", "Date"])
>>> df
  ID                Date
0  A 2010-01-01 00:00:00
1  A 2010-02-01 00:00:00
2  A 2010-02-15 00:00:00
3  B 2010-01-01 00:00:00
4  C 2010-02-01 00:00:00
5  C 2010-02-15 00:00:00
>>> df["everagain"] = df["ID"] == df["ID"].shift(-1)
>>> df
  ID                Date everagain
0  A 2010-01-01 00:00:00      True
1  A 2010-02-01 00:00:00      True
2  A 2010-02-15 00:00:00     False
3  B 2010-01-01 00:00:00     False
4  C 2010-02-01 00:00:00      True
5  C 2010-02-15 00:00:00     False

If you wanted ones and zeroes instead of True and False, you could use (df["ID"] == df["ID"].shift(-1))*1) or (df["ID"] == df["ID"].shift(-1)).astype(int) to convert them.

answered Sep 24, 2013 at 1:35

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user1893148 Over a year ago

Really clever, and works well. Thanks. What if, however, I purely wanted to know whether the ID had another date in the next 3000 days. Is there not a way to make my function work?

Collectives™ on Stack Overflow

Pandas Dataframe, Apply Function, Return Index

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related