1

I have a dataframe, df, with two columns: IDs and Dates. It records events for IDs at different dates. Neither field is unique, but rows are unique by combination (no ID has more than one record on the same date).

I have the following function to add a new column to determine, at a given record/date, whether or not (TRUE/FALSE) the ID has another record at any future date:

def f(df):
    count = pd.Series(np.arange(1, len(df)+1), index=df["date"])
    day = count.index.shift(0, freq="D")
    next18month = count.index.shift(3000, freq="D")
    result =  count.asof(next18month).fillna(0).values - count.asof(day).fillna(0).values
    if result[0] > 0:
        return pd.Series(1, df.index)
    else:
        return pd.Series(0, df.index)

Then I can apply the function to my dataframe, grouped by ID:

df["everagain"] = df.groupby("id").apply(f)

It doesn't work. I believe the result[0] is wrong. It works for the first time an ID appears (it counts the second time, tripping a true return), but if there is a second record for a given id, and no third record, it still returns a '1' (True) at the second record. Can someone help with the correct notation?

(Note: 3000 days is enough to count as forever given my dataset).

For example, if df looked like:

   |  ID  |  Date
0  |  A   |  2010-01-01
1  |  A   |  2010-02-01
2  |  A   |  2010-02-15
3  |  B   |  2010-01-01
4  |  C   |  2010-02-01
5  |  C   |  2010-02-15

Then output would hopefully look like:

   |  ID  |  Date        | everagain
0  |  A   |  2010-01-01  | 1
1  |  A   |  2010-02-01  | 1
2  |  A   |  2010-02-15  | 0
3  |  B   |  2010-01-01  | 0
4  |  C   |  2010-02-01  | 1
5  |  C   |  2010-02-15  | 0
1
  • can you postt a sample of your frame and expected output? Commented Sep 24, 2013 at 0:45

1 Answer 1

1

I originally thought I could use .groupby("ID").last() but couldn't quite get it to work. (We could do it with transform, of course, but that feels like too much firepower.)

If your data is ordered by date and has contiguous IDs, however, you can simply compare whether ID is equal to the next ID. For example:

>>> df = df.sort(["ID", "Date"])
>>> df
  ID                Date
0  A 2010-01-01 00:00:00
1  A 2010-02-01 00:00:00
2  A 2010-02-15 00:00:00
3  B 2010-01-01 00:00:00
4  C 2010-02-01 00:00:00
5  C 2010-02-15 00:00:00
>>> df["everagain"] = df["ID"] == df["ID"].shift(-1)
>>> df
  ID                Date everagain
0  A 2010-01-01 00:00:00      True
1  A 2010-02-01 00:00:00      True
2  A 2010-02-15 00:00:00     False
3  B 2010-01-01 00:00:00     False
4  C 2010-02-01 00:00:00      True
5  C 2010-02-15 00:00:00     False

If you wanted ones and zeroes instead of True and False, you could use (df["ID"] == df["ID"].shift(-1))*1) or (df["ID"] == df["ID"].shift(-1)).astype(int) to convert them.

Sign up to request clarification or add additional context in comments.

1 Comment

Really clever, and works well. Thanks. What if, however, I purely wanted to know whether the ID had another date in the next 3000 days. Is there not a way to make my function work?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.