4

I'm trying to use one column containing the start index to subselect a string column.

df = pd.DataFrame({'string': ['abcdef', 'bcdefg'], 'start_index': [3, 5]})
expected = pd.Series(['def', 'g'])

I know that you can substring with the following

df['string'].str[3:]

However, in my case, the start index may vary, so I tried:

df['string'].str[df['start_index']:]

But it return NaNs.

EDIT: What if I don't want to use a loop / list comprehension; i.e. vectorized method preferred.

EDIT2: In this small test case, it seems like list comprehension is faster.

from itertools import islice
%timeit df.apply(lambda x: ''.join(islice(x.string, x.start_index, None)), 1)
%timeit pd.Series([x[y:] for x , y in zip(df.string,df.start_index) ])

631 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
101 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3
  • Do all strings have the same length? Commented Jun 14, 2019 at 21:59
  • No they do not. Commented Jun 14, 2019 at 22:00
  • Might take a look here: stackoverflow.com/questions/39042214/… Commented Jun 14, 2019 at 22:11

1 Answer 1

1

Using for loop with zip of two columns , why we are using for loop here, you can check the link

[x[y:] for x , y in zip(df.string,df.start_index) ]
Out[328]: ['def', 'g']
Sign up to request clarification or add additional context in comments.

2 Comments

It's so slow solution that it's impractical for larger data sets

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.