4

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:

df =

A  B
1  I bought 3 apples in 2013
3  I went to the store in 2020 and got milk
1  In 2015 and 2019 I went on holiday to Spain
2  When I was 17, in 2014 I got a new car
3  I got my present in 2018 and it broke down in 2019

What I would like is to extract all the values of > 1950 and have this as an end result:

A  B                                                    C
1  I bought 3 apples in 2013                            2013
3  I went to the store in 2020 and got milk             2020
1  In 2015 and 2019 I went on holiday to Spain          2015_2019
2  When I was 17, in 2014 I got a new car               2014
3  I got my present in 2018 and it broke down in 2019   2018_2019

I tried to extract values first, but didn't get further than:

df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())

But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?

5
  • Should 1950 be included? Do you want to also extract 19555 and more-digit numbers? Commented Jul 11, 2019 at 10:05
  • You can use this Commented Jul 11, 2019 at 10:06
  • @WiktorStribiżew I haven't come that far, but I was thinking that: because I need the year it took place, filtering the number after I extracted them with >1950 I will get the years and loose the other unusefull values. Commented Jul 11, 2019 at 10:08
  • I would use something like df["C"] = df["B"].str.findall(r'(?<!\d)(?:19[5-9]\d|[2-9]\d{3}|\d{5,})(?!\d)').str.join('_') that also includes 1950 and 5+ digit numbers. Commented Jul 11, 2019 at 10:14
  • If you only need 4 digit years, remove |\d{5,} from the above. To exclude 1950 add (?!1950) / (?!1950(?!\d)) after (?<!\d). Only use it if your input is completely messed up. Commented Jul 11, 2019 at 10:29

2 Answers 2

4

Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::

s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))

   A                                                  B          C
0  1                          I bought 3 apples in 2013       2013
1  3           I went to the store in 2020 and got milk       2020
2  1        In 2015 and 2019 I went on holiday to Spain  2015_2019
3  2             When I was 17, in 2014 I got a new car       2014
4  3  I got my present in 2018 and it broke down in ...  2018_2019
Sign up to request clarification or add additional context in comments.

3 Comments

So, I've got an additional question. What if I only want to keep the earliest year?
Try playing a bit with min @lotw
Yes that I got. My problem is how I can neatly do it. Now I got: df2 = df['C'].str.split('_', expand=True) df2 = df2.fillna(0).astype(int) df2.columns = ['C{}'.format(col) for col in df2.columns ] df = df.join(df2) Which is a big workaround to split C again. I would like it to take the smallest number directly..
3

With single regex pattern (considering your comment "need the year it took place"):

In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')

In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))

In [270]: df
Out[270]: 
   A                                                  B          C
0  1                          I bought 3 apples in 2013       2013
1  3           I went to the store in 2020 and got milk       2020
2  1        In 2015 and 2019 I went on holiday to Spain  2015_2019
3  2             When I was 17, in 2014 I got a new car       2014
4  3  I got my present in 2018 and it broke down in ...  2018_2019

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.