Python - Extract multiple values from string in pandas df

Question

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:

df =

A  B
1  I bought 3 apples in 2013
3  I went to the store in 2020 and got milk
1  In 2015 and 2019 I went on holiday to Spain
2  When I was 17, in 2014 I got a new car
3  I got my present in 2018 and it broke down in 2019

What I would like is to extract all the values of > 1950 and have this as an end result:

A  B                                                    C
1  I bought 3 apples in 2013                            2013
3  I went to the store in 2020 and got milk             2020
1  In 2015 and 2019 I went on holiday to Spain          2015_2019
2  When I was 17, in 2014 I got a new car               2014
3  I got my present in 2018 and it broke down in 2019   2018_2019

I tried to extract values first, but didn't get further than:

df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())

But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?

Should 1950 be included? Do you want to also extract 19555 and more-digit numbers? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 11, 2019 at 10:05
@WiktorStribiżew I haven't come that far, but I was thinking that: because I need the year it took place, filtering the number after I extracted them with >1950 I will get the years and loose the other unusefull values. — Lotw
– Lotw, Commented Jul 11, 2019 at 10:08
I would use something like df["C"] = df["B"].str.findall(r'(?<!\d)(?:19[5-9]\d|[2-9]\d{3}|\d{5,})(?!\d)').str.join('_') that also includes 1950 and 5+ digit numbers. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 11, 2019 at 10:14
If you only need 4 digit years, remove |\d{5,} from the above. To exclude 1950 add (?!1950) / (?!1950(?!\d)) after (?<!\d). Only use it if your input is completely messed up. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 11, 2019 at 10:29

yatu · Accepted Answer · 2019-07-11 10:08:58Z

4

Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::

s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))

   A                                                  B          C
0  1                          I bought 3 apples in 2013       2013
1  3           I went to the store in 2020 and got milk       2020
2  1        In 2015 and 2019 I went on holiday to Spain  2015_2019
3  2             When I was 17, in 2014 I got a new car       2014
4  3  I got my present in 2018 and it broke down in ...  2018_2019

answered Jul 11, 2019 at 10:08

yatu

88.7k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Lotw Over a year ago

So, I've got an additional question. What if I only want to keep the earliest year?

yatu Over a year ago

Try playing a bit with min @lotw

Lotw Over a year ago

Yes that I got. My problem is how I can neatly do it. Now I got: df2 = df['C'].str.split('_', expand=True) df2 = df2.fillna(0).astype(int) df2.columns = ['C{}'.format(col) for col in df2.columns ] df = df.join(df2) Which is a big workaround to split C again. I would like it to take the smallest number directly..

RomanPerekhrest · Accepted Answer · 2019-07-11 10:13:25Z

3

With single regex pattern (considering your comment "need the year it took place"):

In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')

In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))

In [270]: df
Out[270]: 
   A                                                  B          C
0  1                          I bought 3 apples in 2013       2013
1  3           I went to the store in 2020 and got milk       2020
2  1        In 2015 and 2019 I went on holiday to Spain  2015_2019
3  2             When I was 17, in 2014 I got a new car       2014
4  3  I got my present in 2018 and it broke down in ...  2018_2019

edited Jul 11, 2019 at 10:13

answered Jul 11, 2019 at 10:11

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Collectives™ on Stack Overflow

Python - Extract multiple values from string in pandas df

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related