Extract substring from string using Python and regex

Question

I have a pandas dataframe containing very long strings in the 'page' column that I am trying to extract a substring from:

Example string: /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0

Using regex, I am having a hard time determining how to extract the string between the two ampersands and removing all other characters part of the greater string.

So far, my code looks like this:

import pandas as pd
import re

dataset = pd.read_excel(r'C:\Users\example.xlsx')
dataframe = pd.DataFrame(dataset)

dataframe['Page'] = format = re.search(r'&(.*)&',str(dataframe['Page']))

dataframe.to_excel(r'C\Users\output.xlsx)

The code above runs but does not output anything to my new spreadsheet.

Thank you in advance.

Welcome to SO! It's always helpful to include some sample data as text. The easiest way to do this is paste the output of df.head() into a code block in your questions — Charles Landau
– Charles Landau, Commented Dec 11, 2018 at 16:26
Probably, dataframe['Page'].str.extract(r'&([^&]+)&') will do. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 11, 2018 at 16:27
Also, parsing a string representation of the dataframe is just asking for trouble. Instead operate on the series of strings. — Graipher
– Graipher, Commented Dec 11, 2018 at 16:34

Martin Frodl · Accepted Answer · 2018-12-11 16:55:28Z

4

You can extract the query string from the URL with urllib.parse.urlparse, then parse it with urllib.parse.parse_qs:

>>> from urllib.parse import urlparse, parse_qs
>>> path = '/ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0'
>>> query_string = urlparse(path).query  
>>> parse_qs(query)
{'search_query': ['example one'], 'y': ['0'], 'x': ['0']}

EDIT: To extract the query_string from all pages in the Page column:

dataframe['Page'] = dataframe['Page'].apply(lambda page: parse_qs(urlparse(page).query)['search_query'][0])

edited Dec 11, 2018 at 16:55

answered Dec 11, 2018 at 16:29

Martin Frodl

6574 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Graipher Over a year ago

You might want to add that this should be used like dataframe.Page.apply(parse_qs) or similar.

trapslinky Over a year ago

Is there a better way to do this programmatically? I have a few thousands of rows of data with every cell being a unique string.

Martin Frodl Over a year ago

So, you want to extract the value of query_string from each path?

trapslinky Over a year ago

Thank you so much for your help, sir. This is the solution I have had success with.

Martin Frodl Over a year ago

I updated the answer a bit. Turns out, the query string needs to be extracted first before passing it to parse_qs. Given the structure of your data, this was not necessary, but it would break for different inputs.

Code Maniac · Accepted Answer · 2018-12-11 17:14:02Z

1

You can try this

(?<=&).*?(?=&)

Explanation

(?<=&) - Positive lookbehind. Matches &.
(.*?) - Matches anything except newline. (Lazy method).
(?=&) - Positive lookahead matches &.

Demo

edited Dec 11, 2018 at 17:14

answered Dec 11, 2018 at 16:33

Code Maniac

37.9k5 gold badges44 silver badges65 bronze badges

Comments

johnnyb · Accepted Answer · 2018-12-11 16:58:05Z

Fast and efficient pandas method.

Example data:

temp,page
1,  /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
2,  /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
3,  /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0

Code:

df = example.data # from above
df["query"] = df['page'].str.split("&", expand=True)[1].str.split("=", expand=True)[1]
print(df)

Example output:

   temp  \
0  1          
1  2          
2  3          

                                                                                                          page  \
0    /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0   
1    /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0   
2    /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0   

         query  
0  example one  
1  example one  
2  example one

If you would like to label your columns based on the key=value pair, that would be a different extract afterwords.

Collectives™ on Stack Overflow

Extract substring from string using Python and regex

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related