0

I have a pandas dataframe containing very long strings in the 'page' column that I am trying to extract a substring from:

Example string: /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0

Using regex, I am having a hard time determining how to extract the string between the two ampersands and removing all other characters part of the greater string.

So far, my code looks like this:

import pandas as pd
import re

dataset = pd.read_excel(r'C:\Users\example.xlsx')
dataframe = pd.DataFrame(dataset)

dataframe['Page'] = format = re.search(r'&(.*)&',str(dataframe['Page']))

dataframe.to_excel(r'C\Users\output.xlsx)

The code above runs but does not output anything to my new spreadsheet.

Thank you in advance.

7
  • Welcome to SO! It's always helpful to include some sample data as text. The easiest way to do this is paste the output of df.head() into a code block in your questions Commented Dec 11, 2018 at 16:26
  • 1
    Something like dataframe.Page.str.split("&").str[1]? Commented Dec 11, 2018 at 16:27
  • 1
    Probably, dataframe['Page'].str.extract(r'&([^&]+)&') will do. Commented Dec 11, 2018 at 16:27
  • Also, parsing a string representation of the dataframe is just asking for trouble. Instead operate on the series of strings. Commented Dec 11, 2018 at 16:34
  • 1
    Ahh, I understand. Thank you for clearing that up. Commented Dec 11, 2018 at 16:42

3 Answers 3

4

You can extract the query string from the URL with urllib.parse.urlparse, then parse it with urllib.parse.parse_qs:

>>> from urllib.parse import urlparse, parse_qs
>>> path = '/ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0'
>>> query_string = urlparse(path).query  
>>> parse_qs(query)
{'search_query': ['example one'], 'y': ['0'], 'x': ['0']}

EDIT: To extract the query_string from all pages in the Page column:

dataframe['Page'] = dataframe['Page'].apply(lambda page: parse_qs(urlparse(page).query)['search_query'][0])
Sign up to request clarification or add additional context in comments.

5 Comments

You might want to add that this should be used like dataframe.Page.apply(parse_qs) or similar.
Is there a better way to do this programmatically? I have a few thousands of rows of data with every cell being a unique string.
So, you want to extract the value of query_string from each path?
Thank you so much for your help, sir. This is the solution I have had success with.
I updated the answer a bit. Turns out, the query string needs to be extracted first before passing it to parse_qs. Given the structure of your data, this was not necessary, but it would break for different inputs.
1

You can try this

(?<=&).*?(?=&)

Explanation

  • (?<=&) - Positive lookbehind. Matches &.
  • (.*?) - Matches anything except newline. (Lazy method).
  • (?=&) - Positive lookahead matches &.

Demo

Comments

0

Fast and efficient pandas method.

Example data:

temp,page
1,  /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
2,  /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0
3,  /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0

Code:

df = example.data # from above
df["query"] = df['page'].str.split("&", expand=True)[1].str.split("=", expand=True)[1]
print(df)

Example output:

   temp  \
0  1          
1  2          
2  3          

                                                                                                          page  \
0    /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0   
1    /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0   
2    /ex/search/!tu/p/z1/zVJdb4IwFP0r88HH0Sp-hK/dz/d5/L2dBISEvZ0FBIS9nQSEh/?s&search_query=example one&y=0&x=0   

         query  
0  example one  
1  example one  
2  example one  

If you would like to label your columns based on the key=value pair, that would be a different extract afterwords.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.