How to extract only one string from regex in Python?

Question

I have been trying to build a simple account manager sort of application for myself using Python which will read SMS from my phone and extract information based on some regex patterns.

I wrote a complex regex pattern and tested the same on https://pythex.org/. Example:

Text: 1.00 is debited from ******1234  for food

Pattern: (account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)

Result: from ******1234

However, when I try to do the same in Python using the str.extract() method, rather than getting a single result, I am getting a dataframe with a column for each group.

Python code looks like this:

all_sms=pd.read_csv("all_sms.csv")

pattern = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'

test = all_sms.extract(pattern, expand = False)

Output of the python code for the message above:

0           from
1               
2            NaN
3            NaN
4            NaN
5     ******1234
6           1234
7           1234
8               
9               
10

I am very new to Python and trying to learn by hands-on experience, it would be really helpful if someone can point out where I am going wrong with this?

Put ?: after each unescaped (. Remove redundant capturing groups. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 5, 2017 at 11:03
Consider using a language tag if this is about a specific language — doctorlove
– doctorlove, Commented Dec 5, 2017 at 11:03

fmv1992 · Accepted Answer · 2017-12-05 11:21:31Z

2

Before diving into your regex pattern you should understand why you are using pandas. Pandas is suitable for data analysis (thus suitable for your problem) but seems like an overkill here.

If you are a beginner I advice you to stick with pure python not because pandas is complicated but because knowing the python standard library will help you in the long run. If you skip the basics now this may hurt you in the long run.

Considering you are going to use python3 (without pandas) I would proceed as follow:

# Needed imports from standard library.
import csv
import re

# Declare the constants of my tiny program.
PATTERN = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'
COMPILED_REGEX = re.compile(PATTERN)

# This list will store the matched regex.
found_regexes = list()

# Do the necessary loading to enable searching for the regex.
with open('mysmspath.csv', newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=' ', quotechar='"')
    # Iterate over rows in your csv file.
    for row in csv_reader:
        match = COMPILED_REGEX.search(row)
        if match:
            found_regexes.append(row)

print(found_regexes)

Not necessarily this is going to solve your problem with copy-paste but this might give you an idea of a more simpler approach to your problem.

answered Dec 5, 2017 at 11:21

fmv1992

3221 gold badge5 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nadeem Hussain Over a year ago

Thanks for the detailed answer and explanation. Will try to stick to the basics for a while now :)

Collectives™ on Stack Overflow

How to extract only one string from regex in Python?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related