0

I have been trying to build a simple account manager sort of application for myself using Python which will read SMS from my phone and extract information based on some regex patterns.

I wrote a complex regex pattern and tested the same on https://pythex.org/. Example:

Text: 1.00 is debited from ******1234  for food

Pattern: (account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)

Result: from ******1234

However, when I try to do the same in Python using the str.extract() method, rather than getting a single result, I am getting a dataframe with a column for each group.

Python code looks like this:

all_sms=pd.read_csv("all_sms.csv")

pattern = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'

test = all_sms.extract(pattern, expand = False)

Output of the python code for the message above:

0           from
1               
2            NaN
3            NaN
4            NaN
5     ******1234
6           1234
7           1234
8               
9               
10              

I am very new to Python and trying to learn by hands-on experience, it would be really helpful if someone can point out where I am going wrong with this?

2
  • 2
    Put ?: after each unescaped (. Remove redundant capturing groups. Commented Dec 5, 2017 at 11:03
  • Consider using a language tag if this is about a specific language Commented Dec 5, 2017 at 11:03

1 Answer 1

2

Before diving into your regex pattern you should understand why you are using pandas. Pandas is suitable for data analysis (thus suitable for your problem) but seems like an overkill here.

If you are a beginner I advice you to stick with pure python not because pandas is complicated but because knowing the python standard library will help you in the long run. If you skip the basics now this may hurt you in the long run.

Considering you are going to use python3 (without pandas) I would proceed as follow:

# Needed imports from standard library.
import csv
import re

# Declare the constants of my tiny program.
PATTERN = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'
COMPILED_REGEX = re.compile(PATTERN)

# This list will store the matched regex.
found_regexes = list()

# Do the necessary loading to enable searching for the regex.
with open('mysmspath.csv', newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=' ', quotechar='"')
    # Iterate over rows in your csv file.
    for row in csv_reader:
        match = COMPILED_REGEX.search(row)
        if match:
            found_regexes.append(row)

print(found_regexes)

Not necessarily this is going to solve your problem with copy-paste but this might give you an idea of a more simpler approach to your problem.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the detailed answer and explanation. Will try to stick to the basics for a while now :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.