I have been trying to build a simple account manager sort of application for myself using Python which will read SMS from my phone and extract information based on some regex patterns.
I wrote a complex regex pattern and tested the same on https://pythex.org/. Example:
Text: 1.00 is debited from ******1234 for food
Pattern: (account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)
Result: from ******1234
However, when I try to do the same in Python using the str.extract() method, rather than getting a single result, I am getting a dataframe with a column for each group.
Python code looks like this:
all_sms=pd.read_csv("all_sms.csv")
pattern = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'
test = all_sms.extract(pattern, expand = False)
Output of the python code for the message above:
0 from
1
2 NaN
3 NaN
4 NaN
5 ******1234
6 1234
7 1234
8
9
10
I am very new to Python and trying to learn by hands-on experience, it would be really helpful if someone can point out where I am going wrong with this?
?:after each unescaped(. Remove redundant capturing groups.