Extract only the specific value from string with Regex Using Python

Question

I am trying to extract Specific text values from string using regex but due to not having the spaces between the start of the keyword from which the values need to be extracted getting the error. Looking out to extract the values of the keywords starts with.

Tried using PyPDF2 and pdfminer but getting the Error.

fr = PyPDF2.PdfFileReader(file)
data = fr.getPage(0).extractText()

OutPut : ['Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....']

I am looking out to capture Ack No, Date of Issue, CIN from the above output

Using the script:

    regex_ack_no = re.compile(r"Ack No(\d+)")
    regex_due_date = re.compile(r"Date of Issue(\S+ \d{1,2}, \d{4})")
    regex_CIN = re.compile(r"CIN(\$\d+\.\d{1,2})")

ack_no = re.search(regex_ack_no, data).group(1)
due_date = re.search(regex_due_date, data).group(1)
cin = re.search(regex_CIN, data).group(1)

return[ack_no, due_date, cin]

Error:

AttributeError: 'NoneType' object has no attribute 'group'

When using the same script with the another PDF file having data in the table format its working.

You're not matching the : between Ack No and the number. You're not matching the : after Date of Issue. You're not matching the : after CIN, and the format of CIN is not $ followed by a number with 1-2 decimal digits. — Barmar
– Barmar, Commented Oct 6, 2020 at 19:24
In other words, the regular expressions don't seem to match the data format at all. — Barmar
– Barmar, Commented Oct 6, 2020 at 19:24
@Barmar - Have tried using above methods as well that you mentioned but didn't worked, Now I am trying to match the keyword starts with Ack No, Date of Issue, CIN, as we have to capture values from multiple PDF. — Manz
– Manz, Commented Oct 6, 2020 at 19:31
Date of issue is \d\d\.\d\d\.\d{4}, why are you matching \d{1,2}, \d{4}? — Barmar
– Barmar, Commented Oct 6, 2020 at 19:35

Barmar · Accepted Answer · 2020-10-06 19:56:51Z

1

You need to change the regexp patterns to match the data format. The keywords are followed by spaces and :, you have to match them. The format of the date is not what you have in your pattern, neither is the format of CIN.

Before calling .group(1), check that the match was successful. In my code below I return default values when there's no match.

import re

data = 'Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....'

regex_ack_no = re.compile(r"Ack No\s*:\s*(\d+)")
regex_due_date = re.compile(r"Date of Issue\s*:\s*(\d\d\.\d\d\.\d{4})")
regex_CIN = re.compile(r"CIN:\s*(\w+?)GSTIN:")

ack_no = re.search(regex_ack_no, data)
if ack_no:
    ack_no = ack_no.group(1)
else:
    ack_no = 'Ack No not found'
due_date = re.search(regex_due_date, data)
if due_date:
    due_date = due_date.group(1)
else:
    due_date = 'Due date not found'
cin = re.search(regex_CIN, data)
if cin:
    cin = cin.group(1)
else:
    cin = 'CIN not found'

print([ack_no, due_date, cin])

DEMO

answered Oct 6, 2020 at 19:56

Barmar

789k57 gold badges555 silver badges669 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Manz Over a year ago

Thanks for the valuable answer, As i am new to this language Just want to know ""regex_CIN = re.compile(r"CIN:\s*(\w+?)GSTIN:") , why we have used GSTIN in the regular expression, We have PDF in different structure format SO what will happen in the case where GSTIN is not present, so do we have to make the change in script, IS there a way to find CIN without using GSTIN in the regular expression.

Barmar Over a year ago

As I said in a comment above, there's no delimiter after the CIN value, so I used GSTIN: to detect the end.

Barmar Over a year ago

You could make GSTIN optional with (?:GSTIN:)?, but then it might include some other field in the CIN. Unless you can define straightforward rules for how to find the different bits, you're going to have a hard time with this.

Collectives™ on Stack Overflow

Extract only the specific value from string with Regex Using Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related