I am trying to extract Specific text values from string using regex but due to not having the spaces between the start of the keyword from which the values need to be extracted getting the error. Looking out to extract the values of the keywords starts with.
Tried using PyPDF2 and pdfminer but getting the Error.
fr = PyPDF2.PdfFileReader(file)
data = fr.getPage(0).extractText()
OutPut : ['Date : 2020-09-06 20:43:00Ack No : 3320000266Original for RecipientInvoice No.: IN05200125634Date of Issue: 06.09.2015TAX INVOICE(Issued u/s 31(1) of GST Act, 2017)POLO INDUSTRIES LIMITEDCIN: K253648B85PLC015063GSTIN: 3451256132uuy668803E1Z9PAN: BBB7653279K .....']
I am looking out to capture Ack No, Date of Issue, CIN from the above output
Using the script:
regex_ack_no = re.compile(r"Ack No(\d+)")
regex_due_date = re.compile(r"Date of Issue(\S+ \d{1,2}, \d{4})")
regex_CIN = re.compile(r"CIN(\$\d+\.\d{1,2})")
ack_no = re.search(regex_ack_no, data).group(1)
due_date = re.search(regex_due_date, data).group(1)
cin = re.search(regex_CIN, data).group(1)
return[ack_no, due_date, cin]
Error:
AttributeError: 'NoneType' object has no attribute 'group'
When using the same script with the another PDF file having data in the table format its working.
:betweenAck Noand the number. You're not matching the:afterDate of Issue. You're not matching the:afterCIN, and the format ofCINis not$followed by a number with 1-2 decimal digits.\d\d\.\d\d\.\d{4}, why are you matching\d{1,2}, \d{4}?