1,483 questions
Best practices
1
vote
1
replies
85
views
How to build an OCR system in Flutter that can extract structured data from multiple bill formats?
I am trying to build an OCR feature in my Flutter app that can read hotel bills in multiple formats. The challenge is that these bills do not follow a fixed layout. From each bill, I need to extract ...
Advice
0
votes
1
replies
31
views
How to identify if a table has a header when extracted from Textract
I'm extracting tables from financial statement PDFs (like 10-Ks)
Textract has a feature which allows for extracting them in markdown format.
However, some of the tables don't have headers.
For example:...
0
votes
0
answers
171
views
How can I extract a table from a PDF document where the cells contain multi-line text that is vertically centered?
I'm working on extracting bank transaction data from a PDF shown below using Python. Each transaction includes two dates, amounts (Money In, Money Out, Balance), and a description. The challenge is ...
1
vote
0
answers
93
views
How to extract whole number as single token from pdf
ASP.NET Core 9 MVC / C# controller extracts texts from pdf using pdfpig based on code in answer How to group text to lines if there is small difference in Y position
If thousands are separated by ...
0
votes
0
answers
62
views
How to preserve text on single PDF to Image conversion?
Trying to import user's daily duties which are PDF format. The PDF contains no more than 20 dates and airport/time combinations which is what I need to capture from the document. I have the text ...
0
votes
0
answers
45
views
How to match and highlight text in a PDF with PyMuPDF when control characters are present between sentences?
I'm using PyMuPDF (fitz) to search for and highlight text in a PDF. However, the PDF text contains various control characters between sentences, which makes it difficult to match multi-sentence ...
0
votes
0
answers
109
views
PyMuPDF - Extract table contents
I try to extract the table text of a PDF:
With the following code code i get:
page 0 of page-1-ocr.pdf
Tables rowsasf 49
texysdft [['', '', 'Staatlic', 'he Fische', 'rprüfung', 'in Bayern - Prü', '...
0
votes
0
answers
55
views
KeyFrame detection in python
I'm building a RAG system for a platform where the primary content consists of videos and slides. My approach involves extracting keyframes from videos using OpenCV
diff = cv2.absdiff(prev_image, ...
1
vote
0
answers
63
views
Preserve Empty Columns When Extracting Tables from PDF
I have 25–30 different types of PDF documents, each containing tables with varying structures. My ultimate goal is to extract table data from specific headings (i.e., between certain titles) and ...
0
votes
0
answers
110
views
Unable to Extract Text from a Specific Email Template in Gmail (Spring Boot & IMAP)
I am working on an email extraction process using Java, Spring boot, and IMAP to read emails from Gmail. The process works fine for most emails, extracting only the text content. However, one specific ...
0
votes
1
answer
70
views
Extracting a Specific Text from PDF using RegEx
Without getting into too much detail about how we ended up in this situation (a lot of poor business decisions), I need to find the text: "SomeID=[Integer]" from a PDF file (e.g. SomeID=...
0
votes
4
answers
300
views
Parse video id and start time from differently formatted youtube URL strings
I needed to extract the video Id and the start time from any kind of youtube url that the users can input. I have a working solution but it is not right.
Questions:
Could someone help me to fix the ...
0
votes
1
answer
110
views
Extracting path and filename from a string containing both in vanilla SQLITE
I have a SQLITE DB that contains fullnames (i.e., parentpath\filename, e.g.
C:\Users\Public\My Music\Classic Queen\16 - Who Wants to Live Forever.mp3
I want to query and get the filename separate from ...
0
votes
0
answers
124
views
Python: Concurrent execution of drission chromiumpage
I have created one api that calls the function whose task is to extract web content from url that has been shared as a parameter.
I am facing a problem when my api is getting multiple request, the ...
0
votes
2
answers
84
views
How to extract a word in a string between two similar characters" with VBA?
I need to extract Wööörd_03 from this string:
"https://Word01.com/Word_02/Wööörd_03/Word_04/Word_05=0"
My code doesn't, cause I get different results:
Sub ExtractWord()
Dim sString As ...
0
votes
2
answers
184
views
How to tokenize a text in Google Spreadsheet
I have a cell (D2) in Google Sheets containing a title, and I want to extract everything up to the first punctuation mark (if any) and display it in another cell.
Example
For example, if D2 contains:
&...
0
votes
1
answer
87
views
Extract the numbers interspersed with ‘+’
I have a column in my table in string format that contains different types of discounts:
integers
decimal numbers
compound discounts, i.e. whole numbers interspersed with the + symbol (e.g. 10+3, 5+3+...
1
vote
1
answer
476
views
How to extract valuable information from the JSON output of Document AI custom extractors?
I am working with a simple custom extractor in Document AI, which tries to find the following fields in any pdf uploaded:
Country
Nombre
Adress
Country
Mail
Adress
City
And i am using the following ...
0
votes
1
answer
428
views
Related to Document Intelligence - Azure Cognitive Services
I have built a composed model in document intelligence studio(Formerly known as Form recognizer). It is built to extract different fields from different types of document with different patterns.
...
1
vote
2
answers
106
views
Extracting the the UTM value that contains mixed characters from Google Sheets via REGEX [duplicate]
I have multiple lines like this one where I need to extract the value associated with the utm_campaign field. As you can see, the value comprises digits, letter and characters (ex"-")
https:...
0
votes
1
answer
73
views
How do I automatically scroll to a specific section in the DOM using Selenium?
I'm trying to use Selenium to scroll to a specific section on a webpage and retrieve the text from that section.
Context:
I’m working with a webpage that disables text highlighting through CSS ...
0
votes
1
answer
331
views
how can I compare to texts and extract only the difference text into the other cell
Column A
Column B
Column C
Iam18yearsold
Iam17yearsold
7
thereisagirl
therearegirls
are,s
I need to compare to cells and then extract only the difference to the third cell. I want to have the result ...
0
votes
1
answer
86
views
Extract separated date from a text
so I have this text in excel: Wed Aug 04 00:00:00 WIB 2021 and I need to extract the date to the cell beside it like 04-Aug-21 which is for me kind of complicated, can anyone help?
so I already can ...
-1
votes
1
answer
148
views
How to create an Excel table from non-structured text using text formatting as the delimeter [closed]
I am trying to extract most of the information found on a government website (CFIA-CFIT Part I and Part II) and create a table in excel. This table is to have three columns; ID, Name, and Detail. The ...
0
votes
1
answer
122
views
Order arrangement of texts of docx documents in the document.xml
I am trying to extract text from docx files, where I am getting collapsed text from the document like the text present at the bottom or in a random text box is extracted first and then the texts from ...
1
vote
0
answers
56
views
Extracting text from a pdf file with differents strcuture failed how to properly do it Not all texts is extracted , just a portion is extracted
I am trying to extract text from CV in pdf extension. I come up with this script but I have a problem. The script does not extract all the text and I have problem to identify different block of the ...
0
votes
0
answers
202
views
Capturing Formatted Numbering from DOCX Files in Python
I'm working on a Python project where I need to extract text from DOCX files, preserving the formatted numbering. I've encountered a peculiar issue that I'm hoping someone can help me solve.
The ...
0
votes
0
answers
157
views
Guidance on Extracting Compliance Items from PDF documents by fine-tuning a LLM
Need some guidance on extracting large compliance items from raw PDF documents. I have csv with these compliance items and I want to fine-tune a LLM such that if it reads any new PDF documents it can ...
3
votes
1
answer
363
views
Parsing formulas efficiently using regex and Polars
I am trying to parse a series of mathematical formulas and need to extract variable names efficiently using Polars in Python.
Regex support in Polars seems to be limited, particularly with look-around ...
-1
votes
1
answer
123
views
Extracting Text from PDFs with Python Without Including Comments
I have been trying to extract text from PDF files to automate a significant and tedious part of my job using Python. With the help of ChatGPT, I have written multiple lines of code. However, I am ...
1
vote
1
answer
383
views
Accurately Detecting randomly rotated Text in Images
I'm trying to detect text from items, which may be rotated in various directions. I've tried using Tesseract, EasyOCR, and EAST for text detection and extraction, but I am encountering issues with ...
1
vote
0
answers
108
views
AWS Textract With AWS Signature Version 4 Using Go Lang
I have 3 credentials:
host
acckey
secretkey
That from AWS. I am using AWS Signature Ver 4 method
And then i want to using textract feature from AWS with Golang. I have build the code and have a ...
0
votes
2
answers
93
views
How to convert a string in python to separate strings [closed]
I have a pandas dataframe with only one column containing symbols. I need to separate those symbols in groups of 13 and 39 inside a single string.
symbol
3IINFOTECH
3MINDIA
3PLAND
20MICRONS
3RDROCK
...
1
vote
0
answers
130
views
Improving OCR accuracy with pytesseract for processing manga images
def get_string(img_path):
img = cv2.imread(img_path)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
...
2
votes
0
answers
236
views
lopdf RUST PDF - only getting text
brand new to rust and am trying to read a pdf file with lopdf.
trying out various examples but I am just getting characters. I need all the chars like spaces, tabs, line breaks, etc...for Regex.
Is ...
0
votes
0
answers
93
views
How to use custom ToUnicode maps for text extraction
Tying to use iText7 to extract text from pdf file outputs question marks only:
???????? ??????????
?
???????????????????????? ???????????????????
???????? ????????????????????????????
?????????????????...
0
votes
0
answers
138
views
How to extract text instead of question marks from PDF file
Tying to use iText7 8.0.4 to extract text from pdf file outputs question marks only:
???????? ??????????
?
???????????????????????? ???????????????????
???????? ????????????????????????????
???????????...
0
votes
0
answers
58
views
How to select only necessary info in table extraction using uipath?
How to extract the only needed info from a web page using uipath table extraction. When I try to select the specific info the other unwanted info is also selected due to its similar pattern as the ...
0
votes
0
answers
128
views
How to use text extraction strategy
I am stuck in itext7 custom strategy. My goal is to extract data from a PDF to a text file without losing the table format. My PDF has a different table structure, some table columns are horizontal ...
2
votes
2
answers
80
views
R function to extract two word phrases in a list which also contains the first word as a separate string
I have a data frame with a string column, and a list of words/phrases which I would like to extract from the column. I have used the following code.
df <- data.frame(string = c("A rose is a ...
0
votes
2
answers
54
views
Extracting datetime column from PDF document
I am trying to extract data from PDF documents in R using "str_extract_all" function. I am trying to look for a date time field, which is displayed in the document in the below format:
Est ...
1
vote
4
answers
503
views
Pandas Extract Phone Number if it is in Correct Format [closed]
I have a column that has phone numbers. They are usually formatted in (555) 123-4567 but sometimes they are in a different format or they are not proper numbers. I am trying to convert this field to ...
2
votes
1
answer
1k
views
How to print the table and lines in their reading order from a document analyzed by AWS Textract
I am using AWS Textract in order to extract text and tables from a pdf document.
I need code that can parse the text extracted, and tables extracted and print everything in one string in the order ...
0
votes
0
answers
144
views
Improving Text Alignment in OCR-Extracted Text Using pytesseract
I'm facing issues with misaligned text extraction from images. I suspect the problem lies in formatting rather than extraction. Can I utilize bounding box coordinates to improve text alignment?
see ...
0
votes
4
answers
902
views
Extract text having different length in excel
I have a list of address data as below. But they don't follow any pattern. Comma, dot or space was used to separate words. I applied the formula
=TRIM(RIGHT(A1,FIND(" ",A1,FIND(" ",...
0
votes
1
answer
896
views
Extract text from pdf in correct visual order from PDF
while using a Python library to extract text from a PDF, the order of the selected text doesn't match what you visually see on the screen? For instance, when i copy some text at top of page, then a ...
0
votes
1
answer
78
views
How to find the coordinates in a text?
This is my some of text:
PalHebron Governorate, Palestine31°31′27″N 35°6′32″E / 31.52417°N 35.10889°E / / 31.52417; 35.10889 (Hebron/Al-Khalil Old Town)
Cultural:(ii), (iv), (vi)
20.6 (51)
2017
2017–
...
0
votes
1
answer
39
views
How extract and save giant tar archive in python
I have faced with the next case:
I need to extract very big tar.xc archive, which contains one .dat file with size close to 15Gb. And save file to folder.
But If I'm using tarfile.open(path/to/archive)...
1
vote
1
answer
70
views
Using Tesseract not able to identify single character from the image
I tried to extract the number from the attached image
[
But I am not getting the number 8 as an output. I tried with different PSM values as well like 6, 10 etc.
This is what I have so far:
image = ...
0
votes
0
answers
287
views
Text extraction from pdf tables - differentiating between columns & rows
I'm using 'Fitz' library to extract text from a pdf file. Bounding boxes/rectangles will be drawn around tables from which text is supposed to be extracted.
The current extraction is returning the ...