Newest 'text-extraction' Questions

Best practices

1 vote

1 replies

85 views

How to build an OCR system in Flutter that can extract structured data from multiple bill formats?

I am trying to build an OCR feature in my Flutter app that can read hotel bills in multiple formats. The challenge is that these bills do not follow a fixed layout. From each bill, I need to extract ...

Manish sahu

406

asked Nov 28 at 8:33

Advice

0 votes

1 replies

31 views

How to identify if a table has a header when extracted from Textract

I'm extracting tables from financial statement PDFs (like 10-Ks) Textract has a feature which allows for extracting them in markdown format. However, some of the tables don't have headers. For example:...

Jason Pereira

33

asked Nov 3 at 5:26

0 votes

0 answers

171 views

How can I extract a table from a PDF document where the cells contain multi-line text that is vertically centered?

I'm working on extracting bank transaction data from a PDF shown below using Python. Each transaction includes two dates, amounts (Money In, Money Out, Balance), and a description. The challenge is ...

Sand

1

asked Jul 8 at 13:19

1 vote

0 answers

93 views

How to extract whole number as single token from pdf

ASP.NET Core 9 MVC / C# controller extracts texts from pdf using pdfpig based on code in answer How to group text to lines if there is small difference in Y position If thousands are separated by ...

Andrus

28.3k

asked Jun 10 at 18:32

0 votes

0 answers

62 views

How to preserve text on single PDF to Image conversion?

Trying to import user's daily duties which are PDF format. The PDF contains no more than 20 dates and airport/time combinations which is what I need to capture from the document. I have the text ...

Paul Wilson

33

asked Jun 10 at 1:29

0 votes

0 answers

45 views

How to match and highlight text in a PDF with PyMuPDF when control characters are present between sentences?

I'm using PyMuPDF (fitz) to search for and highlight text in a PDF. However, the PDF text contains various control characters between sentences, which makes it difficult to match multi-sentence ...

Shantanu

1

asked May 21 at 19:43

0 votes

0 answers

109 views

PyMuPDF - Extract table contents

I try to extract the table text of a PDF: With the following code code i get: page 0 of page-1-ocr.pdf Tables rowsasf 49 texysdft [['', '', 'Staatlic', 'he Fische', 'rprüfung', 'in Bayern - Prü', '...

Marc

4,049

asked Apr 18 at 19:39

0 votes

0 answers

55 views

KeyFrame detection in python

I'm building a RAG system for a platform where the primary content consists of videos and slides. My approach involves extracting keyframes from videos using OpenCV diff = cv2.absdiff(prev_image, ...

Daniel

13

asked Mar 24 at 15:40

1 vote

0 answers

63 views

Preserve Empty Columns When Extracting Tables from PDF

I have 25–30 different types of PDF documents, each containing tables with varying structures. My ultimate goal is to extract table data from specific headings (i.e., between certain titles) and ...

Requiet

85

asked Mar 19 at 12:13

0 votes

0 answers

110 views

Unable to Extract Text from a Specific Email Template in Gmail (Spring Boot & IMAP)

I am working on an email extraction process using Java, Spring boot, and IMAP to read emails from Gmail. The process works fine for most emails, extracting only the text content. However, one specific ...

Irfan Abdul Salam

1

asked Feb 19 at 3:10

0 votes

1 answer

70 views

Extracting a Specific Text from PDF using RegEx

Without getting into too much detail about how we ended up in this situation (a lot of poor business decisions), I need to find the text: "SomeID=[Integer]" from a PDF file (e.g. SomeID=...

user3121062

53

asked Feb 4 at 21:43

0 votes

4 answers

300 views

Parse video id and start time from differently formatted youtube URL strings

I needed to extract the video Id and the start time from any kind of youtube url that the users can input. I have a working solution but it is not right. Questions: Could someone help me to fix the ...

Zoltán Süle

1,742

asked Jan 15 at 14:37

0 votes

1 answer

110 views

Extracting path and filename from a string containing both in vanilla SQLITE

I have a SQLITE DB that contains fullnames (i.e., parentpath\filename, e.g. C:\Users\Public\My Music\Classic Queen\16 - Who Wants to Live Forever.mp3 I want to query and get the filename separate from ...

Jason Blue

19

asked Dec 23, 2024 at 19:32

0 votes

0 answers

124 views

Python: Concurrent execution of drission chromiumpage

I have created one api that calls the function whose task is to extract web content from url that has been shared as a parameter. I am facing a problem when my api is getting multiple request, the ...

Irfanali Shaikh

1

asked Dec 23, 2024 at 15:42

0 votes

2 answers

84 views

How to extract a word in a string between two similar characters" with VBA?

I need to extract Wööörd_03 from this string: "https://Word01.com/Word_02/Wööörd_03/Word_04/Word_05=0" My code doesn't, cause I get different results: Sub ExtractWord() Dim sString As ...

Jasco

253

asked Dec 17, 2024 at 18:46

0 votes

2 answers

184 views

How to tokenize a text in Google Spreadsheet

I have a cell (D2) in Google Sheets containing a title, and I want to extract everything up to the first punctuation mark (if any) and display it in another cell. Example For example, if D2 contains: &...

Emanuele Benatti

1

asked Dec 13, 2024 at 11:43

0 votes

1 answer

87 views

Extract the numbers interspersed with ‘+’

I have a column in my table in string format that contains different types of discounts: integers decimal numbers compound discounts, i.e. whole numbers interspersed with the + symbol (e.g. 10+3, 5+3+...

Matilde

53

asked Nov 29, 2024 at 9:12

1 vote

1 answer

476 views

How to extract valuable information from the JSON output of Document AI custom extractors?

I am working with a simple custom extractor in Document AI, which tries to find the following fields in any pdf uploaded: Country Nombre Adress Country Mail Adress City And i am using the following ...

Javier Romero Garcia

11

asked Nov 26, 2024 at 16:37

0 votes

1 answer

428 views

Extracting the the UTM value that contains mixed characters from Google Sheets via REGEX [duplicate]

I have multiple lines like this one where I need to extract the value associated with the utm_campaign field. As you can see, the value comprises digits, letter and characters (ex"-") https:...

Alan Benlolo

13

asked Oct 26, 2024 at 14:28

0 votes

1 answer

73 views

How do I automatically scroll to a specific section in the DOM using Selenium?

I'm trying to use Selenium to scroll to a specific section on a webpage and retrieve the text from that section. Context: I’m working with a webpage that disables text highlighting through CSS ...

poe trenton

3

asked Oct 23, 2024 at 12:35

0 votes

1 answer

331 views

how can I compare to texts and extract only the difference text into the other cell

Column A Column B Column C Iam18yearsold Iam17yearsold 7 thereisagirl therearegirls are,s I need to compare to cells and then extract only the difference to the third cell. I want to have the result ...

jajangjaras

15

asked Oct 9, 2024 at 8:09

0 votes

1 answer

86 views

Extract separated date from a text

so I have this text in excel: Wed Aug 04 00:00:00 WIB 2021 and I need to extract the date to the cell beside it like 04-Aug-21 which is for me kind of complicated, can anyone help? so I already can ...

Toya Tanaj

13

asked Sep 26, 2024 at 4:04

-1 votes

1 answer

148 views

How to create an Excel table from non-structured text using text formatting as the delimeter [closed]

I am trying to extract most of the information found on a government website (CFIA-CFIT Part I and Part II) and create a table in excel. This table is to have three columns; ID, Name, and Detail. The ...

Feketenyek

29

asked Sep 23, 2024 at 19:18

0 votes

1 answer

122 views

Order arrangement of texts of docx documents in the document.xml

I am trying to extract text from docx files, where I am getting collapsed text from the document like the text present at the bottom or in a random text box is extracted first and then the texts from ...

vignesh

3

asked Aug 28, 2024 at 12:27

1 vote

0 answers

56 views

Extracting text from a pdf file with differents strcuture failed how to properly do it Not all texts is extracted , just a portion is extracted

I am trying to extract text from CV in pdf extension. I come up with this script but I have a problem. The script does not extract all the text and I have problem to identify different block of the ...

emma

363

asked Aug 26, 2024 at 17:01

0 votes

0 answers

202 views

Capturing Formatted Numbering from DOCX Files in Python

I'm working on a Python project where I need to extract text from DOCX files, preserving the formatted numbering. I've encountered a peculiar issue that I'm hoping someone can help me solve. The ...

Anshuman Sharma

1

asked Aug 23, 2024 at 23:43

0 votes

0 answers

157 views

Guidance on Extracting Compliance Items from PDF documents by fine-tuning a LLM

Need some guidance on extracting large compliance items from raw PDF documents. I have csv with these compliance items and I want to fine-tune a LLM such that if it reads any new PDF documents it can ...

Daremitsu

655

asked Aug 1, 2024 at 16:23

3 votes

1 answer

363 views

Parsing formulas efficiently using regex and Polars

I am trying to parse a series of mathematical formulas and need to extract variable names efficiently using Polars in Python. Regex support in Polars seems to be limited, particularly with look-around ...

Olibarer

423

asked Jul 23, 2024 at 21:34

-1 votes

1 answer

123 views

Extracting Text from PDFs with Python Without Including Comments

I have been trying to extract text from PDF files to automate a significant and tedious part of my job using Python. With the help of ChatGPT, I have written multiple lines of code. However, I am ...

MDMT

1

asked Jul 8, 2024 at 12:42

1 vote

1 answer

383 views

Accurately Detecting randomly rotated Text in Images

I'm trying to detect text from items, which may be rotated in various directions. I've tried using Tesseract, EasyOCR, and EAST for text detection and extraction, but I am encountering issues with ...

Agura

11

asked Jul 2, 2024 at 19:34

1 vote

0 answers

108 views

AWS Textract With AWS Signature Version 4 Using Go Lang

I have 3 credentials: host acckey secretkey That from AWS. I am using AWS Signature Ver 4 method And then i want to using textract feature from AWS with Golang. I have build the code and have a ...

Hafi Ihza Farhana

13

asked Jul 2, 2024 at 11:20

0 votes

2 answers

93 views

How to convert a string in python to separate strings [closed]

I have a pandas dataframe with only one column containing symbols. I need to separate those symbols in groups of 13 and 39 inside a single string. symbol 3IINFOTECH 3MINDIA 3PLAND 20MICRONS 3RDROCK ...

Hamza Ahmed

1,841

asked Jun 27, 2024 at 9:46

1 vote

0 answers

130 views

Improving OCR accuracy with pytesseract for processing manga images

def get_string(img_path): img = cv2.imread(img_path) img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC) gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) ...

Myat Thet

19

asked Jun 18, 2024 at 5:38

2 votes

0 answers

236 views

lopdf RUST PDF - only getting text

brand new to rust and am trying to read a pdf file with lopdf. trying out various examples but I am just getting characters. I need all the chars like spaces, tabs, line breaks, etc...for Regex. Is ...

diogenes

2,191

asked Jun 9, 2024 at 5:15

0 votes

0 answers

93 views

How to use custom ToUnicode maps for text extraction

Tying to use iText7 to extract text from pdf file outputs question marks only: ???????? ?????????? ? ???????????????????????? ??????????????????? ???????? ???????????????????????????? ?????????????????...

Andrus

28.3k

asked Jun 5, 2024 at 13:11

0 votes

0 answers

138 views

How to extract text instead of question marks from PDF file

Tying to use iText7 8.0.4 to extract text from pdf file outputs question marks only: ???????? ?????????? ? ???????????????????????? ??????????????????? ???????? ???????????????????????????? ???????????...

Andrus

28.3k

asked Jun 4, 2024 at 12:32

0 votes

0 answers

58 views

How to select only necessary info in table extraction using uipath?

How to extract the only needed info from a web page using uipath table extraction. When I try to select the specific info the other unwanted info is also selected due to its similar pattern as the ...

Dilip

1

asked May 30, 2024 at 11:51

0 votes

0 answers

128 views

How to use text extraction strategy

I am stuck in itext7 custom strategy. My goal is to extract data from a PDF to a text file without losing the table format. My PDF has a different table structure, some table columns are horizontal ...

Ibad Ur Rehman

1

asked May 28, 2024 at 10:10

2 votes

2 answers

80 views

R function to extract two word phrases in a list which also contains the first word as a separate string

I have a data frame with a string column, and a list of words/phrases which I would like to extract from the column. I have used the following code. df <- data.frame(string = c("A rose is a ...

ayeh

68

asked May 21, 2024 at 12:03

0 votes

2 answers

54 views

Extracting datetime column from PDF document

I am trying to extract data from PDF documents in R using "str_extract_all" function. I am trying to look for a date time field, which is displayed in the document in the below format: Est ...

Ram Subramanian

1

asked May 19, 2024 at 9:32

1 vote

4 answers

503 views

Pandas Extract Phone Number if it is in Correct Format [closed]

I have a column that has phone numbers. They are usually formatted in (555) 123-4567 but sometimes they are in a different format or they are not proper numbers. I am trying to convert this field to ...

Bijan

8,826

asked May 15, 2024 at 16:35

2 votes

1 answer

1k views

How to print the table and lines in their reading order from a document analyzed by AWS Textract

I am using AWS Textract in order to extract text and tables from a pdf document. I need code that can parse the text extracted, and tables extracted and print everything in one string in the order ...

diegofigueroa79

51

asked May 3, 2024 at 3:31

0 votes

0 answers

144 views

Improving Text Alignment in OCR-Extracted Text Using pytesseract

I'm facing issues with misaligned text extraction from images. I suspect the problem lies in formatting rather than extraction. Can I utilize bounding box coordinates to improve text alignment? see ...

code_comm

1

asked Apr 30, 2024 at 9:13

0 votes

4 answers

902 views

Extract text having different length in excel

I have a list of address data as below. But they don't follow any pattern. Comma, dot or space was used to separate words. I applied the formula =TRIM(RIGHT(A1,FIND(" ",A1,FIND(" ",...

Ngan Huynh

11

asked Apr 18, 2024 at 15:48

0 votes

1 answer

896 views

Extract text from pdf in correct visual order from PDF

while using a Python library to extract text from a PDF, the order of the selected text doesn't match what you visually see on the screen? For instance, when i copy some text at top of page, then a ...

Phalgun

1

asked Apr 15, 2024 at 17:09

0 votes

1 answer

78 views

How to find the coordinates in a text?

This is my some of text: PalHebron Governorate, Palestine31°31′27″N 35°6′32″E / 31.52417°N 35.10889°E / / 31.52417; 35.10889 (Hebron/Al-Khalil Old Town) Cultural:(ii), (iv), (vi) 20.6 (51) 2017 2017– ...

Midas Estanislao

9

asked Apr 15, 2024 at 3:46

0 votes

1 answer

39 views

How extract and save giant tar archive in python

I have faced with the next case: I need to extract very big tar.xc archive, which contains one .dat file with size close to 15Gb. And save file to folder. But If I'm using tarfile.open(path/to/archive)...

Amali Yarmukhametov

3

asked Apr 11, 2024 at 10:34

1 vote

1 answer

70 views

Using Tesseract not able to identify single character from the image

I tried to extract the number from the attached image [ But I am not getting the number 8 as an output. I tried with different PSM values as well like 6, 10 etc. This is what I have so far: image = ...

Mukul Saini

13

asked Apr 10, 2024 at 10:49

0 votes

0 answers

287 views

Text extraction from pdf tables - differentiating between columns & rows

I'm using 'Fitz' library to extract text from a pdf file. Bounding boxes/rectangles will be drawn around tables from which text is supposed to be extracted. The current extraction is returning the ...

Apoorva

115

asked Apr 8, 2024 at 6:51

Collectives™ on Stack Overflow