0

I have a pdf file with its contents formatted as follows:

00:12 There once lived a man...

00:18 who was thought to have...

and the list goes on following the same pattern. Now I'm trying to write a Regex program that will read the file and remove all of the time stamps as well as replace the line skips with spaces. In other words. I want to make one big paragraph out of it.

This is what I came up for the reg expression:

transcript.replace(transcript.matches("^[0-9:]+$"),"")

and that will get rid of any numbers and colons, meaning the time stamps. Now I'm not sure how to replace the line skips, would I do something like

transcript.replace(transcript.matches("^[\n]+$"), " ")

Any help would be appreciated. Thanks!

1

1 Answer 1

1

Couldn't you just check for a blank line, skip (or delete) those lines and use your transcript code to handle the timestamps?

for line in file:
    if line == "": #test that this is how a blank line is read
       line.delete
    else:
       transcript.replace(transcript.matches("^[0-9:]+$"),"")

This may return a block of text with the following appearance

There once lived a man...

who was thought to have...

Which you still need to wrap into continuous paragraphs. Do the three dots appear at the end of each line as in your quoted text?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.