How to extract text between certain patterns using regular expression (RegEx)?

Question

My text:

27/07/18, 12:02 PM - user_a: https://www.youtube.com/
 Watch this
27/07/18, 12:15 PM - user_b: <Media omitted>
27/07/18, 12:52 PM - user_b: Read this fully
some text
some text
.
some text
27/07/18, 12:56 PM - user_c: text ..

Here I want to extract the messages sent by the users. I tried two regex. But I didn't get the answer I wanted

First regex:

re.findall(r''+user_name+ ':(.*)', data)

Here I couldn't able to extract the text multi lines

Second regex:

re.findall(r''+ user_name + ':[^(:)]*', data)

Here I couldn't able to extract the full text having a hyper link .i.e., I could able to get only "https". It considers the symbol ":" as an endpoint.

How do I handle this ? Any kind of suggestions would be really great & helpful

Try something like user_\w*: \s*(.*(?:\r?\n(?!\d+\/).+)*) (replace \w* with name). — bobble bubble
– bobble bubble, Commented Aug 31, 2018 at 14:23

Paolo · Accepted Answer · 2018-08-31 15:38:35Z

1

You may use the following pattern:

user_b: (.*?)(?=^[0-9]{2}/[0-9]{2}/[0-9]{2})

Regex demo here.

Note the usage of re.MULTILINE and re.DOTALL. The first flag is needed to match beginning of line patterns over multiline text, whereas re.DOTALL is needed to enable the . to match newlines too.

In Python:

import re
data = '''
27/07/18, 12:02 PM - user_a: https://www.youtube.com/
 Watch this
27/07/18, 12:15 PM - user_b: <Media omitted>
27/07/18, 12:52 PM - user_b: Read this fully
some text
some text
.
some text
27/07/18, 12:56 PM - user_c: text ..
'''
usern = 'user_b'

pattern = re.compile(r""+usern+r": (.*?)(?=^[0-9]{2}/[0-9]{2}/[0-9]{2})",re.DOTALL|re.MULTILINE)
print(re.findall(pattern,data))

Prints:

['<Media omitted>\n', 'Read this fully\nsome text\nsome text\n.\nsome text\n']

answered Aug 31, 2018 at 15:38

Paolo

26.7k8 gold badges51 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Gary_W Over a year ago

Won't you have the same issue as your comment to me where if the client starts a line with the same date/time format? :-) I think to make it more reliable the logging needs to include a string of characters unlikely to ever be entered that can be matched on (but if it's a field with no entry validations it's a matter of time before some user enters the magic string).

Paolo Over a year ago

Yes, that's correct Gary. I'll modify the lookahead to make it even more restrictive when I get home, as (?=^[0-9]{2}/[0-9]{2}/[0-9]{2}, [0-9]{2}:[0-9]{2} [AP]M - user)

Gary_W Over a year ago

I was updating mine with that same pattern when i realized the problem.

Paolo Over a year ago

What problem is there if using (?=^[0-9]{2}/[0-9]{2}/[0-9]{2}, [0-9]{2}:[0-9]{2} [AP]M - user)?

Gary_W Over a year ago

I didn't include the text "user" as I suspect that was just for example and real user_id's won't match that pattern. Who knows though. Posters rarely supply all the info one needs to properly answer a question so we work with what we are given :-/

|

Gary_W · Accepted Answer · 2018-08-31 14:47:52Z

1

I believe your regex should be: user_b: (.*?)^[0-9]. After your user is found, match the rest of the line until a number as the first character of a line is found (the next entry). Make sure to turn on multi-line.

See a demo here.

answered Aug 31, 2018 at 14:47

Gary_W

10.4k1 gold badge26 silver badges42 bronze badges

3 Comments

Paolo Over a year ago

Will fail if a number is at the beginning of the string to be matched.

Gary_W Over a year ago

Good catch! I was then inclined to match the date/time instead of just the number but since it's user input, that could be matched as well as who knows what they will enter! I would still do that if the logging could add a sequence of characters or something to look for instead that had a low chance of being entered by users.

Paolo Over a year ago

Yeah, a lookahead is needed for the exact dd/mm/yy pattern.

Collectives™ on Stack Overflow

How to extract text between certain patterns using regular expression (RegEx)?

2 Answers 2

6 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related