4

My text:

27/07/18, 12:02 PM - user_a: https://www.youtube.com/
 Watch this
27/07/18, 12:15 PM - user_b: <Media omitted>
27/07/18, 12:52 PM - user_b: Read this fully
some text
some text
.
some text
27/07/18, 12:56 PM - user_c: text ..

Here I want to extract the messages sent by the users. I tried two regex. But I didn't get the answer I wanted

First regex:

re.findall(r''+user_name+ ':(.*)', data)

Here I couldn't able to extract the text multi lines

Second regex:

re.findall(r''+ user_name + ':[^(:)]*', data)

Here I couldn't able to extract the full text having a hyper link .i.e., I could able to get only "https". It considers the symbol ":" as an endpoint.

How do I handle this ? Any kind of suggestions would be really great & helpful

1

2 Answers 2

1

You may use the following pattern:

user_b: (.*?)(?=^[0-9]{2}/[0-9]{2}/[0-9]{2})

Regex demo here.

Note the usage of re.MULTILINE and re.DOTALL. The first flag is needed to match beginning of line patterns over multiline text, whereas re.DOTALL is needed to enable the . to match newlines too.


In Python:

import re
data = '''
27/07/18, 12:02 PM - user_a: https://www.youtube.com/
 Watch this
27/07/18, 12:15 PM - user_b: <Media omitted>
27/07/18, 12:52 PM - user_b: Read this fully
some text
some text
.
some text
27/07/18, 12:56 PM - user_c: text ..
'''
usern = 'user_b'

pattern = re.compile(r""+usern+r": (.*?)(?=^[0-9]{2}/[0-9]{2}/[0-9]{2})",re.DOTALL|re.MULTILINE)
print(re.findall(pattern,data))

Prints:

['<Media omitted>\n', 'Read this fully\nsome text\nsome text\n.\nsome text\n']
Sign up to request clarification or add additional context in comments.

6 Comments

Won't you have the same issue as your comment to me where if the client starts a line with the same date/time format? :-) I think to make it more reliable the logging needs to include a string of characters unlikely to ever be entered that can be matched on (but if it's a field with no entry validations it's a matter of time before some user enters the magic string).
Yes, that's correct Gary. I'll modify the lookahead to make it even more restrictive when I get home, as (?=^[0-9]{2}/[0-9]{2}/[0-9]{2}, [0-9]{2}:[0-9]{2} [AP]M - user)
I was updating mine with that same pattern when i realized the problem.
What problem is there if using (?=^[0-9]{2}/[0-9]{2}/[0-9]{2}, [0-9]{2}:[0-9]{2} [AP]M - user)?
I didn't include the text "user" as I suspect that was just for example and real user_id's won't match that pattern. Who knows though. Posters rarely supply all the info one needs to properly answer a question so we work with what we are given :-/
|
1

I believe your regex should be: user_b: (.*?)^[0-9]. After your user is found, match the rest of the line until a number as the first character of a line is found (the next entry). Make sure to turn on multi-line.

See a demo here.

3 Comments

Will fail if a number is at the beginning of the string to be matched.
Good catch! I was then inclined to match the date/time instead of just the number but since it's user input, that could be matched as well as who knows what they will enter! I would still do that if the logging could add a sequence of characters or something to look for instead that had a low chance of being entered by users.
Yeah, a lookahead is needed for the exact dd/mm/yy pattern.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.