3

I have a message which I am trying to split.

import re

message = "Aug 10, 17:04 UTCThis is update 1.Aug 10, 15:56 UTCThis is update 2.Aug 10, 15:55 UTCThis is update 3."

split_message = re.split(r'[a-zA-Z]{3} (0[1-9]|[1-2][0-9]|3[0-1]), ([0-1]?[0-9]|2[0-3]):[0-5][0-9] UTC', message)

print(split_message)

Expected Output:

["This is update 1", "This is update 2", "This is update 3"]

Actual Output:

['', '10', '17', "This is update 1", '10', '15',  "This is update 2", '10', '15', "This is update 3"]

Not sure what I am missing.

2 Answers 2

4

You are using "capturing groups", this is why their content is also part of the result array. You'll want to use non capturing groups (beginning with ?:):

import re

message = "Aug 10, 17:04 UTCThis is update 1.Aug 10, 15:56 UTCThis is update 2.Aug 10, 15:55 UTCThis is update 3."

split_message = re.split(r"[a-zA-Z]{3} (?:0[1-9]|[1-2][0-9]|3[0-1]), (?:[0-1]?[0-9]|2[0-3]):[0-5][0-9] UTC", message)

print(split_message)

You will however always get an empty entry first, because an empty string is in front of your first split pattern:

['', 'This is update 1.', 'This is update 2.', 'This is update 3.']

As statet in the docs:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

Sign up to request clarification or add additional context in comments.

1 Comment

print([x for x in split_message if x]) will remove the empty elements.
0

Not using regex, but wanted to highlight the power of Python string splitting for tasks like this. Way less headaches as easier to understand.

message = "Aug 10, 17:04 UTCThis is update 1.Aug 10, 15:56 UTCThis is update 2.Aug 10, 15:55 UTCThis is update 3."
values = message.split("UTC")
values = values[1:]
result = [v.split(".")[0] for v in values]

Note: this may not work if your messages ("This is update 1.") contain multiple . symbols.

2 Comments

Thanks for reply. It won't work for my use-case. As you mentioned my message may have multiple "." or none at all or it can also have 'UTC' in it. Only thing I am certain is that it will have date pattern like(Aug 10, 17:04 UTC) and that's why going after regex for solution.
Ah yes, the value of context. Looks like Aram Becker has given a good answer. Will leave this answer here in case anyone finds it valuable in the future.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.