2

I need help with regex in python.

I've got a large html file[around 400 lines] with the following pattern

text here(div,span,img tags)

<!-- 3GP||Link|| --> 

text here(div,span,img tags)

So, now i am searching for a regex expression which can extract me this-:

Link

The given pattern is unique in the html file.

3 Answers 3

4
>>> d = """
... Some text here(div,span,img tags)
...
... <!-- 3GP||**Some link**|| -->
...
... Some text here(div,span,img tags)
... """
>>> import re
>>> re.findall(r'\<!-- 3GP\|\|([^|]+)\|\| --\>',d)
['**Some link**']
  • r'' is a raw literal, it stops interpretation of standard string escapes
  • \<!-- 3GP\|\| is a regexp escaped match for <!-- 3GP||
  • ([^|]+) will match everything upto a | and groups it for convenience
  • \|\| --\> is a regexp escaped match for || -->
  • re.findall returns all non-overlapping matches of re pattern within a string, if there's a group expression in the re pattern, it returns that.
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks.It worked. If you don't mind can you please explain to me what you did there.
I think strictly speaking the < and > don't need escaping here, but it doesn't do any harm, they are metacharacters in other pattern implementations.
Thanks. A very nice explanation.Can you suggest me any good tutorial for learning regex.The problem is there are too many tutorials available.
Sadly no. I'd recommend reading the python docs for the re module, experimenting and asking questions when you get stuck. Decent syntax highlighter might help too
0
import re
re.match(r"<!-- 3GP\|\|(.+?)\|\| -->", "<!-- 3GP||Link|| -->").group(1)

yields "Link".

Comments

0

In case you need to parse something else, you can also combine the regular expression with BeautifulSoup:

import re
from BeautifulSoup import BeautifulSoup, Comment

soup = BeautifulSoup(<your html here>)
link_regex = re.compile('\s+3GP\|\|(.*)\|\|\s+')
comment = soup.find(text=lambda text: isinstance(text, Comment)
                    and link_regex.match(text))
link = link_regex.match(comment).group(1)
print link

Note that in this case the regular expresion only needs to match the comment contents because BeautifulSoup already takes care of extracting the text from the comments.

3 Comments

My html is too malformed, thats why am not using beautiful soup.
I see, then I agree on that the best option is to use regular expressions to sanitize your data.
Yes, thats what am going to do

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.