Extract text between pattern using REGEX

Question

I need help with regex in python.

I've got a large html file[around 400 lines] with the following pattern

text here(div,span,img tags)

<!-- 3GP||Link|| --> 

text here(div,span,img tags)

So, now i am searching for a regex expression which can extract me this-:

Link

The given pattern is unique in the html file.

MattH · Accepted Answer · 2011-12-20 12:15:53Z

4

>>> d = """
... Some text here(div,span,img tags)
...
... <!-- 3GP||**Some link**|| -->
...
... Some text here(div,span,img tags)
... """
>>> import re
>>> re.findall(r'\<!-- 3GP\|\|([^|]+)\|\| --\>',d)
['**Some link**']

r'' is a raw literal, it stops interpretation of standard string escapes
\<!-- 3GP\|\| is a regexp escaped match for <!-- 3GP||
([^|]+) will match everything upto a | and groups it for convenience
\|\| --\> is a regexp escaped match for || -->
re.findall returns all non-overlapping matches of re pattern within a string, if there's a group expression in the re pattern, it returns that.

edited Dec 20, 2011 at 12:15

answered Dec 20, 2011 at 11:50

MattH

38.4k11 gold badges85 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

RanRag Over a year ago

Thanks.It worked. If you don't mind can you please explain to me what you did there.

MattH Over a year ago

I think strictly speaking the < and > don't need escaping here, but it doesn't do any harm, they are metacharacters in other pattern implementations.

RanRag Over a year ago

Thanks. A very nice explanation.Can you suggest me any good tutorial for learning regex.The problem is there are too many tutorials available.

MattH Over a year ago

Sadly no. I'd recommend reading the python docs for the re module, experimenting and asking questions when you get stuck. Decent syntax highlighter might help too

Jan Pöschko · Accepted Answer · 2011-12-20 11:52:17Z

0

import re
re.match(r"<!-- 3GP\|\|(.+?)\|\| -->", "<!-- 3GP||Link|| -->").group(1)

yields "Link".

answered Dec 20, 2011 at 11:52

Jan Pöschko

5,6101 gold badge30 silver badges28 bronze badges

Comments

jcollado · Accepted Answer · 2011-12-20 12:20:39Z

0

In case you need to parse something else, you can also combine the regular expression with BeautifulSoup:

import re
from BeautifulSoup import BeautifulSoup, Comment

soup = BeautifulSoup(<your html here>)
link_regex = re.compile('\s+3GP\|\|(.*)\|\|\s+')
comment = soup.find(text=lambda text: isinstance(text, Comment)
                    and link_regex.match(text))
link = link_regex.match(comment).group(1)
print link

Note that in this case the regular expresion only needs to match the comment contents because BeautifulSoup already takes care of extracting the text from the comments.

answered Dec 20, 2011 at 12:20

jcollado

40.5k9 gold badges108 silver badges139 bronze badges

3 Comments

RanRag Over a year ago

My html is too malformed, thats why am not using beautiful soup.

jcollado Over a year ago

I see, then I agree on that the best option is to use regular expressions to sanitize your data.

RanRag Over a year ago

Yes, thats what am going to do

Collectives™ on Stack Overflow

Extract text between pattern using REGEX

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related