BeautifulSoup HTML Parsing using regex

Question

I'm trying to parse four HTML pages from IMDB.com. I would like to extract all of the IMDB ID's from each listing (this can be found within the HTML code, and looks something like this: href="/title/tt0080684/" title="Star Wars: Episode V - The Empire Strikes Back (1980)" But I can't seem to get my regex below to work...is it something wrong with the regex or the syntax for beautifulsoup? Thank you!

import urllib2
from bs4 import BeautifulSoup
import re, json

for start_num in ('1', '2', '3', '4'):
   response = urllib2.urlopen('http://www.imdb.com/search/title?at=0&genres=sci_fi&sort=user_rating&start='+ start_num +'&title_type=feature')
   html_doc = response.read()
   soup = BeautifulSoup(html_doc, "html.parser")

   for movie in soup.find_all(re.compile('\"href=\"/title/\"')):
      print(tag.name)

alecxe · Accepted Answer · 2016-01-25 03:41:44Z

1

You are using the find_all() with a regular expression incorrectly. If you want BeautifulSoup to check the href attribute values against a regular expression, you need to provide an href keyword argument with a regular expression as a value:

for movie in soup.find_all(href=re.compile(r'/title/')):
    print(tag.name)

answered Jan 25, 2016 at 3:41

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

SunnyMarkLiu · Accepted Answer · 2016-01-25 03:54:27Z

0

I guess you want to get the tag and its content,which is the movie name.The regular expression was wrong (there is no quotes on the href left).you can try this one:

re.compile('href=\"/title/\"')

I hope it can work.

answered Jan 25, 2016 at 3:54

SunnyMarkLiu

931 gold badge4 silver badges10 bronze badges

1 Comment

SunnyMarkLiu Over a year ago

oops, re.compile('href=\"/title/')

Collectives™ on Stack Overflow

BeautifulSoup HTML Parsing using regex

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related