0

I'm trying to parse four HTML pages from IMDB.com. I would like to extract all of the IMDB ID's from each listing (this can be found within the HTML code, and looks something like this: href="/title/tt0080684/" title="Star Wars: Episode V - The Empire Strikes Back (1980)" But I can't seem to get my regex below to work...is it something wrong with the regex or the syntax for beautifulsoup? Thank you!

import urllib2
from bs4 import BeautifulSoup
import re, json

for start_num in ('1', '2', '3', '4'):
   response = urllib2.urlopen('http://www.imdb.com/search/title?at=0&genres=sci_fi&sort=user_rating&start='+ start_num +'&title_type=feature')
   html_doc = response.read()
   soup = BeautifulSoup(html_doc, "html.parser")

   for movie in soup.find_all(re.compile('\"href=\"/title/\"')):
      print(tag.name)
0

2 Answers 2

1

You are using the find_all() with a regular expression incorrectly. If you want BeautifulSoup to check the href attribute values against a regular expression, you need to provide an href keyword argument with a regular expression as a value:

for movie in soup.find_all(href=re.compile(r'/title/')):
    print(tag.name)
Sign up to request clarification or add additional context in comments.

Comments

0

I guess you want to get the tag and its content,which is the movie name.The regular expression was wrong (there is no quotes on the href left).you can try this one:

re.compile('href=\"/title/\"')

I hope it can work.

1 Comment

oops, re.compile('href=\"/title/')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.