I'm trying to parse four HTML pages from IMDB.com. I would like to extract all of the IMDB ID's from each listing (this can be found within the HTML code, and looks something like this: href="/title/tt0080684/" title="Star Wars: Episode V - The Empire Strikes Back (1980)" But I can't seem to get my regex below to work...is it something wrong with the regex or the syntax for beautifulsoup? Thank you!
import urllib2
from bs4 import BeautifulSoup
import re, json
for start_num in ('1', '2', '3', '4'):
response = urllib2.urlopen('http://www.imdb.com/search/title?at=0&genres=sci_fi&sort=user_rating&start='+ start_num +'&title_type=feature')
html_doc = response.read()
soup = BeautifulSoup(html_doc, "html.parser")
for movie in soup.find_all(re.compile('\"href=\"/title/\"')):
print(tag.name)