Web scraping using Python

Question

I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?

phihag · Accepted Answer · 2013-06-18 15:02:03Z

9

http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:

import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <[email protected]>')
opener = urllib2.build_opener()
content = opener.open(r).read()

Refusing requests without Accept headers is incorrect; RFC 2616 clearly states

If no Accept header field is present, then it is assumed that the client accepts all media types.

edited Jun 18, 2013 at 15:02

answered Aug 6, 2011 at 23:10

phihag

289k75 gold badges475 silver badges489 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

RoundTower Over a year ago

nice answer. Out of curiosity, how did you discover this?

phihag Over a year ago

@RoundTower I captured a working request (by Chromium), and added the exact same headers in Python. Once it worked, I tried to remove every single HTTP header until it didn't.

avi Over a year ago

@phihag - how did you capture a working request in Chromium? Can I do that in Chrome also?

phihag Over a year ago

I used Wireshark, but you can also use the Chromium developer tools, just press F12 and go to the Network tab. Chrome is just Chromium with a Google branding, so it works there (and in many other modern browsers) as well.

Collectives™ on Stack Overflow

Web scraping using Python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related