2

I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?

1 Answer 1

9

http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:

import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <[email protected]>')
opener = urllib2.build_opener()
content = opener.open(r).read()

Refusing requests without Accept headers is incorrect; RFC 2616 clearly states

If no Accept header field is present, then it is assumed that the client accepts all media types.

Sign up to request clarification or add additional context in comments.

4 Comments

nice answer. Out of curiosity, how did you discover this?
@RoundTower I captured a working request (by Chromium), and added the exact same headers in Python. Once it worked, I tried to remove every single HTTP header until it didn't.
@phihag - how did you capture a working request in Chromium? Can I do that in Chrome also?
I used Wireshark, but you can also use the Chromium developer tools, just press F12 and go to the Network tab. Chrome is just Chromium with a Google branding, so it works there (and in many other modern browsers) as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.