Python authentication with requests. Why it doesn't work? .NET and wikipedia

Question

It has been few days that I tried to scrape this page: http://londoncoffeeguide.com/

I tried to use requests or scrapy, but I'm newby to the scrapin world and I cannot find a way to login. Is it possible to login to this website with requests and use BeautifulSoup to scrape it? Or is it possible to do it with scrapy?

Furthermore, I tried to test requests following this example, and to test it on wikipedia, using the same pages linked there I tried this:

import requests
from bs4 import BeautifulSoup as bs


def get_login_token(raw_resp):
    soup = bs(raw_resp.text, 'lxml')
    token = [n['value'] for n in soup.find_all('input')
    if n['name'] == 'wpLoginToken']
    return token[0]


payload = {
    'wpName': 'my_login',
    'wpPassword': 'my_pass!',
    'wpLoginAttempt': 'Log in',
    #'wpLoginToken': '',
    }

with requests.session() as s:
    resp = s.get('http://en.wikipedia.org/w/index.php?title=Special:UserLogin')
    payload['wpLoginToken'] = get_login_token(resp)
    print payload
    response_post = s.post('http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login', data=payload)
    response = s.get('http://en.wikipedia.org/wiki/Special:Watchlist')

    r = bs(response.content)
    print r.get_text()

What I see is that I still get the suggestion to login in order to see the wishlist page.

Where is the mistake?

Don't worry about scraping and BeautifulSoup until you can get the page you want in the first place; you're just adding complexity that will make things harder to debug. — abarnert
– abarnert, Commented Nov 4, 2013 at 22:39
Anyway, I notice that you aren't looking at response_post at all. So… how do you know whether you logged in successfully? If you didn't, you obviously won't be logged in on subsequent pages… — abarnert
– abarnert, Commented Nov 4, 2013 at 22:46
Also, any particular reason you're trying to scrape the web interface instead of using the MediaWiki API? — abarnert
– abarnert, Commented Nov 4, 2013 at 22:51
Hello abarnet: my goal is not to scrape wikipedia, but to scrape londoncoffee. I'm trying to scrape wiki using the web interface in order to make some practice. Here I'm using beautifulsoup in order to understand if I'm logged in or not. Any other way to understand if I'm in or not? — foebu
– foebu, Commented Nov 4, 2013 at 22:53
Yeah, look at the response_post. Is it the same thing you get in the browser? If so, is there a redirect you have to follow? Or some JS code that the site is expecting you to run? — abarnert
– abarnert, Commented Nov 4, 2013 at 22:58

j011y · Accepted Answer · 2013-11-05 01:57:26Z

1

I got this to login (yes i created an account and tested it)

from mechanize import Browser
    br = Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Firefox')]
    br.open("http://www.londoncoffeeguide.com")
    for form in br.forms():
        if form.attrs['id'] == 'form':
            br.form = form
    br.form['p$lt$zoneContent$PagePlaceholder$p$lt$zoneRight$logonform$Login1$UserName'] = 'username goes here'
    br.form['p$lt$zoneContent$PagePlaceholder$p$lt$zoneRight$logonform$Login1$Password'] = 'password goes here'
    response = br.submit()

then you can pass response.read() to beautiful soup and do all kinds of stuff

answered Nov 5, 2013 at 1:57

j011y

1113 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

foebu Over a year ago

Good answer, j0lly! Can you tell me just few things more about mechanize? Are there modules which allow to do the same? Is it similar to Selenium? - for the sake of curiosity and completeness. Thank you! :)

foebu Over a year ago

Actually I have another question: how to proceed visiting another page of the same website without logging in again? I tried to "br.open" another url but it requires another login.

j011y Over a year ago

Thanks! I don't know a terrible lot about mechanize to be honest as I only used it for the first time the other day at work. after you submit the form (which will log you in) you should be able to follow additional links by using the br.follow_link(text="the actual link text").

foebu Over a year ago

Oh, great, thank you! I got confused by the fact that it works with the object link and not with the url. Thank you very much. (Even though everything is useless if I am not allowed to use those data)

j011y Over a year ago

No probs. You should also be able to loop over all the links until you find the url you want also. check my answer to this question stackoverflow.com/questions/19803075/…

Collectives™ on Stack Overflow

Python authentication with requests. Why it doesn't work? .NET and wikipedia

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related