1

It has been few days that I tried to scrape this page: http://londoncoffeeguide.com/

I tried to use requests or scrapy, but I'm newby to the scrapin world and I cannot find a way to login. Is it possible to login to this website with requests and use BeautifulSoup to scrape it? Or is it possible to do it with scrapy?

Furthermore, I tried to test requests following this example, and to test it on wikipedia, using the same pages linked there I tried this:

import requests
from bs4 import BeautifulSoup as bs


def get_login_token(raw_resp):
    soup = bs(raw_resp.text, 'lxml')
    token = [n['value'] for n in soup.find_all('input')
    if n['name'] == 'wpLoginToken']
    return token[0]


payload = {
    'wpName': 'my_login',
    'wpPassword': 'my_pass!',
    'wpLoginAttempt': 'Log in',
    #'wpLoginToken': '',
    }

with requests.session() as s:
    resp = s.get('http://en.wikipedia.org/w/index.php?title=Special:UserLogin')
    payload['wpLoginToken'] = get_login_token(resp)
    print payload
    response_post = s.post('http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login', data=payload)
    response = s.get('http://en.wikipedia.org/wiki/Special:Watchlist')

    r = bs(response.content)
    print r.get_text()

What I see is that I still get the suggestion to login in order to see the wishlist page.

Where is the mistake?

14
  • Don't worry about scraping and BeautifulSoup until you can get the page you want in the first place; you're just adding complexity that will make things harder to debug. Commented Nov 4, 2013 at 22:39
  • Anyway, I notice that you aren't looking at response_post at all. So… how do you know whether you logged in successfully? If you didn't, you obviously won't be logged in on subsequent pages… Commented Nov 4, 2013 at 22:46
  • Also, any particular reason you're trying to scrape the web interface instead of using the MediaWiki API? Commented Nov 4, 2013 at 22:51
  • Hello abarnet: my goal is not to scrape wikipedia, but to scrape londoncoffee. I'm trying to scrape wiki using the web interface in order to make some practice. Here I'm using beautifulsoup in order to understand if I'm logged in or not. Any other way to understand if I'm in or not? Commented Nov 4, 2013 at 22:53
  • Yeah, look at the response_post. Is it the same thing you get in the browser? If so, is there a redirect you have to follow? Or some JS code that the site is expecting you to run? Commented Nov 4, 2013 at 22:58

1 Answer 1

1

I got this to login (yes i created an account and tested it)

from mechanize import Browser
    br = Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Firefox')]
    br.open("http://www.londoncoffeeguide.com")
    for form in br.forms():
        if form.attrs['id'] == 'form':
            br.form = form
    br.form['p$lt$zoneContent$PagePlaceholder$p$lt$zoneRight$logonform$Login1$UserName'] = 'username goes here'
    br.form['p$lt$zoneContent$PagePlaceholder$p$lt$zoneRight$logonform$Login1$Password'] = 'password goes here'
    response = br.submit()

then you can pass response.read() to beautiful soup and do all kinds of stuff

Sign up to request clarification or add additional context in comments.

5 Comments

Good answer, j0lly! Can you tell me just few things more about mechanize? Are there modules which allow to do the same? Is it similar to Selenium? - for the sake of curiosity and completeness. Thank you! :)
Actually I have another question: how to proceed visiting another page of the same website without logging in again? I tried to "br.open" another url but it requires another login.
Thanks! I don't know a terrible lot about mechanize to be honest as I only used it for the first time the other day at work. after you submit the form (which will log you in) you should be able to follow additional links by using the br.follow_link(text="the actual link text").
Oh, great, thank you! I got confused by the fact that it works with the object link and not with the url. Thank you very much. (Even though everything is useless if I am not allowed to use those data)
No probs. You should also be able to loop over all the links until you find the url you want also. check my answer to this question stackoverflow.com/questions/19803075/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.