Web scraping python

Question

I've been trying to use this code to extract the urls but I can't get the google maps url shown in html. It returns 'None' when i try to find the url in this segment.

import urllib
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from urllib.request import urlopen
url="http://www.example.com"
html=urlopen(url)
soup=BeautifulSoup(html)
for tag in soup.findAll('a',href=True):
    print(tag['href'])



<div class="map_container">
    <div id="map_canvas" style="width: 100%; height: 450px; margin-top: 10px; position: relative; background-color: rgb(229, 227, 223); overflow: hidden; -webkit-  transform: translateZ(0px);">
        <div class="gm-style" style="position: absolute; left: 0px; top: 0px; overflow: hidden; width: 100%; height: 100%; z-index: 0;">
            <div style="position: absolute; left: 0px; top: 0px; overflow: hidden; width: 100%; height: 100%; z-index: 0;">...</div>
            <div style="margin-left: 5px; margin-right: 5px; z-index: 1000000; position: absolute; left: 0px; bottom: 0px;">
                <a target="_blank" href="http://maps.google.com/mapsll=28.535959,77.146119&amp;z=14&amp;t=m&amp;hl=en&amp;gl=US&amp;mapclient=apiv3" title="Click to see this area on Google Maps" style="position: static; overflow: visible; float: none; display: inline;">
                    <div style="width: 62px; height: 26px; cursor: pointer;">...</div>
                </a>
            </div>
        </div>
    </div>
</div>

Javascript is probably needed in rendering the page you are trying to scrape. In that case, a urllib request will not render that page exactly as you see it in the browser. You will need to use Selenium for that. — joemar.ct
– joemar.ct, Commented May 7, 2014 at 13:42
Does changing soup=BeautifulSoup(html) to soup=BeautifulSoup(html, 'html.parser') help? — alecxe
– alecxe, Commented May 7, 2014 at 13:43
How are you trying to find the tag attribute? It looks like it's there to me.. The <a> tag, right? — aIKid
– aIKid, Commented May 7, 2014 at 14:13
@alecxe changing soup=BeautifulSoup(html) to soup=BeautifulSoup(html, 'html.parser') didn't help. — user3612315
– user3612315, Commented May 7, 2014 at 14:21

alecxe · Accepted Answer · 2014-05-07 14:44:53Z

2

The problem here is that this maps.google.com link is a part of a div with id="map_canvas" that is constructed using javascript. urllib (or urllib2) loads the page with an empty map_canvas div:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = "http://www.zomato.com/ncr/monkey-bar-vasant-kunj-delhi/maps#tabtop"
>>> doc = BeautifulSoup(urllib2.urlopen(url))
>>> print doc.find('div', id='map_canvas')
<div id="map_canvas" style="width:100%; height:450px; margin-top: 10px;"></div>

This means that you cannot easily get the link using the tools you are using now.

An alternative solution would be to use selenium:

>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> browser.get(url)
>>> link = browser.find_element_by_xpath('//div[@id="map_canvas"]//a')
>>> link.get_attribute('href')
u'http://maps.google.com/maps?ll=28.536562,77.147664&z=14&t=m&hl=en&gl=US&mapclient=apiv3'

answered May 7, 2014 at 14:44

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3612315 Over a year ago

can it be made faster? opening up firefox and loading takes too much time, I have a list of pages from where the google maps url is to be extracted.

alecxe Over a year ago

@user3612315 you can make use of a headless browser. See stackoverflow.com/questions/18539491/… and realpython.com/blog/python/…

Collectives™ on Stack Overflow

Web scraping python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related