1

I've been trying to use this code to extract the urls but I can't get the google maps url shown in html. It returns 'None' when i try to find the url in this segment.

import urllib
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from urllib.request import urlopen
url="http://www.example.com"
html=urlopen(url)
soup=BeautifulSoup(html)
for tag in soup.findAll('a',href=True):
    print(tag['href'])



<div class="map_container">
    <div id="map_canvas" style="width: 100%; height: 450px; margin-top: 10px; position: relative; background-color: rgb(229, 227, 223); overflow: hidden; -webkit-  transform: translateZ(0px);">
        <div class="gm-style" style="position: absolute; left: 0px; top: 0px; overflow: hidden; width: 100%; height: 100%; z-index: 0;">
            <div style="position: absolute; left: 0px; top: 0px; overflow: hidden; width: 100%; height: 100%; z-index: 0;">...</div>
            <div style="margin-left: 5px; margin-right: 5px; z-index: 1000000; position: absolute; left: 0px; bottom: 0px;">
                <a target="_blank" href="http://maps.google.com/mapsll=28.535959,77.146119&amp;z=14&amp;t=m&amp;hl=en&amp;gl=US&amp;mapclient=apiv3" title="Click to see this area on Google Maps" style="position: static; overflow: visible; float: none; display: inline;">
                    <div style="width: 62px; height: 26px; cursor: pointer;">...</div>
                </a>
            </div>
        </div>
    </div>
</div>
9
  • Javascript is probably needed in rendering the page you are trying to scrape. In that case, a urllib request will not render that page exactly as you see it in the browser. You will need to use Selenium for that. Commented May 7, 2014 at 13:42
  • Does changing soup=BeautifulSoup(html) to soup=BeautifulSoup(html, 'html.parser') help? Commented May 7, 2014 at 13:43
  • How are you trying to find the tag attribute? It looks like it's there to me.. The <a> tag, right? Commented May 7, 2014 at 14:13
  • @alecxe changing soup=BeautifulSoup(html) to soup=BeautifulSoup(html, 'html.parser') didn't help. Commented May 7, 2014 at 14:21
  • @aIKid yes i'm using <a> tag Commented May 7, 2014 at 14:22

1 Answer 1

2

The problem here is that this maps.google.com link is a part of a div with id="map_canvas" that is constructed using javascript. urllib (or urllib2) loads the page with an empty map_canvas div:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = "http://www.zomato.com/ncr/monkey-bar-vasant-kunj-delhi/maps#tabtop"
>>> doc = BeautifulSoup(urllib2.urlopen(url))
>>> print doc.find('div', id='map_canvas')
<div id="map_canvas" style="width:100%; height:450px; margin-top: 10px;"></div>

This means that you cannot easily get the link using the tools you are using now.

An alternative solution would be to use selenium:

>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> browser.get(url)
>>> link = browser.find_element_by_xpath('//div[@id="map_canvas"]//a')
>>> link.get_attribute('href')
u'http://maps.google.com/maps?ll=28.536562,77.147664&z=14&t=m&hl=en&gl=US&mapclient=apiv3'
Sign up to request clarification or add additional context in comments.

2 Comments

can it be made faster? opening up firefox and loading takes too much time, I have a list of pages from where the google maps url is to be extracted.
@user3612315 you can make use of a headless browser. See stackoverflow.com/questions/18539491/… and realpython.com/blog/python/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.