1

I am trying to parse data from a google interactive website. It is rendered in JS, so I use Qt to load the site to parse from. I believe I have the site loaded and rendered properly, but for some reason I am getting and empty list returned to me when I execute the xpath parsing code.

Here is my full code:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit() 

url = 'https://www.consumerbarometer.com/en/graph-builder/?question=M1&filter=country:singapore,canada,mexico,brazil,argentina,united_states,bulgaria,austria,belgium,croatia,czech_republic,denmark,estonia,finland,france,germany,greece,hungary,italy,ireland,latvia,lithuania,norway,netherlands,poland,portugal,russia,romania,serbia,slovakia,spain,slovenia,sweden,switzerland,ukraine,united_kingdom,australia,china,israel,hong_kong_sar,japan,korea,new_zealand,malaysia,taiwan,turkey,vietnam'  
#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)

archive_links = tree.xpath('//*[@id="main-page-wrapper"]/div/section/div/section[1]/div/div/graph/div/div[4]/div/div/graph-bar-chart/div[2]/svg/g[1]/g[2]/g[1]/text()')
print archive_links

This is the html that I am trying to grab: <text class="bar-text-label" y="22" dy="10">Argentina</text>

Any thoughts why I am getting [] returned to me?

1 Answer 1

1

You can make a shorter and more reliable xpath expression and you have to use namespaces:

tree.xpath('//text[@class="bar-text-label"]/text()', namespaces={'n': 'http://www.w3.org/2000/svg'})

Alternative solution could be to use selenium browser automation package:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get('https://www.consumerbarometer.com/en/graph-builder/?question=M1&filter=country:singapore,canada,mexico,brazil,argentina,united_states,bulgaria,austria,belgium,croatia,czech_republic,denmark,estonia,finland,france,germany,greece,hungary,italy,ireland,latvia,lithuania,norway,netherlands,poland,portugal,russia,romania,serbia,slovakia,spain,slovenia,sweden,switzerland,ukraine,united_kingdom,australia,china,israel,hong_kong_sar,japan,korea,new_zealand,malaysia,taiwan,turkey,vietnam')

// wait for svg to appear
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.TAG_NAME, 'svg')))

for text in driver.find_elements_by_class_name('bar-text-label'):
    print(text.text)

driver.close()
Sign up to request clarification or add additional context in comments.

7 Comments

I had actually just tried the shorter xpath expression above and even with the namespace addition, I am still getting an empty list returned.
@Meepl hm, I haven't tried using pyqt4, but I've saved the page source in to an html file, parsed it with lxml.html and used the provided xpath - worked for me. Anyway, would you be okay with an alternative "selenium" based solution? Thanks.
yes, absolutely. i have selenium installed, but i am quite unfamiliar with it
this worked perfectly! thank you, so now my issue is that I am attempting to get the data value for each country, which has the element type :<rect rx="3" ry="3" width="76%" height="40" transform="translate(0,40)" data-value="76" class="bar"></rect> is it possible to grab the data-value attribute with selenium? I tried for text in driver.find_elements_by_class_name('bar'): print(data_value.text) but it did not work.
I also tried this, which did not work: for data in driver.find_elements_by_xpath('//*[contains(@data-value)]/@data-value'): print(data.text)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.