Issue with parsing html with lxml by xpath

Question

I am trying to parse data from a google interactive website. It is rendered in JS, so I use Qt to load the site to parse from. I believe I have the site loaded and rendered properly, but for some reason I am getting and empty list returned to me when I execute the xpath parsing code.

Here is my full code:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit() 

url = 'https://www.consumerbarometer.com/en/graph-builder/?question=M1&filter=country:singapore,canada,mexico,brazil,argentina,united_states,bulgaria,austria,belgium,croatia,czech_republic,denmark,estonia,finland,france,germany,greece,hungary,italy,ireland,latvia,lithuania,norway,netherlands,poland,portugal,russia,romania,serbia,slovakia,spain,slovenia,sweden,switzerland,ukraine,united_kingdom,australia,china,israel,hong_kong_sar,japan,korea,new_zealand,malaysia,taiwan,turkey,vietnam'  
#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)

archive_links = tree.xpath('//*[@id="main-page-wrapper"]/div/section/div/section[1]/div/div/graph/div/div[4]/div/div/graph-bar-chart/div[2]/svg/g[1]/g[2]/g[1]/text()')
print archive_links

This is the html that I am trying to grab: <text class="bar-text-label" y="22" dy="10">Argentina</text>

Any thoughts why I am getting [] returned to me?

alecxe · Accepted Answer · 2015-02-04 15:53:06Z

1

You can make a shorter and more reliable xpath expression and you have to use namespaces:

tree.xpath('//text[@class="bar-text-label"]/text()', namespaces={'n': 'http://www.w3.org/2000/svg'})

Alternative solution could be to use selenium browser automation package:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get('https://www.consumerbarometer.com/en/graph-builder/?question=M1&filter=country:singapore,canada,mexico,brazil,argentina,united_states,bulgaria,austria,belgium,croatia,czech_republic,denmark,estonia,finland,france,germany,greece,hungary,italy,ireland,latvia,lithuania,norway,netherlands,poland,portugal,russia,romania,serbia,slovakia,spain,slovenia,sweden,switzerland,ukraine,united_kingdom,australia,china,israel,hong_kong_sar,japan,korea,new_zealand,malaysia,taiwan,turkey,vietnam')

// wait for svg to appear
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.TAG_NAME, 'svg')))

for text in driver.find_elements_by_class_name('bar-text-label'):
    print(text.text)

driver.close()

edited Feb 4, 2015 at 15:53

answered Feb 4, 2015 at 15:32

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

metersk Over a year ago

I had actually just tried the shorter xpath expression above and even with the namespace addition, I am still getting an empty list returned.

alecxe Over a year ago

@Meepl hm, I haven't tried using pyqt4, but I've saved the page source in to an html file, parsed it with lxml.html and used the provided xpath - worked for me. Anyway, would you be okay with an alternative "selenium" based solution? Thanks.

metersk Over a year ago

yes, absolutely. i have selenium installed, but i am quite unfamiliar with it

metersk Over a year ago

this worked perfectly! thank you, so now my issue is that I am attempting to get the data value for each country, which has the element type :<rect rx="3" ry="3" width="76%" height="40" transform="translate(0,40)" data-value="76" class="bar"></rect> is it possible to grab the data-value attribute with selenium? I tried for text in driver.find_elements_by_class_name('bar'): print(data_value.text) but it did not work.

metersk Over a year ago

I also tried this, which did not work: for data in driver.find_elements_by_xpath('//*[contains(@data-value)]/@data-value'): print(data.text)

|

Collectives™ on Stack Overflow

Issue with parsing html with lxml by xpath

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related