Parse Html using lxml and xpath

Question

I am trying to use lxml with python because after reading and doing google recommendation is to use lxml over other parsing packages. I have following dom structure and I manage write the correct xpath and I double check my xpath on xpath check to confirm the validity of it. Xpath works fine on Xpath Checker but when I put it with lxml in python I am not getting results infract I get object instead of actual text.

Here is my dom structure:

<div class="pdsc-l">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td width="35%" valign="top">
<font size="2" face="Arial, Helvetica, sans-serif">Brand</font>
</td>
<td width="65%" valign="top">
<font size="2" face="Arial, Helvetica, sans-serif">HTC</font>
</td>
</tr>
<tr>
<td width="35%" valign="top">
<td width="65%" valign="top">

Following xpath that I wrote gives me what I want..

//td//font[text()='Brand']/following::td[1]

But with lxml I am nto getting the result:

This is my code:
    rawPage = urllib2.urlopen(request)
    read = rawPage.read()
    #print read
    tree = etree.HTML(read)    
    for tr in tree.xpath("//tr"):
        print tr.xpath("//td//font[text()='Brand']/following::td[1]")

Here is the out put

[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]

I tried it with following change but still i don't get the result, The code I wrote has the url, hopefully that will help for a better answer:

from lxml import etree
from lxml.html import fromstring, tostring
    url = 'http://www.ebay.com/ctg/111176858'
    request = urllib2.Request(url)
    rawPage = urllib2.urlopen(request)
    read = rawPage.read()
    #print read
    tree = etree.HTML(read)    
    for tr in tree.xpath("//tr"):
        t = tr.xpath("//td//font[text()='Brand']/following::td[1]")[0]
        print tostring(t)

maybe post the output you're getting so we can know a bit more what's going on? — Emmett Butler
– Emmett Butler, Commented Aug 28, 2012 at 18:47

Emmett Butler · Accepted Answer · 2012-08-28 20:06:46Z

9

appending a [0].text to the end of the print statement in your answer should give you what you want. Basically, what's being printed in your question are single-element lists of lxml.etree._Elements, which have attributes like tag and text that you can use to get different properties. So, try

tr.xpath("//td//font[text()='Brand']/following::td[1]")[0].text

edited Aug 28, 2012 at 20:06

answered Aug 28, 2012 at 18:48

Emmett Butler

6,2352 gold badges33 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

add-semi-colons Over a year ago

I am getting an index out of bound with your answer

add-semi-colons Over a year ago

hmm i still get all None values

Emmett Butler Over a year ago

i believe the text attribute of your <td> element is None because your <td> elements don't directly contain text. you might change your xpath to access the nested elements (eg <font>) which contain text as a direct child

add-semi-colons Over a year ago

I will try that but when I try this on xpath checker in firefox I get the text

Collectives™ on Stack Overflow

Parse Html using lxml and xpath

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related