4

I am trying to use lxml with python because after reading and doing google recommendation is to use lxml over other parsing packages. I have following dom structure and I manage write the correct xpath and I double check my xpath on xpath check to confirm the validity of it. Xpath works fine on Xpath Checker but when I put it with lxml in python I am not getting results infract I get object instead of actual text.

Here is my dom structure:

<div class="pdsc-l">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td width="35%" valign="top">
<font size="2" face="Arial, Helvetica, sans-serif">Brand</font>
</td>
<td width="65%" valign="top">
<font size="2" face="Arial, Helvetica, sans-serif">HTC</font>
</td>
</tr>
<tr>
<td width="35%" valign="top">
<td width="65%" valign="top">

Following xpath that I wrote gives me what I want..

//td//font[text()='Brand']/following::td[1]

But with lxml I am nto getting the result:

This is my code:
    rawPage = urllib2.urlopen(request)
    read = rawPage.read()
    #print read
    tree = etree.HTML(read)    
    for tr in tree.xpath("//tr"):
        print tr.xpath("//td//font[text()='Brand']/following::td[1]")

Here is the out put

[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]

I tried it with following change but still i don't get the result, The code I wrote has the url, hopefully that will help for a better answer:

from lxml import etree
from lxml.html import fromstring, tostring
    url = 'http://www.ebay.com/ctg/111176858'
    request = urllib2.Request(url)
    rawPage = urllib2.urlopen(request)
    read = rawPage.read()
    #print read
    tree = etree.HTML(read)    
    for tr in tree.xpath("//tr"):
        t = tr.xpath("//td//font[text()='Brand']/following::td[1]")[0]
        print tostring(t)
1
  • 1
    maybe post the output you're getting so we can know a bit more what's going on? Commented Aug 28, 2012 at 18:47

1 Answer 1

9

appending a [0].text to the end of the print statement in your answer should give you what you want. Basically, what's being printed in your question are single-element lists of lxml.etree._Elements, which have attributes like tag and text that you can use to get different properties. So, try

tr.xpath("//td//font[text()='Brand']/following::td[1]")[0].text
Sign up to request clarification or add additional context in comments.

4 Comments

I am getting an index out of bound with your answer
hmm i still get all None values
i believe the text attribute of your <td> element is None because your <td> elements don't directly contain text. you might change your xpath to access the nested elements (eg <font>) which contain text as a direct child
I will try that but when I try this on xpath checker in firefox I get the text

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.