2

Using lxml/python and xpath, I retrieved the value betwen my tags. I would like to get the html properties too, not just the text, my programme works, but skipped two lines.

python :

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import lxml.html
htmltree = lxml.html.parse('data.html')
res = htmltree.xpath("//table[@class='mainTable']/tr/td/text()")
print '\n'.join(res).encode("latin-1")

data.html sample

<table class='mainTable'>
         <TR>
                  <TD bgcolor="#cccccc">235</TD>
                  <TD bgcolor="#cccccc"> Windows XP / Office 2003.</TD>
                  <TD bgcolor="#cccccc">
                  G:\REMI\projets\Migration_XP_Office2003\Procedures\Installation Win XP et Office 2003.doc</TD>
                  <TD bgcolor="#cccccc">2005-10-18</TD>
                  <TD bgcolor="#cccccc">2010-12-30</TD></TR>
                  <TD bgcolor="#cccccc">
                  <P class="MsoBodyText" 
                    style="margin: 0cm 0cm 0pt;"><STRONG><FONT face="Times New Roman" size="5">blablablablablablbala<BR><BR></FONT></STRONG></FONT></P>
                  </TD>
                <TR>
                  <TD bgcolor="#cccccc">23</TD>
                  <TD bgcolor="#cccccc">XEROX/ MAC</TD>
                  <TD bgcolor="#cccccc">
                    <P>joint.</P>
                    <P>&nbsp;</P></TD>
                  <TD bgcolor="#cccccc">G:\DDTH_INF\REMI\bdcfiles\I098_Page_de_garde_MAC.doc</TD>
                  <TD bgcolor="#cccccc">2012-12-19</TD>
                  <TD bgcolor="#cccccc">2012-12-19</TD>
         </TR>
 </table>

return :

 235 Windows XP / Office 2003.
 G:\REMI\projets\Migration_XP_Office2003\Procedures\Installation Win XP
 et Office 2003.doc 2005-10-18 2010-12-30

 23 XEROX/ MAC G:\DDTH_INF\REMI\bdcfiles\I098_Page_de_garde_MAC.doc
 2012-12-19 2012-12-19

I dont understand why the programm skipped

<P class="MsoBodyText" 
                        style="margin: 0cm 0cm 0pt;"><STRONG><FONT face="Times New Roman" size="5">blablablablablablbala<BR><BR></FONT></STRONG></FONT></P>

and

 <P>joint.</P>
                        <P>&nbsp;</P>

Because it's between <p> tag ? I just want to get all the data between each TD. I tried with /tr/td/p/ too but it's not the solution.

note: this code is a sample, its possible that html is broken, but my file is well structured.

2
  • 1
    try res = htmltree.xpath("//table[@class='mainTable']/tr/td//text()")to grab all child text nodes. Commented Nov 18, 2015 at 15:46
  • 1
    Just a simple remark: only use lxml for html pages that contain correct html. You could find in many places unclosed tags that lxml won't be able to process (in fact only use it if you wrote the HTML...). If unsure, use BeautifulSoup with is great at fixing such (minor) errors. Commented Nov 18, 2015 at 16:17

1 Answer 1

1

This is because you are getting the text() out of each td element - which basically means - give me a text node located directly inside the td element.

Instead, call .text_content() on every td found:

texts = [td.text_content() for td in htmltree.xpath("//table[@class='mainTable']/tr/td")]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot my friend, I understand better, I started to used xpath just today, i'm still a beginer, thank for the precision about how text() method means exactly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.