Parse html element with lxml / xpath

Question

Using lxml/python and xpath, I retrieved the value betwen my tags. I would like to get the html properties too, not just the text, my programme works, but skipped two lines.

python :

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import lxml.html
htmltree = lxml.html.parse('data.html')
res = htmltree.xpath("//table[@class='mainTable']/tr/td/text()")
print '\n'.join(res).encode("latin-1")

data.html sample

<table class='mainTable'>
         <TR>
                  <TD bgcolor="#cccccc">235</TD>
                  <TD bgcolor="#cccccc"> Windows XP / Office 2003.</TD>
                  <TD bgcolor="#cccccc">
                  G:\REMI\projets\Migration_XP_Office2003\Procedures\Installation Win XP et Office 2003.doc</TD>
                  <TD bgcolor="#cccccc">2005-10-18</TD>
                  <TD bgcolor="#cccccc">2010-12-30</TD></TR>
                  <TD bgcolor="#cccccc">
                  <P class="MsoBodyText" 
                    style="margin: 0cm 0cm 0pt;"><STRONG><FONT face="Times New Roman" size="5">blablablablablablbala<BR><BR></FONT></STRONG></FONT></P>
                  </TD>
                <TR>
                  <TD bgcolor="#cccccc">23</TD>
                  <TD bgcolor="#cccccc">XEROX/ MAC</TD>
                  <TD bgcolor="#cccccc">
                    <P>joint.</P>
                    <P>&nbsp;</P></TD>
                  <TD bgcolor="#cccccc">G:\DDTH_INF\REMI\bdcfiles\I098_Page_de_garde_MAC.doc</TD>
                  <TD bgcolor="#cccccc">2012-12-19</TD>
                  <TD bgcolor="#cccccc">2012-12-19</TD>
         </TR>
 </table>

return :

 235 Windows XP / Office 2003.
 G:\REMI\projets\Migration_XP_Office2003\Procedures\Installation Win XP
 et Office 2003.doc 2005-10-18 2010-12-30

 23 XEROX/ MAC G:\DDTH_INF\REMI\bdcfiles\I098_Page_de_garde_MAC.doc
 2012-12-19 2012-12-19

I dont understand why the programm skipped

<P class="MsoBodyText" 
                        style="margin: 0cm 0cm 0pt;"><STRONG><FONT face="Times New Roman" size="5">blablablablablablbala<BR><BR></FONT></STRONG></FONT></P>

and

 <P>joint.</P>
                        <P>&nbsp;</P>

Because it's between <p> tag ? I just want to get all the data between each TD. I tried with /tr/td/p/ too but it's not the solution.

note: this code is a sample, its possible that html is broken, but my file is well structured.

try res = htmltree.xpath("//table[@class='mainTable']/tr/td//text()")to grab all child text nodes. — Learner
– Learner, Commented Nov 18, 2015 at 15:46
Just a simple remark: only use lxml for html pages that contain correct html. You could find in many places unclosed tags that lxml won't be able to process (in fact only use it if you wrote the HTML...). If unsure, use BeautifulSoup with is great at fixing such (minor) errors. — Serge Ballesta
– Serge Ballesta, Commented Nov 18, 2015 at 16:17

alecxe · Accepted Answer · 2015-11-18 16:17:28Z

1

This is because you are getting the text() out of each td element - which basically means - give me a text node located directly inside the td element.

Instead, call .text_content() on every td found:

texts = [td.text_content() for td in htmltree.xpath("//table[@class='mainTable']/tr/td")]

edited Nov 18, 2015 at 16:17

answered Nov 18, 2015 at 15:46

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

xif Over a year ago

Thanks a lot my friend, I understand better, I started to used xpath just today, i'm still a beginer, thank for the precision about how text() method means exactly.

Collectives™ on Stack Overflow

Parse html element with lxml / xpath

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related