Parsing HTML data with lxml

Question

I'm a beginner in coding and a friend of mine told me to use BeautifulSoup instead of htmlparser. After running into some problems I got a tip to use lxml instead of BeaytifulSoup because it's 10x better.

I'm hoping someone can give me a hint how to scrape the text I'm looking for.

What I want is to find a table with the following rows and data:

<tr>
    <td><a href="website1.com">website1</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam1.com">spam1</a></td>
</tr>
<tr>
    <td><a href="website2.com">website2</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam2.com">spam2</a></td>
</tr>

How do I scrape the website with info 1 and 2, without spam, with lxml and get the following results?

[['url' 'info1', 'info2'], ['url', 'info1', 'info2']]

Acorn · Accepted Answer · 2011-12-26 13:47:46Z

4

import lxml.html as lh

tree = lh.fromstring(your_html)

result = []
for row in tree.xpath("tr"):
    url, info1, info2 = row.xpath("td")[:3]
    result.append([url.xpath("a")[0].attrib['href'],
                   info1.text_content(),
                   info2.text_content()])

Result:

[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]

edited Dec 26, 2011 at 13:47

answered Dec 26, 2011 at 13:24

Acorn

50.8k30 gold badges143 silver badges180 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

kev · Accepted Answer · 2011-12-26 14:04:43Z

4

I use the xpath: td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()

$ python3
>>> import lxml.html
>>> doc = lxml.html.parse('data.xml')
>>> [[j for j in i.xpath('td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()')] for i in doc.xpath('//tr')]
[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]

edited Dec 26, 2011 at 14:04

answered Dec 26, 2011 at 13:02

kev

163k49 gold badges286 silver badges282 bronze badges

3 Comments

Retrace Over a year ago

All the table rows are the same within the table. I'm using Python 2.7.2+. Within the table row I only want the first 3 and as result. So [['url(website1)', 'info1', 'info2'], ['url(website2)','info1', 'info2']]. Thank you for your reply

Acorn Over a year ago

I think it can probably be safely assumed that the actual content will not contain the words spam. Although only @Trees can really tell us what aspects of the data are consistent.

kev Over a year ago

@Acorn changed to contains(.,"spam"). spam can be replace by patterns like ad.website.com.

unutbu · Accepted Answer · 2011-12-26 14:10:23Z

1

import lxml.html as LH

doc = LH.fromstring(content)
print([tr.xpath('td[1]/a/@href | td[position()=2 or position()=3]/text()')
       for tr in doc.xpath('//tr')])

The long XPath has the following meaning:

td[1]                                   find the first <td>  
  /a                                    find the <a>
    /@href                              return its href attribute value
|                                       or
td[position()=2 or position()=3]        find the second or third <td>
  /text()                               return its text value

edited Dec 26, 2011 at 14:10

answered Dec 26, 2011 at 13:37

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

1 Comment

Retrace Over a year ago

You just made my day with a few lines of code. And thank you for the explanation. Actually all of the answers are great. I was learning about the xpath to get it with firebug. But his is much easier to find the first table row and process the data within. Thank you all again and merry x-mas :)

Collectives™ on Stack Overflow

Parsing HTML data with lxml

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related