how can i parse html with lxml

Question

I have this html:

<td class="name-td alLeft bordR">13.10.2016, Thu<span class="sp">|</span>17:00</td>

I want to get a date (13.10.2016) and a time (17:00).

I'm doing that:

t = lxml.html.parse(url)
nextMatchDate = t.findall(".//td[@class='bordR']")[count].text

But getting an error,

IndexError: list index out of range

I think it happens because I have a html-tags in a tag

Could you help me please?

I use a for loop, because I have a several td class="bordR" — zagazat
– zagazat, Commented Oct 11, 2016 at 16:16
How many results are there from findall and what is the value of count? — Open AI - Opting Out
– Open AI - Opting Out, Commented Oct 11, 2016 at 16:16
I need a parse this piece of code, and get something like nextMatch = "13.10.2016 at 17:00" — zagazat
– zagazat, Commented Oct 11, 2016 at 16:20

Community · Accepted Answer · 2017-05-23 12:19:30Z

2

The problem is in the way you check for the bordR class. class is a multi-valued space-delimited attribute and you have to account for other classes on an element. In XPath you should be using "contains":

.//td[contains(@class, 'bordR')]

Or, even more reliable would be to add "concat" to the partial match check.

Once you've located the element you can use .text_content() method to get the complete text including all the children:

In [1]: from lxml.html import fromstring

In [2]: data = '<td class="name-td alLeft bordR">13.10.2016, Thu<span class="sp">|</span>17:00</td>'

In [3]: td = fromstring(data)

In [4]: print(td.text_content())
13.10.2016, Thu|17:00

To take a step further, you can load the date string into a datetime object:

In [5]: from datetime import datetime
In [6]: datetime.strptime(td.text_content(), "%d.%m.%Y, %a|%H:%M")
Out[6]: datetime.datetime(2016, 10, 13, 17, 0)

edited May 23, 2017 at 12:19

CommunityBot

11 silver badge

answered Oct 11, 2016 at 16:54

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

zagazat Over a year ago

yeah, it works. but I need to get a content in <td class="name-td alLeft bordR">13.10.2016, Thu<span class="sp">|</span>17:00</td>

alecxe Over a year ago

@zagazat wait, sorry, is not it what I've tried to answer? Could you be more specific about what is the problem? Thanks.

zagazat Over a year ago

okay. i have a html-doc. with results of the hockey matches. i want to parse it. so i can get info about matches in <td class="match">Team1 vs Team2</td>. but, i CANT get info <td class="n-match">13.10.2016 <b>17:00</b></td>. because of error list index out of range. do you have a telegram? could you chat with me find me @zagazat

skovorodkin · Accepted Answer · 2016-10-11 16:59:30Z

0

There's a method called .itertext that:

Iterates over the text content of a subtree.

So if you have an element td in a variable td, you can do this:

>>> text = list(td.itertext()); text
['13.10.2016, Thu', '|', '17:00']

>>> date, time = text[0].split(',')[0], text[-1]

>>> datetime_text = '{} at {}'.format(date, time)

>>> datetime_text
'13.10.2016 at 17:00'

answered Oct 11, 2016 at 16:59

skovorodkin

10.5k2 gold badges41 silver badges30 bronze badges

Collectives™ on Stack Overflow

how can i parse html with lxml

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related