1

I have this html:

<td class="name-td alLeft bordR">13.10.2016, Thu<span class="sp">|</span>17:00</td>

I want to get a date (13.10.2016) and a time (17:00).

I'm doing that:

t = lxml.html.parse(url)
nextMatchDate = t.findall(".//td[@class='bordR']")[count].text

But getting an error,

IndexError: list index out of range

I think it happens because I have a html-tags in a tag

Could you help me please?

5
  • How is count defined? What is its value? Commented Oct 11, 2016 at 16:13
  • I use a for loop, because I have a several td class="bordR" Commented Oct 11, 2016 at 16:16
  • How many results are there from findall and what is the value of count? Commented Oct 11, 2016 at 16:16
  • It's not a problem in count. for example count=39 Commented Oct 11, 2016 at 16:17
  • I need a parse this piece of code, and get something like nextMatch = "13.10.2016 at 17:00" Commented Oct 11, 2016 at 16:20

2 Answers 2

2

The problem is in the way you check for the bordR class. class is a multi-valued space-delimited attribute and you have to account for other classes on an element. In XPath you should be using "contains":

.//td[contains(@class, 'bordR')]

Or, even more reliable would be to add "concat" to the partial match check.

Once you've located the element you can use .text_content() method to get the complete text including all the children:

In [1]: from lxml.html import fromstring

In [2]: data = '<td class="name-td alLeft bordR">13.10.2016, Thu<span class="sp">|</span>17:00</td>'

In [3]: td = fromstring(data)

In [4]: print(td.text_content())
13.10.2016, Thu|17:00

To take a step further, you can load the date string into a datetime object:

In [5]: from datetime import datetime
In [6]: datetime.strptime(td.text_content(), "%d.%m.%Y, %a|%H:%M")
Out[6]: datetime.datetime(2016, 10, 13, 17, 0)
Sign up to request clarification or add additional context in comments.

3 Comments

yeah, it works. but I need to get a content in <td class="name-td alLeft bordR">13.10.2016, Thu<span class="sp">|</span>17:00</td>
@zagazat wait, sorry, is not it what I've tried to answer? Could you be more specific about what is the problem? Thanks.
okay. i have a html-doc. with results of the hockey matches. i want to parse it. so i can get info about matches in <td class="match">Team1 vs Team2</td>. but, i CANT get info <td class="n-match">13.10.2016 <b>17:00</b></td>. because of error list index out of range. do you have a telegram? could you chat with me find me @zagazat
0

There's a method called .itertext that:

Iterates over the text content of a subtree.

So if you have an element td in a variable td, you can do this:

>>> text = list(td.itertext()); text
['13.10.2016, Thu', '|', '17:00']

>>> date, time = text[0].split(',')[0], text[-1]

>>> datetime_text = '{} at {}'.format(date, time)

>>> datetime_text
'13.10.2016 at 17:00'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.