Simple Web Scraping with Python

Question

I haven't been able to find a simple way to do this, i have been following this and I have written the following,

##just comments before this
    import lxml,requests
 23 page = requests.get('https://finalexams.rutgers.edu.html')
 24 
 25 tree = html.fromstring(page.text)
 26 
 27 tableRow = tree.xpath('//tr/text() ' )
 28 
 29 print 'Rows' , tableRow

That script needs to parse through table rows like these and take out the things inside of them, but there could be a potentially infinite amount of table rows. I don't know how to access nested tags and they don't have unique names or ID's for me to look for.

How can I write a for loop that gets each of these table rows and lets me grab the individual bits of them?

  <tr>
    <td> 04264</td>
    <td>01:198:205</td>
    <td>01</td>
    <td>INTR DISCRET STRCT I</td>



  <td>C</td>
  <td>Dec 17, 2014:  8:00 AM - 11:00 AM </td>




  </tr>

  <tr>
    <td> 09907</td>
    <td>01:198:214</td>
    <td>01</td>
    <td>SYSTEMS PROGRAMMING</td>



  <td>C</td>
  <td>Dec 18, 2014:  8:00 PM - 11:00 PM </td>




  </tr>

tree = html.fromstring(page.text) isn't going to work with import lxml; did you do a from lxml import html somewhere? — abarnert
– abarnert, Commented Dec 3, 2014 at 2:31

abarnert · Accepted Answer · 2014-12-03 02:34:12Z

3

If you want to find the tr elements themselves, instead of their (empty) text, just search for the tr elements, instead of their text:

rows = tree.xpath('//tr')

And then you can iterate them:

for row in rows:

And then you can either search each one for td elements (e.g., by using row.xpath, or row.findall, etc.), or just assume all their children are td elements (as they happen to be in this case):

    for column in row:

And then you can do whatever it is you wanted to do with each column, like extract its text:

        print column.text

answered Dec 3, 2014 at 2:34

abarnert

368k54 gold badges626 silver badges692 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

alecxe · Accepted Answer · 2014-12-03 02:34:08Z

0

Iterate over all tr tags and make an inner loop over td tags for every row, example:

from lxml.html import fromstring

data = """
your html here
"""

root = fromstring(data)
for index, row in enumerate(root.xpath('//table/tr')):
    print "Row #%s" % index

    for cell in row.findall('td'):
        print cell.text.strip()

    print "----"

Prints:

Row #0
04264
01:198:205
01
INTR DISCRET STRCT I
C
Dec 17, 2014:  8:00 AM - 11:00 AM
----
Row #1
09907
01:198:214
01
SYSTEMS PROGRAMMING
C
Dec 18, 2014:  8:00 PM - 11:00 PM
----

answered Dec 3, 2014 at 2:34

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Collectives™ on Stack Overflow

Simple Web Scraping with Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related