Import rows of a table from html file as a list in python

Question

I am a python newbie, and I'm having some trouble that I can't resolve (even after about a million Google searches).

I have >100 html files, each of which has a couple of tables in them. Ultimately, I would like to have each row of the first HTML table in the file as a list in python, but without the HTML tags. For the first step I'm trying to figure out how to get rid of the HTML tags, and then I need to figure out how to import this as a list.

My HTML file looks like this:

 <tr><td>1</td><td>FORWARD</td><td>72</td><td>20</td><td>60.29</td><td>55.00</td><td>5.00</td><td>3.00</td></tr>
 <tr><td>&nbsp;</td><td>REVERSE</td><td>258</td><td>20</td><td>60.11</td><td>45.00</td><td>4.00</td><td>3.00</td></tr>
 <tr><td>2</td><td>FORWARD</td><td>77</td><td>20</td><td>60.08</td><td>50.00</td><td>5.00</td><td>2.00</td></tr>
 <tr><td>&nbsp;</td><td>REVERSE</td><td>258</td><td>20</td><td>60.11</td><td>45.00</td><td>4.00</td><td>3.00</td></tr>

And what I want is the values from the rows to be put in lists, similar to what you would get if did this by hand:

 row1 = [FORWARD, 72, 20, 60.29, 55.0, 5.00, 3.00].

I read that BeautifulSoup might be able to help, so I tried:

 from bs4 import BeautifulSoup

 def removeTags(html, *tags):
     soup = BeautifulSoup(html)
     for tag in tags:
         for tag in soup.findAll(tag):
             tag.replaceWith("")
     return soup


 testhtml = open('myfile.html', 'r')

 print removeTags(testhtml, 'tr', 'td')

But this seems to remove all of the information in the tables, not just the HTML tags. I've tried several other things as well, but I seem to be stuck. I would appreciate any suggestions.

Octipi · Accepted Answer · 2013-02-20 00:01:35Z

2

This is a little sloppy but it does the trick.

with open('htmlfile.html','r') as file:
  rows = []
  for line in file:
    start = max(line.find('FORWARD'),line.find('REVERSE'))
    rows.append(line[start:].replace('<','').replace('>','').replace('/','').replace('td',' ').replace('tr',' ').strip().split('  '))
print(rows)

edited Feb 20, 2013 at 0:01

answered Feb 19, 2013 at 23:56

Octipi

8317 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

wRAR Over a year ago

it's a bad idea to assume that a HTML file has a given line structure

Andreanna Over a year ago

This works great for me, even if it's a little messy. But I don't understand why the line start = max(line.find('FORWARD'),line.find('REVERSE')) works. print(start) returns a value of -1, which according to the python documentation indicates that it could not find the string 'FORWARD' or the string 'REVERSE'

Octipi Over a year ago

start == -1 would imply that both FORWARD and REVERSE are not in the line. If either of them are in the line then start is set to the first index where one of them appears.

Andreanna Over a year ago

Oh, I see. I was running the start = ... for the wrong line so that's why I was always getting -1 back. Makes much more sense when you do it right...Thanks!

Jon Clements · Accepted Answer · 2013-02-19 23:58:15Z

0

Given your sample data you can get the first row as a list, by using the following code:

>>> list(soup.find('tr').strings)
[u'1', u'FORWARD', u'72', u'20', u'60.29', u'55.00', u'5.00', u'3.00']

answered Feb 19, 2013 at 23:58

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Comments

askewchan · Accepted Answer · 2013-02-20 00:09:47Z

0

Try something like this:

soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')
for row in rows:
    print [col.string for col in row.findAll('td')]

Edit: You can call float on the col.string if you want to get numbers back, but this will give you an error for the 'FORWARD', etc., tags. This should get you started, however.

edited Feb 20, 2013 at 0:09

answered Feb 19, 2013 at 23:50

askewchan

46.7k18 gold badges125 silver badges135 bronze badges

2 Comments

Octipi Over a year ago

My apologies. Your answer looked very similar to one I had posted and I was trying to edit mine. I accidentally edited yours.

askewchan Over a year ago

@Eric No worries. Rolled back.

Collectives™ on Stack Overflow

Import rows of a table from html file as a list in python

3 Answers 3

4 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related