I am a python newbie, and I'm having some trouble that I can't resolve (even after about a million Google searches).
I have >100 html files, each of which has a couple of tables in them. Ultimately, I would like to have each row of the first HTML table in the file as a list in python, but without the HTML tags. For the first step I'm trying to figure out how to get rid of the HTML tags, and then I need to figure out how to import this as a list.
My HTML file looks like this:
<tr><td>1</td><td>FORWARD</td><td>72</td><td>20</td><td>60.29</td><td>55.00</td><td>5.00</td><td>3.00</td></tr>
<tr><td> </td><td>REVERSE</td><td>258</td><td>20</td><td>60.11</td><td>45.00</td><td>4.00</td><td>3.00</td></tr>
<tr><td>2</td><td>FORWARD</td><td>77</td><td>20</td><td>60.08</td><td>50.00</td><td>5.00</td><td>2.00</td></tr>
<tr><td> </td><td>REVERSE</td><td>258</td><td>20</td><td>60.11</td><td>45.00</td><td>4.00</td><td>3.00</td></tr>
And what I want is the values from the rows to be put in lists, similar to what you would get if did this by hand:
row1 = [FORWARD, 72, 20, 60.29, 55.0, 5.00, 3.00].
I read that BeautifulSoup might be able to help, so I tried:
from bs4 import BeautifulSoup
def removeTags(html, *tags):
soup = BeautifulSoup(html)
for tag in tags:
for tag in soup.findAll(tag):
tag.replaceWith("")
return soup
testhtml = open('myfile.html', 'r')
print removeTags(testhtml, 'tr', 'td')
But this seems to remove all of the information in the tables, not just the HTML tags. I've tried several other things as well, but I seem to be stuck. I would appreciate any suggestions.