2

I am a python newbie, and I'm having some trouble that I can't resolve (even after about a million Google searches).

I have >100 html files, each of which has a couple of tables in them. Ultimately, I would like to have each row of the first HTML table in the file as a list in python, but without the HTML tags. For the first step I'm trying to figure out how to get rid of the HTML tags, and then I need to figure out how to import this as a list.

My HTML file looks like this:

 <tr><td>1</td><td>FORWARD</td><td>72</td><td>20</td><td>60.29</td><td>55.00</td><td>5.00</td><td>3.00</td></tr>
 <tr><td>&nbsp;</td><td>REVERSE</td><td>258</td><td>20</td><td>60.11</td><td>45.00</td><td>4.00</td><td>3.00</td></tr>
 <tr><td>2</td><td>FORWARD</td><td>77</td><td>20</td><td>60.08</td><td>50.00</td><td>5.00</td><td>2.00</td></tr>
 <tr><td>&nbsp;</td><td>REVERSE</td><td>258</td><td>20</td><td>60.11</td><td>45.00</td><td>4.00</td><td>3.00</td></tr>

And what I want is the values from the rows to be put in lists, similar to what you would get if did this by hand:

 row1 = [FORWARD, 72, 20, 60.29, 55.0, 5.00, 3.00]. 

I read that BeautifulSoup might be able to help, so I tried:

 from bs4 import BeautifulSoup

 def removeTags(html, *tags):
     soup = BeautifulSoup(html)
     for tag in tags:
         for tag in soup.findAll(tag):
             tag.replaceWith("")
     return soup


 testhtml = open('myfile.html', 'r')

 print removeTags(testhtml, 'tr', 'td')

But this seems to remove all of the information in the tables, not just the HTML tags. I've tried several other things as well, but I seem to be stuck. I would appreciate any suggestions.

3 Answers 3

2

This is a little sloppy but it does the trick.

with open('htmlfile.html','r') as file:
  rows = []
  for line in file:
    start = max(line.find('FORWARD'),line.find('REVERSE'))
    rows.append(line[start:].replace('<','').replace('>','').replace('/','').replace('td',' ').replace('tr',' ').strip().split('  '))
print(rows)
Sign up to request clarification or add additional context in comments.

4 Comments

it's a bad idea to assume that a HTML file has a given line structure
This works great for me, even if it's a little messy. But I don't understand why the line start = max(line.find('FORWARD'),line.find('REVERSE')) works. print(start) returns a value of -1, which according to the python documentation indicates that it could not find the string 'FORWARD' or the string 'REVERSE'
start == -1 would imply that both FORWARD and REVERSE are not in the line. If either of them are in the line then start is set to the first index where one of them appears.
Oh, I see. I was running the start = ... for the wrong line so that's why I was always getting -1 back. Makes much more sense when you do it right...Thanks!
0

Given your sample data you can get the first row as a list, by using the following code:

>>> list(soup.find('tr').strings)
[u'1', u'FORWARD', u'72', u'20', u'60.29', u'55.00', u'5.00', u'3.00']

Comments

0

Try something like this:

soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')
for row in rows:
    print [col.string for col in row.findAll('td')]

Edit: You can call float on the col.string if you want to get numbers back, but this will give you an error for the 'FORWARD', etc., tags. This should get you started, however.

2 Comments

My apologies. Your answer looked very similar to one I had posted and I was trying to edit mine. I accidentally edited yours.
@Eric No worries. Rolled back.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.