1

Let's say I have a string that looks like this:

stuff = "<table><tr><td>Tuesday, January 15, 2019</td><td>2:44 PM EST</td><td>12</td><td>$530</td></tr><tr><td>Thursday, January 3, 2019</td><td>11:55 PM EST</td><td>11.5</td><td>$821</td></tr><tr><td>Friday, December 7, 2018</td><td>2:49 AM EST</td><td>11</td><td>$800</td></tr><tr><td>Wednesday, November 28, 2018</td><td>11:49 AM EST</td><td>9.5</td><td>$487</td></tr><tr><td>Monday, November 26, 2018</td><td>10:25 AM EST</td><td>11</td><td>$650</td></tr><tr><td>Thursday, November 22, 2018</td><td>5:52 PM EST</td><td>8.5</td><td>$792</td></tr><tr><td>Thursday, November 8, 2018</td><td>3:42 PM EST</td><td>11.5</td><td>$600</td></tr><tr><td>Saturday, September 29, 2018</td><td>9:40 PM EST</td><td>10</td><td>$470</td></tr><tr><td>Tuesday, September 4, 2018</td><td>4:11 PM EST</td><td>9.5</td><td>$649</td></tr><tr><td>Friday, July 13, 2018</td><td>2:07 PM EST</td><td>8</td><td>$650</td></tr><tr><td>Friday, July 6, 2018</td><td>1:21 PM EST</td><td>12</td><td>$495</td></tr><tr><td>Wednesday, June 13, 2018</td><td>5:14 PM EST</td><td>10</td><td>$450</td></tr><tr><td>Monday, June 4, 2018</td><td>4:24 PM EST</td><td>9.5</td><td>$476</td></tr><tr><td>Friday, April 13, 2018</td><td>9:16 AM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Monday, March 5, 2018</td><td>7:23 AM EST</td><td>8.5</td><td>$560</td></tr><tr><td>Thursday, January 11, 2018</td><td>1:40 PM EST</td><td>12</td><td>$800</td></tr><tr><td>Saturday, January 6, 2018</td><td>3:13 PM EST</td><td>9</td><td>$600</td></tr><tr><td>Thursday, December 14, 2017</td><td>1:06 PM EST</td><td>7.5</td><td>$726</td></tr><tr><td>Thursday, November 9, 2017</td><td>6:10 PM EST</td><td>10.5</td><td>$601</td></tr><tr><td>Wednesday, September 20, 2017</td><td>9:40 AM EST</td><td>10.5</td><td>$850</td></tr><tr><td>Friday, July 6, 2018</td><td>1:21 PM EST</td><td>12</td><td>$495</td></tr><tr><td>Wednesday, June 13, 2018</td><td>5:14 PM EST</td><td>10</td><td>$450</td></tr><tr><td>Monday, June 4, 2018</td><td>4:24 PM EST</td><td>9.5</td><td>$476</td></tr><tr><td>Friday, April 13, 2018</td><td>9:16 AM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Monday, March 5, 2018</td><td>7:23 AM EST</td><td>8.5</td><td>$560</td></tr><tr><td>Thursday, January 11, 2018</td><td>1:40 PM EST</td><td>12</td><td>$800</td></tr><tr><td>Saturday, January 6, 2018</td><td>3:13 PM EST</td><td>9</td><td>$600</td></tr><tr><td>Thursday, December 14, 2017</td><td>1:06 PM EST</td><td>7.5</td><td>$726</td></tr><tr><td>Thursday, November 9, 2017</td><td>6:10 PM EST</td><td>10.5</td><td>$601</td></tr><tr><td>Wednesday, September 20, 2017</td><td>9:40 AM EST</td><td>10.5</td><td>$850</td></tr><tr><td>Monday, July 24, 2017</td><td>12:22 PM EST</td><td>10.5</td><td>$600</td></tr><tr><td>Saturday, June 17, 2017</td><td>7:54 AM EST</td><td>11</td><td>$550</td></tr><tr><td>Saturday, June 10, 2017</td><td>7:32 PM EST</td><td>7.5</td><td>$750</td></tr><tr><td>Wednesday, May 24, 2017</td><td>3:10 PM EST</td><td>11</td><td>$741</td></tr><tr><td>Sunday, May 14, 2017</td><td>4:34 AM EST</td><td>10.5</td><td>$750</td></tr><tr><td>Monday, April 17, 2017</td><td>8:45 AM EST</td><td>10.5</td><td>$750</td></tr><tr><td>Saturday, April 1, 2017</td><td>9:44 PM EST</td><td>11</td><td>$750</td></tr><tr><td>Thursday, March 2, 2017</td><td>4:05 PM EST</td><td>11</td><td>$970</td></tr><tr><td>Thursday, February 23, 2017</td><td>3:03 PM EST</td><td>11.5</td><td>$675</td></tr><tr><td>Monday, January 23, 2017</td><td>3:29 PM EST</td><td>11</td><td>$726</td></tr><tr><td>Sunday, January 22, 2017</td><td>6:47 PM EST</td><td>11</td><td>$655</td></tr><tr><td>Friday, December 9, 2016</td><td>2:38 AM EST</td><td>10</td><td>$575</td></tr><tr><td>Thursday, December 8, 2016</td><td>5:23 PM EST</td><td>11.5</td><td>$1,200</td></tr><tr><td>Thursday, December 8, 2016</td><td>8:29 AM EST</td><td>12</td><td>$946</td></tr><tr><td>Saturday, November 26, 2016</td><td>3:09 PM EST</td><td>12</td><td>$1,031</td></tr><tr><td>Wednesday, November 23, 2016</td><td>3:45 PM EST</td><td>7.5</td><td>$650</td></tr><tr><td>Monday, November 21, 2016</td><td>7:23 AM EST</td><td>11</td><td>$1,031</td></tr><tr><td>Friday, November 18, 2016</td><td>5:12 PM EST</td><td>11</td><td>$1,031</td></tr><tr><td>Thursday, November 17, 2016</td><td>9:11 AM EST</td><td>11</td><td>$660</td></tr><tr><td>Tuesday, November 8, 2016</td><td>7:17 AM EST</td><td>6.5</td><td>$777</td></tr><tr><td>Saturday, September 24, 2016</td><td>5:57 PM EST</td><td>8</td><td>$815</td></tr><tr><td>Thursday, August 25, 2016</td><td>3:52 PM EST</td><td>6.5</td><td>$750</td></tr><tr><td>Saturday, August 20, 2016</td><td>2:20 PM EST</td><td>10.5</td><td>$721</td></tr><tr><td>Saturday, August 20, 2016</td><td>1:39 PM EST</td><td>8</td><td>$721</td></tr><tr><td>Thursday, July 21, 2016</td><td>1:21 PM EST</td><td>10.5</td><td>$650</td></tr><tr><td>Wednesday, July 20, 2016</td><td>6:14 AM EST</td><td>7.5</td><td>$777</td></tr><tr><td>Saturday, June 25, 2016</td><td>10:00 AM EST</td><td>9.5</td><td>$950</td></tr><tr><td>Thursday, June 23, 2016</td><td>5:26 PM EST</td><td>10.5</td><td>$580</td></tr><tr><td>Tuesday, June 21, 2016</td><td>1:19 PM EST</td><td>12.5</td><td>$600</td></tr><tr><td>Tuesday, May 31, 2016</td><td>10:06 AM EST</td><td>9.5</td><td>$828</td></tr></table>"

How would I use something like .read_html() in Pandas to make sense of that string?

I am obtaining this by doing using Selenium:

stuff = html_table.get_attribute('innerHTML')

and I must do it this way because there is a bunch of JavaScript stopping me from accessing things.

1 Answer 1

2

Use read_html what return list of DataFrames, so need select first by indexing:

df = pd.read_html(stuff)[0]
print (df.head())
                              0             1     2     3
0     Tuesday, January 15, 2019   2:44 PM EST  12.0  $530
1     Thursday, January 3, 2019  11:55 PM EST  11.5  $821
2      Friday, December 7, 2018   2:49 AM EST  11.0  $800
3  Wednesday, November 28, 2018  11:49 AM EST   9.5  $487
4     Monday, November 26, 2018  10:25 AM EST  11.0  $650

Then if necessary, is possible some data cleaning:

df.columns = ['date','time','val1','val2']

df['date'] = pd.to_datetime(df['date'] + '-' + df.pop('time').str[:-4], 
                            format='%A, %B %d, %Y-%I:%M %p')

df['val2'] = df['val2'].replace(['\$',','],'', regex=True).astype(int)
print (df.head())
                 date  val1  val2
0 2019-01-15 14:44:00  12.0   530
1 2019-01-03 23:55:00  11.5   821
2 2018-12-07 02:49:00  11.0   800
3 2018-11-28 11:49:00   9.5   487
4 2018-11-26 10:25:00  11.0   650
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.