Python Web Scraper/Crawler - HTML Tables to Excel Spreadsheet

Question

I'm trying to make a web scraper that will pull tables from a website and then paste them onto an excel spreadsheet. I'm an EXTREME beginner at Python (and coding in general) - literally started learning a couple days ago.

So, how do I make this web scraper/crawler? Here's the code that I have:

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'https://www.techpowerup.com/gpudb/?mobile=0&released%5B%5D=y14_c&released%5B%5D=y11_14&generation=&chipname=&interface=&ushaders=&tmus=&rops=&memsize=&memtype=&buswidth=&slots=&powerplugs=&sort=released&q='
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'processors'})

list_of_rows = []
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
    text = cell.text.replace('&nbsp;', '')
    list_of_cells.append(text)
list_of_rows.append(list_of_cells)

outfile = open("./GPU.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Product Name", "GPU Chip", "Released", "Bus", "Memory", "GPU clock", "Memory clock", "Shaders/TMUs/ROPs"])
writer.writerows(list_of_rows)

Now the program WORKS for the website present in the code above.

Now, I want to scrape the tables from the following website: https://www.techpowerup.com/gpudb/2990/radeon-rx-560d

Note that there are several tables on this page. What should I add/change to get the program to work on this page? I'm trying to get all of the tables, but if anyone could help me even get one of them, I would appreciate it so much!

There are some very nice tutorials and documentation that will help you learn! [Like this one (Web Scraping Workshop)] (gist.github.com/bradmontgomery/1872970). — cosinepenguin
– cosinepenguin, Commented Jul 13, 2017 at 3:19
@cosinepenguin Thanks for the link! I'm trying to scrape data tables. Do you have any links that would cover something like that, or even better, do you know if there's a small fix in my code that'll help me scrape one of those tables in the second link that I mentioned? Thank you SO much for your help! — James Flanagin
– James Flanagin, Commented Jul 13, 2017 at 3:25
I hope the answer below helps! Otherwise there are scores of awesome tutorials that explain how to do everything with BeautifulSoup! One of the most thorough is from Stanford: Web Scraping with Beautiful Soup. This also highlights DOM models in websites and how to interact with them. Once you master these basics, doing everything else is amazingly simple! Good luck! — cosinepenguin
– cosinepenguin, Commented Jul 13, 2017 at 4:03

cosinepenguin · Accepted Answer · 2017-07-13 03:53:47Z

Essentially, you just need to modify the code you have in your question to account for the fact the site has several tables!

What is really neat (or, dare I say, beautiful) about BeautifulSoup (BS) is the findAll method! This creates a BS object that you can iterate over!

So, say you have 5 tables in your source. You could conceivably run tables = soup.findAll("table"), which would return a list of every table object in the source's code! You could then iterate over that BS object and pull information out of each respective table.

Your code could look something like this:

import csv
import requests
import bs4

url = 'https://www.techpowerup.com/gpudb/2990/radeon-rx-560d'
response = requests.get(url)
html = response.content

soup = bs4.BeautifulSoup(html, "lxml")

tables = soup.findAll("table")

tableMatrix = []
for table in tables:
    #Here you can do whatever you want with the data! You can findAll table row headers, etc...
    list_of_rows = []
    for row in table.findAll('tr')[1:]:
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&nbsp;', '')
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    tableMatrix.append((list_of_rows, list_of_cells))
print(tableMatrix)

This code works, though I will note that I did not add any of the CSV file formatting that the original code had! You'll have to redesign that however it works for you. But I commented at the location where you have absolute liberty to do whatever you please for each table in the source. You could decide to findAll("th") elements in each table object and populate your CSV file like that, or you could extract the information from the cells themselves. Right now I save the cell data for each table in a tuple, which I append to the list tableMatrix.

I hope this helps you on your Python and BeautifulSoup adventure!

Sources:

Hi @cosinepenguin! Thank you so much for all of your help! What command would you use to put each element in its own cell in an Excel spreadsheet? You're guidance is highly appreciated! :)

Brad Martsberger · Accepted Answer · 2019-03-19 20:05:12Z

0

If you want the practice, this isn't a bad thing to do. If you just want something that works, pip install tablepyxl (github)

answered Mar 19, 2019 at 20:05

Brad Martsberger

1,96714 silver badges7 bronze badges

Comments

Farhan Hai Khan · Accepted Answer · 2020-08-14 19:48:02Z

I think I have already solved this issue! You can run the code below on Google Colaboratory. Feel free to ask questions further if needed. Cheers! This code saves tables from multiple URLs into Excel Sheets.

#for writing to excel(xlsx) we will be needing XlsxWriter, please install it first if you don't have it!
try:
  import XlsxWriter
except ModuleNotFoundError:
  print("XlsxWriter is not installed!!")
  get_ipython().system("pip install XlsxWriter")

#to scrape a table from a webpage
from urllib.parse import urlparse,urlsplit
import requests
import pandas as pd
import os


urls=["https://www.macrotrends.net/countries/IND/india/gdp-growth-rate",
      "http://www.inwea.org/wind-energy-in-india/wind-power-potential",
      "https://en.wikipedia.org/wiki/List_of_districts_in_India",
      "https://en.wikipedia.org/wiki/List_of_Indian_people_by_net_worth",
      "https://en.wikipedia.org/wiki/States_and_union_territories_of_India",
      "https://en.wikipedia.org/wiki/List_of_governors-general_of_India",
      "https://en.wikipedia.org/wiki/List_of_Indian_independence_activists",
      "https://en.wikipedia.org/wiki/List_of_Indian_Grammy_Award_winners_and_nominees",
      "https://en.wikipedia.org/wiki/List_of_Indian_Academy_Award_winners_and_nominees",
      "https://en.wikipedia.org/wiki/List_of_highest-grossing_Indian_films"
      ]


print(len(urls),"Urls Found")

#convert the sheetname- remove _ and - , put title case and remove spaces
def modify_name(my_str):
  replaced=my_str.replace("_", " ").replace("-", " ")
  return replaced.title().replace(" ","")


#get all tables from a url
def get_dataframes(url):
  html = requests.get(url).content
  df_list = pd.read_html(html)
  #print(len(df_list)," Dataframes Returned")
  return df_list

#if df is too small then don't add it
def filter_dfs(dfs_list,min_rows=10):
  new_dfs_list=[]
  for each_df in dfs_list:
    if(len(each_df)>min_rows):
      new_dfs_list.append(each_df)
  return new_dfs_list

#to avoid InvalidWorksheetName: Excel worksheet name 'StatesAndUnionTerritoriesOfIndia1' must be <= 31 chars.
def crop_name(name,thres=29):
  if len(name)<thres:
    return name
  else:
    return name[:thres]

#to get first n elements from list only
def crop_list(lst,thres=29):
  if len(lst)<thres:
    return lst
  else:
    return lst[:thres]

#converts urls to dataframes to excel sheets
#get_max= get the maximum number of tables from each url
#min_rows= the minimum number of rows in each table to save it to the excel sheet
#crop_name_thres= some excel sheets can get quite huge sheet names which blows up the code
#so crop the sheet name for the better purpose

def urls_to_excel(urls,excel_path=None,get_max=10,min_rows=0,crop_name_thres=29):
  excel_path=os.path.join(os.getcwd(),"Excel_Multiple_Sheets_Output.xlsx") if excel_path==None else excel_path
  writer = pd.ExcelWriter(excel_path, engine='xlsxwriter')
  i=0
  for url in urls:
    parsed=urlsplit(url)
    sheet_name=parsed.path.split('/')[-1]
    mod_sheet_name=crop_name(modify_name(sheet_name),thres=crop_name_thres)

    dfs_list=get_dataframes(url)
    filtered_dfs_list=filter_dfs(dfs_list,min_rows=min_rows)
    filtered_dfs_list=crop_list(filtered_dfs_list,thres=get_max)
    for each_df in filtered_dfs_list:
      print("Parsing Excel Sheet "," : ",str(i)+mod_sheet_name)
      i+=1
      each_df.to_excel(writer, sheet_name=str(i)+mod_sheet_name, index=True)
  writer.save()
urls_to_excel(urls,get_max=1,min_rows=10)

Obtained Output after running:

XlsxWriter is not installed!!
Collecting XlsxWriter
  Downloading https://files.pythonhosted.org/packages/2b/98/17875723b6814fc4d0fc03f0997ee00de2dbd78cf195e2ec3f2c9c789d40/XlsxWriter-1.3.3-py2.py3-none-any.whl (144kB)
     |████████████████████████████████| 153kB 9.4MB/s
Installing collected packages: XlsxWriter
Successfully installed XlsxWriter-1.3.3
10 Urls Found
Parsing Excel Sheet   :  0GdpGrowthRate
Parsing Excel Sheet   :  1WindPowerPotential
Parsing Excel Sheet   :  2ListOfDistrictsInIndia
Parsing Excel Sheet   :  3ListOfIndianPeopleByNetWorth
Parsing Excel Sheet   :  4StatesAndUnionTerritoriesOfIn
Parsing Excel Sheet   :  5ListOfGovernorsGeneralOfIndia
Parsing Excel Sheet   :  6ListOfIndianIndependenceActiv
Parsing Excel Sheet   :  7ListOfIndianGrammyAwardWinner
Parsing Excel Sheet   :  8ListOfIndianAcademyAwardWinne
Parsing Excel Sheet   :  9ListOfHighestGrossingIndianFi

Collectives™ on Stack Overflow

Python Web Scraper/Crawler - HTML Tables to Excel Spreadsheet

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related