after scraping a website, I have retrieved all html links. After setting them into a set(), to remove any duplicates, I am still retrieving certain values. How do I remove the values of '#', '#content', '#uscb-nav-skip-header', '/', None, from set of link.
from bs4 import BeautifulSoup
import urllib
import re
#Gets the html code for scrapping
r = urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()
#Creates a beautifulsoup object to run
soup = BeautifulSoup(r, 'html.parser')
#Set removes duplicates
lst2 = set()
for link in soup.find_all('a'):
lst2.add(link.get('href'))
lst2
{'#',
'#content',
'#uscb-nav-skip-header',
'/',
'/data/tables/time-series/demo/popest/pre-1980-county.html',
'/data/tables/time-series/demo/popest/pre-1980-national.html',
'/data/tables/time-series/demo/popest/pre-1980-state.html',
'/en.html',
'/library/publications/2010/demo/p25-1138.html',
'/library/publications/2010/demo/p25-1139.html',
'/library/publications/2015/demo/p25-1142.html',
'/programs-surveys/popest/data.html',
'/programs-surveys/popest/data/tables.html',
'/programs-surveys/popest/geographies.html',
'/programs-surveys/popest/guidance-geographies.html',
None,
'https://twitter.com/uscensusbureau',
...}
forloop, check iflink.get('href')is something you don't want, and skip adding it to the set.