Webscraping with Python 3

Question

after scraping a website, I have retrieved all html links. After setting them into a set(), to remove any duplicates, I am still retrieving certain values. How do I remove the values of '#', '#content', '#uscb-nav-skip-header', '/', None, from set of link.

from bs4 import BeautifulSoup
import urllib
import re

#Gets the html code for scrapping
r = urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()

#Creates a beautifulsoup object to run
soup = BeautifulSoup(r, 'html.parser')

#Set removes duplicates
lst2 = set()
for link in soup.find_all('a'):
    lst2.add(link.get('href'))
lst2

{'#',
 '#content',
 '#uscb-nav-skip-header',
 '/',
 '/data/tables/time-series/demo/popest/pre-1980-county.html',
 '/data/tables/time-series/demo/popest/pre-1980-national.html',
 '/data/tables/time-series/demo/popest/pre-1980-state.html',
 '/en.html',
 '/library/publications/2010/demo/p25-1138.html',
 '/library/publications/2010/demo/p25-1139.html',
 '/library/publications/2015/demo/p25-1142.html',
 '/programs-surveys/popest/data.html',
 '/programs-surveys/popest/data/tables.html',
 '/programs-surveys/popest/geographies.html',
 '/programs-surveys/popest/guidance-geographies.html',
 None,
 'https://twitter.com/uscensusbureau',
 ...}

In the for loop, check if link.get('href') is something you don't want, and skip adding it to the set. — Barmar
– Barmar, Commented Oct 31, 2019 at 0:36
Try to simplify your question a little more. The example is nice, but the HTML piece isn't that relevant to the problem. You could start with just the set and then ask something like how to I remove items from a set based on some criteria? — DStauffman
– DStauffman, Commented Oct 31, 2019 at 0:46

kaya3 · Accepted Answer · 2019-10-31 00:39:28Z

2

The character # (and everything after it) in a URL is relevant to a browser, but not to the server when making a web-request, so it is fine to cut those parts out of URLs. This will leave URLs like '#content' blank, but also change '/about#contact' into just '/about', which is actually what you want. From there, we just need an if statement to only add the non-empty strings to the set. This will also filter out None at the same time:

lst2 = set()
for link in soup.find_all('a'):
    url = link.get('href')
    url = url.split('#')[0]
    if url:
        lst2.add(url)

If you specifically want to exclude '/' (although it is a valid URL), you can simply write lst2.discard('/') at the end. Since lst2 is a set, this will remove it if it's there, or do nothing if it isn't.

answered Oct 31, 2019 at 0:39

kaya3

51.6k7 gold badges87 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ooi18 · Accepted Answer · 2019-10-31 00:37:51Z

0

You can loop through your set and use regex to filter each element in the set. For the None, you can simply check if the value is None or not.

answered Oct 31, 2019 at 0:37

ooi18

1422 silver badges7 bronze badges

Comments

game0ver · Accepted Answer · 2019-10-31 00:38:27Z

0

Try with the following:

set(link.get('href') for link in soup.findAll(name='link') if link.has_attr("href"))

answered Oct 31, 2019 at 0:38

game0ver

1,29010 silver badges22 bronze badges

Comments

paltaa · Accepted Answer · 2019-10-31 00:42:42Z

0

you can use list comprehension:

new_set = [link if '#' not in link for link in lst2 ]

answered Oct 31, 2019 at 0:42

paltaa

3,2521 gold badge19 silver badges34 bronze badges

Comments

QHarr · Accepted Answer · 2019-10-31 00:46:40Z

0

You could examine the html and use :not (bs4 4.7.1+) to filter out various href based on their values and apply a final test on href length

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = bs(r.content, 'lxml')
links = [i['href'] for i in soup.select('a[href]:not([class*="-nav-"],[class*="-pagination-"])') if len(i['href']) > 1]
print(links)

answered Oct 31, 2019 at 0:46

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Collectives™ on Stack Overflow

Webscraping with Python 3

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related