1

after scraping a website, I have retrieved all html links. After setting them into a set(), to remove any duplicates, I am still retrieving certain values. How do I remove the values of '#', '#content', '#uscb-nav-skip-header', '/', None, from set of link.

from bs4 import BeautifulSoup
import urllib
import re

#Gets the html code for scrapping
r = urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read()

#Creates a beautifulsoup object to run
soup = BeautifulSoup(r, 'html.parser')

#Set removes duplicates
lst2 = set()
for link in soup.find_all('a'):
    lst2.add(link.get('href'))
lst2

{'#',
 '#content',
 '#uscb-nav-skip-header',
 '/',
 '/data/tables/time-series/demo/popest/pre-1980-county.html',
 '/data/tables/time-series/demo/popest/pre-1980-national.html',
 '/data/tables/time-series/demo/popest/pre-1980-state.html',
 '/en.html',
 '/library/publications/2010/demo/p25-1138.html',
 '/library/publications/2010/demo/p25-1139.html',
 '/library/publications/2015/demo/p25-1142.html',
 '/programs-surveys/popest/data.html',
 '/programs-surveys/popest/data/tables.html',
 '/programs-surveys/popest/geographies.html',
 '/programs-surveys/popest/guidance-geographies.html',
 None,
 'https://twitter.com/uscensusbureau',
 ...}
2
  • 1
    In the for loop, check if link.get('href') is something you don't want, and skip adding it to the set. Commented Oct 31, 2019 at 0:36
  • Try to simplify your question a little more. The example is nice, but the HTML piece isn't that relevant to the problem. You could start with just the set and then ask something like how to I remove items from a set based on some criteria? Commented Oct 31, 2019 at 0:46

5 Answers 5

2

The character # (and everything after it) in a URL is relevant to a browser, but not to the server when making a web-request, so it is fine to cut those parts out of URLs. This will leave URLs like '#content' blank, but also change '/about#contact' into just '/about', which is actually what you want. From there, we just need an if statement to only add the non-empty strings to the set. This will also filter out None at the same time:

lst2 = set()
for link in soup.find_all('a'):
    url = link.get('href')
    url = url.split('#')[0]
    if url:
        lst2.add(url)

If you specifically want to exclude '/' (although it is a valid URL), you can simply write lst2.discard('/') at the end. Since lst2 is a set, this will remove it if it's there, or do nothing if it isn't.

Sign up to request clarification or add additional context in comments.

Comments

0

You can loop through your set and use regex to filter each element in the set. For the None, you can simply check if the value is None or not.

Comments

0

Try with the following:

set(link.get('href') for link in soup.findAll(name='link') if link.has_attr("href"))

Comments

0

you can use list comprehension:

new_set = [link if '#' not in link for link in lst2 ]

Comments

0

You could examine the html and use :not (bs4 4.7.1+) to filter out various href based on their values and apply a final test on href length

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = bs(r.content, 'lxml')
links = [i['href'] for i in soup.select('a[href]:not([class*="-nav-"],[class*="-pagination-"])') if len(i['href']) > 1]
print(links)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.