0

So I am making a webpage 'crawler' that parses a webpage and then searches for a word or set of words within the webpage. Here arises my problem, the data that I am looking for is contained within the parsed webpage (I ran it using the specific word as a test) yet it says that the data that it is looking for has not been found.

from html.parser import HTMLParser
from urllib import *

class dataFinder(HTMLParser):
    def open_webpage(self):
        import urllib.request
        request = urllib.request.Request('https://www.summet.com/dmsi/html/readingTheWeb.html')#Insert Webpage
        response = urllib.request .urlopen(request)
        web_page = response.read()
        self.webpage_text = web_page.decode()
        return self.webpage_text


    def handle_data(self, data):
        wordtofind = 'PaperBackSwap.com'
        if data == wordtofind:
            print('Match found:',data)
        else:
            print('No matches found')



p = dataFinder()
print(p.open_webpage())
p.handle_data(p.webpage_text)

I have run the program without the open webpage function using the feed method and it works and finds the data, however it now does not work.

Any help in solving this problem is appreciated

2
  • What exactly is it that you are aiming to extract from the website? Links from href tags? Commented Aug 14, 2017 at 10:35
  • I am just trying to find text from within the page, whether it be in href tags or in p tags Commented Aug 14, 2017 at 10:55

3 Answers 3

1

You are trying to compare html page and string and of course they are not simillar so you got 'No matches found'. To find string inside of string you can use str.find() method. It returns position of first found position of text else -1.

Correct code:

from html.parser import HTMLParser
from urllib import *

class dataFinder(HTMLParser):
    def open_webpage(self):
        import urllib.request
        request = urllib.request.Request('https://www.summet.com/dmsi/html/readingTheWeb.html')#Insert Webpage
        response = urllib.request .urlopen(request)
        web_page = response.read()
        self.webpage_text = web_page.decode()
        return self.webpage_text

    def handle_data(self, data):
        wordtofind = 'PaperBackSwap.com'
        if data.find(wordtofind) != -1:
            print('Match found position:', data.find(wordtofind))
        else:
            print('No matches found')

p = dataFinder()
print(p.open_webpage())
p.handle_data(p.webpage_text)
Sign up to request clarification or add additional context in comments.

2 Comments

This does work, and i must thank you for introducing me to this. I am quite new to programming and so have not had a chance to explore the documentation very thoroughly, if anyone can point me to where in the documentation this is then I would be very appreciative. Also you said that it returns the first found position of it, is there any way of getting it to return all the positions of the word
@S0lo you can use this function - code.activestate.com/recipes/… for getting all positions of the substring. You can use it like this: allindices(data, wordtofind)
0

I am able to parse and find text from html content with Beautifulsoup, please see whether it works for you. Below is the sample code for your case.

from bs4 import BeautifulSoup

soup= BeautifulSoup(web_page,'html.parser')
for s in soup.findAll(wordtofind):
    if data == wordtofind:
        print('Match found:',data)
    else:
        print('No matches found')

Comments

0

Late to the party, but I would strongly advise using the requests module for HTTP interactions. It will make your life a lot easier.

import requests
from html.parser import HTMLParser

class dataFinder(HTMLParser):
    def open_webpage(self):
        request = requests.get('https://www.summet.com/dmsi/html/readingTheWeb.html')
        self.webpage_text = request.text
        return self.webpage_text

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.