python html parser data not being found

Question

So I am making a webpage 'crawler' that parses a webpage and then searches for a word or set of words within the webpage. Here arises my problem, the data that I am looking for is contained within the parsed webpage (I ran it using the specific word as a test) yet it says that the data that it is looking for has not been found.

from html.parser import HTMLParser
from urllib import *

class dataFinder(HTMLParser):
    def open_webpage(self):
        import urllib.request
        request = urllib.request.Request('https://www.summet.com/dmsi/html/readingTheWeb.html')#Insert Webpage
        response = urllib.request .urlopen(request)
        web_page = response.read()
        self.webpage_text = web_page.decode()
        return self.webpage_text


    def handle_data(self, data):
        wordtofind = 'PaperBackSwap.com'
        if data == wordtofind:
            print('Match found:',data)
        else:
            print('No matches found')



p = dataFinder()
print(p.open_webpage())
p.handle_data(p.webpage_text)

I have run the program without the open webpage function using the feed method and it works and finds the data, however it now does not work.

Any help in solving this problem is appreciated

What exactly is it that you are aiming to extract from the website? Links from href tags? — Luke
– Luke, Commented Aug 14, 2017 at 10:35
I am just trying to find text from within the page, whether it be in href tags or in p tags — S0lo
– S0lo, Commented Aug 14, 2017 at 10:55

t.m.adam · Accepted Answer · 2017-08-14 12:52:50Z

1

You are trying to compare html page and string and of course they are not simillar so you got 'No matches found'. To find string inside of string you can use str.find() method. It returns position of first found position of text else -1.

Correct code:

from html.parser import HTMLParser
from urllib import *

class dataFinder(HTMLParser):
    def open_webpage(self):
        import urllib.request
        request = urllib.request.Request('https://www.summet.com/dmsi/html/readingTheWeb.html')#Insert Webpage
        response = urllib.request .urlopen(request)
        web_page = response.read()
        self.webpage_text = web_page.decode()
        return self.webpage_text

    def handle_data(self, data):
        wordtofind = 'PaperBackSwap.com'
        if data.find(wordtofind) != -1:
            print('Match found position:', data.find(wordtofind))
        else:
            print('No matches found')

p = dataFinder()
print(p.open_webpage())
p.handle_data(p.webpage_text)

edited Aug 14, 2017 at 12:52

t.m.adam

15.4k3 gold badges34 silver badges54 bronze badges

answered Aug 14, 2017 at 10:25

Mentos

16913 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

S0lo Over a year ago

This does work, and i must thank you for introducing me to this. I am quite new to programming and so have not had a chance to explore the documentation very thoroughly, if anyone can point me to where in the documentation this is then I would be very appreciative. Also you said that it returns the first found position of it, is there any way of getting it to return all the positions of the word

Mentos Over a year ago

@S0lo you can use this function - code.activestate.com/recipes/… for getting all positions of the substring. You can use it like this: allindices(data, wordtofind)

SeJaPy · Accepted Answer · 2017-08-14 10:30:06Z

0

I am able to parse and find text from html content with Beautifulsoup, please see whether it works for you. Below is the sample code for your case.

from bs4 import BeautifulSoup

soup= BeautifulSoup(web_page,'html.parser')
for s in soup.findAll(wordtofind):
    if data == wordtofind:
        print('Match found:',data)
    else:
        print('No matches found')

answered Aug 14, 2017 at 10:30

SeJaPy

2942 gold badges6 silver badges20 bronze badges

Comments

Luke · Accepted Answer · 2017-08-14 14:07:11Z

0

Late to the party, but I would strongly advise using the requests module for HTTP interactions. It will make your life a lot easier.

import requests
from html.parser import HTMLParser

class dataFinder(HTMLParser):
    def open_webpage(self):
        request = requests.get('https://www.summet.com/dmsi/html/readingTheWeb.html')
        self.webpage_text = request.text
        return self.webpage_text

answered Aug 14, 2017 at 14:07

Luke

7722 gold badges7 silver badges24 bronze badges

Collectives™ on Stack Overflow

python html parser data not being found

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related