Extracting Text / Parse Text with html.parser (Python)

Question

I want to extract text from a html file, specifically from the <p> and <h1> Tag. I did see the code from the python doc regarding this topic: from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

But I'm not sure how to go from here, in order to extract only texts within certain tags (

and . Any hint and advice in the right direction is welcomed! (I do not want to use beautiful soup or any external libraries)

Andrej Kesely · Accepted Answer · 2020-11-05 11:12:49Z

2

You can use this example how to parse the text from <h1> and <p> tags:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

    def __init__(self):
        super().__init__()
        self.data = []
        self.capture = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'h1'):
            self.capture = True

    def handle_endtag(self, tag):
        if tag in ('p', 'h1'):
            self.capture = False

    def handle_data(self, data):
        if self.capture:
            self.data.append(data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1><p>This is P tag</p></body></html>')

print(parser.data)

Prints:

['Parse me!', 'This is P tag']

answered Nov 5, 2020 at 11:12

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extracting Text / Parse Text with html.parser (Python)

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related