How to handle self-closing tags without end-slash in html.parser.HTMLParser

Question

By default it seems that html.parser.HTMLParser cannot handle self closing tags correctly, if they are not terminated using /. E.g. it handles <img src="asfd"/> fine, but it incorrectly handles <img scr="asdf"> by considering it as a non-self closing tag.

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.depth = 0

    def handle_starttag(self, tag, attrs):
        print('|'*self.depth + tag)
        self.depth += 1

    def handle_endtag(self, tag):
        self.depth -= 1

html_content = """
<html>
<head>
<title>test</title>
</head>
<body>
<div>
    <img src="http://closed.example.com" />
    <div>1</div>
    <div>2</div>
    <img src="http://unclosed.example.com">
    <div>3</div> <!-- will be indented too far -->
    <div>4</div>
</div>
</body>
</html>
"""
parser = MyHTMLParser()
parser.feed(html_content)

Is there a way to change this behaviour so it correctly handles self-closing tags without slash, or maybe a workaround?

For context: I'm writing a script for an environment where I only have access to a pure python interpreter and can only use built-in libraries, I cannot use any other ones.

No self closing tag has ever been specified or required to have a closing slash by any HTML standard since the beginning of time. While putting a slash there is allowed, it has no meaning, it does nothing and browsers ignore it. The W3C Validator now warns that, if anything, the slash can cause problems in certain instances. — Rob
– Rob, Commented Aug 5, 2024 at 16:04
@Rob Right! That's makes it even more weird that this library cannot handle the self closing tags without slash! — flawr
– flawr, Commented Aug 6, 2024 at 7:15

Andrej Kesely · Accepted Answer · 2024-08-04 23:09:08Z

You can try override some HTMLParser internals, such as HTMLParser.check_for_whole_start_tag() and handle the <img> tag without explicit closing accordingly:

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.depth = 0

        self._orig_handle_starttag = self.handle_starttag

    def handle_starttag(self, tag, attrs):
        print("|" * self.depth + tag)
        self.depth += 1

    def handle_endtag(self, tag):
        self.depth -= 1

    def _my_handle_starttag(self, tag, attrs):
        self._orig_handle_starttag(tag, attrs)
        self.handle_endtag(tag)
        self.handle_starttag = self._orig_handle_starttag

    def check_for_whole_start_tag(self, i):
        rv = super().check_for_whole_start_tag(i)

        if rv:
            tag = self.rawdata[i:rv].lower()
            if tag.startswith("<img") and not tag.endswith("/>"):
                self.handle_starttag = self._my_handle_starttag

        return rv


html_content = """\
<html>
<head>
<title>test</title>
</head>
<body>
<div>
    <img src="http://closed.example.com" />
    <div>1</div>
    <div>2</div>
    <img src="http://unclosed.example.com">
    <div>3</div> <!-- will be indented too far -->
    <div>4</div>
</div>
</body>
</html>"""

parser = MyHTMLParser()
parser.feed(html_content)

Prints:

html
|head
||title
|body
||div
|||img
|||div
|||div
|||img
|||div
|||div

Collectives™ on Stack Overflow

How to handle self-closing tags without end-slash in html.parser.HTMLParser

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related