2

By default it seems that html.parser.HTMLParser cannot handle self closing tags correctly, if they are not terminated using /. E.g. it handles <img src="asfd"/> fine, but it incorrectly handles <img scr="asdf"> by considering it as a non-self closing tag.

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.depth = 0

    def handle_starttag(self, tag, attrs):
        print('|'*self.depth + tag)
        self.depth += 1

    def handle_endtag(self, tag):
        self.depth -= 1

html_content = """
<html>
<head>
<title>test</title>
</head>
<body>
<div>
    <img src="http://closed.example.com" />
    <div>1</div>
    <div>2</div>
    <img src="http://unclosed.example.com">
    <div>3</div> <!-- will be indented too far -->
    <div>4</div>
</div>
</body>
</html>
"""
parser = MyHTMLParser()
parser.feed(html_content)

Is there a way to change this behaviour so it correctly handles self-closing tags without slash, or maybe a workaround?

For context: I'm writing a script for an environment where I only have access to a pure python interpreter and can only use built-in libraries, I cannot use any other ones.

2
  • No self closing tag has ever been specified or required to have a closing slash by any HTML standard since the beginning of time. While putting a slash there is allowed, it has no meaning, it does nothing and browsers ignore it. The W3C Validator now warns that, if anything, the slash can cause problems in certain instances. Commented Aug 5, 2024 at 16:04
  • @Rob Right! That's makes it even more weird that this library cannot handle the self closing tags without slash! Commented Aug 6, 2024 at 7:15

1 Answer 1

1

You can try override some HTMLParser internals, such as HTMLParser.check_for_whole_start_tag() and handle the <img> tag without explicit closing accordingly:

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.depth = 0

        self._orig_handle_starttag = self.handle_starttag

    def handle_starttag(self, tag, attrs):
        print("|" * self.depth + tag)
        self.depth += 1

    def handle_endtag(self, tag):
        self.depth -= 1

    def _my_handle_starttag(self, tag, attrs):
        self._orig_handle_starttag(tag, attrs)
        self.handle_endtag(tag)
        self.handle_starttag = self._orig_handle_starttag

    def check_for_whole_start_tag(self, i):
        rv = super().check_for_whole_start_tag(i)

        if rv:
            tag = self.rawdata[i:rv].lower()
            if tag.startswith("<img") and not tag.endswith("/>"):
                self.handle_starttag = self._my_handle_starttag

        return rv


html_content = """\
<html>
<head>
<title>test</title>
</head>
<body>
<div>
    <img src="http://closed.example.com" />
    <div>1</div>
    <div>2</div>
    <img src="http://unclosed.example.com">
    <div>3</div> <!-- will be indented too far -->
    <div>4</div>
</div>
</body>
</html>"""

parser = MyHTMLParser()
parser.feed(html_content)

Prints:

html
|head
||title
|body
||div
|||img
|||div
|||div
|||img
|||div
|||div
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.