2
<tr>
  <td style="color: #0000FF;text-align: center"><p>Sam<br/>John<br/></p></td>
</tr>

I am using the python HTMLParser module to extract the values Sam and John from the below html snippet, but the handle_data function is capturing only Sam and not John.

How I can get both Sam and John?

4
  • Is using HTMLParser module a requirement? Commented Aug 22, 2014 at 13:05
  • Thanks for your reply. Preferably, because I have completed parsing most of the html document and only this part is remaining. Commented Aug 22, 2014 at 13:07
  • Could you provide a very minimal example depicting your issue ? This would help to fix what might be wrong in your code. Commented Aug 22, 2014 at 13:26
  • Sorry I might have gone wrong somewhere, it is working. Will get back on this. Thanks Commented Aug 22, 2014 at 13:47

1 Answer 1

4

You can have an instance-level variable that would have True/False values. Set it to True if p tag started, False if p tag ended. When the value is True, get the data in the handle_data() method:

from HTMLParser import HTMLParser

data = """
<tr>
  <td style="color: #0000FF;text-align: center"><p>Sam<br/>John<br/></p></td>
</tr>
"""

class Parser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.recording = False

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.recording = True

    def handle_endtag(self, tag):
        if tag == 'p':
            self.recording = False

    def handle_data(self, data):
        if self.recording:
            print data

parser = Parser()
parser.feed(data)

Prints:

Sam
John
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.