0

I have a html file where consists of 400 html tags and I wanted to extract some specific text from the tag. This file is a local file and not online webpage. I just tried using 1 html file first to check and confirm on the logic. In the real requirement, I will run it with a batch of html files (more than 50 html files).

What I want to extract is any text that sit between these tag:

Text I wanted

And in the html file, this tag might be used more than 1.

I did try to extract the text from the file using this code

 global count
 with open(file_path, 'r', encoding ="utf8") as fp:
          
            lines = fp.readlines() 
            text= '<div class="th-choice-list-name headerViewModeElementLoc th-choice-list-   description-value">'

            for line in lines:                              
                
                if line.find(text) != -1:                       
                        count = count + 1
                        result = re.search('<div class="th-choice-list-name headerViewModeElementLoc th-           choice-list-   description-value">(.*)</div>', line)                       
                        print(result.group(1))
                        print(count)
                            

And my problem are:

  1. It only can identify for the first search for the line.find(...) but not for the next similar tag.
  2. It can't extract the exact text I wanted because there is repetitive ' ... ' tag in the input file so it will take the whole line of html code that started with the first <div class="th-choice-list-name headerViewModeElementLoc ... and ended with any

This will be the 'simplified' version of the html file as the input (Bolded are the text that I want)

  </div><div id="th-templateEditor-section17-header" class="th-section" componentid="17"><div id="th-17-button_submenu" class="x-btn button_submenu inline_div x-btn-default-small"><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 1**</div><em id="th-17-button_submenu-btnWrap" class=""><button id="th-17-button_submenu-btnEl" type="button" hidefocus="true" role="button" autocomplete="off" title="Menu" class="x-btn-center" aria-label="Menu"><span id="th-17-button_submenu-btnInnerEl" class="x-btn-inner" style="">&nbsp;</span><span id="th-17-button_submenu-btnIconEl" class="x-btn-icon  x-hide-display">&nbsp;</span></button></em></div><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 2**</div><div id="th-templateEditor-section17-header-invalid-message" class="invalidElementMessage">
1
  • Don't use regular expressions to process HTML, use an HTML parser like Beautiful Soup. Commented Aug 3, 2023 at 3:42

1 Answer 1

0

As suggested by @Barmar, I think you should use third-party library like BeautifulSoup parse and find tags with given criteria using find_all() as it simplifies parsing and searching easier than using regex

from bs4 import BeautifulSoup

html = '''
<div id="th-templateEditor-section17-header" class="th-section" componentid="17"><div id="th-17-button_submenu" class="x-btn button_submenu inline_div x-btn-default-small"><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 1**</div><em id="th-17-button_submenu-btnWrap" class=""><button id="th-17-button_submenu-btnEl" type="button" hidefocus="true" role="button" autocomplete="off" title="Menu" class="x-btn-center" aria-label="Menu"><span id="th-17-button_submenu-btnInnerEl" class="x-btn-inner" style="">&nbsp;</span><span id="th-17-button_submenu-btnIconEl" class="x-btn-icon  x-hide-display">&nbsp;</span></button></em></div><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 2**</div><div id="th-templateEditor-section17-header-invalid-message" class="invalidElementMessage"></div>
'''

search_classes = 'th-choice-list-name headerViewModeElementLoc th-choice-list-description-value'.split(' ')

parsed_html = BeautifulSoup(html, "html.parser")
divs = parsed_html.find_all('div', {'class': search_classes})

for div in divs:
    print(div.text)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.