I have a html file where consists of 400 html tags and I wanted to extract some specific text from the tag. This file is a local file and not online webpage. I just tried using 1 html file first to check and confirm on the logic. In the real requirement, I will run it with a batch of html files (more than 50 html files).
What I want to extract is any text that sit between these tag:
Text I wantedAnd in the html file, this tag might be used more than 1.
I did try to extract the text from the file using this code
global count
with open(file_path, 'r', encoding ="utf8") as fp:
lines = fp.readlines()
text= '<div class="th-choice-list-name headerViewModeElementLoc th-choice-list- description-value">'
for line in lines:
if line.find(text) != -1:
count = count + 1
result = re.search('<div class="th-choice-list-name headerViewModeElementLoc th- choice-list- description-value">(.*)</div>', line)
print(result.group(1))
print(count)
And my problem are:
- It only can identify for the first search for the line.find(...) but not for the next similar tag.
- It can't extract the exact text I wanted because there is repetitive ' ... ' tag in the input file so it will take the whole line of html code that started with the first <div class="th-choice-list-name headerViewModeElementLoc ... and ended with any
This will be the 'simplified' version of the html file as the input (Bolded are the text that I want)
</div><div id="th-templateEditor-section17-header" class="th-section" componentid="17"><div id="th-17-button_submenu" class="x-btn button_submenu inline_div x-btn-default-small"><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 1**</div><em id="th-17-button_submenu-btnWrap" class=""><button id="th-17-button_submenu-btnEl" type="button" hidefocus="true" role="button" autocomplete="off" title="Menu" class="x-btn-center" aria-label="Menu"><span id="th-17-button_submenu-btnInnerEl" class="x-btn-inner" style=""> </span><span id="th-17-button_submenu-btnIconEl" class="x-btn-icon x-hide-display"> </span></button></em></div><div class="th-choice-list-name headerViewModeElementLoc th-choice-list-description-value">**Text I wanted 2**</div><div id="th-templateEditor-section17-header-invalid-message" class="invalidElementMessage">