0

How do I convert this text to being readable (removing all the </mtext> i.e. I already tried using html2text, but it only removed the < p >, and I need everything removed.'

I want it like on https://templates.mailchimp.com/resources/html-to-text/ not like on https://www.textfixer.com/html/html-to-text.php <p>Du kan g\u00f8re det s\u00e5dan her:<\/p><p><math><mrow><munder><mrow><munder><mrow><mtable><mtr><mtd><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mover><mrow><mtext> <\/mtext><mn>4297<\/mn><\/mrow><mrow><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1<\/mn><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1<\/mn><mo>\u2062<\/mo><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><\/mrow><\/mover><\/mtd><\/mtr><mtr><mtd><munder><mrow><mtable><mtr><mtd><mo>+<\/mo><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1425<\/mn><\/mtd><\/mtr><\/mtable><\/mrow><mo>\u0332<\/mo><\/munder><\/mtd><\/mtr><mtr><mtd><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>5722<\/mn><\/mtd><\/mtr><\/mtable><\/mrow><mo>\u0332<\/mo><\/munder><\/mrow><mo>\u0332<\/mo><\/munder><\/mrow><\/math><\/p>

1

2 Answers 2

1

You can do this using BeautifulSoup.

from bs4 import BeautifulSoup

html = "<p>Du kan g\u00f8re det s\u00e5dan her:<\/p><p><math><mrow><munder><mrow><munder><mrow><mtable><mtr><mtd><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mover><mrow><mtext> <\/mtext><mn>4297<\/mn><\/mrow><mrow><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1<\/mn><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1<\/mn><mo>\u2062<\/mo><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><\/mrow><\/mover><\/mtd><\/mtr><mtr><mtd><munder><mrow><mtable><mtr><mtd><mo>+<\/mo><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1425<\/mn><\/mtd><\/mtr><\/mtable><\/mrow><mo>\u0332<\/mo><\/munder><\/mtd><\/mtr><mtr><mtd><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>5722<\/mn><\/mtd><\/mtr><\/mtable><\/mrow><mo>\u0332<\/mo><\/munder><\/mrow><mo>\u0332<\/mo><\/munder><\/mrow><\/math><\/p>"
soup = BeautifulSoup(html)

# remove the script and style elements
for script in soup(["script", "style"]):
    script.extract()
    
# extract the text
text = soup.get_text()

print(text)
Sign up to request clarification or add additional context in comments.

3 Comments

It ended up giving me Du kan f.eks. g\u00f8re s\u00e5dan her:<\p> It completely skipped important numbers like these (bold doesn't work for some reason, so between ** and **, that's supposed to be bold): <\/mtext><mn>**4297**<\/mn><\/mrow><mrow><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>**1**<\/mn><mtext> <\/mtext><mtext> That's when I used soup.p.get_text().replace("\n", "") because I wan't to keep the whitespaces
my current code: soup = BeautifulSoup(questions[0]) for script in soup(["script", "style"]): script.decompose() ques = soup.p.get_text().replace("\n", "") questions[0] just equals the string you had above
Hmm, not sure I understand. What's your desired output? It gave me Du kan gøre det sådan her: 4297 1 1⁢ + 1425̲ 5722̲̲, which seems to preserve the spaces okay?
0

I don't know if there's anything you want here.

from simplified_scrapy import SimplifiedDoc,utils
html = '''
<p>Du kan g\u00f8re det s\u00e5dan her:<\/p><p><math><mrow><munder><mrow><munder><mrow><mtable><mtr><mtd><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mover><mrow><mtext> <\/mtext><mn>4297<\/mn><\/mrow><mrow><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1<\/mn><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1<\/mn><mo>\u2062<\/mo><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><\/mrow><\/mover><\/mtd><\/mtr><mtr><mtd><munder><mrow><mtable><mtr><mtd><mo>+<\/mo><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>1425<\/mn><\/mtd><\/mtr><\/mtable><\/mrow><mo>\u0332<\/mo><\/munder><\/mtd><\/mtr><mtr><mtd><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mtext> <\/mtext><mn>5722<\/mn><\/mtd><\/mtr><\/mtable><\/mrow><mo>\u0332<\/mo><\/munder><\/mrow><mo>\u0332<\/mo><\/munder><\/mrow><\/math><\/p>
'''
doc = SimplifiedDoc(html)
print (doc.text)
print (doc.removeHtml(html))
print (doc.replaceReg(html,'<[^>]*>','').strip())
print (doc.replaceReg(doc.replaceReg(html,'<[^>]*>',''),'[ ]+',' ').strip()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.