Remove html entities and extract text content using regex

Question

I have a text containing just HTML entities such as < and   I need to remove this all and get just the text content:

&nbspHello there&lt;testdata&gt;

So, I need to get Hello there and testdata from this section. Is there any way of using negative lookahead to do this?

I tried the following: /((?!&.+;).)+/ig but this doesnt seem to work very well. So, how can I just extract the required text from there?

Kevin Doyon · Accepted Answer · 2021-03-18 22:30:48Z

25

A better syntax to find HTML entities is the following regular expression:

/&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});/ig

This syntax ignores false entities.

edited Mar 18, 2021 at 22:30

Kevin Doyon

3,5882 gold badges35 silver badges38 bronze badges

answered Jun 7, 2019 at 8:39

Mahoor13

5,6475 gold badges28 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Grant Gryczan Over a year ago

This doesn't necessarily matter, but it's worth noting that this is technically not comprehensive. &amp, &#123, and &#000000000000000000123; are all valid HTML entities that won't be matched by this.

Mahoor13 Over a year ago

[a-z0-9]+ matches &amp and similar forms, and #[0-9]{1,6} matches all entities from &#0 to &#999999 . I think other forms are not useful.

Grant Gryczan Over a year ago

It matches &, not &amp. Your regex requires a semicolon, but &amp is a valid HTML entity. And I didn't say anything about whether those forms of entities are useful. I only said this regex is not comprehensive. If someone needed a comprehensive regex for their use case, this would not work.

Community · Accepted Answer · 2017-05-23 11:57:42Z

4

Here are 2 suggestions:

1) Match all the entities using /(&.+;)/ig. Then, using whatever programming language you are using, replace those matches with an empty string. For example, in php use preg_replace; in C# use Regex.Replace. See this SO for a similar solution that accounts for more cases: How to remove html special chars?

2) If you really want to do this using the plaintext portions, you could try something like this: /(?:^|;)([^&;]+)(?:&|$)/ig. What its actually trying to do it match the pieces between; and & with special cases for start and end without entities. This is probably not the way to go, you're likely to run into different cases this breaks.

edited May 23, 2017 at 11:57

CommunityBot

11 silver badge

answered Sep 30, 2014 at 19:39

dtyler

1,4482 gold badges16 silver badges21 bronze badges

1 Comment

Mkl Rjv Over a year ago

Thanks, tried 2-Just got back from the looney bin. I'll go with 1.

gneusch · Accepted Answer · 2020-10-14 16:31:18Z

1

It's language specific but in Python you can use html.unescape (MAN). Like:

import html
print(html.unescape("This string contains &amp; and &gt;"))
#prints: This string contains & and >

answered Oct 14, 2020 at 16:31

gneusch

1256 bronze badges

Comments

Eugen_R · Accepted Answer · 2023-07-13 13:34:36Z

After a short look at the python documentation one can come across the html.parser module: https://docs.python.org/3/library/html.parser.html#module-html.parser

And after some short prototyping one can come up with the fairly simple code:

from html.parser import HTMLParser

line_with_html = 'Data before tag with <span style="color:var(--md-font-color-green)">some gren text</span> with a nice logo'


class CleanHTML(HTMLParser):
    def reset(self) -> None:
        self.extracted_data = ""
        return super().reset()

    def remove_tags(self, html_data: str) -> str:
        """
        Args:
            html_data (str): HTML data which might contain tags.

        Returns:
            str: Data without any HTML tags. Forces feeding of any buffered data.
        """
        self.reset()
        self.feed(html_data)
        self.close()
        return self.extracted_data

    def handle_data(self, data: str) -> None:
        """
        Args:
            data (str): Html data extracted from tags to be processed.
        """
        self.extracted_data += data


p = CleanHTML()
print(p.remove_tags(line_with_html))

No need to:

Use regular expression
Use third-party modules like BeautifulSoup
Use parsers whih were not intended for HTML, like the XML parser

Sergio Leite · Accepted Answer · 2024-12-10 18:09:31Z

0

/(&.+?;)/ig works better. You may have multiple html entities in your string. If so, /(&.+;)/ig will match only once with everything between the first & and the last ; since + is a greedy find and +? is lazy.

answered Dec 10, 2024 at 18:09

Sergio Leite

1

Collectives™ on Stack Overflow

Remove html entities and extract text content using regex

5 Answers 5

3 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related