9

I have a text containing just HTML entities such as < and   I need to remove this all and get just the text content:

&nbspHello there<testdata>

So, I need to get Hello there and testdata from this section. Is there any way of using negative lookahead to do this?

I tried the following: /((?!&.+;).)+/ig but this doesnt seem to work very well. So, how can I just extract the required text from there?

5 Answers 5

25

A better syntax to find HTML entities is the following regular expression:

/&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});/ig

This syntax ignores false entities.

Sign up to request clarification or add additional context in comments.

3 Comments

This doesn't necessarily matter, but it's worth noting that this is technically not comprehensive. &amp, &#123, and { are all valid HTML entities that won't be matched by this.
[a-z0-9]+ matches &amp and similar forms, and #[0-9]{1,6} matches all entities from &#0 to &#999999 . I think other forms are not useful.
It matches &, not &amp. Your regex requires a semicolon, but &amp is a valid HTML entity. And I didn't say anything about whether those forms of entities are useful. I only said this regex is not comprehensive. If someone needed a comprehensive regex for their use case, this would not work.
4

Here are 2 suggestions:

1) Match all the entities using /(&.+;)/ig. Then, using whatever programming language you are using, replace those matches with an empty string. For example, in php use preg_replace; in C# use Regex.Replace. See this SO for a similar solution that accounts for more cases: How to remove html special chars?

2) If you really want to do this using the plaintext portions, you could try something like this: /(?:^|;)([^&;]+)(?:&|$)/ig. What its actually trying to do it match the pieces between; and & with special cases for start and end without entities. This is probably not the way to go, you're likely to run into different cases this breaks.

1 Comment

Thanks, tried 2-Just got back from the looney bin. I'll go with 1.
1

It's language specific but in Python you can use html.unescape (MAN). Like:

import html
print(html.unescape("This string contains & and >"))
#prints: This string contains & and >

Comments

1

After a short look at the python documentation one can come across the html.parser module: https://docs.python.org/3/library/html.parser.html#module-html.parser

And after some short prototyping one can come up with the fairly simple code:

from html.parser import HTMLParser

line_with_html = 'Data before tag with <span style="color:var(--md-font-color-green)">some gren text</span> with a nice logo'


class CleanHTML(HTMLParser):
    def reset(self) -> None:
        self.extracted_data = ""
        return super().reset()

    def remove_tags(self, html_data: str) -> str:
        """
        Args:
            html_data (str): HTML data which might contain tags.

        Returns:
            str: Data without any HTML tags. Forces feeding of any buffered data.
        """
        self.reset()
        self.feed(html_data)
        self.close()
        return self.extracted_data

    def handle_data(self, data: str) -> None:
        """
        Args:
            data (str): Html data extracted from tags to be processed.
        """
        self.extracted_data += data


p = CleanHTML()
print(p.remove_tags(line_with_html))

No need to:

  • Use regular expression
  • Use third-party modules like BeautifulSoup
  • Use parsers whih were not intended for HTML, like the XML parser

Comments

0

/(&.+?;)/ig works better. You may have multiple html entities in your string. If so, /(&.+;)/ig will match only once with everything between the first & and the last ; since + is a greedy find and +? is lazy.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.