Powershell remove HTML tags in string content

Question

I have a large HTML data string separated into small chunks. I am trying to write a PowerShell script to remove all the HTML tags, but am finding it difficult to find the right regex pattern.

Example String:

<p>This is an example<br />of various <span style="color: #445444">html content</span>

I have tried using:

$string -replace '\<([^\)]+)\>',''

It works with simple examples but ones such as above it captures the whole string.

Any suggestions on whats the best way to achieve this?

briantist · Accepted Answer · 2020-06-11 18:14:10Z

24

For a pure regex, it should be as easy as <[^>]+>:

$string -replace '<[^>]+>',''

Debuggex Demo

Note that this could fail with certain HTML comments or the contents of <pre> tags.

Instead, you could use the HTML Agility Pack (alternative link), which is designed for use in .Net code, and I've used it successfully in PowerShell before:

Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($string)
$doc.DocumentNode.InnerText

HTML Agility Pack works well with non-perfect HTML.

edited Jun 11, 2020 at 18:14

answered Apr 28, 2015 at 21:27

briantist

48.3k6 gold badges94 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

briantist Over a year ago

Did you end up using the regex or the parser?

Arturski Over a year ago

I used regex, it is working well so far, because my script is already quite big and i am manually parsing the html, but the library is looking quite good i will try in my other projects, thanks

briantist Over a year ago

There are a fair number of other questions here on SO about agility pack, so you can find more help here or post about it. Many of them will be C# specific but they are still applicable to Powershell usage. It's quite a nice library, but do brush up on XPath to get the most of of it.

donothingsuccessfully Over a year ago

The link for HTML Agility Pack was broken for me. It's available on nuget: nuget.org/packages/HtmlAgilityPack

briantist Over a year ago

donothingsuccessfully the link still works for me, but added yours as an alternative; thanks!

Ludwig Fichtinger · Accepted Answer · 2021-08-20 12:10:15Z

To resolve umlauts and special characters I used a html Object. Here is my function:

Function ConvertFrom-Html
{
    <#
        .SYNOPSIS
            Converts a HTML-String to plaintext.

        .DESCRIPTION
            Creates a HtmlObject Com object und uses innerText to get plaintext. 
            If that makes an error it replaces several HTML-SpecialChar-Placeholders and removes all <>-Tags via RegEx.

        .INPUTS
            String. HTML als String

        .OUTPUTS
            String. HTML-Text als Plaintext

        .EXAMPLE
        $html = "<p><strong>Nutzen:</strong></p><p>Der&nbsp;Nutzen ist &uuml;beraus gro&szlig;.<br />Test ob 3 &lt; als 5 &amp; &quot;4&quot; &gt; &apos;2&apos; it?"
        ConvertFrom-Html -Html $html
        $html | ConvertFrom-Html

        Result:
        "Nutzen:
        Der Nutzen ist überaus groß.
        Test ob 3 < als 5 ist & "4" > '2'?"


        .Notes
            Author: Ludwig Fichtinger FILU
            Inital Creation Date: 01.06.2021
            ChangeLog: v2 20.08.2021 try catch with replace for systems without Internet Explorer

    #>

    [CmdletBinding(SupportsShouldProcess = $True)]
    Param(
        [Parameter(Mandatory = $true, Position = 0, ValueFromPipeline = $true, HelpMessage = "HTML als String")]
        [AllowEmptyString()]
        [string]$Html
    )

    try
    {
        $HtmlObject = New-Object -Com "HTMLFile"
        $HtmlObject.IHTMLDocument2_write($Html)
        $PlainText = $HtmlObject.documentElement.innerText
    }
    catch
    {
        $nl = [System.Environment]::NewLine
        $PlainText = $Html -replace '<br>',$nl
        $PlainText = $PlainText -replace '<br/>',$nl
        $PlainText = $PlainText -replace '<br />',$nl
        $PlainText = $PlainText -replace '</p>',$nl
        $PlainText = $PlainText -replace '&nbsp;',' '
        $PlainText = $PlainText -replace '&Auml;','Ä'
        $PlainText = $PlainText -replace '&auml;','ä'
        $PlainText = $PlainText -replace '&Ouml;','Ö'
        $PlainText = $PlainText -replace '&ouml;','ö'
        $PlainText = $PlainText -replace '&Uuml;','Ü'
        $PlainText = $PlainText -replace '&uuml;','ü'
        $PlainText = $PlainText -replace '&szlig;','ß'
        $PlainText = $PlainText -replace '&amp;','&'
        $PlainText = $PlainText -replace '&quot;','"'
        $PlainText = $PlainText -replace '&apos;',"'"
        $PlainText = $PlainText -replace '<.*?>',''
        $PlainText = $PlainText -replace '&gt;','>'
        $PlainText = $PlainText -replace '&lt;','<'
    }

    return $PlainText
}

Example:

"<p><strong>Nutzen:</strong></p><p>Der&nbsp;Nutzen ist &uuml;beraus gro&szlig;.<br />Test ob 3 &lt; als 5 ist &amp; &quot;4&quot; &gt; &apos;2&apos;?" | ConvertFrom-Html

Result:

Nutzen:
Der Nutzen ist überaus groß.
Test ob 3 < als 5 ist & "4" > '2'?

Giedrius · Accepted Answer · 2015-04-28 21:27:40Z

1

You can try this:

$string -replace '<.*?>',''

answered Apr 28, 2015 at 21:27

Giedrius

944 bronze badges

1 Comment

Ashley Over a year ago

Careful using .* like this. This is a less efficient way of matching. If you know the end delimiter, then the negative character set ( [^>] ) in the chosen answer means the engine is just looking for one character to stop the match, rather than having to backtrack to match the '>' later.

Collectives™ on Stack Overflow

Powershell remove HTML tags in string content

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related