12

I have a large HTML data string separated into small chunks. I am trying to write a PowerShell script to remove all the HTML tags, but am finding it difficult to find the right regex pattern.

Example String:

<p>This is an example<br />of various <span style="color: #445444">html content</span>

I have tried using:

$string -replace '\<([^\)]+)\>',''

It works with simple examples but ones such as above it captures the whole string.

Any suggestions on whats the best way to achieve this?

3 Answers 3

24

For a pure regex, it should be as easy as <[^>]+>:

$string -replace '<[^>]+>',''

Regular expression visualization

Debuggex Demo

Note that this could fail with certain HTML comments or the contents of <pre> tags.

Instead, you could use the HTML Agility Pack (alternative link), which is designed for use in .Net code, and I've used it successfully in PowerShell before:

Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($string)
$doc.DocumentNode.InnerText

HTML Agility Pack works well with non-perfect HTML.

Sign up to request clarification or add additional context in comments.

5 Comments

Did you end up using the regex or the parser?
I used regex, it is working well so far, because my script is already quite big and i am manually parsing the html, but the library is looking quite good i will try in my other projects, thanks
There are a fair number of other questions here on SO about agility pack, so you can find more help here or post about it. Many of them will be C# specific but they are still applicable to Powershell usage. It's quite a nice library, but do brush up on XPath to get the most of of it.
The link for HTML Agility Pack was broken for me. It's available on nuget: nuget.org/packages/HtmlAgilityPack
donothingsuccessfully the link still works for me, but added yours as an alternative; thanks!
3

To resolve umlauts and special characters I used a html Object. Here is my function:

Function ConvertFrom-Html
{
    <#
        .SYNOPSIS
            Converts a HTML-String to plaintext.

        .DESCRIPTION
            Creates a HtmlObject Com object und uses innerText to get plaintext. 
            If that makes an error it replaces several HTML-SpecialChar-Placeholders and removes all <>-Tags via RegEx.

        .INPUTS
            String. HTML als String

        .OUTPUTS
            String. HTML-Text als Plaintext

        .EXAMPLE
        $html = "<p><strong>Nutzen:</strong></p><p>Der&nbsp;Nutzen ist &uuml;beraus gro&szlig;.<br />Test ob 3 &lt; als 5 &amp; &quot;4&quot; &gt; &apos;2&apos; it?"
        ConvertFrom-Html -Html $html
        $html | ConvertFrom-Html

        Result:
        "Nutzen:
        Der Nutzen ist überaus groß.
        Test ob 3 < als 5 ist & "4" > '2'?"


        .Notes
            Author: Ludwig Fichtinger FILU
            Inital Creation Date: 01.06.2021
            ChangeLog: v2 20.08.2021 try catch with replace for systems without Internet Explorer

    #>

    [CmdletBinding(SupportsShouldProcess = $True)]
    Param(
        [Parameter(Mandatory = $true, Position = 0, ValueFromPipeline = $true, HelpMessage = "HTML als String")]
        [AllowEmptyString()]
        [string]$Html
    )

    try
    {
        $HtmlObject = New-Object -Com "HTMLFile"
        $HtmlObject.IHTMLDocument2_write($Html)
        $PlainText = $HtmlObject.documentElement.innerText
    }
    catch
    {
        $nl = [System.Environment]::NewLine
        $PlainText = $Html -replace '<br>',$nl
        $PlainText = $PlainText -replace '<br/>',$nl
        $PlainText = $PlainText -replace '<br />',$nl
        $PlainText = $PlainText -replace '</p>',$nl
        $PlainText = $PlainText -replace '&nbsp;',' '
        $PlainText = $PlainText -replace '&Auml;','Ä'
        $PlainText = $PlainText -replace '&auml;','ä'
        $PlainText = $PlainText -replace '&Ouml;','Ö'
        $PlainText = $PlainText -replace '&ouml;','ö'
        $PlainText = $PlainText -replace '&Uuml;','Ü'
        $PlainText = $PlainText -replace '&uuml;','ü'
        $PlainText = $PlainText -replace '&szlig;','ß'
        $PlainText = $PlainText -replace '&amp;','&'
        $PlainText = $PlainText -replace '&quot;','"'
        $PlainText = $PlainText -replace '&apos;',"'"
        $PlainText = $PlainText -replace '<.*?>',''
        $PlainText = $PlainText -replace '&gt;','>'
        $PlainText = $PlainText -replace '&lt;','<'
    }

    return $PlainText
}

Example:

"<p><strong>Nutzen:</strong></p><p>Der&nbsp;Nutzen ist &uuml;beraus gro&szlig;.<br />Test ob 3 &lt; als 5 ist &amp; &quot;4&quot; &gt; &apos;2&apos;?" | ConvertFrom-Html

Result:

Nutzen:
Der Nutzen ist überaus groß.
Test ob 3 < als 5 ist & "4" > '2'?

Comments

1

You can try this:

$string -replace '<.*?>',''

1 Comment

Careful using .* like this. This is a less efficient way of matching. If you know the end delimiter, then the negative character set ( [^>] ) in the chosen answer means the engine is just looking for one character to stop the match, rather than having to backtrack to match the '>' later.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.