0

I want to extract this text

Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)

from this html block

<span id='tid-span-369523'><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>

I'm trying to set this code but nothing is written on output2.txt

$html = Get-Content -Path 'C:\temp\html\metalarea2.html' -Raw

$pattern = '<span id="tid-span-\\d+"><a id="tid-link-\\d+" href=".+?" title=".+?">(.+?)</a></span>'

$matches = Select-String -InputObject $html -Pattern $pattern -AllMatches
$result = $matches | % { $_.Matches } | % { $_.Groups[1].Value }
$result | Out-File -FilePath "C:\temp\html\output2.txt"

I don't understand where the problem lies

EDIT: SOLUTIONS

$pattern = '<span id=\x27tid-span-\d+\x27><a id="tid-link-\d+" href=".+?" title=".+?">(.+?)</a></span>'

OR

$pattern = '<a id="tid-link-\d+".+?>(.+?)</a>'
3
  • @mklement0 nothing happens if I use $pattern = '<span id="tid-span-\d+"><a id="tid-link-\d+" href=".+?" title=".+?">(.+?)</a></span>' output2.txt is empty Commented Jan 28, 2023 at 2:57
  • 1
    Take a closer look at your html. span id= is followed by a single quoted string instead of a double quoted string. Use either 2 single quotes, or \x27, to catch the single quotes in a single quoted string: $pattern = '<span id=\x27tid-span-\d+\x27><a id="tid-link-\d+" href=".+?" title=".+?">(.+?)</a></span>' Commented Jan 28, 2023 at 3:18
  • @Darin ok, thank you I edit pattern in this way $pattern = '<a id="tid-link-\d+".+?>(.+?)</a>' Commented Jan 28, 2023 at 3:20

2 Answers 2

2

It is generally a bad idea to peek and/or poke in structured text using regular expressions. Instead, it is better to use a proper (html) parser to manipulate your data.

To give you an example using the IHTMLDocument2 interface:

$Html = @'
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <span id="tid-span-369523"><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
        <span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
        <div id="something">Text within div</div>
    </body>
</html>
'@
function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}

$Document = ParseHtml $Html
$Document.getElementsByTagName('a') |
    Where-Object { $_.id -Like 'tid-link-*' } |
    Foreach-Object { $_.innerText }
Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)
Sign up to request clarification or add additional context in comments.

Comments

1

You can use below regular expression to capture plain text between HTML tags:

(<[^>]*>)+(?<plaintext>[^<]+)<\/[^>]*>

You can refer to this example from regex101.com: Live sample

Here is a full script example:

$html = @"
<span id="tid-span-369523"><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
<div id="something">Text within div</div>
"@

$pattern = '(<[^>]*>)+(?<plaintext>[^<]+)<\/[^>]*>'
$options = [System.Text.RegularExpressions.RegexOptions]::Multiline

$matches = [regex]::Matches($html, $pattern, $options)

$results = $matches | %{ $_.Groups["plaintext"].Value }

$results

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.