1

I am having trouble extracting some attributes out of an html page and need some ideas to help me get unstuck.

I am using PowerShell and am using the htmlagilitypack to help me parse the html. I have a very crude version that I was able to do with regex but it doesn't always work so I thought the better option would be to use xpath to parse the results. If regex is the way to go please let me know.

So far I have been able to grab the page that I am interested in and split it apart by rows.

$results = $htmldoc.DocumentNode.SelectNodes("//p[@class='row']")

After the page is split up I am trying to iterate through each row using xpath to grab the information I am interested in.

ForEach ($item in $results) {

$ID=$null

$ID = $item.OuterHtml
}

This gets me close to what I am wanting but it grabs a bunch of other info that I don't want as well. Here is what the $item.outerhml looks like at this point.

OuterHtml            : <p class="row" data-latitude="41.5937565437255" data-longitude="-93.6437636649079" data-pid="4184719674"> <a href="/mod/4184719674.html" class="i"></a> 
                   <span class="star"></span> <span class="pl"> <span class="date">Nov 27</span>  <a href="/mod/4184719674.html">iPhone and other Cell Phone Unlocks</a> 
                   </span> <span class="l2">   <span class="pnr"> <small> (Des Moines)</small> <span class="px"> <span class="p"> <a href="#" class="maptag" 
                   data-pid="4184719674">map</a></span></span> </span>  <a class="gc" href="/mod/" data-cat="mod">cell phones - by dealer</a> </span> </p>

I just want the data-pid attribute.

sorry for the crappy picture

I have tried a bunch of other ways to extract the data-pid attribute but haven't had any success. Here is one such method I have tried, but it keeps returning the same value over and over.

$ID = $Date.DocumentNode.SelectSingleNode("//p/@data-pid")

I have a feeling that this is something simple but have hit a roadblock. Let me know what other information I need to post.

2
  • You have to describe more clearly what to get stuck means. Do you get a compile/syntax error? Do you get a run-time error? Do you get an empty result set? Commented Nov 27, 2013 at 22:09
  • I am trying to extract the data-pid attribute for each table row and store it in a variable but I am having trouble getting anything to work. The code posted above will grab what I need but I am only wanting to get the data-pid attribute from it. Commented Nov 27, 2013 at 22:19

1 Answer 1

1

In your foreach loop you should be able to get the attribute's value like this:

$ID = $item.GetAttributeValue("data-pid", "")

To walk all the attributes on that node try:

$item.Attributes | Select Name,Value
Sign up to request clarification or add additional context in comments.

5 Comments

Do you know if it's possible to use wildcards with this?
Also, what is the best way to use this for nested tags? Thanks, so far it is wokring.
RE wildcards, I don't think so but you can use the Attributes property e.g.: $item.Attributes | Select Name,Value. RE nested tags, you can always use $item.SelectNodes('<xpath query>').
I will try this out when I get a chance, thanks for the info.
I'm getting the logic working with my xpath query but it is not working quite right. I can grab the info I'm interested in but it doesn't iterate through the items the right way. I can get the correct ID to work but when I try to grab a date it just keeps spitting out the same date. Should I open a new question or update my original question with details?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.