14

Is there a way to parse HTML string in .Net code behind like DOM parsing...

i.e. GetElementByTagName("abc").GetElementByTagName("tag")

I've this code chunk...

private void LoadProfilePage()
{        
    string sURL;
    sURL = "http://www.abcd1234.com/abcd1234";

    WebRequest wrGETURL;
    wrGETURL = WebRequest.Create(sURL);

    //WebProxy myProxy = new WebProxy("myproxy",80);
    //myProxy.BypassProxyOnLocal = true;

    //wrGETURL.Proxy = WebProxy.GetDefaultProxy();

    Stream objStream;
    objStream = wrGETURL.GetResponse().GetResponseStream();

    if (objStream != null)
    {
        StreamReader objReader = new StreamReader(objStream);

        string sLine = objReader.ReadToEnd();

        if (String.IsNullOrEmpty(sLine) == false)
        {
            ....                   
        }
    }
}
1
  • if it is valid XHTML perhaps you can load it in System.Xml.XDocument Commented Feb 24, 2011 at 13:39

5 Answers 5

12

You can use the excellent HTML Agility Pack.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Sign up to request clarification or add additional context in comments.

1 Comment

dangit oded - I just posted this same thing only I'm slow and your a speed deamon! :) +1 from me.
10

Take a look at using the Html Agility Pack

Example of its use:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")
 {
    HtmlAttribute att = link["href"];
    att.Value = FixLink(att);
 }

1 Comment

xpath string should be "//a[@href]" ?
3

You can use the HTML Agility Pack and a little XPath (it can even download the document for you):

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.abcd1234.com/abcd1234");
HtmlNodeCollection tags = doc.DocumentNode.SelectNodes("//abc//tag");

1 Comment

That link is outdated. They moved to html-agility-pack.net/?z=codeplex
2

I've used the HTML Agility Pack to do this exact thing and I think it's great. It has been really helpful to me.

1 Comment

That link is outdated. They moved to html-agility-pack.net/?z=codeplex
0

maybe this can help: What is the best way to parse html in C#?

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.