Parsing HTML String [duplicate]

Question

Is there a way to parse HTML string in .Net code behind like DOM parsing...

i.e. GetElementByTagName("abc").GetElementByTagName("tag")

I've this code chunk...

private void LoadProfilePage()
{        
    string sURL;
    sURL = "http://www.abcd1234.com/abcd1234";

    WebRequest wrGETURL;
    wrGETURL = WebRequest.Create(sURL);

    //WebProxy myProxy = new WebProxy("myproxy",80);
    //myProxy.BypassProxyOnLocal = true;

    //wrGETURL.Proxy = WebProxy.GetDefaultProxy();

    Stream objStream;
    objStream = wrGETURL.GetResponse().GetResponseStream();

    if (objStream != null)
    {
        StreamReader objReader = new StreamReader(objStream);

        string sLine = objReader.ReadToEnd();

        if (String.IsNullOrEmpty(sLine) == false)
        {
            ....                   
        }
    }
}

if it is valid XHTML perhaps you can load it in System.Xml.XDocument — Bazzz
– Bazzz, Commented Feb 24, 2011 at 13:39

Oded · Accepted Answer · 2017-06-06 18:54:23Z

12

You can use the excellent HTML Agility Pack.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

edited Jun 6, 2017 at 18:54

user6451184

answered Feb 24, 2011 at 13:39

Oded

501k102 gold badges900 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

phillip Over a year ago

dangit oded - I just posted this same thing only I'm slow and your a speed deamon! :) +1 from me.

Hossein Hadi · Accepted Answer · 2019-10-16 11:36:00Z

10

Take a look at using the Html Agility Pack

Example of its use:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")
 {
    HtmlAttribute att = link["href"];
    att.Value = FixLink(att);
 }

edited Oct 16, 2019 at 11:36

Hossein Hadi

1,51616 silver badges23 bronze badges

answered Feb 24, 2011 at 13:39

Mark Coleman

40.9k9 gold badges84 silver badges101 bronze badges

1 Comment

Joachim Chapman Over a year ago

xpath string should be "//a[@href]" ?

Kobi · Accepted Answer · 2021-06-09 20:19:05Z

3

You can use the HTML Agility Pack and a little XPath (it can even download the document for you):

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.abcd1234.com/abcd1234");
HtmlNodeCollection tags = doc.DocumentNode.SelectNodes("//abc//tag");

edited Jun 9, 2021 at 20:19

answered Feb 24, 2011 at 13:40

Kobi

139k41 gold badges259 silver badges302 bronze badges

1 Comment

Will Over a year ago

That link is outdated. They moved to html-agility-pack.net/?z=codeplex

phillip · Accepted Answer · 2011-02-24 13:41:13Z

2

I've used the HTML Agility Pack to do this exact thing and I think it's great. It has been really helpful to me.

answered Feb 24, 2011 at 13:41

phillip

2,77819 silver badges22 bronze badges

1 Comment

Will Over a year ago

That link is outdated. They moved to html-agility-pack.net/?z=codeplex

Community · Accepted Answer · 2017-05-23 11:33:24Z

0

maybe this can help: What is the best way to parse html in C#?

edited May 23, 2017 at 11:33

CommunityBot

11 silver badge

answered Feb 24, 2011 at 13:40

alexl

6,8083 gold badges26 silver badges29 bronze badges

Collectives™ on Stack Overflow

Parsing HTML String [duplicate]

5 Answers 5

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Linked

Related