Parsing HTML document: Regular expression or LINQ?

Question

Trying to parse an HTML document and extract some elements (any links to text files).

The current strategy is to load an HTML document into a string. Then find all instances of links to text files. It could be any file type, but for this question, it's a text file.

The end goal is to have an IEnumerable list of string objects. That part is easy, but parsing the data is the question.

<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href="http://myServer.com/blah.txt"></div>
<span>Here is your second text file: <a href="http://myServer.com/blarg2.txt"></span>
<div>Here is your third text file: <a href="http://myServer.com/bat.txt"></div>
<div>Here is your fourth text file: <a href="http://myServer.com/somefile.txt"></div>
<div>Thanks for visiting!</div>
</body>
</html>

The initial approaches are:

load the string into an XML document, and attack it in a Linq-To-Xml fashion.
create a regex, to look for a string starting with href=, and ending with .txt

The question being:

what would that regex look like? I am a regex newbie, and this is part of my regex learning.
which method would you use to extract a list of tags?
which would be the most performant way?
which method would be the most readable/maintainable?

Update: Kudos to Matthew on the HTML Agility Pack suggestion. It worked just fine! The XPath suggestion works as well. I wish I could mark both answers as 'The Answer', but I obviously cannot. They are both valid solutions to the problem.

Here's a C# console app using the regex suggested by Jeff. It reads the string fine, and will not include any href that is not ended with .txt. With the given sample, it correctly does NOT include the .txt.snarg file in the results (as provided in the HTML string function).

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace ParsePageLinks
{
    class Program
    {
        static void Main(string[] args)
        {
            GetAllLinksFromStringByRegex();
        }

        static List<string> GetAllLinksFromStringByRegex()
        {
            string myHtmlString = BuildHtmlString();
            string txtFileExp = "href=\"([^\\\"]*\\.txt)\"";

            List<string> foundTextFiles = new List<string>();

            MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString, txtFileExp, RegexOptions.IgnoreCase);
            foreach (Match m in textFileLinkMatches)
            {
                foundTextFiles.Add( m.Groups[1].ToString()); // this is your captured group
            }

            return files;
        }

            static string BuildHtmlString()
            {
                return new StringReader(@"<html><head><title>Blah</title></head><body><br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt.snarg""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div></body></html>").ReadToEnd();
            }       
        }
    }

@JD: absolutely! As Matthew suggested, the HTML Agility Pack sounds worthy of a look. Were you going to suggest that or another? — p.campbell
– p.campbell, Commented May 25, 2009 at 18:13
@Philoushka I was going to suggest HTML Agility Pack... it rocks. — Jeff
– Jeff, Commented May 25, 2009 at 19:55

Matthew Flaschen · Accepted Answer · 2009-05-25 19:19:49Z

13

Neither. Load it into an (X/HT)MLDocument and use XPath, which is a standard method of manipulating XML and very powerful. The functions to look at are SelectNodes and SelectSingleNode.

Since you are apparently using HTML (not XHTML), you should use HTML Agility Pack. Most of the methods and properties match the related XML classes.

Sample implementation using XPath:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(new StringReader(@"<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div>
</body>
</html>"));
        HtmlNode root = doc.DocumentNode;
        // 3 = ".txt".Length - 1.  See http://stackoverflow.com/questions/402211/how-to-use-xpath-function-in-a-xpathexpression-instance-programatically
        HtmlNodeCollection links = root.SelectNodes("//a[@href['.txt' = substring(., string-length(.)- 3)]]");
    IList<string> fileStrings;
    if(links != null)
    {
        fileStrings = new List<string>(links.Count);
        foreach(HtmlNode link in links)
        fileStrings.Add(link.GetAttributeValue("href", null));
    }
    else
        fileStrings = new List<string>(0);

edited May 25, 2009 at 19:19

answered May 25, 2009 at 18:00

Matthew Flaschen

286k53 gold badges523 silver badges554 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

p.campbell Over a year ago

@Matthew: The HTML Agility Pack gave me what I needed in about 5 minutes of implementation. It came with samples and source. Kudos to Simon Mourier!

Pete Montgomery Over a year ago

There's also now some support for "LINQ to HTML" in the Agility pack.

Jeff Meatball Yang · Accepted Answer · 2009-05-25 18:25:26Z

1

I would recommend regex. Why?

Flexible (case-insensitivity, easy to add new file extensions, elements to check, etc.)
Fast to write
Fast to run

Regex expressions will not be hard to read, as long as you can WRITE regexes.

using this as the regular expression:

href="([^"]*\.txt)"

Explanation:

It has parentheses around the filename, which will result in a "captured group" which you can access after each match has been found.
It has to escape the "." by using the regex escape character, a backslash.
It has to match any character EXCEPT double-quotes: [^"] until it finds
the ".txt"

it translates into an escaped string like this:

string txtExp = "href=\"([^\\\"]*\\.txt)\"

Then you can iterate over your Matches:

Matches txtMatches = Regex.Matches(input, exp, RegexOptions.IgnoreCase);
foreach(Match m in txtMatches) {
  string filename = m.Groups[1]; // this is your captured group
}

answered May 25, 2009 at 18:25

Jeff Meatball Yang

39.3k27 gold badges94 silver badges125 bronze badges

6 Comments

p.campbell Over a year ago

@Jeff: this is an excellent code sample. Thank you for the input!

Matthew Flaschen Over a year ago

That will match .txt anywhere in the href, when the OP explicitly said "ends with". In my opinion, regex is inappropriate here.

Dmitri Farkov Over a year ago

@Matthew: No, It will only match an HREF ending with (.txt"). I don't think HREF's contain quotes in the middle.

Svante Over a year ago

Don't try to use regular expressions to parse non-regular languages.

Jeff Meatball Yang Over a year ago

I understand the desire to approach this from a DOM/XPath point of view - but my rationale was that a regex implementation assumes very little about the input data. Obviously, if the OP can make assumptions, especially like well-formed documents, a DOM approach is much "cleaner". @Svante: I think regexes are GREAT at finding known patterns out of non-regular data. Think how many times you've grepped for something with a regex. Also, the OP wanted a regex example.

|

peterchen · Accepted Answer · 2009-05-25 18:28:13Z

0

Alternatively to Matthew Flaschen's suggestion, DOM (e.g. if you suffer from a X?L allergy outbreak)

It gets a bad rep sometimes - I guess because implementations are funny sometimes, and the native COM interfaces are a bit unwieldy without some (minor) smart helpers, but I've found it a robust, stable and intuitive / explorable way to parse and manipulate HTML.

answered May 25, 2009 at 18:28

peterchen

41.4k22 gold badges110 silver badges195 bronze badges

1 Comment

Matthew Flaschen Over a year ago

You're actually suggesting he use IE's HTML parser from .NET via COM interop?....

RichardTheKiwi · Accepted Answer · 2011-03-01 20:01:49Z

0

REGEX is not fast, in fact it's slower than native string parse stuff in .NET. Don't believe me, see for yourself.

None of the examples above are faster than going to the DOM directly.

HTMLDocument doc = wb.Document;
var links = doc.Links;

edited Mar 1, 2011 at 20:01

RichardTheKiwi

108k28 gold badges206 silver badges269 bronze badges

answered Mar 1, 2011 at 19:52

JWP

251 bronze badge

Collectives™ on Stack Overflow

Parsing HTML document: Regular expression or LINQ?

4 Answers 4

2 Comments

6 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related