5

My C# site allows users to submit HTML to be displayed on the site. I would like to limit the tags and attributes allowed for the HTML, but am unable to figure out how to do this in .net.

I've tried using Html Agility Pack, but I don't see how to modify the HTML, I can see how to go through the HTML and find certain data, but actually generating an output file is baffling me.

Does anyone have a good example for cleaning up HTML in .net? The agility pack might be the answer, but the documentation is lacking.

1
  • Good question. This is at the top of my list whenever I allow HTML code to be submitted and displayed - generally I use controls that format and sanitize the result for me (i.e. www.freetextbox.com in ASP.NET) but I should really confirm the result too. +1 for the question. Commented Jan 6, 2010 at 16:09

6 Answers 6

4

I would strongly recommend Microsoft's Anti-XSS Library for santizing input. It supports sanitizing html.

Sign up to request clarification or add additional context in comments.

Comments

3

You should only accept well-formed HTML.

You can then use LINQ to XML to parse and modify it.

You can make a recursive function that takes an element from the user and returns a new element with a whitelisted set of tags and attributes.

For example:

//Maps allowed tags to allowed attributes for the tags.
static readonly Dictionary<string, string[]> AllowedTags = new Dictionary<string, string[]>(StringComparer.OrdinalIgnoreCase) {
    { "b",    new string[0] },
    { "img",  new string[] { "src", "alt" } },
    //...
};
static XElement CleanElement(XElement dirtyElement) {
    return new XElement(dirtyElem.Name,
        dirtyElement.Elements
            .Where(e => AllowedTags.ContainsKey(e.Name))
            .Select<XElement, XElement>(CleanElement)
            .Concat(
                dirtyElement.Attributes
                    .Where(a => AllowedTags[dirtyElem.Name].Contains(a.Name, StringComparer.OrdinalIgnoreCase))
            );
}

If you allow hyperlinks, make sure to disallow javascript: urls; this code doesn't do that.

Comments

2

With HtmlAgilityPack you can remove unwanted tags from the input:

node.ParentNode.RemoveChild(node);

1 Comment

That's the method I was looking for. Thanks.
0

A tool you can use that is available off of SourceForge is SGMLReader which turns the HTML into properly formatted XML and allows you to read it as an XmlReader or load it into an XmlDocument object for further processing. I have used this before for parsing web pages which are not always in properly formatted HTML.

Comments

0

Have you had a look at MarkdownSharp which is Open Source and created by the guys here?

Comments

0

Jeff Atwood posted his whitelist-based approach on Refactor My Code at http://refactormycode.com/codes/333-sanitize-html

I believe StackOverflow combines that with the tag-balancing code at http://refactormycode.com/codes/360-balance-html-tags for sanitizing posts and preparing them for display. And, of course, they use MarkdownSharp for enabling Markdown on posts.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.