12

I've tried using the Sanitize gem to clean a string which contains the HTML of a website.

It only removed the <script> tags, not the JavaScript inside the script tags.

What can I use to remove the JavaScript from a page?

1
  • 2
    Do you also want to remove all on* attributes? Commented Nov 28, 2011 at 17:07

7 Answers 7

13
require 'open-uri'      # included with Ruby; only needed to load HTML from a URL
require 'nokogiri'      # gem install nokogiri   read more at http://nokogiri.org

html = open('http://stackoverflow.com')              # Get the HTML source string
doc = Nokogiri.HTML(html)                            # Parse the document

doc.css('script').remove                             # Remove <script>…</script>
puts doc                                             # Source w/o script blocks

doc.xpath("//@*[starts-with(name(),'on')]").remove   # Remove on____ attributes
puts doc                                             # Source w/o any JavaScript
Sign up to request clarification or add additional context in comments.

1 Comment

This seems like a really bad idea if your intention is to prevent XSS attacks. There are all sorts of edge cases you're missing. owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet
6

I am partial to the Loofah gem. Modified from an example in the docs:

1.9.3p0 :005 > Loofah.fragment("<span onclick='foo'>hello</span> <script>alert('OHAI')</script>").scrub!(:prune).to_s
 => "<span>hello</span> " 

You might be interested in the ActiveRecord extensions Loofah provides.

Comments

6

It turns out that Sanitize has an option built in (just not well documented)...

Sanitize.clean(content, :remove_contents => ['script', 'style'])

This removed all script and style tags (and their content) as I wanted.

Comments

1

So you need to add the sanitize gem to your Gemfile:

gem 'sanitize`

Then bundle

And then you can do Sanitize.clean(text, remove_contents: ['script', 'style'])

Comments

0

I use this regular expression to get rid of <script> and </script> tags in embeded content and just make the tags vanish. It also gets rid of things like < script> or < /script > ...etc... i.e. added whitespace.

post.content = post.content.gsub(/<\s*script\s*>|<\s*\/\s*script\s*>/, '')

Comments

0

remove all script tags

html_content = html_content.gsub(/<script.*?>[\s\S]*<\/script>/i, "")

source

Comments

0

Remove all <script> tags and their contents:

regex = /<\s*s\s*c\s*r\s*i\s*p\s*t.*?>.*?<\s*\/\s*s\s*c\s*r\s*i\s*p\s*t\s*>|<\s*s\s*c\s*r\s*i\s*p\s*t.*?>|<\s*\/\s*s\s*c\s*r\s*i\s*p\s*t\s*>/im
while text =~ regex
  text.gsub!(regex, '')
end

This will even take care of cases like:

<scr<script></script>ipt>alert('hello');</scr</script>ipt>
<script class='blah'  >alert('hello');</script  >

And other tricks. It won't, however, remove JavaScript that is executed via onload= or onclick=.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.