I've tried using the Sanitize gem to clean a string which contains the HTML of a website.
It only removed the <script> tags, not the JavaScript inside the script tags.
What can I use to remove the JavaScript from a page?
require 'open-uri' # included with Ruby; only needed to load HTML from a URL
require 'nokogiri' # gem install nokogiri read more at http://nokogiri.org
html = open('http://stackoverflow.com') # Get the HTML source string
doc = Nokogiri.HTML(html) # Parse the document
doc.css('script').remove # Remove <script>…</script>
puts doc # Source w/o script blocks
doc.xpath("//@*[starts-with(name(),'on')]").remove # Remove on____ attributes
puts doc # Source w/o any JavaScript
I am partial to the Loofah gem. Modified from an example in the docs:
1.9.3p0 :005 > Loofah.fragment("<span onclick='foo'>hello</span> <script>alert('OHAI')</script>").scrub!(:prune).to_s
=> "<span>hello</span> "
You might be interested in the ActiveRecord extensions Loofah provides.
Remove all <script> tags and their contents:
regex = /<\s*s\s*c\s*r\s*i\s*p\s*t.*?>.*?<\s*\/\s*s\s*c\s*r\s*i\s*p\s*t\s*>|<\s*s\s*c\s*r\s*i\s*p\s*t.*?>|<\s*\/\s*s\s*c\s*r\s*i\s*p\s*t\s*>/im
while text =~ regex
text.gsub!(regex, '')
end
This will even take care of cases like:
<scr<script></script>ipt>alert('hello');</scr</script>ipt>
<script class='blah' >alert('hello');</script >
And other tricks. It won't, however, remove JavaScript that is executed via onload= or onclick=.
on*attributes?