scraping external javascript files using Ruby

Question

I need to crawl several URLs and grab their contents into a DB.

the crawled data must contain both the HTML and external CSS and JS files.

I used Nokogiri to grab CSS with no problem but am unable to get the Javacript as easily..

here is my relevant code:

...

arrJS = []
page = Nokogiri::HTML(open(url)) 
page.css('script').map {|link| arrJS << link['src'].to_s}

...

when I use this on a site like yahoo.com - I get a wierd arrJS array that has no relevance to the javascripts on the html.

any thoughts?

What "weird" array do you get? I don't see why this shouldn't work (except for <script> tags without a src attribute) — Niklas B.
– Niklas B., Commented Feb 5, 2012 at 16:58
the wierd response is an array that looks like [["", "", "", "", "", "", "", "", "l.yimg.com/zz/combo?nn/lib/metro/g/uicontrib/yui/yui_3.4.1.js", "", "", "", "", ""], ["", "", "", "", "", "", "", "", "l.yimg.com/zz/combo?nn/lib/metro/g/uicontrib/yui/yui_3.4.1.js", "", "", "", "", ""], ["", "", "", "", "", "", "", "", "l.yimg.com/zz/combo?nn/lib/metro/g/uicontrib/yui/yui_3.4.1.js", "", "", "", "", ""], ["", "", "", "", "", "", "", "", "l.yimg.com/zz/combo?nn/lib/metro/g/uicontrib/yui/yui_3.4.1.js", "", "", "", "", ""],.. — NightOwl
– NightOwl, Commented Feb 5, 2012 at 17:02
@NighOwl: As I said, the empty strings mean that the corresponding script tags don't have a src attribute. Also you should use each instead of map here. — Niklas B.
– Niklas B., Commented Feb 5, 2012 at 19:25

gioele · Accepted Answer · 2012-02-05 19:11:33Z

2

You are confusing Array#map with Array#each. Try this

arrJS = []
page = Nokogiri::HTML(open(url))

page.css('script').each do |script|
    src = script['src']
    arrJS << src.to_s unless src.nil?
end

This will give you the content of all the src attributes of all the script elements.

If, instead, you want the content of the inlined scripts, not the source URI, you can use

contentJS = []

page.css('script').each do |script|
    contentJS << script.content if script['src'].nil?
end

answered Feb 5, 2012 at 19:11

gioele

10.3k5 gold badges61 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1