2

I need to crawl several URLs and grab their contents into a DB.

the crawled data must contain both the HTML and external CSS and JS files.

I used Nokogiri to grab CSS with no problem but am unable to get the Javacript as easily..

here is my relevant code:

...

arrJS = []
page = Nokogiri::HTML(open(url)) 
page.css('script').map {|link| arrJS << link['src'].to_s}

...

when I use this on a site like yahoo.com - I get a wierd arrJS array that has no relevance to the javascripts on the html.

any thoughts?

4

1 Answer 1

2

You are confusing Array#map with Array#each. Try this

arrJS = []
page = Nokogiri::HTML(open(url))

page.css('script').each do |script|
    src = script['src']
    arrJS << src.to_s unless src.nil?
end

This will give you the content of all the src attributes of all the script elements.

If, instead, you want the content of the inlined scripts, not the source URI, you can use

contentJS = []

page.css('script').each do |script|
    contentJS << script.content if script['src'].nil?
end
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.