Scraping JavaScript-loaded HTML with Ruby on Rails and Nokogiri

Question

I am trying to scrape a website for product names.

My controller does the following:

page = Nokogiri::HTML(open(PAGE_URL))
@items_array = page.css("li.item h3")

Then displaying it in the view as:

<%= @items_array.each do |item| %>
<%= item.text %><br /><br />
<% end %>

The problem is that the HTML is only loaded for the first 10 items. The rest is generated by JavaScript. I can't seem to figure out how exactly.

Any ideas on how to scrape the rest of the content is much appreciated!

Martin · Accepted Answer · 2014-05-30 06:57:58Z

1

It won't work. Nokogiri cannot scrape anything that is not on the page, and for what I can see (using "view source" on my browser), a good part of the list is not HTML. How is it loaded is irrelevant in this case (probably using JavaScript).

Best option would be to ask them if they expose an API you could use (that would make your work much easier).

Scrapping is very fragile as it depend on the exact layout of the page.

answered May 30, 2014 at 6:57

Martin

7,7531 gold badge23 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Yule Over a year ago

Should definitely ask if you can use their information, even if they don;t have an API. Likelihood is that they will welcome it as long as you link back to their page.

Abhiram · Accepted Answer · 2015-11-22 12:24:53Z

0

You need to use Web drivers with headless like, https://github.com/watir/watir-webdriver

http://watirwebdriver.com/headless/

answered Nov 22, 2015 at 12:24

Abhiram

1,48715 silver badges24 bronze badges

Collectives™ on Stack Overflow

Scraping JavaScript-loaded HTML with Ruby on Rails and Nokogiri

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related