13

Is there any JavaScript web crawler framework?

3
  • Could you be more specific? Are you looking for a web crawler implemented in JavaScript? Server-side (Node.js) or client-side (in a browser)? Commented Apr 5, 2011 at 17:31
  • 4
    Is there a client-side webcrawler framework? How would that work? Commented Apr 5, 2011 at 17:36
  • I wrote three APIs using server-side javascript. You can run nodejs from your command-line as easy as you can python. This is a perfectly valid question. Commented Mar 13, 2013 at 0:06

3 Answers 3

10

There's a new framework that was just release for Node.js called spider. It uses jQuery under the hood to crawl/index a website's HTML pages. The API and configuration are really nice especially if you already know jQuery.

From the test suite, here's an example of crawling the New York Times website:

var spider = require('../main');

spider()
  .route('www.nytimes.com', '/pages/dining/index.html', function (window, $) {
    $('a').spider();
  })
  .route('travel.nytimes.com', '*', function (window, $) {
    $('a').spider();
    if (this.fromCache) return;

    var article = { title: $('nyt_headline').text(), articleBody: '', photos: [] }
    article.body = '' 
    $('div.articleBody').each(function () {
      article.body += this.outerHTML;
    })
    $('div#abColumn img').each(function () {
      var p = $(this).attr('src');
      if (p.indexOf('ADS') === -1) {
        article.photos.push(p);
      }
    })
    console.log(article);
  })
  .route('dinersjournal.blogs.nytimes.com', '*', function (window, $) {
    var article = {title: $('h1.entry-title').text()}
    console.log($('div.entry-content').html())
  })
  .get('http://www.nytimes.com/pages/dining/index.html')
  .log('info')
  ;
Sign up to request clarification or add additional context in comments.

2 Comments

Spend a morning to make spider to work, it can't be run in latest 0.6.6 node.js.
This is a good start, but it doesn't seem to handle meta redirects or document base overrides so it will fail to crawl many sites. But it is the best implementation I've seen for node. And with support for cookies it's better than other open source crawlers.
8

Try the PhantomJS. Not exactly a crawler, but could be easily used for that purpose. It has the fully-functional WebKit engine built-in, with an ability to save screenshots etc. Works as the simple command-line JS interpreter.

Comments

1

Server-side?

Try node-crawler: https://github.com/joshfire/node-crawler

1 Comment

I wouldn't consider this a crawler since it doesn't compile subsequent uri's to crawl. It will basically download the source of a given URL and trigger a callback on completion. It's up to the consumer to define logic for crawling the links provided in that page, something that isn't very straightforward.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.