is there any java script web crawler framework [closed]

Question

Is there any JavaScript web crawler framework?

Could you be more specific? Are you looking for a web crawler implemented in JavaScript? Server-side (Node.js) or client-side (in a browser)? — Matt Ball
– Matt Ball, Commented Apr 5, 2011 at 17:31
Is there a client-side webcrawler framework? How would that work? — Shakakai
– Shakakai, Commented Apr 5, 2011 at 17:36
I wrote three APIs using server-side javascript. You can run nodejs from your command-line as easy as you can python. This is a perfectly valid question. — salezica
– salezica, Commented Mar 13, 2013 at 0:06

Shakakai · Accepted Answer · 2011-04-05 17:32:16Z

10

There's a new framework that was just release for Node.js called spider. It uses jQuery under the hood to crawl/index a website's HTML pages. The API and configuration are really nice especially if you already know jQuery.

From the test suite, here's an example of crawling the New York Times website:

var spider = require('../main');

spider()
  .route('www.nytimes.com', '/pages/dining/index.html', function (window, $) {
    $('a').spider();
  })
  .route('travel.nytimes.com', '*', function (window, $) {
    $('a').spider();
    if (this.fromCache) return;

    var article = { title: $('nyt_headline').text(), articleBody: '', photos: [] }
    article.body = '' 
    $('div.articleBody').each(function () {
      article.body += this.outerHTML;
    })
    $('div#abColumn img').each(function () {
      var p = $(this).attr('src');
      if (p.indexOf('ADS') === -1) {
        article.photos.push(p);
      }
    })
    console.log(article);
  })
  .route('dinersjournal.blogs.nytimes.com', '*', function (window, $) {
    var article = {title: $('h1.entry-title').text()}
    console.log($('div.entry-content').html())
  })
  .get('http://www.nytimes.com/pages/dining/index.html')
  .log('info')
  ;

answered Apr 5, 2011 at 17:32

Shakakai

3,5742 gold badges18 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kuroro Over a year ago

Spend a morning to make spider to work, it can't be run in latest 0.6.6 node.js.

Marcus Pope Over a year ago

This is a good start, but it doesn't seem to handle meta redirects or document base overrides so it will fail to crawl many sites. But it is the best implementation I've seen for node. And with support for cookies it's better than other open source crawlers.

zindel · Accepted Answer · 2011-04-05 17:31:20Z

8

Try the PhantomJS. Not exactly a crawler, but could be easily used for that purpose. It has the fully-functional WebKit engine built-in, with an ability to save screenshots etc. Works as the simple command-line JS interpreter.

answered Apr 5, 2011 at 17:31

zindel

1,87511 silver badges14 bronze badges

Comments

bpierre · Accepted Answer · 2011-04-05 17:31:50Z

1

Server-side?

Try node-crawler: https://github.com/joshfire/node-crawler

answered Apr 5, 2011 at 17:31

bpierre

11.6k2 gold badges28 silver badges28 bronze badges

1 Comment

Marcus Pope Over a year ago

I wouldn't consider this a crawler since it doesn't compile subsequent uri's to crawl. It will basically download the source of a given URL and trigger a callback on completion. It's up to the consumer to define logic for crawling the links provided in that page, something that isn't very straightforward.

Collectives™ on Stack Overflow

is there any java script web crawler framework [closed]

3 Answers 3

2 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Linked

Related