Parse a website using NodeJs

Question

I'm trying to parse the web site dl-protect and given a url of this type : http://www.dl-protect.com/F469D615 the output would be directly an uptobox link for example.

I tried to figure out how this service works using chrome dev console.

First of all, there's 2 cases to considerate :

You don't need to enter a captcha, you just need to click on the continue button. Then the NodeJs program should return the URL (uptobox here) found on the second page
You need to enter a captcha. In this case the NodeJs program should return the URL of the captcha

So far, here's my code (written in ES6) :

import request from 'request';
import cheerio from 'cheerio';

// try to respect the header has if it were coming from a browser
let options = {
  url: 'http://www.dl-protect.com/F469D615',
  headers: {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,en-US;q=0.8,en;q=0.6,fr-FR;q=0.4',
    'Cache-Control': 'max-age=0', 
    'Connection': 'keep-alive', 
    'Content-Type': 'application/x-www-form-urlencoded', 
    'Host': 'www.dl-protect.com', 
    'Origin': 'http://www.dl-protect.com', 
    'Referer': 'http://www.dl-protect.com/F469D615', 
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/49.0.2623.108 Chrome/49.0.2623.108 Safari/537.36'
  }
};

request.get(options, function (error, response, body) {
    if (!error && response.statusCode == 200) {
        // parse the body response with cheerio
        let $ = cheerio.load(body);

        // detect if a captcha is required
        let isCaptcha = !!$('#captcha').length;

        // url of the captcha if needed
        let captchaUrl = '';

        // display wether we need captcha or not
        switch (isCaptcha) {
            case true:
                captchaUrl = $('#captcha').attr('src');
                console.log(`Captcha required, URL : ${captchaUrl}`);
                break;
            case false:
                console.log('No captcha required');
                break;
        }

        // get the key
        let formKey = $('form[name="ccerure"] input[name="key"]').attr('value');
        console.log(`key : ${formKey}`);

        // set the form as it's computed no need to get it
        // this param is just data about the browser so I ended up copying it once it was generated
        let formIn = [
            '_UETCF0UJREfkVmbpZWZk5Wd7QXYtJ3bGBCduVWb1N2bEBSZsJWY0J3bQtj',
            'cldXZpZXLmRGctwWYuJXZ05Wa7IXZ3VWaWBiREBFItVXat9mcoNkJkVmbpZ',
            'WZk5Wd74CduVGdu92Yg8WZklmdv8WakVXYgwUTUhEIm9GIrNWYilXYsBHIy',
            '9mZgMXZz5WZjlGbgUmbpZXZkl2VgMXZsJWYuV0OvNnLyVGdwFGZh1GZjVmb',
            'pZXZkl2dilGb7UGb1R2bNBibvlGdwlncjVGRgQnblRnbvNEIl5Wa2VGZpdl',
            'JkVmbpZWZk5Wd7sTahpGall2ZmV2bo9mZvp2blFGciJmamN2Zk1mYmpGatt',
            'jcldXZpZFIGREUg0Wdp12byh2Q8ZzMuczM18SayFmZhNFI4ATMuMjM2IjLw',
            '4SO08SZt9mcoNEI4ATMuMjM2IjLw4SO08Sb1lWbvJHaDBSd05WdiVFIp82a',
            'jV2RgU2apxGIswUTUh0SoAiNz4yNzUzL0l2SiV2VlxGcwFEIpQjNfZDO4BC',
            'e15WaMByOxEDWoACMuUzLhxGbpp3bNxHNygHN0YDewMTN=='
        ].join('');

        // if no captcha
        if (!isCaptcha) {
            // override the initial options by adding the necessary form data
            options = Object.assign({}, options, {form: {key: formKey, i: formIn, submitform: 'Continuer'}});

            // reach the same page with a post containing the following data : key, i and submitform
            request.post(options, function (error, response, body) {
                console.log(body);
                // console.log(response);
                // console.log(error);
            });
        }
    }
});

When I look at the chrome dev panel (network tab + preserve log), as soon as I click on the continue button, it shows me this :

I really thought passing "key", "i" and "submitform" would be enough but it's not. It just get back to the first page instead of going to the second page with the URL.

Any clue about how to get as output the uptobox link (in this case) would be really nice.

Thanks !

The question is, do you have any idea why I cannot reach the page that I want. Maybe I wasn't clear, let me explain in other words :) There are 2 pages. Open the link you'll see the first one (with the continue button) and if you click on it, the second page with the protected link. In my code, I try to simulate that. So basically I would like to understand why I cannot get as output the uptobox link ? — maxime1992
– maxime1992, Commented May 17, 2016 at 19:17
Have a look at a scraping library such as osmosis or a headless browser framework like PhantomJs. — Daniel B
– Daniel B, Commented May 17, 2016 at 19:27
@DanielB osmosis seems cool and I'll definitely give a try. As I say to Soren, I'd like to avoid PhantomJs in a first time and then if I can't avoid it I'll go this way. If someone as an idea in the meantime, I'd be glad to hear it because come on ... If the browser can do it, we can do it too :) — maxime1992
– maxime1992, Commented May 17, 2016 at 19:35
According to this osmosis does not execute the JavaScript on the page you are downloading, and hence it may fail if the webserver requires ajax calls as part of the validation. "If the browser can do it, we can do it too" -- yes, that is why PhantomJS was developed, not sure why you don't want to use that. — Soren
– Soren, Commented May 17, 2016 at 23:54

Soren · Accepted Answer · 2016-05-17 19:26:13Z

2

Most website will try to protect themself against people scraping their site -- their reasons wary and the reasons will be their own -- however typically means to protect sites would be to use cookies and hidden fields etc, each of those being signed and timestamped and expired, and possibly even validated for single use in the backend.

What this site does specifically is anyones guess, and a part of their internal security engineering.

So you are probably out of luck for simple crawling like what you are trying to do, and you will need a full browser to do the work -- fortunately (for you) there are headless browsers such as PhantomJs which may be of help.

answered May 17, 2016 at 19:26

Soren

14.8k4 gold badges43 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

maxime1992 Over a year ago

It was my first approach but it was more complicated and slower. Plus, the integration with Node after is a little bit harder ... I thought I would find a way but unfortunately I did not. I'm sure there's something missing here. Something obvious. I feel close (maybe I'm far from it ...) and I'd like to stay on this way. If after few days I still don't find anything about that, I'll go for the browser way.

Collectives™ on Stack Overflow

Parse a website using NodeJs

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related