1

I want to scrape page "https://www.ukr.net/ua/news/sport.html" with Nodejs. I`m trying to make basic get request with 'request' npm module, here is example:

const inspect = require('eyespect').inspector();
const request = require('request');
const url = 'https://www.ukr.net/news/dat/sport/2/';
const options = {
    method: 'get',
    json: true,
    url: url
};

request(options,  (err, res, body) => {
    if (err) {
        inspect(err, 'error posting json');
        return
    }
    const headers = res.headers;
    const statusCode = res.statusCode;
    inspect(headers, 'headers');
    inspect(statusCode, 'statusCode');
    inspect(body, 'body');
});

But in response body I only get

body: '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 
Transitional//EN">\n<html>\n<head>\n<META HTTP-EQUIV="expires" 
CONTENT="Wed, 26 Feb 1997 08:21:57 GMT">\n<META HTTP-EQUIV=Refresh
CONTENT="10">\n<meta HTTP-EQUIV="Content-type" CONTENT="text/html; 
charset=utf-8">\n<title>www.ukr.net</title>\n</head>\n<body>\n
Идет загрузка, подождите .....\n</body>\n</html>'

If I make get request from Postman, I get exactly what I need:

enter image description here

Please help me guys.

1
  • Идет загрузка, подождите ..... = loading, please wait.... - the page you are trying to scrape has elements that are loaded dynamically, so your initial request comes back with the "loading" message instead - maybe you could use something like phantom js to render the page for you? stackoverflow.com/a/31059035/459517 - Postman is probably doing something like this automatically. Commented Feb 4, 2017 at 19:20

2 Answers 2

1

You might have been blocked by bot protection - this can be checked with curl.

curl -vL https://www.ukr.net/news/dat/sport/2/

curl seem to get the result and if curl is working then there is probably something missing in the request from node, a solution could be to mimic a browser of your choice.

For example - Here is an example of Chrome-like request taken from developer-tools:

enter image description here

deriving the following options for the request:

const options = {
    method: 'get',
    json: true,
    url: url,
    gzip: true,
    headers: {
        "Host": "www.ukr.net",
        "Pragma": "no-cache",
        "Cache-Control": "no-cache",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, sdch, br",
        "Accept-Language": "en-US,en;q=0.8"
    }
};
Sign up to request clarification or add additional context in comments.

Comments

1

If you have experience in jquery, there a library to access of the HTML, for example.

Markup example we'll be using:

<ul id="fruits">
  <li class="apple">Apple</li>
  <li class="orange">Orange</li>
  <li class="pear">Pear</li>
</ul>

First you need to load in the HTML. This step in jQuery is implicit, since jQuery operates on the one, baked-in DOM. With Cheerio, we need to pass in the HTML document.

var cheerio = require('cheerio');

$ = cheerio.load('<ul id="fruits">...</ul>');

Selectors

$('ul .pear').attr('class')

probably you can make something like this.

request(options,  (err, res, body) => {

  var $ = cheerio.load(html);

})

https://github.com/cheeriojs/cheerio

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.