16

I'm fetching this page with with this request library in Node.JS, and parsing the body using cheerio.

Calling $.html() on the parsed response body reveals that the title attribute for the page is:

<title>Le Relais de l'Entrec?te</title>

... when it should be:

<title>Le Relais de l'Entrecôte</title>

I've tried setting the options for the request library to include encoding: 'utf8', but that didn't seem to change anything.

How do I preserve these characters?

1
  • cheerio might also just exhibit this bug, which incorrectly outputs certain characters in certain situations Commented Aug 13, 2014 at 4:09

2 Answers 2

33

You can use iconv (or better iconv-lite) for the conversion itself, but to detect the encoding you should check out the charset and jschardet modules. Here's an example of them both in action:

var charset = require('charset'),
    jschardet = require('jschardet'),
    Iconv = require('iconv').Iconv;

request.get({url: 'http://www.example.com', encoding: 'binary'}, function(err, res, body) {
    var enc = charset(res.headers, body) || jschardet.detect(body).encoding.toLowerCase();

    if(enc !== 'utf8') {
        var iconv = new Iconv(enc, 'UTF-8//TRANSLIT//IGNORE');
        body = iconv.convert(new Buffer(body, 'binary')).toString('utf8');
    }

    console.log(body);
});

Sign up to request clarification or add additional context in comments.

4 Comments

I think this is a better answer as it takes response header into consideration.
Yes this is definitely a better answer and should be the accepted one
This should be the correct answer. It cleverly uses all available means (apart from asking the developer of the site) to detect the encoding and it succeeds!
Note that jschardet v2.2.1 will return 'UTF-8', which won't match 'utf8' after making it lowercase.
23

The page appears to be encoded with iso-8859-1. You'll need to tell request to hand you back an un-encoded buffer by passing encoding: null and use something like node-iconv to convert it.

If you're writing a generalized crawler, you'll have to figure out how to detect the encoding of each page you encounter to decode it correctly, otherwise the following should work for your case:

var request = require('request');                                               
var iconv = require('iconv');                                                   

request.get({                                                                   
  url: 'http://www.relaisentrecote.fr',                                         
  encoding: null,                                                               
}, function(err, res, body) {                                                   
  var ic = new iconv.Iconv('iso-8859-1', 'utf-8');                              
  var buf = ic.convert(body);                                                   
  var utf8String = buf.toString('utf-8');  
  // .. do something with utf8String ..                                                                             
});                                                                             

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.