0

I am trying parse a site, but the html is a mess. Can anyone with more experience in parsing sites help me?

<tr>
<td><font FACE=Tahoma color='#CC0000' size=2><b>Date</b></font></td>    
<td><font FACE=Tahoma color='#CC0000' size=2><b>Place</b></font></td>
<td><font FACE=Tahoma color='#CC0000' size=2><b>Situation</b></font></td>
</tr> 

<tr><td rowspan=2>16/09/2011 10:11</td><td>New York</td><td><FONT COLOR="000000">Situation Red</font></td></tr>
<tr><td colspan=2>Optional comment hello new york</td></tr>
<tr><td rowspan=2>16/09/2011 10:08</td><td>Texas</td><td><FONT COLOR="000000">Situation Green</font></td></tr>
<tr><td colspan=2>Optional comment hello texas </td></tr>
<tr><td rowspan=1>06/09/2011 13:14</td><td>California</td><td><FONT COLOR="000000">Yellow Situation</font></td></tr>
</TABLE>

A strange and crazy thing is the comment not in the head of table also the start point(california) dont have comment. So, start point always will be like this:

Date: 06/09/2011 13:14

Place: California

Situation: Yellow Situation

Comment: null

all others places have a comment and will be like this:

Date: 16/09/2011 10:11

Place: New York

Situation: Situation Red

Comment: Optional comment hello new york.

I have tried some approaches, but I don't have much experience with node.js and less with HTML parsing. I need a getting started with parsing crazy stuff.

1

1 Answer 1

10

I built a distributed scraper in node.js. I found it easier to parse html that had been parsed through html tidy.

Here is a module to run html through tidy:

var spawn = require('child_process').spawn;
var fs = require('fs');

var tidy = (function() {
this.html = function(str, callback) {
    var buffer = '';
    var error = '';

    if (!callback) {
        throw new Error('No callback provided for tidy.html');
    }
    var ptidy = spawn(
        'tidy',
        [
            '--quiet',
            'y',
            '--force-output',
            'y',
            '--bare',
            'y',
            '--break-before-br',
            'y',
            '--hide-comments',
            'y',
            '--output-xhtml',
            'y',
            '--fix-uri',
            'y',
            '--wrap',
            '0'
        ]);

    ptidy.stdout.on('data', function (data) {
        buffer += data;
    });

    ptidy.stderr.on('data', function (data) {
        error += data;
    });

    ptidy.on('exit', function (code) {
        //fs.writeFileSync('last_tidy.html', buffer, 'binary');
        callback(buffer);
    });

    ptidy.stdin.write(str);
    ptidy.stdin.end();      
}
return this;
})();

module.exports = tidy;

Example (if saved as tidy.js):

require('./tidy.js');
tidy.html('<table><tr><td>badly formatted html</tr>', function(html) { console.log(html); });

Result:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />
<title></title>
</head>
<body>
<table>
<tr>
<td>badly formatted html</td>
</tr>
</table>
</body>
</html>
Sign up to request clarification or add additional context in comments.

1 Comment

I was able to get this module working using require('htmltidy/htmltidy.js') - just FYI for anyone else who downloads it from npm :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.