PHP Simple HTML DOM and cURL Not Working

Question

I can't get my scraper to return the specific content I'm looking for. If I return $output, I see digg as though it's being hosted on my server, so I know I'm accessing the site properly, I'm just not able to then access elements from the new DOM. What am I doing wrong?

<?php

include('simple_html_dom.php');


function curl_download($url) {

$ch = curl_init();                                              //creates a new cURL resource handle
curl_setopt($ch, CURLOPT_URL, "http://digg.com");               // Set URL to download
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);                 //  TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");          // Set a referer
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true );                // Should cURL return or print out the data? (true = return, false = print) 
curl_setopt($ch, CURLOPT_HEADER, 0);                            // Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_TIMEOUT, 10);                          // Timeout in seconds


$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

}       

$html = new simple_html_dom();
$html->load($output, true, false );


    foreach($html->find('div.digg-story__kicker') as $article) {
        $article_title = $article->find('.digg-story__kicker')->innertext;
        return $article_title;
    }

    echo $article_title;


?>

Edit: Okay, dumb mistake, I'm calling the function now:

$html = curl_download('http://digg.com')

and if I echo $html I'm seeing the "mirrored site", but when I use str_get_html($html) which simple_html_dom.php says will //get html dom from stringI keep getting this error message:

Fatal error: Call to a member function str_get_html() on null in /home/andrew73124/public_html/scraper/scraper.php on line 31

the code snippets provided seem disjointed - there is a function curl_download but that never gets called and it doesn't return any value either so it is unclear where $output variable comes from — Professor Abronsius
– Professor Abronsius, Commented Jun 7, 2017 at 22:11
Oh duh, I'm not even calling the function. Okay so I need: '$html = curl_download('digg.com');' to call the function. That returns a string right? So now I need to convert it to a DOMDocument? — warrenbuffering
– warrenbuffering, Commented Jun 7, 2017 at 22:18
there is a double assignment of $html as a variable - perhaps try $output=curl_download('http://digg.com') before $html = new simple_html_dom();$html->load($output, true, false ); — Professor Abronsius
– Professor Abronsius, Commented Jun 7, 2017 at 22:33
this works for me. <?php foreach(@DOMDocument::loadHTML(file_get_contents('http://digg.com/'))->getElementsByTagName("div") as $div){ if($div->getAttribute("class")!=='digg-story__kicker'){ continue; } var_dump($div->textContent); } - literally just that, no curl, no simple_html_dom.php, no nothing, just that. — hanshenrik
– hanshenrik, Commented Jun 7, 2017 at 22:40

Professor Abronsius · Accepted Answer · 2017-06-08 06:40:39Z

The curl function needed an additional setting - namely CURLOPT_FOLLOWLOCATION and the function itself needs to return a value in order that it's values can be used. In the code below I return an object with both the response and the info which allows you to test for the http_code before attempting to process the response data. This uses standard DOMDocument but no doubt using simple_dom will be easy to do.

function curl_download( $url ) {

    $ch = curl_init(); 
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );/* NEW */
    curl_setopt( $ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0" );
    curl_setopt( $ch, CURLOPT_HEADER, 0 );
    curl_setopt( $ch, CURLOPT_TIMEOUT, 10 );


    $output = curl_exec($ch);
    $info = curl_getinfo($ch);
    curl_close($ch);

    return (object)array(
        'response'  =>  $output,
        'info'      =>  $info
    );
}       


$output = curl_download( 'http://www.digg.com' );
if( $output->info['http_code']==200 ){

    libxml_use_internal_errors( true );

    $dom=new DOMDocument;

    $dom->preserveWhiteSpace = false;
    $dom->validateOnParse = false;
    $dom->standalone=true;
    $dom->strictErrorChecking=false;
    $dom->substituteEntities=true;
    $dom->recover=true;
    $dom->formatOutput=false;

    $dom->loadHTML( $output->response );

    libxml_clear_errors();

    $xp=new DOMXPath( $dom );
    $col=$xp->query('//div[@class="digg-story__kicker"]');
    if( !empty( $col ) ){
        foreach( $col as $node )echo $node->nodeValue;
    }
} else {
    echo '<pre>',print_r($output->info,true),'</div>';
}

Updated answer to include error mitigation code offered by libxml - weidly though the code as it was orginally ran without issue locally before adding the libxml error handling code....

Without the CURLOPT_FOLLOWLOCATION set I get:

Array
(
    [url] => http://www.digg.com
    [content_type] => text/html
    [http_code] => 301
    [header_size] => 191
    [request_size] => 79
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 0.421
    [namelookup_time] => 0.031
    [connect_time] => 0.234
    [pretransfer_time] => 0.234
    [size_upload] => 0
    [size_download] => 185
    [speed_download] => 439
    [speed_upload] => 0
    [download_content_length] => 185
    [upload_content_length] => 0
    [starttransfer_time] => 0.421
    [redirect_time] => 0
    [certinfo] => Array
        (
        )
)

But with CURLOPT_FOLLOWLOCATION set as true I get

WE'VE SEEN BETTER ANIME TRIBUTE VIDEOS...<more>...RESIST THE URGE TO SUBTWEET A BAD APPLE

I get this error when I try to run your code verbatim? Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 126 in /home/andrew73124/public_html/scraper/scraper.php on line 32. Thank you all for helping I really appreciate it! Maybe I need to look into it more, is there a good, extensive resource covering scraping sites from beginning to end using only the cURL method?

Steve · Accepted Answer · 2017-06-07 22:10:54Z

1

Your loop is weird, you are looping over the titles, so just access the innertext property:

foreach($html->find('div.digg-story__kicker') as $article) {

    echo $article->innertext;

}

answered Jun 7, 2017 at 22:10

Steve

20.5k5 gold badges47 silver badges71 bronze badges

1 Comment

warrenbuffering Over a year ago

Whoops, that was just supposed to be 'foreach(html->find('div.digg-story') as $article) {. Even when I have that correct it tells me I'm returning 'null' which makes me think it isn't converting the string returned to a DOMDocument right?

Collectives™ on Stack Overflow

PHP Simple HTML DOM and cURL Not Working

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related