1

I can't get my scraper to return the specific content I'm looking for. If I return $output, I see digg as though it's being hosted on my server, so I know I'm accessing the site properly, I'm just not able to then access elements from the new DOM. What am I doing wrong?

<?php

include('simple_html_dom.php');


function curl_download($url) {

$ch = curl_init();                                              //creates a new cURL resource handle
curl_setopt($ch, CURLOPT_URL, "http://digg.com");               // Set URL to download
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);                 //  TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");          // Set a referer
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true );                // Should cURL return or print out the data? (true = return, false = print) 
curl_setopt($ch, CURLOPT_HEADER, 0);                            // Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_TIMEOUT, 10);                          // Timeout in seconds


$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

}       

$html = new simple_html_dom();
$html->load($output, true, false );


    foreach($html->find('div.digg-story__kicker') as $article) {
        $article_title = $article->find('.digg-story__kicker')->innertext;
        return $article_title;
    }

    echo $article_title;


?>

Edit: Okay, dumb mistake, I'm calling the function now:

$html = curl_download('http://digg.com')

and if I echo $html I'm seeing the "mirrored site", but when I use str_get_html($html) which simple_html_dom.php says will //get html dom from stringI keep getting this error message:

Fatal error: Call to a member function str_get_html() on null in /home/andrew73124/public_html/scraper/scraper.php on line 31

8
  • 1
    Digg still exists, wow Commented Jun 7, 2017 at 22:11
  • 1
    the code snippets provided seem disjointed - there is a function curl_download but that never gets called and it doesn't return any value either so it is unclear where $output variable comes from Commented Jun 7, 2017 at 22:11
  • Oh duh, I'm not even calling the function. Okay so I need: '$html = curl_download('digg.com');' to call the function. That returns a string right? So now I need to convert it to a DOMDocument? Commented Jun 7, 2017 at 22:18
  • 1
    there is a double assignment of $html as a variable - perhaps try $output=curl_download('http://digg.com') before $html = new simple_html_dom();$html->load($output, true, false ); Commented Jun 7, 2017 at 22:33
  • 1
    this works for me. <?php foreach(@DOMDocument::loadHTML(file_get_contents('http://digg.com/'))->getElementsByTagName("div") as $div){ if($div->getAttribute("class")!=='digg-story__kicker'){ continue; } var_dump($div->textContent); } - literally just that, no curl, no simple_html_dom.php, no nothing, just that. Commented Jun 7, 2017 at 22:40

2 Answers 2

1

The curl function needed an additional setting - namely CURLOPT_FOLLOWLOCATION and the function itself needs to return a value in order that it's values can be used. In the code below I return an object with both the response and the info which allows you to test for the http_code before attempting to process the response data. This uses standard DOMDocument but no doubt using simple_dom will be easy to do.

function curl_download( $url ) {

    $ch = curl_init(); 
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );/* NEW */
    curl_setopt( $ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0" );
    curl_setopt( $ch, CURLOPT_HEADER, 0 );
    curl_setopt( $ch, CURLOPT_TIMEOUT, 10 );


    $output = curl_exec($ch);
    $info = curl_getinfo($ch);
    curl_close($ch);

    return (object)array(
        'response'  =>  $output,
        'info'      =>  $info
    );
}       


$output = curl_download( 'http://www.digg.com' );
if( $output->info['http_code']==200 ){

    libxml_use_internal_errors( true );

    $dom=new DOMDocument;

    $dom->preserveWhiteSpace = false;
    $dom->validateOnParse = false;
    $dom->standalone=true;
    $dom->strictErrorChecking=false;
    $dom->substituteEntities=true;
    $dom->recover=true;
    $dom->formatOutput=false;

    $dom->loadHTML( $output->response );

    libxml_clear_errors();

    $xp=new DOMXPath( $dom );
    $col=$xp->query('//div[@class="digg-story__kicker"]');
    if( !empty( $col ) ){
        foreach( $col as $node )echo $node->nodeValue;
    }
} else {
    echo '<pre>',print_r($output->info,true),'</div>';
}

Updated answer to include error mitigation code offered by libxml - weidly though the code as it was orginally ran without issue locally before adding the libxml error handling code....

Without the CURLOPT_FOLLOWLOCATION set I get:

Array
(
    [url] => http://www.digg.com
    [content_type] => text/html
    [http_code] => 301
    [header_size] => 191
    [request_size] => 79
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 0.421
    [namelookup_time] => 0.031
    [connect_time] => 0.234
    [pretransfer_time] => 0.234
    [size_upload] => 0
    [size_download] => 185
    [speed_download] => 439
    [speed_upload] => 0
    [download_content_length] => 185
    [upload_content_length] => 0
    [starttransfer_time] => 0.421
    [redirect_time] => 0
    [certinfo] => Array
        (
        )
)

But with CURLOPT_FOLLOWLOCATION set as true I get

WE'VE SEEN BETTER ANIME TRIBUTE VIDEOS...<more>...RESIST THE URGE TO SUBTWEET A BAD APPLE
Sign up to request clarification or add additional context in comments.

1 Comment

I get this error when I try to run your code verbatim? Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 126 in /home/andrew73124/public_html/scraper/scraper.php on line 32. Thank you all for helping I really appreciate it! Maybe I need to look into it more, is there a good, extensive resource covering scraping sites from beginning to end using only the cURL method?
1

Your loop is weird, you are looping over the titles, so just access the innertext property:

foreach($html->find('div.digg-story__kicker') as $article) {

    echo $article->innertext;

}

1 Comment

Whoops, that was just supposed to be 'foreach(html->find('div.digg-story') as $article) {. Even when I have that correct it tells me I'm returning 'null' which makes me think it isn't converting the string returned to a DOMDocument right?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.