How to extracting Data from HTML table using php

Question

I keep trying different methods of extracting the data from the HTML table such as using xpath. The table(s) do not contain any classes so I am not sure how to use xpath without classes or Id. This data is being retrieved from an rss xml file. I am currently using DOM. After I extract the data, I will try to sort, the tables by Job Title

Here is my php code

$html='';
$xml= simplexml_load_file($url) or die("ERROR: Cannot connect to url\n check if report still exist in the Gradleaders system");

/*What we do here in this loop is retrieve all content inside the encoded content, 
*which includes the CDATA information. This is where the HTML and styling is included.
*/

foreach($xml->channel->item as $cont){
    $html=''.$cont->children('content',true)->encoded.'<br>';   //actual tag name is encoded 
}

$htmlParser= new DOMDocument();     //to parse html using DOMDocument
libxml_use_internal_errors(true);   // your HTML gives parser warnings, keep them internal
$htmlParser->loadHTML($html);       //Loaded the html string we took from simple xml

$htmlParser->preserveWhiteSpace = false;
$tables= $htmlParser->getElementsByTagName('table');
$rows= $tables->item(0)->getElementsByTagName('tr');

foreach($rows as $row){
    $cols = $row->getElementsByTagName('td');
    echo $cols;
}

This is the HTML I am extracting info from

<table cellpadding='1' cellspacing='2'>
  <tr>
    <td><b>Job Title:</b></td>
    <td>Job Example </td>
  </tr>
  <tr>
    <td><b>Job ID:</b></td>
    <td>23992</td>
  </tr>
  <tr>
    <td><b>Job Description:</b></td>
    <td>Just a job example </td>
  </tr>
  <tr>
    <td><b>Job Category:</b></td>
    <td>Work-study Position</td>
  </tr>
  <tr>
    <td><b>Position Type:</b></td>
    <td>Work-study</td>
  </tr>
  <tr>
    <td><b>Applicant Type:</b></td>
    <td>Work-study</td>
  </tr>
  <tr>
    <td><b>Status:</b></td>
    <td>Active</td>
  </tr>
  <tr>
    <td colspan='2'><b><a href='https://www.myjobs.com/tuemp/job_view.aspx?token=I1iBwstbTs2pau+SjrYfWA%3d%3d'>Click to View More</a></b></td>
  </tr>
</table>

Well, I need to parse all the data inside the table. I have many tables like this since this is an rss feed. The whole goal is to be able to reorganize all the tables to alphabetical order according to the Job Title — Jose Ortiz
– Jose Ortiz, Commented May 13, 2016 at 17:34
You need the text or the html inside table ? Please update your question with a sample of the desired output. — Pedro Lobito
– Pedro Lobito, Commented May 13, 2016 at 17:38
I will need the Html, I just need to be able to grab tag td to see what Job Title it is, so I can sort accordingly. I will update — Jose Ortiz
– Jose Ortiz, Commented May 13, 2016 at 17:40

Pedro Lobito · Accepted Answer · 2018-01-23 02:26:37Z

8

You can use xpath to query('//td') and retrieve the td html using C14N(), something like:

$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
    echo $td->C14N();
    //if just need the text use:
    //echo $td->textContent;
}

Output:

<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...

C14N();

Returns canonicalized nodes as a string or FALSE on failure

Update:

Another question, how can I grab individual Table Data? For example, just grab, Job ID

Use XPath contains, i.e.:

foreach($x->query('//td[contains(., "Job ID:")]') as $td){
    echo $td->textContent;
}

Update V2:

How can I get the next Table Data after that (to actually get the Job Id) ?

Use following-sibling::*[1], i.e:

echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992

edited Jan 23, 2018 at 2:26

answered May 13, 2016 at 17:36

Pedro Lobito

99.8k36 gold badges274 silver badges278 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jose Ortiz Over a year ago

Excuse me, disregard my last message. Thank You so much. I've been researching for a week to solve this. Can You direct me to some good resources for this type of parsing? Another question, how can I grab individual Table Data? For example, just grab, Job ID?

Jose Ortiz Over a year ago

I'm sorry for asking so many questions, I just feel like you are the best resource I have came across from here. My last question about grabbing individual table data, such as Job Id, How can I get the next Table Data after that (to actually get the Job Id) ? As I said before, I have many tables, and each Job Id is unique, so how do i go to the next table data from the table

Pedro Lobito Over a year ago

NP, check the new update. I'll go grab some food now ;) GL

Keith Tyler Over a year ago

These are some really esoteric and heady XPaths and on top of it, not very portable at all. Woe to he who has to do this in more than one language -- or if the developer changes the name of a field, or adds a space between field name and colon, or eschews colons for something else...

Pedro Lobito Over a year ago

Which "other one language" are you talking about?

Keith Tyler · Accepted Answer · 2016-05-13 17:38:32Z

-2

$xpathParser = new DOMXPath($htmlParser);
$tableDataNodes = $xpathParser->evaluate("//table/tr/td")
for ($x=0;$x<$tableDataNodes.length;$x++) {
    echo $tableDataNodes[$x];
}

answered May 13, 2016 at 17:38

Keith Tyler

8256 silver badges19 bronze badges

1 Comment

Jose Ortiz Over a year ago

Thank you, I will try your solution as soon as I can Keith

Collectives™ on Stack Overflow

How to extracting Data from HTML table using php

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related