5

I keep trying different methods of extracting the data from the HTML table such as using xpath. The table(s) do not contain any classes so I am not sure how to use xpath without classes or Id. This data is being retrieved from an rss xml file. I am currently using DOM. After I extract the data, I will try to sort, the tables by Job Title

Here is my php code

$html='';
$xml= simplexml_load_file($url) or die("ERROR: Cannot connect to url\n check if report still exist in the Gradleaders system");

/*What we do here in this loop is retrieve all content inside the encoded content, 
*which includes the CDATA information. This is where the HTML and styling is included.
*/

foreach($xml->channel->item as $cont){
    $html=''.$cont->children('content',true)->encoded.'<br>';   //actual tag name is encoded 
}

$htmlParser= new DOMDocument();     //to parse html using DOMDocument
libxml_use_internal_errors(true);   // your HTML gives parser warnings, keep them internal
$htmlParser->loadHTML($html);       //Loaded the html string we took from simple xml

$htmlParser->preserveWhiteSpace = false;
$tables= $htmlParser->getElementsByTagName('table');
$rows= $tables->item(0)->getElementsByTagName('tr');

foreach($rows as $row){
    $cols = $row->getElementsByTagName('td');
    echo $cols;
}

This is the HTML I am extracting info from

<table cellpadding='1' cellspacing='2'>
  <tr>
    <td><b>Job Title:</b></td>
    <td>Job Example </td>
  </tr>
  <tr>
    <td><b>Job ID:</b></td>
    <td>23992</td>
  </tr>
  <tr>
    <td><b>Job Description:</b></td>
    <td>Just a job example </td>
  </tr>
  <tr>
    <td><b>Job Category:</b></td>
    <td>Work-study Position</td>
  </tr>
  <tr>
    <td><b>Position Type:</b></td>
    <td>Work-study</td>
  </tr>
  <tr>
    <td><b>Applicant Type:</b></td>
    <td>Work-study</td>
  </tr>
  <tr>
    <td><b>Status:</b></td>
    <td>Active</td>
  </tr>
  <tr>
    <td colspan='2'><b><a href='https://www.myjobs.com/tuemp/job_view.aspx?token=I1iBwstbTs2pau+SjrYfWA%3d%3d'>Click to View More</a></b></td>
  </tr>
</table>

4
  • What do you need to extract ? Commented May 13, 2016 at 17:32
  • Well, I need to parse all the data inside the table. I have many tables like this since this is an rss feed. The whole goal is to be able to reorganize all the tables to alphabetical order according to the Job Title Commented May 13, 2016 at 17:34
  • You need the text or the html inside table ? Please update your question with a sample of the desired output. Commented May 13, 2016 at 17:38
  • I will need the Html, I just need to be able to grab tag td to see what Job Title it is, so I can sort accordingly. I will update Commented May 13, 2016 at 17:40

2 Answers 2

8

You can use xpath to query('//td') and retrieve the td html using C14N(), something like:

$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
    echo $td->C14N();
    //if just need the text use:
    //echo $td->textContent;
}

Output:

<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...

C14N();

Returns canonicalized nodes as a string or FALSE on failure


Update:

Another question, how can I grab individual Table Data? For example, just grab, Job ID

Use XPath contains, i.e.:

foreach($x->query('//td[contains(., "Job ID:")]') as $td){
    echo $td->textContent;
}

Update V2:

How can I get the next Table Data after that (to actually get the Job Id) ?

Use following-sibling::*[1], i.e:

echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992
Sign up to request clarification or add additional context in comments.

5 Comments

Excuse me, disregard my last message. Thank You so much. I've been researching for a week to solve this. Can You direct me to some good resources for this type of parsing? Another question, how can I grab individual Table Data? For example, just grab, Job ID?
I'm sorry for asking so many questions, I just feel like you are the best resource I have came across from here. My last question about grabbing individual table data, such as Job Id, How can I get the next Table Data after that (to actually get the Job Id) ? As I said before, I have many tables, and each Job Id is unique, so how do i go to the next table data from the table
NP, check the new update. I'll go grab some food now ;) GL
These are some really esoteric and heady XPaths and on top of it, not very portable at all. Woe to he who has to do this in more than one language -- or if the developer changes the name of a field, or adds a space between field name and colon, or eschews colons for something else...
Which "other one language" are you talking about?
-2
$xpathParser = new DOMXPath($htmlParser);
$tableDataNodes = $xpathParser->evaluate("//table/tr/td")
for ($x=0;$x<$tableDataNodes.length;$x++) {
    echo $tableDataNodes[$x];
}

1 Comment

Thank you, I will try your solution as soon as I can Keith

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.