This was a cool question because it promoted thought about the DoM.
I raised a question How do HTML Parsers process untagged text which was commented generously by @sideshowbarker, which made me think, and improved my knowledge of the DoM, especially about text nodes.
Below is a DoM based way of finding candidate text nodes and padding them with 'p' tags. There are lots of text nodes that we should leave alone, like the spaces, carriage returns and line feeds we use for formatting (which an "uglifier" may strip out).
<?php
$html = file_get_contents("nodeTest.html"); // read the test file
$dom = new domDocument; // a new dom object
$dom->loadHTML($html); // build the DoM
$bodyNodes = $dom->getElementsByTagName('body'); // returns DOMNodeList object
foreach($bodyNodes[0]->childNodes as $child) // assuming 1 <body> node
{
$text="";
// this tests for an untagged text node that has more than non-formatting characters
if ( ($child->nodeType == 3) && ( strlen( $text = trim($child->nodeValue)) > 0 ) )
{ // its a candidate for adding tags
$newText = "<p>".$text."</p>";
echo str_replace($text,$newText,$child->nodeValue);
}
else
{ // not a candidate for adding tags
echo $dom->saveHTML($child);
}
}
nodeTest.html contains this.
<!DOCTYPE HTML>
<html>
<body>
<h2><b>Hello World</b></h2>
<p>First</p>
Second
<p>Third</p>
fourth
<p>Third</p>
<!-- comment -->
</body>
</html>
and the output is this.... I did not bother echoing the outer tags. Notice that comments and formatting are properly treated.
<h2><b>Hello World</b></h2>
<p>First</p>
<p>Second</p>
<p>Third</p>
<p>fourth</p>
<p>Third</p>
<!-- comment -->
Obviously you need to traverse the DoM and repeat the search/replace at each element node if you wish to make the thing more general. We are only stopping at the Body node in this example and processing each direct child node.
I'm not 100% sure the code is the most efficient possible and I may think some more on that and update if I find a better way.