Home » Php » html – How do you format DOM structures in PHP?

html – How do you format DOM structures in PHP?

Posted by: admin July 12, 2020 Leave a comment

Questions:

My first guess was the PHP DOM classes (with the formatOutput parameter). However, I cannot get this block of HTML to be formatted and output correctly. As you can see, the indention and alignment is not correct.

$html = '
<html>
<body>
<div>

<div>

        <div>

                <p>My Last paragraph</p>
            <div>
                            This is another text block and some other stuff.<br><br>
                Again we will start a new paragraph
                            and some other stuff
                            <br>
        </div>
</div>
        <div>
                        <div>
                            <h1>Another Title</h1>
                                                    </div>
                        <p>Some text again <b>for sure</b></p>
                </div>
</div>
<div>
    <pre><code>
    <span>&lt;html&gt;</span>
        <span>&lt;head&gt;</span>
            <span>&lt;title&gt;</span>
                Page Title
            <span>&lt;/title&gt;</span>
            <span>&lt;/head&gt;</span>
    <span>&lt;/html&gt;</span>
    </code></pre>
</div>
</div>
</body>
</html>';

header('Content-Type: text/plain');
libxml_use_internal_errors(TRUE);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
print $dom->saveHTML();

Update: I added a pre-formatted code block to the example.

How to&Answers:

Here are some improvements over @hijarian answer:

LibXML Errors

If you don’t call libxml_use_internal_errors(true), PHP will output all HTML errors found. However, if you call that function, the errors won’t be suppressed, instead they will go to a pile that you can inspect by calling libxml_get_errors(). The problem with this is that it eats memory, and DOMDocument is known to be very picky. If you’re processing lots of files in batch, you will eventually run out of memory. There are two solutions for this:

if (libxml_use_internal_errors(true) === true)
{
    libxml_clear_errors();
}

Since libxml_use_internal_errors(true) returns the previous value of this setting (default false), this has the effect of only clearing errors if you run it more than once (as in batch processing).

The other option is to pass the LIBXML_NOERROR | LIBXML_NOWARNING flags to the loadHTML() method. Unfortunately, for reasons that are unknown to me, this still leaves a couple of errors behind.

Bare in mind that DOMDocument will always output a error (even when using internal libxml errors and setting the suppressing flags) if you pass a empty (or blankish) string to the load*() methods.

Regex

The regex />\s*</im doesn’t make a whole lot of sense, it’s better to use ~>[[:space:]]++<~m to also catch \v (vertical tabs) and only replace if spaces actually exist (+ instead of *) without giving back (++) – which is faster – and to drop the case insensitve overhead (since whitespace has no case).

You may also want to normalize newlines to \n and other control characters (specially if the origin of the HTML is unknown), since a \r will come back as &#23; after saveXML() for instance.

DOMDocument::$preserveWhitespace is useless and unnecessary after running the above regex.

Oh, and I don’t see the need to protect blank pre-like tags here. Whitespace-only snippets are useless.

Additional Flags for loadHTML()

  • LIBXML_COMPACT – "this may speed up your application without needing to change the code"
  • LIBXML_NOBLANKSneed to run more tests on this one
  • LIBXML_NOCDATAneed to run more tests on this one
  • LIBXML_NOXMLDECL – documented, but not implemented =(

UPDATE: Setting any of these options will have the effect of not formatting the output.

On saveXML()

The DOMDocument::saveXML() method will output the XML declaration. We need to manually purge it (since the LIBXML_NOXMLDECL isn’t implemented). To do that, we could use a combination of substr() + strpos() to look for the first line break or even use a regex to clean it up.

Another option, that seems to have an added benefit is simply doing:

$dom->saveXML($dom->documentElement);

Another thing, if you have inline tags are are empty, such as the b, i or li in:

<b class="carret"></b>
<i class="icon-dashboard"></i> Dashboard
<li class="divider"></li>

The saveXML() method will seriously mangle them (placing the following element inside the empty one), messing your whole HTML. Tidy also has a similar problem, except that it just drops the node.

To fix that, you can use the LIBXML_NOEMPTYTAG flag along with saveXML():

$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

This option will convert empty (aka self-closing) tags to inline tags and allow empty inline tags as well.

Fixing HTML[5]

With all the stuff we did so far, our HTML output has two major problems now:

  1. no DOCTYPE (it was stripped when we used $dom->documentElement)
  2. empty tags are now inline tags, meaning one <br /> turned into two (<br></br>) and so on

Fixing the first one is fairly easy, since HTML5 is pretty permissive:

"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

To get our empty tags back, which are the following:

  • area
  • base
  • basefont (deprecated in HTML5)
  • br
  • col
  • command
  • embed
  • frame (deprecated in HTML5)
  • hr
  • img
  • input
  • keygen
  • link
  • meta
  • param
  • source
  • track
  • wbr

We can either use str_[i]replace in a loop:

foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag)
{
    $html = str_ireplace('>/<' . $tag . '>', ' />', $html);
}

Or a regular expression:

$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);

This is a costly operation, I haven’t benchmarked them so I can’t tell you which one performs better but I would guess preg_replace(). Additionally, I’m not sure if the case insensitive version is needed. I’m under the impression that XML tags are always lowercased. UPDATE: Tags are always lowercased.

On <script> and <style> Tags

These tags will always have their content (if existent) encapsulated into (uncommented) CDATA blocks, which will probably break their meaning. You’ll have to replace those tokens with a regular expression.

Implementation

function DOM_Tidy($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}

Answer:

Here’s the comment at the php.net: http://ru2.php.net/manual/en/domdocument.save.php#88630

It looks like when you load HTML from the string (like you did) DOMDocument becomes lazy and does not format anything in it.

Here’s working solution to your problem:

// Clean your HTML by hand first
$html = preg_replace('/>\s*</im', '><', $html);
$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
$dom->preserveWhitespace = false;
// Use saveXML(), not saveHTML()
print $dom->saveXML();

Basically, you throw out the spaces between tags and use saveXML() instead of saveHTML().
saveHTML() just does not work in this situation. However, you get an XML declaration in first line of text.