Home » Php » javascript – How to close unclosed HTML Tags?

javascript – How to close unclosed HTML Tags?

Posted by: admin April 23, 2020 Leave a comment

Questions:

Whenever we are fetching some user inputed content with some editing from the database or similar sources, we might retrieve the portion which only contains the opening tag but no closing.

This can hamper the website’s current layout.

Is there a clientside or serverside way of fixing this?

How to&Answers:

Found a great answer for this one:

Use PHP 5 and use the loadHTML() method of the DOMDocument object. This auto parses badly formed HTML and a subsequent call to saveXML() will output the valid HTML. The DOM functions can be found here:

http://www.php.net/dom

The usage of this:

$doc = new DOMDocument();
$doc->loadHTML($yourText);
$yourText = $doc->saveHTML();

Answer:

You can use Tidy:

Tidy is a binding for the Tidy HTML clean and repair utility which allows you to not only clean and otherwise manipulate HTML documents, but also traverse the document tree.

or HTMLPurifier

HTML Purifier is a standards-compliant
HTML filter library written in
PHP. HTML Purifier will not only remove all malicious
code (better known as XSS) with a thoroughly audited,
secure yet permissive whitelist,
it will also make sure your documents are
standards compliant, something only achievable with a
comprehensive knowledge of W3C’s specifications.

Answer:

I have solution for php

<?php
    // close opened html tags
    function closetags ( $html )
        {
        #put all opened tags into an array
        preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
        $openedtags = $result[1];

        #put all closed tags into an array
        preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
        $closedtags = $result[1];
        $len_opened = count ( $openedtags );

        # all tags are closed
        if( count ( $closedtags ) == $len_opened )
        {
            return $html;
        }
        $openedtags = array_reverse ( $openedtags );

        # close tags
        for( $i = 0; $i < $len_opened; $i++ )
        {
            if ( !in_array ( $openedtags[$i], $closedtags ) )
            {
                $html .= "</" . $openedtags[$i] . ">";
            }
            else
            {
                unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
            }
        }
        return $html;
    }
    // close opened html tags
?>

You can use this function like

   <?php echo closetags("your content <p>test test"); ?>

Answer:

For HTML fragments, and working from KJS’s answer I have had success with the following when the fragment has one root element:

$dom = new DOMDocument();
$dom->loadHTML($string);
$body = $dom->documentElement->firstChild->firstChild;
$string = $dom->saveHTML($body);

Without a root element this is possible (but seems to wrap only the first text child node in p tags in text <p>para</p> text):

$dom = new DOMDocument();
$dom->loadHTML($string);
$bodyChildNodes = $dom->documentElement->firstChild->childNodes;

$string = '';
foreach ($bodyChildNodes as $node){
   $string .= $dom->saveHTML($node);
}

Or better yet, from PHP >= 5.4 and libxml >= 2.7.8 (2.7.7 for LIBXML_HTML_NOIMPLIED):

$dom = new DOMDocument();

// Load with no html/body tags and do not add a default dtd
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$string = $dom->saveHTML();    

Answer:

In addition to server-side tools like Tidy, you can also use the user’s browser to do some of the cleanup for you. One of the really great things about innerHTML is that it will apply the same on-the-fly repair to dynamic content as it does to HTML pages. This code works pretty well (with two caveats) and nothing actually gets written to the page:

var divTemp = document.createElement('div');
divTemp.innerHTML = '<p id="myPara">these <i>tags aren\'t <strong> closed';
console.log(divTemp.innerHTML); 

The caveats:

  1. The different browsers will return different strings. This isn’t so bad, except in the the case of IE, which will return capitalized tags and will strip the quotes from tag attributes, which will not pass validation. The solution here is to do some simple clean-up on the server side. But at least the document will be properly structured XML.

  2. I suspect that you may have to put in a delay before reading the innerHTML — give the browser a chance to digest the string — or you risk getting back exactly what was put in. I just tried on IE8 and it looks like the string gets parsed immediately, but I’m not so sure on IE6. It would probably be best to read the innerHTML after a delay (or throw it into a setTimeout() to force it to the end of the queue).

I would recommend you take @Gordon’s advice and use Tidy if you have access to it (it takes less work to implement) and failing that, use innerHTML and write your own tidy function in PHP.

And though this isn’t part of your question, as this is for a CMS, consider also using the YUI 2 Rich Text Editor for stuff like this. It’s fairly easy to implement, somewhat easy to customize, the interface is very familiar to most users, and it spits out perfectly valid code. There are several other off-the-shelf rich text editors out there, but YUI has the best license and is the most powerful I’ve seen.

Answer:

A better PHP function to delete not open/not closed tags from webmaster-glossar.de (me)

function closetag($html){
    $html_new = $html;
    preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result1);
    preg_match_all ( "#</([a-z]+)>#iU", $html, $result2);
    $results_start = $result1[1];
    $results_end = $result2[1];
    foreach($results_start AS $startag){
        if(!in_array($startag, $results_end)){
            $html_new = str_replace('<'.$startag.'>', '', $html_new);
        }
    }
    foreach($results_end AS $endtag){
        if(!in_array($endtag, $results_start)){
            $html_new = str_replace('</'.$endtag.'>', '', $html_new);
        }
    }
    return $html_new;
}

use this function like:

closetag('i <b>love</b> my <strike>cat'); 
#output: i <b>love</b> my cat

closetag('i <b>love</b> my cat</strike>'); 
#output: i <b>love</b> my cat

Answer:

Erik Arvidsson wrote a nice HTML SAX parser in 2004. http://erik.eae.net/archives/2004/11/20/12.18.31/

It keeps track of the the open tags, so with a minimalistic SAX handler it’s possible to insert closing tags at the correct position:

function tidyHTML(html) {
    var output = '';
    HTMLParser(html, {
        comment: function(text) {
            // filter html comments
        },
        chars: function(text) {
            output += text;
        },
        start: function(tagName, attrs, unary) {
            output += '<' + tagName;
            for (var i = 0; i < attrs.length; i++) {
                output += ' ' + attrs[i].name + '=';
                if (attrs[i].value.indexOf('"') === -1) {
                    output += '"' + attrs[i].value + '"';
                } else if (attrs[i].value.indexOf('\'') === -1) {
                    output += '\'' + attrs[i].value + '\'';
                } else { // value contains " and ' so it cannot contain spaces
                    output += attrs[i].value;
                }
            }
            output += '>';
        },
        end: function(tagName) {
            output += '</' + tagName + '>';
        }
    });
    return output;
}

Answer:

I used to the native DOMDocument method, but with a few improvements for safety.

Note, other answers that use DOMDocument do not consider html strands such as

This is a <em>HTML</em> strand

The above will actually result in

<p>This is a <em>HTML</em> strand

My Solution is below

function closeDanglingTags($html) {
    if (strpos($html, '<') || strpos($html, '>')) {
        // There are definitiley HTML tags
        $wrapped = false;
        if (strpos(trim($html), '<') !== 0) {
            // The HTML starts with a text node. Wrap it in an element with an id to prevent the software wrapping it with a <p>
            //  that we know nothing about and cannot safely retrieve
            $html = cHE::getDivHtml($html, null, 'closedanglingtagswrapper');
            $wrapped = true;
        }
        $doc = new DOMDocument();
        $doc->encoding = 'utf-8';
        @$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
        if ($doc->firstChild) {
            // Test whether the firstchild is definitely a DOMDocumentType
            if ($doc->firstChild instanceof DOMDocumentType) {
                // Remove the added doctype
                $doc->removeChild($doc->firstChild);
            }
        }
        if ($wrapped) {
            // The contents originally started with a text node and was wrapped in a div#plasmappclibtextwrap. Take the contents
            //  out of that div
            $node = $doc->getElementById('closedanglingtagswrapper');
            $children = $node->childNodes;  // The contents of the div. Equivalent to $('selector').children()
            $doc = new DOMDocument();   // Create a new document to add the contents to, equiv. to "var doc = $('<html></html>');"
            foreach ($children as $childnode) {
                $doc->appendChild($doc->importNode($childnode, true)); // E.g. doc.append()
            }
        }
        // Remove the added html,body tags
        return trim(str_replace(array('<html><body>', '</body></html>'), '', html_entity_decode($doc->saveHTML())));
    } else {
        return $html;
    }
}