Home » Javascript » Unescape HTML entities in Javascript?

Unescape HTML entities in Javascript?

Posted by: admin December 6, 2017 Leave a comment

Questions:

I have some Javascript code that communicates with an XML-RPC backend.
The XML-RPC returns strings of the form:

<img src='myimage.jpg'>

However, when I use the Javascript to insert the strings into HTML, they render literally. I don’t see an image, I literally see the string:

<img src='myimage.jpg'>

My guess is that the HTML is being escaped over the XML-RPC channel.

How can I unescape the string in Javascript? I tried the techniques on this page, unsuccessfully: http://paulschreiber.com/blog/2008/09/20/javascript-how-to-unescape-html-entities/

What are other ways to diagnose the issue?

Answers:

I use the following method:

function htmlDecode(input){
  var e = document.createElement('div');
  e.innerHTML = input;
  // handle case of empty input
  return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}

htmlDecode("&lt;img src='myimage.jpg'&gt;"); 
// returns "<img src='myimage.jpg'>"

Basically I create a DOM element programmatically, assign the encoded HTML to its innerHTML and retrieve the nodeValue from the text node created on the innerHTML insertion. Since it just creates an element but never adds it, no site HTML is modified.

It will work cross-browser (including older browsers) and accept all the HTML Character Entities.

EDIT: The old version of this code did not work on IE with blank inputs, as evidenced here on jsFiddle (view in IE). The version above works with all inputs.

UPDATE: appears this doesn’t work with large string, and it also introduces a security vulnerability, see comments.

Questions:
Answers:

Most answers given here have a huge disadvantage: if the string you are trying to convert isn’t trusted then you will end up with a Cross-Site Scripting (XSS) vulnerability. For the function in the accepted answer, consider the following:

htmlDecode("<img src='dummy' onerror='alert(/xss/)'>");

The string here contains an unescaped HTML tag, so instead of decoding anything the htmlDecode function will actually run JavaScript code specified inside the string.

This can be avoided by using DOMParser which is supported in all modern browsers:

function htmlDecode(input)
{
  var doc = new DOMParser().parseFromString(input, "text/html");
  return doc.documentElement.textContent;
}

// This returns "<img src='myimage.jpg'>"
htmlDecode("&lt;img src='myimage.jpg'&gt;");

// This returns ""
htmlDecode("<img src='dummy' onerror='alert(/xss/)'>");

This function is guaranteed to not run any JavaScript code as a side-effect. Any HTML tags will be ignored, only text content will be returned.

Compatibility note: Parsing HTML with DOMParser requires at least Chrome 30, Firefox 12, Opera 17, Internet Explorer 10, Safari 7.1 or Microsoft Edge. So all browsers without support are way past their EOL and as of 2017 the only ones that can still be seen in the wild occasionally are older Internet Explorer and Safari versions (usually these still aren’t numerous enough to bother).

Questions:
Answers:

If you’re using jQuery:

function htmlDecode(value){ 
  return $('<div/>').html(value).text(); 
}

Otherwise, use Strictly Software’s Encoder Object, which has an excellent htmlDecode() function.

Questions:
Answers:

Chris answer is nice & elegant but it fails if value is undefined. Just simple improvement makes it solid:

function htmlDecode(value) {
   return (typeof value === 'undefined') ? '' : $('<div/>').html(value).text();
}

Questions:
Answers:

CMS’ answer works fine, unless the HTML you want to unescape is very long, longer than 65536 chars. Because then in Chrome the inner HTML gets split into many child nodes, each one at most 65536 long, and you need to concatenate them. This function works also for very long strings:

function unencodeHtmlContent(escapedHtml) {
  var elem = document.createElement('div');
  elem.innerHTML = escapedHtml;
  var result = '';
  // Chrome splits innerHTML into many child nodes, each one at most 65536.
  // Whereas FF creates just one single huge child node.
  for (var i = 0; i < elem.childNodes.length; ++i) {
    result = result + elem.childNodes[i].nodeValue;
  }
  return result;
}

See this answer about innerHTML max length for more info: https://stackoverflow.com/a/27545633/694469

Questions:
Answers:

Not a direct response to your question, but wouldn’t it be better for your RPC to return some structure (be it XML or JSON or whatever) with those image data (urls in your example) inside that structure?

Then you could just parse it in your javascript and build the <img> using javascript itself.

The structure you recieve from RPC could look like:

{"img" : ["myimage.jpg", "myimage2.jpg"]}

I think it’s better this way, as injecting a code that comes from external source into your page doesn’t look very secure. Imaging someone hijacking your XML-RPC script and putting something you wouldn’t want in there (even some javascript…)

Questions:
Answers:

This is a better:

String::decode = ->
   $('<textarea />').html(this).text()

use:

"&lt;img src='myimage.jpg'&gt;".decode();

from: HTML Entity Decode

Questions:
Answers:

The trick is to use the power of the browser to decode the special HTML characters, but not allow the browser to execute the results as if it was actual html… This function uses a regex to identify and replace encoded HTML characters, one character at a time.

function unescapeHtml(html) {
    var el = document.createElement('div');
    return html.replace(/\&[#0-9a-z]+;/gi, function (enc) {
        el.innerHTML = enc;
        return el.innerText
    });
}

Questions:
Answers:

I use this in my project: inspired by other answers but with an extra secure parameter, can be useful when you deal with decorated characters

var decodeEntities=(function(){

    var el=document.createElement('div');
    return function(str, safeEscape){

        if(str && typeof str === 'string'){

            str=str.replace(/\</g, '&lt;');

            el.innerHTML=str;
            if(el.innerText){

                str=el.innerText;
                el.innerText='';
            }
            else if(el.textContent){

                str=el.textContent;
                el.textContent='';
            }

            if(safeEscape)
                str=str.replace(/\</g, '&lt;');
        }
        return str;
    }
})();

And it’s usable like:

var label='safe <b> character &eacute;ntity</b>';
var safehtml='<div title="'+decodeEntities(label)+'">'+decodeEntities(label, true)+'</div>';

Questions:
Answers:

All of the other answers here have problems.

The document.createElement(‘div’) methods (including those using jQuery) execute any javascript passed into it (a security issue) and the DOMParser.parseFromString() method trims whitespace. Here is a pure javascript solution that has neither problem:

function htmlDecode(html) {
    var textarea = document.createElement("textarea");
    html= html.replace(/\r/g, String.fromCharCode(0xe000)); // Replace "\r" with reserved unicode character.
    textarea.innerHTML = html;
    var result = textarea.value;
    return result.replace(new RegExp(String.fromCharCode(0xe000), 'g'), '\r');
}

TextArea is used specifically to avoid executig js code. It passes these:

htmlDecode('&lt;&amp;&nbsp;&gt;'); // returns "<& >" with non-breaking space.
htmlDecode('  '); // returns "  "
htmlDecode('<img src="dummy" onerror="alert(\'xss\')">'); // Does not execute alert()
htmlDecode('\r\n') // returns "\r\n", doesn't lose the \r like other solutions.

Leave a Reply

Your email address will not be published. Required fields are marked *