Home » c# » How do you convert Html to plain text?

How do you convert Html to plain text?

Posted by: admin November 29, 2017 Leave a comment

Questions:

I have snippets of Html stored in a table. Not entire pages, no tags or the like, just basic formatting.

I would like to be able to display that Html as text only, no formatting, on a given page (actually just the first 30 – 50 characters but that’s the easy bit).

How do I place the “text” within that Html into a string as straight text?

So this piece of code.

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

Becomes:

Hello World. Is there anyone out there?

Answers:

If you are talking about tag stripping, it is relatively straight forward if you don’t have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you’ll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with ‘Left To Right’ or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don’t know of a good one to recommend.

Questions:
Answers:

The free and open source HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

var plainText = ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello world!</b><br
/><i>it is me!
!</i>

And you’ll get a plain text result like:

hello world!
it is me!

Questions:
Answers:

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}

Questions:
Answers:

HTTPUtility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as &lt; and &gt; for HTTP transmission.

HTTPUtility.HTMLEncode() method, detailed here:

public static void HtmlEncode(
  string s,
  TextWriter output
)

Usage:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

Questions:
Answers:

To add to vfilby’s answer, you can just perform a RegEx replace within your code; no new classes are necessary. In case other newbies like myself stumple upon this question.

using System.Text.RegularExpressions;

Then…

private string StripHtml(string source)
{
        string output;

        //get rid of HTML tags
        output = Regex.Replace(source, "<[^>]*>", string.Empty);

        //get rid of multiple blank lines
        output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

        return output;
}

Questions:
Answers:

There not a method with the name ‘ConvertToPlainText’ in the HtmlAgilityPack but you can convert a html string to CLEAR string with :

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

Thats works for me. BUT I DONT FIND A METHOD WITH NAME ‘ConvertToPlainText’ IN ‘HtmlAgilityPack’.

Questions:
Answers:

I think the easiest way is to make a ‘string’ extension method (based on what user Richard have suggested):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Then just use this extension method on any ‘string’ variable in your program:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

Questions:
Answers:

If you have data that has HTML tags and you want to display it so that a person can SEE the tags, use HttpServerUtility::HtmlEncode.

If you have data that has HTML tags in it and you want the user to see the tags rendered, then display the text as is. If the text represents an entire web page, use an IFRAME for it.

If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text, use a regular expression.

Questions:
Answers:

The simplest way I found:

HtmlFilter.ConvertToPlainText(html);

The HtmlFilter class is located in Microsoft.TeamFoundation.WorkItemTracking.Controls.dll

The dll can be found in folder like this:
%ProgramFiles%\Common Files\microsoft shared\Team Foundation Server\14.0\

In VS 2015, the dll also requires reference to Microsoft.TeamFoundation.WorkItemTracking.Common.dll, located in the same folder.

Questions:
Answers:

Depends on what you mean by “html.” The most complex case would be complete web pages. That’s also the easiest to handle, since you can use a text-mode web browser. See the Wikipedia article listing web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

Questions:
Answers:

Three Step Process for converting HTML into Plain Text

First You need to Install Nuget Package For HtmlAgilityPack
Second Create This class

public class HtmlToText
{
    public HtmlToText()
    {
    }

    public string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    private void ConvertContentTo(HtmlNode node, TextWriter outText)
    {
        foreach(HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText);
        }
    }

    public void ConvertTo(HtmlNode node, TextWriter outText)
    {
        string html;
        switch(node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlNodeType.Document:
                ConvertContentTo(node, outText);
                break;

            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html));
                }
                break;

            case HtmlNodeType.Element:
                switch(node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                    ConvertContentTo(node, outText);
                }
                break;
        }
    }
}

By using above class with reference to Judah Himango’s answer

Third you need to create the Object of above class and Use ConvertHtml(HTMLContent) Method for converting HTML into Plain Text rather than ConvertToPlainText(string html);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);

Questions:
Answers:

Here is my solution:

public string StripHTML(string html)
{
    var regex = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    return System.Web.HttpUtility.HtmlDecode((regex.Replace(html, "")));
}

Example:

StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>");
// output -> Here is my solution:

Questions:
Answers:

public static string StripTags2(string html)
{
return html.Replace(“<“, “<“).Replace(“>”, “>”);
}

By this you escape all “<” and “>” in a string. Is this what you want?