Home » Php » Ignore HTML characters when searching HTML stored content PHP/MySQL?

Ignore HTML characters when searching HTML stored content PHP/MySQL?

Posted by: admin July 12, 2020 Leave a comment

Questions:

We are storing blocks of HTML content in MySQL (it is formatted text created in CKeditor. Ckeidor adds inline CSS styles to format the text). We are storing the database in a column called “page_content”

We want a search feature that allows users to search the text (only). However we don’t want the search to return HTML characters which is what is happening right now. For example, if we search “font” we dont want the search results to return pages with the HTML styles <font>

Is there some way to ignore HTML/CSS characters when searching HTML stored content from MySQL?

How to&Answers:

Have you considered setting up a separate table for these searches? MySQL’s full text search only works with MyISAM tables so you probably don’t want to mix that the full text search with important data (unless, of course, you have some strange distaste for foreign keys and referential integrity).

The approach I’ve used in the past is basically this:

  • Set up a separate table with a simple (id, search_text) structure.
    • The id matches the id of the thing you’re searching.
    • The search_text is everything (body text, title, author name, …) that you want to search mashed into one blob of text.
  • Add full text indexing on the search table.
  • Update your database update process to construct the appropriate search_text string as plain text; this is where you would strip out the HTML and possibly apply some other mappings (such as expanding things like “A+” to something that the full text search will find).
  • When searching, you apply the same mappings as you apply to the searchable data and then go look in your search table for matches.

This solves your HTML problem, lets you search more than the HTML content easily, and allows you to tune the search results by weighting different components of the search text through repetition (e.g. if you want tags to be more important than the body text, just add the tags two or three times when building the search_text).

You’ll have to process your text to remove or ignore the HTML. This approach lets you do it just once rather than doing it on every search.

Answer:

I am guessing you want to do the searching on the database? (in that case, stripping HTML tags would mean that you would have to store the content twice).

Try looking into using MYSQLs full text search functionality, in natural language mode.

http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html