Currently designing a CMS for use on my website. I am wondering if there were any free libraries available for creating tags based on the content.
I like trees. Trees are plants that have leaves. Leaves on tree can be
Would produce the tags trees and leaves.
The library should be PHP or JS.
I have found a simple library for half my task – http://www.cafewebmaster.com/get-top-100-words-keywords-text-php
I have edited what the library specifications should be (thanks to guidance from @NullUserException)-
Count all words (ignoring case and inflections), throw out stop words and pick the ones with the highest frequency
Edit text to make words that are more specific to the genre (may have a lower frequency), be of higher value. For example in the example – ‘multi-colored’ should become higher value because it is more specific to the subject. However it should include a prefix indicating it relates to the subject (it would become leaves-multi-colored).
Algorithm should remove words that have less than 3 characters unless they are in capitals or formatted otherwise
Are the tags on your CMS already defined? If yes you could index your text in memory and search using all known tags against your text. Pick the highest scoring tags and present to the user.
Indexing and searching could be done with http://lucene.apache.org/solr/
Edit: Note that I do suggest that your tags/keywords are defined and manageable from an administration panel (like for example in wordpress). Otherwise you’d end up with thousands of keywords generated from your articles which would never help the end user.