I need to take two text blocks with html tags and render a comparison – merge the two text blocks and then highlight what was added or removed from one version to the next.
I have used the PEAR Text_Diff class to successfully render comparisons of plain text, but when I try to throw text with html tags in it, it gets UGLY. Because of the word and character-based compare algorithms the class uses, html tags get broken and I end up with ugly stuff like
<p><span class="new"> </</span>p>. It slaughters the html.
Is there a way to generate a text comparison while preserving the original valid html markup?
Thanks for the help. I’ve been working on this for weeks :[
This is the best solution I could think of: find/replace each type of html tag with 1 special non-standard character like the apple logo (opt shift k), render the comparison with this kind of primative markdown, then revert the non-standard characters back into tags. Any feedback?
The problem seems to be that your diff program should be treating existing HTML tags as atomic tokens rather than as individual characters.
If your engine has the ability to limit itself to working on word boundaries, see if you can override the function that determines word boundaries so it recognizes and treats HTML tags as a single “word”.
You could also do as you are saying and create a lookup dictionary of distinct HTML tags that replaces each with a distinct unused Unicode value (I think there are some user-defined ranges you can use). However, if you do this, any changes to markup will be treated as if they were a change to the previous or following word, because the Unicode character will become part of that word to the tokenizer. Adding a space before and after each of your token Unicode characters would keep the HTML tag changes separate from the plain text changes.
Simple Diff, by Paul Butler, looks as though it’s designed to do exactly what you need: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php
Notice in his php code that there’s an html wrapper: htmlDiff($old, $new)
(His blog post on it: http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/
What about using an html tidier / formatter on each block first? This will create a standard “structure” which your diff might find easier to swallow
Try running your HTML blocks through this function first:
That should convert all of your “<“‘s and “>”‘s into their corresponding codes, perhaps fixing your problem.
//Example: $html_1 = "<html><head></head><body>Something</body></html>" $html_2 = "<html><head></head><body><p id='abc'>Something Else</p></body></html>" //Below code taken from http://www.go4expert.com/forums/showthread.php?t=4189. //Not sure if/how it works exactly $diff = &new Text_Diff(htmlentities($html_1), htmlentities($html_2)); $renderer = &new Text_Diff_Renderer(); echo $renderer->render($diff);
A copy of my own answer from here.
Following features are really nice:
- Works with badly formed HTML that can be found “in the wild”.
- The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
- In addition to the default visual diff, HTML source can be diffed coherently.
- Provides easy to understand descriptions of the changes.
- The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.