How to assess text similarity?

How to assess text similarity? - Printable Version

+- CodeIgniter Forums (https://forum.codeigniter.com)
+-- Forum: Archived Discussions (https://forum.codeigniter.com/forumdisplay.php?fid=20)
+--- Forum: Archived Development & Programming (https://forum.codeigniter.com/forumdisplay.php?fid=23)
+--- Thread: How to assess text similarity? (/showthread.php?tid=35705)

How to assess text similarity? - El Forum - 11-08-2010

[eluser]Steven Ross[/eluser]
I have a text A (needle) and I want to find out how similar it is to each of multiple texts B (haystack). To make it simpler to process, the multiple documents in the haystack could be combined into 1 long one. With "similar", I mean does needle contain one or more text fragments (let's say larger than 100 characters) that also appear verbatim in the haystack?

I could loop through needle, cutting out 100-char substrings, each time moving forward one character, and then use each substring in a strpos statement comparing it with the haystack. If there's a match I can then expand the substring to see how many characters >100 match.

But somehow this seems slow and cumbersome. Also, it gets tripped up if there are double spaces, line breaks etc. Is there a faster, more elegant algorithm? Ideally, it would find matches even if a word in the substring is changed, perhaps assigning percentage similarities, e.g. "abxdefghij" would not match "abcdefghij" at all with my strpos solution above, but with a better algorithm it can show a 90% match.

Any ideas?

How to assess text similarity? - El Forum - 11-08-2010

[eluser]bretticus[/eluser]
Never used it but PHP has had a built-in algorithm for sometime now: similar_text.