Login

11-08-2010, 07:29 PM

[eluser]Steven Ross[/eluser]
I have a text A (needle) and I want to find out how similar it is to each of multiple texts B (haystack). To make it simpler to process, the multiple documents in the haystack could be combined into 1 long one. With "similar", I mean does needle contain one or more text fragments (let's say larger than 100 characters) that also appear verbatim in the haystack?

I could loop through needle, cutting out 100-char substrings, each time moving forward one character, and then use each substring in a strpos statement comparing it with the haystack. If there's a match I can then expand the substring to see how many characters >100 match.

But somehow this seems slow and cumbersome. Also, it gets tripped up if there are double spaces, line breaks etc. Is there a faster, more elegant algorithm? Ideally, it would find matches even if a word in the substring is changed, perhaps assigning percentage similarities, e.g. "abxdefghij" would not match "abcdefghij" at all with my strpos solution above, but with a better algorithm it can show a 90% match.

Any ideas?

11-08-2010, 08:22 PM

[eluser]bretticus[/eluser]
Never used it but PHP has had a built-in algorithm for sometime now: similar_text.