Welcome Guest, Not a member yet? Register   Sign In
Creating a cross-platform forum search function
#1

[eluser]TheFuzzy0ne[/eluser]
Hi, guys! I'm looking for some ideas on a forum search function. I believe I've touched on the subject before, but didn't really get the results I was looking for.

I'd like to create a forum search function that uses Active Record, so it's compatible with all databases that the Active Record class supports. I've looked into MySQL fulltext searching, and whilst it's nice, it's not compatible with all databases.

I'm looking for some ideas. I was thinking of using LIKE, but I'd imagine that might be resource intensive, and also it doesn't help when looking for words that may or may not have a preceding space.

I've noticed that some other mainstream forums out there seem to index words in a word list, and then search them. I've yet to understand how this works so if anyone can point me in the right direction, that would be great. More specifically, if you could give an example of how I might query the word list with Active Record and return a list of relevant posts, that would be great.

Basically, any suggestions that don't require the user to be running a specific database platform would be fantastic. Obviously I need to figure this out before I design my database layout, which is why I've been holding off for now. This is a project which I'd like to eventually release as a module for CodeIgniter users, hence why I'd like to make it work on the most common databases.

Thanks in advance for you suggestions.
#2

[eluser]drewbee[/eluser]
Zend Search Lucene. Insanely powerful and configurable. It stores the indexes in files. I couldn't even fathom using anything else.
#3

[eluser]Rob Steele[/eluser]
alright, well in my CS class we had to index words with a spider. Basically in strictly theoretical dumbed down version of it, when new documents are added to the db, that document is tokenized into words. Afterwards a pointer to that document is put into an array that is indexed to that word(We used a hash map for speed). To cut down on size, there are certain words you ignore that are tooo common place, ie ('the', 'an') all the articles in grammar. But the easiest way is to just find a 3rd party lib, unless you really want to learn how to do it the nitty gritty way.

Also, someone please tell me if i'm wrong but, to make that index would be a many-to-many pattern
ie
a words table
word_id
word

a document table
document_id
document_location

and a word_doc table
word_id
document_id

that way you can check to see if a word is already associated with a document and it saves space sense you don't have to worry about writing the document path hundreds of times.

-Rob
#4

[eluser]TheFuzzy0ne[/eluser]
Drew - Thanks for the suggestion. I'll be looking into that over the weekend. I assume that this is something I can easily make into part of my package, and not something I need admin privileges to install?

Rob - Interesting stuff! I've always wondered how search engines did it, and I had a vague idea, I just wasn't 100%. You suggested using a third party library. Please could you suggest the terms I should Google for? I'm sure there is a technical name for this type of indexing, but I've spent hours searching and never found what I was looking for. Obviously this is a sign I'm not using the right search terms.

Also, are their any third party libraries you would recommend? I'm guessing you haven't actually used any, but instead implemented your own?

Thank you both for your comments.
#5

[eluser]pistolPete[/eluser]
[quote author="TheFuzzy0ne" date="1237505330"] I assume that this is something I can easily make into part of my package, and not something I need admin privileges to install?[/quote]

No need to install anything:
Quote:Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5.
Quoted from the documentation

There are also several topics on this forum on how to integrate it into CI, e.g.: http://ellislab.com/forums/viewthread/74616/
#6

[eluser]Rob Steele[/eluser]
Quote:Also, are their any third party libraries you would recommend? I’m guessing you haven’t actually used any, but instead implemented your own?
You assume correct, the easiest way is to just get a third party one man. No need to reinvent the wheel.
#7

[eluser]Rob Steele[/eluser]
i refer you to another forum post about the benchmarking on adding and indexing that using zend
Zend Search w/ CI post
#8

[eluser]drewbee[/eluser]
Yeah. Zend search lucene can easily be dropped ontop of the CI Framework.
#9

[eluser]TheFuzzy0ne[/eluser]
Thank you all for your comments.

drewbee, I'm concerned that this solution is not fully scalable. Can you put my mind at ease, please?
#10

[eluser]jedd[/eluser]
Are you asking if Lucene is scalable, or if it plugging it into a framework is scalable?

If the former, consider wikipedia uses it. If the latter, consider they migrated to it from mediawiki's built-insearch about 3 years ago.




Theme © iAndrew 2016 - Forum software by © MyBB