Creating a cross-platform forum search function |
[eluser]TheFuzzy0ne[/eluser]
I was referring to [url="http://ellislab.com/forums/viewthread/74616/#518110"]this post[/url]. [quote author="Henry Weismann" date="1232335414"]Search lucene is meant to be implemented in large scale websites. For instance if you are using a tagging system of any kind you will find out that after you reach 1 million tagged entries your application will start bogging down and becoming very slow. That is because rdbms databases have scalability issues when dealing with extremely large data sets using "expensive" queries on those datasets. Search lucene based on the java library apache lucene is much faster. You can also look into hadoop. But if you are using it on a small dataset it's like using a sledge hammer to push in a thumb tack.[/quote]
[eluser]drewbee[/eluser]
Yes, it is highly scalable. You have to remember, a database is nothing more then files with data in them, organized for optimization, permissions, users etc. This is just skipping that part going straight to files. There are a few limitations based on the operating system. The maximum index size on a 32 bit system is 2 GB (2^31 - 1). This is not a fault of the index, but of the operating system; this is the case for any linux 32 bit system. If you need a bigger index, 64 bit can provide a maximum index (file size) of (2^63 - 1) 8,589,934,592 GB. lol, you should be fine. It uses segments to store the indexes, and each index can be merged using the optimize() method. I used it as a spider for one of my test projects, and it worked just fine (including the slow speed of PHP crawling pages, adding and parsing data) etc. Henry is correct as well. I tend to use lucene for even basic searches (IE tags, blogs, forums etc etc). The only downside to this is it takes twice as much diskspace (database side and Lucene Index); As I was describing in another thread though, the less I can hit my DB server, the better.
[eluser]TheFuzzy0ne[/eluser]
I'm sorry for sounding so thick, but I don't understand how a 2GB index file can be any good for anything. Obviously, it is, but I don't see how... Just how quickly can something be found in a file that's 2GB? Assuming of course that this is one big-a$$ index. Also, I was aware of the 2GB limit on Linux, but I was under the impression that it depended upon the file system in use. I have a 2.4GB executable happily living on my EXT3 partition. EDIT: Crap! It's just occurred to me that I totally misread Henry's post. I thought he was saying that it's only good on medium sized datasets. D'oh! Sorry for the confusion there... I think the two questions above are my final questions. If I'm happy with what you have to say, then I'm sold! Thanks again.
[eluser]drewbee[/eluser]
Well, I know it uses flock, so you can do concurrent inserts, deletes, and searches (there is no update doc, you have to delete and re-add). One of the features as well is the doc types that you add to it. Some are tokenized, some are searchable, some are returnable etc etc. So by using the correct types and attributes for each document added, it optimizes the amount of data stored in the index vs only data that is displayed. I highly recommend looking over the documentation as it is very resourceful. http://framework.zend.com/manual/en/zend...ucene.html And the fieldtypes: http://framework.zend.com/manual/en/zend...ield-types And to throw gasoline on an already burning 'feature fire', it supports the indexing of excel, word, and power point documents. In my searches, even in my larger data seconds I have yet to see a decently complex query break anything less more then 5 ms or so search time.
[eluser]TheFuzzy0ne[/eluser]
Sold! ...To the gentleman with the Lollerskates! Thanks for that. I think I'm going to be spending a lot of time reading this weekend. Thanks for your patience with regards to explaining what's what. Now I feel somewhat happier that I'm making a choice I'm not going to regret.
[eluser]TheFuzzy0ne[/eluser]
OK, I'm thinking about the implementation of the search engine. I'm not sure how often I need to optimise it, and how to trigger it. I guess I don't want to be optimising the index after every item is added. So how often should it be optimised? Should it be done manually, or automatically? How long would it take to optimise a very big index? Would it exceed PHP's max_execution_time? Sorry to keep coming back with more questions, but I need to be sure that I can make this fit in with what the client needs, and that I'm not going to overlook anything that will cause problems in the future. EDIT: I'm also starting to have second thoughts. I'm reading the documentation, and it mentions that optimizing can be an expensive process in terms of resources. I don't know how expensive that is, it's a bit like asking "how long is a piece of string?". However, I'm concerned, as my Web host's TOS specifically prohibit me from running long processes which are resource intensive, as I am running on a shared host. drewbee, am I just being paranoid here? Having not used Lucene before, I can't use experience to guide me, so I need to borrow some of yours.
[eluser]TheFuzzy0ne[/eluser]
Can anyone recommend a search engine library that: a) Only uses a database. b) I can freely port over to use Active Record. c) Doesn't require compiling, or any command line access to he server.
[eluser]TheFuzzy0ne[/eluser]
I've been trying to ascertain what the maximum open files limited is with my Web host (who my client will be using). After much confusion, I still can't get a decent answer out of them. Is it the way I've explained myself, as I think it's quite clear. They seem to know even less about their file system than I do! My host is a resellers for another company (referred to as Heart). Quote:Tracking ID: *********** So this brings me back to one of my original questions. Is it suited for shared hosting?
[eluser]CtheB[/eluser]
Hi fuzzy, did you make any improvements in getting your questions answered?? I think you've read this part, but in case you didn't: Quote:UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. UnStored fields are practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a separate field as an identifier.link: zend lucene So then you have no problem with the shared host limit
[eluser]TheFuzzy0ne[/eluser]
Unfortunately not. I'm just going to go ahead and implement it anyway. The problem is that the index fragments into several files, and needs optimizing (this is not an issue with space, as the space is virtually unlimited). When optimizing, all of the index files are opened and systems are generally configured to only allow a certain number of open file handles from a single caller. My ISP were hopeless at dealing with this request. Even an estimate would have helped me out, but rather than ask one of their engineers, they'd rather just tell me not to use it. I'm still very interested in finding some kind of guide telling me how often I should optimize, as I'm not sure. |
Welcome Guest, Not a member yet? Register Sign In |