CodeIgniter Forums
Creating a cross-platform forum search function - Printable Version

+- CodeIgniter Forums (https://forum.codeigniter.com)
+-- Forum: Archived Discussions (https://forum.codeigniter.com/forumdisplay.php?fid=20)
+--- Forum: Archived Development & Programming (https://forum.codeigniter.com/forumdisplay.php?fid=23)
+--- Thread: Creating a cross-platform forum search function (/showthread.php?tid=16886)

Pages: 1 2


Creating a cross-platform forum search function - El Forum - 03-20-2009

[eluser]TheFuzzy0ne[/eluser]
I was referring to [url="http://ellislab.com/forums/viewthread/74616/#518110"]this post[/url].

[quote author="Henry Weismann" date="1232335414"]Search lucene is meant to be implemented in large scale websites. For instance if you are using a tagging system of any kind you will find out that after you reach 1 million tagged entries your application will start bogging down and becoming very slow. That is because rdbms databases have scalability issues when dealing with extremely large data sets using "expensive" queries on those datasets. Search lucene based on the java library apache lucene is much faster. You can also look into hadoop.

But if you are using it on a small dataset it's like using a sledge hammer to push in a thumb tack.[/quote]


Creating a cross-platform forum search function - El Forum - 03-20-2009

[eluser]drewbee[/eluser]
Yes, it is highly scalable. You have to remember, a database is nothing more then files with data in them, organized for optimization, permissions, users etc.

This is just skipping that part going straight to files.

There are a few limitations based on the operating system. The maximum index size on a 32 bit system is 2 GB (2^31 - 1). This is not a fault of the index, but of the operating system; this is the case for any linux 32 bit system. If you need a bigger index, 64 bit can provide a maximum index (file size) of (2^63 - 1) 8,589,934,592 GB. lol, you should be fine.

It uses segments to store the indexes, and each index can be merged using the optimize() method.

I used it as a spider for one of my test projects, and it worked just fine (including the slow speed of PHP crawling pages, adding and parsing data) etc.

Henry is correct as well. I tend to use lucene for even basic searches (IE tags, blogs, forums etc etc).

The only downside to this is it takes twice as much diskspace (database side and Lucene Index); As I was describing in another thread though, the less I can hit my DB server, the better.


Creating a cross-platform forum search function - El Forum - 03-20-2009

[eluser]TheFuzzy0ne[/eluser]
I'm sorry for sounding so thick, but I don't understand how a 2GB index file can be any good for anything. Obviously, it is, but I don't see how... Just how quickly can something be found in a file that's 2GB? Assuming of course that this is one big-a$$ index.

Also, I was aware of the 2GB limit on Linux, but I was under the impression that it depended upon the file system in use. I have a 2.4GB executable happily living on my EXT3 partition.

EDIT: Crap! It's just occurred to me that I totally misread Henry's post. I thought he was saying that it's only good on medium sized datasets. D'oh! Sorry for the confusion there... I think the two questions above are my final questions. If I'm happy with what you have to say, then I'm sold! Thanks again.


Creating a cross-platform forum search function - El Forum - 03-20-2009

[eluser]drewbee[/eluser]
Well, I know it uses flock, so you can do concurrent inserts, deletes, and searches (there is no update doc, you have to delete and re-add).

One of the features as well is the doc types that you add to it. Some are tokenized, some are searchable, some are returnable etc etc. So by using the correct types and attributes for each document added, it optimizes the amount of data stored in the index vs only data that is displayed. I highly recommend looking over the documentation as it is very resourceful.

http://framework.zend.com/manual/en/zend.search.lucene.html

And the fieldtypes:
http://framework.zend.com/manual/en/zend.search.lucene.html#zend.search.lucene.index-creation.understanding-field-types

And to throw gasoline on an already burning 'feature fire', it supports the indexing of excel, word, and power point documents.

In my searches, even in my larger data seconds I have yet to see a decently complex query break anything less more then 5 ms or so search time.


Creating a cross-platform forum search function - El Forum - 03-20-2009

[eluser]TheFuzzy0ne[/eluser]
Sold! ...To the gentleman with the Lollerskates!

Thanks for that. I think I'm going to be spending a lot of time reading this weekend. Thanks for your patience with regards to explaining what's what. Now I feel somewhat happier that I'm making a choice I'm not going to regret.


Creating a cross-platform forum search function - El Forum - 03-21-2009

[eluser]TheFuzzy0ne[/eluser]
OK, I'm thinking about the implementation of the search engine. I'm not sure how often I need to optimise it, and how to trigger it. I guess I don't want to be optimising the index after every item is added. So how often should it be optimised?

Should it be done manually, or automatically?

How long would it take to optimise a very big index?

Would it exceed PHP's max_execution_time?

Sorry to keep coming back with more questions, but I need to be sure that I can make this fit in with what the client needs, and that I'm not going to overlook anything that will cause problems in the future.

EDIT: I'm also starting to have second thoughts. I'm reading the documentation, and it mentions that optimizing can be an expensive process in terms of resources. I don't know how expensive that is, it's a bit like asking "how long is a piece of string?". However, I'm concerned, as my Web host's TOS specifically prohibit me from running long processes which are resource intensive, as I am running on a shared host. drewbee, am I just being paranoid here? Having not used Lucene before, I can't use experience to guide me, so I need to borrow some of yours.


Creating a cross-platform forum search function - El Forum - 03-23-2009

[eluser]TheFuzzy0ne[/eluser]
Can anyone recommend a search engine library that:
a) Only uses a database.
b) I can freely port over to use Active Record.
c) Doesn't require compiling, or any command line access to he server.


Creating a cross-platform forum search function - El Forum - 03-23-2009

[eluser]TheFuzzy0ne[/eluser]
I've been trying to ascertain what the maximum open files limited is with my Web host (who my client will be using). After much confusion, I still can't get a decent answer out of them. Is it the way I've explained myself, as I think it's quite clear. They seem to know even less about their file system than I do! My host is a resellers for another company (referred to as Heart).

Quote:Tracking ID: ***********
Ticket status: Resolved [Open ticket]
Created on: 2009-03-21 19:34:59
Last update: 2009-03-23 18:39:27
Last replier: Staff
Category: Technical Support
Replies: 6
Priority: Low
Printer friendly version
Date: 2009-03-21 19:34:59
Name: Daz
E-mail: **********@gmail.com
IP: **.**.199.81
Customer Centre Email Address:
Your Domain Name: ************.co.uk
Message:

Hi. Please could you tell me what the maximum number of files allowed to be opened by PHP is? I'd like to implement Zend Lucene into my Web site, and it uses files for indexing. Too many files will cause this error, and too few files may be resource intensive, so I need to try and find the happy median.

Thanks in advance.

-------------------------------------------------------------------------

Date: 2009-03-21 22:32:54
Name: Tom ******
Message:

Hi Daz,

There shouldn't be any problems, Zend is fully supported on our servers under PHP 4 and PHP 5.

Please explain exactly what you want to achieve?

Regards,
Tom ******
********** Hosting Support
http://www.**********.co.uk/support/

-------------------------------------------------------------------------

Date: 2009-03-21 23:50:05
Name: Daz
Message:

Hi, Tom!

I'll quote a section of the manual for Zend_Search_Lucene, so you can make of it what you will.

Quoted from http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html:

# START QUOTE #

Small segments generally decrease index quality. Many small segments may also trigger the "Too many open files" error determined by OS limitations [12].

in general, background index optimization should be performed for interactive indexing mode and MaxMergeDocs shouldn't be too low for batch indexing.

MergeFactor affects auto-optimization frequency. Lower values increase the quality of unoptimized indexes. Larger values increase indexing performance, but also increase the number of merged segments. This again may trigger the "Too many open files" error.

# END QUOTE #

Basically, if too many files are opened concurrently, an error will be thrown.

Thanks again!

-------------------------------------------------------------------------

Date: 2009-03-23 12:20:24
Name: Tom ******
Message:

Darren,

I can confirm that there shouldn't be any problems using Zend on our servers, it is supported under both PHP 4 and PHP 5.

Regards,
Tom ******
********** Hosting Support
http://www.**********.co.uk/support/

-------------------------------------------------------------------------

Date: 2009-03-23 12:29:24
Name: Daz
Message:

Tom,

Thanks for your reply, but unfortunately it doesn't answer my question. The question is not whether Zend is supported, but rather how many open file handles I can have at any one time before the server throws an error.

Lucene works by using files for an index. The files are referred to as "segments", and all of the segments are opened at once when searching. Optimising the segments concatenates the smaller segments into larger segments (so you end up with a smaller number of large files). I am trying to ascertain how many files can be opened at any one time, so I can find the right balance for optimisation. If I optimise too much, it will put a strain on the server, if I don't optimize enough, I run the risk of getting the "Too many open files" error.

I hope this makes more sense. Perhaps this is a question for Heart to deal with?

Thanks.

-------------------------------------------------------------------------

Date: 2009-03-23 13:48:09
Name: Tom ******
Message:

Hi Darren,

The replies I gave you came from Heart, they have assured me that Zend will work on their system, but I will pass your message on.

Regards,
Tom ******
********** Hosting Support
http://www.**********.co.uk/support/

-------------------------------------------------------------------------

Date: 2009-03-23 18:39:27
Name: Tom ******
Message:

Darren,

The message from Heart is as follows:

Thank you for your reply.

Unfortunately, we are unable to say exactly how many pages Lucene would be able to work with, however it sounds like what the client is trying to achive would be very demanding on the server, and may not be suited to a shared server environment.

Apologies for the initial confusion.

Regards,
Tom *******
********** Hosting Support
http://www.**********.co.uk/support/

So this brings me back to one of my original questions. Is it suited for shared hosting?


Creating a cross-platform forum search function - El Forum - 05-28-2009

[eluser]CtheB[/eluser]
Hi fuzzy,

did you make any improvements in getting your questions answered??

I think you've read this part, but in case you didn't:

Quote:UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. UnStored fields are practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a separate field as an identifier.
link: zend lucene

So then you have no problem with the shared host limitWink


Creating a cross-platform forum search function - El Forum - 05-29-2009

[eluser]TheFuzzy0ne[/eluser]
Unfortunately not. I'm just going to go ahead and implement it anyway. The problem is that the index fragments into several files, and needs optimizing (this is not an issue with space, as the space is virtually unlimited). When optimizing, all of the index files are opened and systems are generally configured to only allow a certain number of open file handles from a single caller. My ISP were hopeless at dealing with this request. Even an estimate would have helped me out, but rather than ask one of their engineers, they'd rather just tell me not to use it. I'm still very interested in finding some kind of guide telling me how often I should optimize, as I'm not sure.