Indexing HTML fragments with Zend Search Lucene |
[eluser]TheFuzzy0ne[/eluser]
Hi everyone. I've got Lucene set up and ready to go - now it's just a case of indexing the data. I see you can index HTML files (one of the many reasons I chose to go with Lucene in the first place), but I can't figure out how the heck I can index HTML fragments or rather the markup that would appear within the body tags. I'm hoping that Lucene supports HTML fragments natively, but if it doesn't, can anyone who's been in this situation before comment on how I should go about indexing HTML fragments? Many thanks in advance.
[eluser]pistolPete[/eluser]
Quote:I’m hoping that Lucene supports HTML fragments natively I didn't use Zend Lucene yet, but the the documentation states the following: zend.search.lucene.index-creation.html-documents Quote:Zend_Search_Lucene_Document_Html class recognizes document title, body and document header meta tags.
[eluser]TheFuzzy0ne[/eluser]
Exactly, but unfortunately I need to index just a fragment, not a whole document. ![]() If it comes to it, then I might have to build the fragment into a whole document first, but I'd like to avoid it if possible. I'm trying to index forum posts that are submitted in HTML, but obviously, they are just fragments, not an entire document. Thanks for your reply.
[eluser]jedd[/eluser]
Hey Fuzzy - I've been meaning to ask, and it's vaguely germane here - are you storing your forum posts in your own markup variant, or in HTML that you generate from whatever markup (if any?) that you allow when the poster presses submit? I've been wondering how to do this for my own rinky dinky forum software, and couldn't come up with a good answer. I've read DekiWiki's rationale for storing everything as HTML (and it makes some sense) but I also like the idea of storing things in a more fundamental format. I just see lots of little irritating problems cropping up in the future (much like this one) no matter which approach I adopt.
[eluser]TheFuzzy0ne[/eluser]
I'm using [url="htmlpurifier.org/"]HTMLPurifier[/url] to validate user submitted mark-up into valid XHTML. The only configuration options I use are these: Code: $config->set('HTML', 'Doctype', 'XHTML 1.1'); # Set the doctype I think that it actually removes script tags too, but I've yet to see that since CodeIgniter's XSS filter strips those tags out anyway. [url="htmlpurifier.org/"]HTMLPurifier[/url] has been tested against every element of the [url="http://ha.ckers.org/xss.html"]XSS cheat sheet[/url] (and passes). I strongly suggest you check it out and [url="http://htmlpurifier.org/comparison"]compare[/url] it to similar libraries on the market. Also, check out the [url="http://htmlpurifier.org/demo.php"]demo[/url] - it knocked my socks off! Try firing some invalid HTML at it, and see the results. Whatever the result, it won't break the rest of the page, unless of course the element style prevents it from wrapping. In my forum, I just shove it into a div with overflow-x set to auto. So, everything is stored in pure (valid) HTML in the database, and printed as htmlentities which NICEdit parses. As for indexing, I'm planning on using Zend_Search_Lucene, although I'm having a few teething problems with it at the moment, but I don't think they'll be too difficult to bypass - I'm just exploring my options at the moment. Hope this helps.
[eluser]jedd[/eluser]
Very much so, thank you. Bookmarked, and now slightly more motivated to start coding that bit of my app.
[eluser]TheFuzzy0ne[/eluser]
I've been thinking about simply stripping the tags and indexing the text that way. Can anyone think of any reason why I shouldn't? My other alternative might be to wrap the fragment in code to make it a complete document. Does anyone else actually use Lucene for indexing fragments?
[eluser]jedd[/eluser]
My initial response (as per the above revelation that Lucene prefers pages rather than fragments) is to wrap it in a very basic html/body pairing - have you tested if that's sufficient? It seems easy enough to strip off the first two (or even one) line and the matching pair at the end of any 'document' you handle this way, or not even bother if it's a one-way trip to the indexer. I think stripping the tags would be quite a complex task.
[eluser]TheFuzzy0ne[/eluser]
I'm wondering how reliable strip_tags() is with well-formed HTML. The indexer accepts a string containing HTML, but I've been given the impression that the encoding in the meta data has a lot to do with how the HTML is interpreted. I think I'm going to have to spent the night playing around with the index - ya know, just shoving in different bits of data, and seeing what I can do with it. |
Welcome Guest, Not a member yet? Register Sign In |