Welcome Guest, Not a member yet? Register   Sign In
splitting messages into a word array for search indexing.
#1

[eluser]TheFuzzy0ne[/eluser]
Hi everyone. I'm trying to build a forum with CodeIgniter, and I've decided that the best way to do this, and keep it portable between different database engines, is to index the words in a message, and put them into a table, and search the index.

The trouble is, I'm not sure how to split the words. I've been looking at phpBB forum, and the regexes they use to split the words are totally beyond me. I'm looking for ideas on the criteria I should use to split words. I need to be sure the first time round that there are no words getting into the database that shouldn't be.

Does anyone have any suggestions on how I should split the messages into words? I would have thought a simple regex like this /[\w\']+/ ought to do it, but phpBB seem to disagree.

Many thanks in advance.
#2

[eluser]pistolPete[/eluser]
What do you want to split? Every forum message or just the search term a user entered in the search form?

You probably want to split it by spaces, why don't you just use:
Code:
$search_terms = explode(" ", $search_input);
#3

[eluser]TheFuzzy0ne[/eluser]
Thanks for your reply.

I need to split every posted message. Splitting by spaces won't work, as I would end up with hyphens, underscore, periods and commas etc, in my word index.
#4

[eluser]fesweb[/eluser]
I'm not sure if this will help, but here is what the CI text helper word_limiter function looks like:
Code:
function word_limiter($str, $limit = 100, $end_char = '…')
    {
        if (trim($str) == '')
        {
            return $str;
        }
    
        preg_match('/^\s*+(?:\S++\s*+){1,'.(int) $limit.'}/', $str, $matches);
            
        if (strlen($str) == strlen($matches[0]))
        {
            $end_char = '';
        }
        
        return rtrim($matches[0]).$end_char;
    }
Maybe you can use that regex as a starting point?
#5

[eluser]darkhouse[/eluser]
I'm not 100% on this, but I would try doing a preg_replace to replace all combinations of non alpha characters as a space, and then split by spaces.
#6

[eluser]TheFuzzy0ne[/eluser]
Thanks for your replies everyone.

If I split by spaces, and the text contains more than one sequential space, this happens:
Code:
$str = "this is  a  test";
$arr = explode(' ', $str);
print_r($arr);
Array
(
    [0] => this
    [1] => is
    [2] =>
    [3] => a
    [4] =>
    [5] => test
)

Sorry for not making myself clear guys, but I'm not look for different ways to split, I am basically looking for a method that isn't going to allow any characters into the database index that shouldn't be there. In other words, I can't decide what should and shouldn't be indexed.

I'm going to settle on this regex unless anyone else has any advances.

Code:
/([0-9\w\']{3,})/ig

The regex will extract anything that contains any letters a-z, numbers 0-9, or apostrophes. It will only extract words and numbers that are more than three characters long. The only catch I can see here is that the regex will extract ''', which isn't a word... I'm not sure whether I should even include apostrophes.
#7

[eluser]xwero[/eluser]
why don't you use the php function str_word_count
#8

[eluser]TheFuzzy0ne[/eluser]
Hi, xwero.

Two reasons really. First of all, I'd need to use charlist (for numbers), which isn't available until PHP 5.1.0 (so my code would not be portable), and second of all, it includes hyphens, and I'm not sure if I want it to include hyphens. A hyphen usually glues two separate words together, but doesn't necessarily constitute a whole word.




Theme © iAndrew 2016 - Forum software by © MyBB