Login

10-29-2008, 08:15 PM

[eluser]dmorin[/eluser]
There appears to be an issue with the auto_link helper function when there are multiple urls from the same domain. For example, take the following plain text:

Quote:Details: http://www.coolpatchpumpkins.com/ or more specifically here: http://www.coolpatchpumpkins.com/corn_maze.html

The function will parse these one at a time and then do a str_replace on each in order replacing the url with the full html link tag. The problem is that when it does a string replace for the first one, it also replaces the base url of the second one. Then when it does the second string replace, it doesn't find any matches. The result is:

Quote:<a href="http://www.coolpatchpumpkins.com/" target="_blank">http://www.coolpatchpumpkins.com/</a> or more specifically here: <a href="http://www.coolpatchpumpkins.com/" target="_blank">http://www.coolpatchpumpkins.com/</a>corn_maze.html

When the links are listed from shorter to longer as they are here, the affect isn't bad with just the base being a link in both cases. However, if the links are exactly the same, or the longer one is first and the shorter is second, there are larger issues since link tags are placed within other link tags:

Quote:<a href="a href=" target="_blank">http://www.coolpatchpumpkins.com/corn_maze.html</a>" target="_blank" >http://www.coolpatchpumpkins.com/corn_maze.html</a>
<a href="a href=" target="_blank">http://www.coolpatchpumpkins.com/corn_maze.html</a>" target="_blank" >http://www.coolpatchpumpkins.com/corn_maze.html</a>

10-30-2008, 07:25 AM

[eluser]sophistry[/eluser]
this is the code you are talking about?

Code:
if (preg_match_all("#(^|\s|\()((http(s?)://)|(www\.))(\w+[^\s\)\<]+)#i", $str, $matches))

            {

                $pop = ($popup == TRUE) ? " target=\"_blank\" " : "";

                for ($i = 0; $i < sizeof($matches['0']); $i++)

                {

                    $period = '';

                    if (preg_match("|\.$|", $matches['6'][$i]))

                    {

                        $period = '.';

                        $matches['6'][$i] = substr($matches['6'][$i], 0, -1);

                    }

                    $str = str_replace($matches['0'][$i],

                                        $matches['1'][$i].'<a href="http'.

                                        $matches['4'][$i].'://'.

                                        $matches['5'][$i].

                                        $matches['6'][$i].'">http'.

                                        $matches['4'][$i].'://'.

                                        $matches['5'][$i].

                                        $matches['6'][$i].'</a>'.

                                        $period, $str);

                }

            }

see that str_replace() in there? that's the root of the problem. it just does a blind str_replace() for each match that the preg function finds. i think this could be better written with a preg_replace(), but i don't have my thinking cap on this morning.

10-30-2008, 07:43 AM

[eluser]dmorin[/eluser]
Yes, that's exactly what I was describing.

10-30-2008, 07:48 AM

[eluser]sophistry[/eluser]
so, i haven't run the code with data you describe, but i can see the problem with this auto_link code. so, i'd say this is a bug. "auto_link() overwrites previously replaced text inside for loop - blind str_replace()."

10-30-2008, 08:43 AM

[eluser]dmorin[/eluser]
Sounds good, thanks for confirming. I'll add it to the bug tracker.

10-30-2008, 08:55 AM

[eluser]dmorin[/eluser]
Added as bug: http://codeigniter.com/bug_tracker/bug/5788/

10-30-2008, 11:15 AM

[eluser]sophistry[/eluser]
ok, got around to running the sample coolpatchpumpkins text you provided above and i can confirm this bug.

here's some starter code using preg_replace() that doesn't suffer from that problem, but it suffers from other problems! it may work in your case in the interim. still cooking on it, but needs some fixing to work nicely with a range of urls like the auto_link() already does.

Code:
function auto_link_text($str, $type = 'both', $popup = FALSE)

{

    $pop = ($popup === TRUE) ? ' target="_blank" ' : '';

    $p = '!https?://[^\s]+!';

    $r = '<a href="$0">$0</a>';

    $out = preg_replace($p,$r,$str);

    return $out;

}

EDIT: in my testing, i uncovered another bug? the auto_link() code will detect an url that starts with www. like www.domain.com but it won't detect plain old domain.com or even perfectly reasonable FQDs like: wiki.codeigniter.com

10-30-2008, 01:14 PM

[eluser]sophistry[/eluser]
EDIT: the EE forum strangles code! please beware. c'mon guys, a smiley conversion inside a code block...? since i can't post to the EE forum, i guess that's a bug report for the EE forum. :exasperated: because this is a code-sharing forum...

see next post for link to pastebin page where you can copy the code for testing...

tell me this is not beautiful.

Code:
/* DO NOT use this code, it got messed up by the EE forum

// see the next post for a link to a pastebin page that handles code properly.

function auto_link_text($str, $type = 'both', $popup = FALSE)

{

    // list of chars that cannot 

    // start or end an url (for the

    // purposes of detection)

    // may not be complete

    $not = '.<>:()';    

    $p = '`(https?://)?(?!['.$not.'])([[:alnum:][:punct:]]+)(?<!['.$not.'])`i';

    $cb = 'urlit_cb';

    $out = preg_replace_callback($p,$cb,$str);

    // since popups are all the same, just replace in one stroke

    $replacement = ($popup) ? ' target="_blank" ' : '';

    $out = str_replace('~!@popup@!~', $replacement, $out);

    return $out;

}

// callback for url replacement

// inserts code to be used for 

// popup replacement

function urlit_cb($matches)

{

    // if there is a scheme, keep it, if not add http...

    // but only to the href, not to the text inside the anchor tag

    if ($matches[1]==='')

    {

        $scheme = 'http://';

        $a_string = $matches[2];

    }

    else

    {

        $scheme = $matches[1];

        $a_string = $scheme.$matches[2];

    }

    $link_url = $scheme.$matches[2];

    // check if the url really is a valid domain or just a word

    // second stage check so we just look for a common domain pattern

    $url_p = '`[[:alnum:]]{2,}[.][a-z]{2,6}`';

    $is_url = preg_match($url_p, $matches[2]);

    $r = ($is_url) ? '<a href="'.$link_url.'">'.$a_string.'</a>' : $a_string;

    return $r;

}

*/

feed it this for testing. break it, please, if you can. ;-)

Code:
$text = "domainatstartofline.com (domaininparens.com) details: .dotstartsthis.com domain.com www.domain.com wiki.codeigniter.com http://www.rc.umd.edu/. specifically:  http://www.rc.umd.edu/index.html

Details: http://www.coolpatchpumpkins.com/ or more specifically here: http://www.coolpatchpumpkins.com/corn_maze.html";

$linked_text = auto_link_text($text);

EDIT: please note that i fixed the two bugs already mentioned in this thread and fixed a third bug/unexpected behavior so that this function does not insert an http:// in the displayed text of the anchor tag - the old function would gleefully insert an http:// if it found a bit of string starting with "www." which was unexpected (or a bug, whichever).

10-30-2008, 01:23 PM

[eluser]sophistry[/eluser]
WTF? the crazy forum munged the code. please do not copy/paste this and tell me that it doesn't work. it doesn't work because the forum messed up the code!

i'm going to try pastebin instead...http://pastebin.com/f4e738080

10-30-2008, 07:42 PM

[eluser]sophistry[/eluser]
got a new version - this one should be the end of it.

added a slew of new features - this one can even parse urls that are jammed together with no separator. try it and see if you can trick it. then let me know what you find.

again, on pastebin so that the forum keeps its munging paws off this delicate flower:
http://pastebin.com/ffc95afe

but, also here below so you can get the gist of the code (DO NOT COPY PASTE this code, it will not work, go to the pastebin link to get the code):

Code:
/* DO NOT USE THIS CODE, get it from pastebin link above

// new auto_link() replacement candidate

function auto_link_text($str, $type = 'both', $popup = FALSE)

{

    // list of chars that cannot start or end an url (for the

    // purposes of detection) this list may not be complete

    // notice, you only have to escape the square brackets. 

    // NOTE:

    // this list will cause some minor issues because it prevents

    // any URL that ends with one of these chars from getting

    // captured as part of the URL. good for periods after URLs,

    // but, bad for query strings that might use these as part

    // of the query itself in the URL. it only matters if the char

    // happens to be at the end of the query string. not too bad

    // as a trade off - e.g., periods, etc... are more likely 

    // to be part of a sentence that ends in a url than the last 

    // char in a query string of a URL

    $not = '-+.,<>:()|\[\]{}';

    // regex detail: do not start with char from the not list, 

    // capture an optional https?://, then again another not list,

    // any alphanumerics or dash (gather possesively ++), alnum or 

    // punct (optional because the first alnum-dash group might be 

    // enough), then do not end with any string from the not list

    // oh, and do it all caselessly

    $p = '#(?!['.$not.'])(https?://)?(?!['.$not.'])([-[:alnum:]]++[.]([[:alnum:][:punct:]]+)?)(?<!['.$not.'])#i';

    // send the captures to a second stage of processing

    $out = preg_replace_callback($p,'urlit_cb',$str);

    // since popups are all the same, just replace in one stroke

    $replacement = ($popup) ? ' target="_blank" ' : '';

    // popup replace string is the same as in urlit_cb()

    $out = str_replace('~!@popup@!~', $replacement, $out);

    return $out;

}

// callback for url replacement

// does extra tests on the captured strings and

// inserts a code to be used for popup replacement

function urlit_cb($matches)

{

    // there are still many things that will get here

    // that are not domains (and that are hard to catch in regex)

    // we can put any checks needed in here

    // to filter out things that are not urls

    // main check if the url really is a real-looking domain 

    // or just a string of alnums and puncts. this is a

    // second stage check so we just look for a common domain pattern

    // that should reject a lot of junk right up front

    $url_p = '#[-[:alnum:]]+[.][a-z]{2,6}#';

    $is_url = preg_match($url_p, $matches[2]);

    if (! $is_url) return $matches[0];

    // if we have http:// strings inside a domain name

    // that were caught by the first stage regex, split them

    // for example: http://domain.comhttp://anotherdom.com

    // use the PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE flags to make sure 

    // we can rebuild the scheme properly

    $url_parts = preg_split('#(https?://)#',$matches[0],-1,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);

    if (count($url_parts) > 2) 

    {

        // rebuild them as space separated string and

        // call the auto_link_text() function again

        $new_urls = '';

        for($k = 0; $k < count($url_parts)-1; $k++)

        {

            $new_urls .= $url_parts[$k].$url_parts[$k+1].' ';

            $k++;

        }

        return auto_link_text($new_urls);

    }

    // now handle the "normal" urls

    // if there is a scheme, keep it, if not add http...

    // but only add http:// to the href, not to the 

    // displayed link text in the anchor tag - use 

    // the whole captured string for that

    $scheme = ($matches[1]==='') ? 'http://' : $matches[1];

    $link_url = $scheme.$matches[2];

    // return the link with a popup placeholder that will get replaced

    $r = '<a href="'.$link_url.'">'.$matches[0].'</a>';

    return $r;

}

*/