Welcome Guest, Not a member yet? Register   Sign In
Email addresses with a "+" are not parsed correctly
#1

[eluser]dmorin[/eluser]
Correct me if I'm wrong, but I believe that the "+" is a valid character for the left side of an email address (what is that side called, the username?). Using the auto_link() function, it does not include anything before and including the plus sign.

Example:

Code:
echo auto_link(' [email protected] ');

Only the [email protected] will actually get wrapped in the link tag. Please confirm so I can post the bug.
#2

[eluser]Derek Allard[/eluser]
line 403 of url_helper.php
Code:
if (preg_match_all("/([a-zA-Z0-9_\.\-]+)@([a-zA-Z0-9\-]+)\.([a-zA-Z0-9\-\.]*)/i", $str, $matches))

to

Code:
if (preg_match_all("/([a-zA-Z0-9_\.\-\+]+)@([a-zA-Z0-9\-]+)\.([a-zA-Z0-9\-\.]*)/i", $str, $matches))

That solve it for you?

I do believe that the + is in fact valid, so I'm tempted to call this a bug and drop in the code fix. Before I do, does anyone have anything else they'd like to add?
#3

[eluser]dmorin[/eluser]
Thanks Derek!

Just for reference, wikipedia lists the following for allowed characters:

Quote:
* Uppercase and lowercase letters
* Digits 0 through 9
* Characters ! # $ % * / ? | ^ { } ` ~ & ' + - = _
* Character . provided that it is not the first nor last character, nor may it appear two or more times consecutively.

Didn't actually look at the RFC though, so I'm not sure if this is accurate. Thanks again for the fix.
#4

[eluser]Derek Allard[/eluser]
My pleasure. RFC's are important, but only if they dictate real-world use. Sometimes they need to be broken for practicality's sake. Email sending is a good example of this. Since the + does tend to get used, it seems a logical choice for inclusion (thanks for raising this).

I'm going to not add in the fix for a few days, and leave this thread open in case anyone from the community wants to add more. If after a bit of time there's no more feedback, could I ask you to just bug me and I'll get it in there for you?
#5

[eluser]sophistry[/eluser]
i think it's a bug. here's the code in question (see below for full copy paste of auto_link() function).

Code:
preg_match_all("/([a-zA-Z0-9_\.\-]+)@([a-zA-Z0-9\-]+)\.([a-zA-Z0-9\-\.]*)/i", $str, $matches)

it only allows alphanumeric, underscore, period, and dash in the mailbox section of the regex.

i believe that you have uncovered some "expedited" programming in the url helper auto_link() function. the email parser here does not conform to RFC2822 (link to FAQS.org) - but it probably shouldn't conform completely. to see someone's attempt at implementing a complete email address parser see the comments on this PHP manual page:

http://us.php.net/manual/en/function.preg-match-all.php

but, a plus sign is an allowed mailbox character. here is the "official" list from the RFC:
Quote:atext = ALPHA / DIGIT / ; Any character except controls,
"!" / "#" / ; SP, and specials.
"$" / "%" / ; Used for atoms
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"

which would translate to a regex character class of:
Code:
// use dash first in character class and we don't have to escape it
// you don't have to escape repetition operators or other special regex symbols
// when they are inside a character class - makes it more readable
// the only one that needs escaping is the single quote
[-.a-zA-Z0-9#!$%&*+/=?^_`{}|~\']

i wrote a test controller to make sure that all the characters inside the class were actually specified correctly for the fix (meaning, i'd be crushed if the new regex has a bunch of escaped characters that make it unreadable :-) ):
Code:
<?php

class Test extends Controller {

    function Test()
    {
        parent::Controller();
    }
        
    function index()
    {
        $chars = '-.a-zA-Z0-9#!$%&*+/\'=?^_`{}|~';
        $len = strlen($chars);
        $i=0;
        print_r($chars); echo '<br>';
        while ($i<$len)
        {
            preg_match(":[$chars]:", $chars[$i], $matches);
            $i++;
            print_r($matches[0]);
        }
    }

}

/* End of file test.php */

the auto_link() function:
Code:
// ------------------------------------------------------------------------

/**
* Auto-linker
*
* Automatically links URL and Email addresses.
* Note: There's a bit of extra code here to deal with
* URLs or emails that end in a period.  We'll strip these
* off and add them after the link.
*
* @access    public
* @param    string    the string
* @param    string    the type: email, url, or both
* @param    bool     whether to create pop-up links
* @return    string
*/
if ( ! function_exists('auto_link'))
{
    function auto_link($str, $type = 'both', $popup = FALSE)
    {
        if ($type != 'email')
        {        
            if (preg_match_all("#(^|\s|\()((http(s?)://)|(www\.))(\w+[^\s\)\<]+)#i", $str, $matches))
            {
                $pop = ($popup == TRUE) ? " target=\"_blank\" " : "";
        
                for ($i = 0; $i < sizeof($matches['0']); $i++)
                {
                    $period = '';
                    if (preg_match("|\.$|", $matches['6'][$i]))
                    {
                        $period = '.';
                        $matches['6'][$i] = substr($matches['6'][$i], 0, -1);
                    }
            
                    $str = str_replace($matches['0'][$i],
                                        $matches['1'][$i].'<a href="http'.
                                        $matches['4'][$i].'://'.
                                        $matches['5'][$i].
                                        $matches['6'][$i].'">http'.
                                        $matches['4'][$i].'://'.
                                        $matches['5'][$i].
                                        $matches['6'][$i].'</a>'.
                                        $period, $str);
                }
            }
        }

        if ($type != 'url')
        {    
            if (preg_match_all("/([a-zA-Z0-9_\.\-]+)@([a-zA-Z0-9\-]+)\.([a-zA-Z0-9\-\.]*)/i", $str, $matches))
            {
                for ($i = 0; $i < sizeof($matches['0']); $i++)
                {
                    $period = '';
                    if (preg_match("|\.$|", $matches['3'][$i]))
                    {
                        $period = '.';
                        $matches['3'][$i] = substr($matches['3'][$i], 0, -1);
                    }
            
                    $str = str_replace($matches['0'][$i], safe_mailto($matches['1'][$i].'@'.$matches['2'][$i].'.'.$matches['3'][$i]).$period, $str);
                }
        
            }
        }
        return $str;
    }

EDIT: and finally... for easy, drop in... the line with the full fix so no one bothers about it any more (might as well fix it right, right?):
Code:
if (preg_match_all("/([-.a-z0-9#!$%&*+/\'=?^_`{}|~]+)@([-a-z0-9]+)\.([-.a-z0-9]*)/i", $str, $matches))

EDIT: i trimmed down the regex for better readability overall... since there is the i switch, you don't need A-Z and i rearranged the dash char and de-escaped the period (no need for escaping in a character class).
#6

[eluser]Pascal Kriete[/eluser]
Yes, the RFC allows all of those, but do you really want them in there?

I'm fine having a +, heck even a ~, but pound signs and curly brackets?

Imagine getting an email from someone like this:
!#$%&'*+-/=?^_`.{|}[email protected]

I sleep better at night knowing that email from mysterious people like that will never reach my inbox.

@Sophistry, your regular expression allows multiple periods, and strings starting with a period - both of which are not valid :lol: . Also you're missing the quoted strings.
#7

[eluser]sophistry[/eluser]
@inparo, thanks for the feedback! so you will decide for us what's acceptable in the RFC? sounds good to me - i wish someone would decide; email addresses are such a mess. :-) oh wait... they did decide: RFC2822.

seriously, the idea is to make it a complete listing of acceptable email address formats so that people don't keep filing bug reports for every character they want included in the "acceptable" list. this way, if someone wants to have auto_link() parse out an email with a colon in it, you just say 'sorry, your email is "out of scope of the RFC".' rather than 'sorry your email format is out of scope for CI.'

to address your insights into the regex: the current auto_link() code does not care about quoted strings so i didn't include code that deals with them. and as far as the other, ahem, shortcomings... those are also present in the current code: allows multiple periods and allows starting with a period.

i took a stab at fixing the regex to detect initial periods (regex gets kooky when you try to express something like multi-periods in the midst of a character class so i'm just going to leave it for now):

this will work to disallow initial periods (adds a "everything but a dot" character class for the first character)
Code:
$chars_not_dot = '-a-z0-9#!$%&*+/\'=?^_`{}|~';
if (preg_match_all("/([$chars_not_dot][$chars_not_dot.]+)@([-a-z0-9]+)\.([-.a-z0-9]*)/i", $str, $matches))
#8

[eluser]Pascal Kriete[/eluser]
You're right, of course, that it's the accepted standard. The problem is that they agreed on something so gosh darned complex.

If you really want to give it a shot though: here you go (or for those that fancy: perl). Derek made a very good point earlier - practicality.
#9

[eluser]sophistry[/eluser]
ha ha! wow, thanks for the link to the "impenetrable RFC 2822 email address parser." so, that's how you'd do it if you wanted to be completely impractical and gosh-darned complex! ;-)

the set of atext "visible chars" i suggested above could become the CI-approved email address standard which would cover 99% of email addresses anyone would ever really use. the gargantuan email address parser is really dealing mostly with edge cases and quoted literals (which would be a real bugbear in an autolink function).

maybe the accepted email chars should go into a config setting like the permitted uri setting? in fact, shouldn't the whole "email detector" regex go into its own function to make it more adaptable/extendable?

BTW, there is another problem with the current email detector: it detects a domain with a hostname ( e.g., [email protected] ) but, it puts the hostname in $matches[2] and the domain and tld in $matches[3]. But, with an email address with no hostname $matches[2] has the domain alone and $matches[3] has the tld alone.

i know you can easily explode('@',$matches[0]); to get the data but, the function should standardize what it captures in the sub-patterns. so that $matches contains a standard set of captured data.

here's some new test code that standardizes $matches:

Code:
&lt;?php

class Test extends Controller {

    function Test()
    {
        parent::Controller();
    }
        
    function index()
    {
        $chars = '-.a-zA-Z0-9#!$%&*+/\'=?^_`{}|~';
        $len = strlen($chars);
        $i=0;
        print_r($chars); echo '<br>';
        while ($i<$len)
        {
            preg_match(":[$chars]:", $chars[$i], $matches);
            $i++;
            print_r($matches[0]);
        }
        
        // test on "real" addresses
        $strs = array(
                    "back|to=school~w0w.does+this^[email protected]",
                    "back{to}school-does+this^[email protected]",
                    'back{to}school#[email protected]',
                    '[email protected]',
                    '[email protected]',
                    '[email protected]',
                    '[email protected]',
                    '[email protected]',
                    'h#[email protected]',
                    '[email protected]',
                    'h$[email protected]',
                    'h%[email protected]',
                    'h&r;@example.com',
                    'h*[email protected]',
                    '[email protected]',
                    'h/[email protected]',
                    "h'[email protected]",
                    '[email protected]',
                    '[email protected]',
                    'h^[email protected]',
                    '[email protected]',
                    'h`[email protected]',
                    'h{[email protected]',
                    'h}[email protected]',
                    'h|[email protected]',
                    '[email protected]',
                    '[email protected]',
                    );
        $chars_not_dot = '-a-z0-9#!$%&*+/\'=?^_`{}|~';
        foreach ($strs as $s)
        {
            preg_match_all(";([$chars_not_dot][$chars_not_dot.]+)@((?:[-a-z0-9]+)\.(?:[-.a-z0-9]*));i", $s, $matches);
            p($matches);
        }
    
    }


}

/* End of file test.php */
#10

[eluser]sophistry[/eluser]
yeah, it's me, back to back posts. but, i couldn't wait.

ok, i put on my REGEX thinking cap and came up with a way to exclude matching on multi-periods (a fix suggested by inparo):
Code:
preg_match_all(";([$chars_not_dot](?:[$chars_not_dot]|[.](?![.]))+)@((?:[-a-z0-9]+)\.(?:[-.a-z0-9]{2,}));i", $s, $matches);

it's pretty cool regex if i do say so myself; i used my first negative lookahead assertion.
Code:
[.](?![.]))

that little squeegee says "match a dot only if it is not followed by another dot."

EDIT: i changed the last part of the regex to look for at least 2 chars instead of the wildcard 0 or more * that was there on first post.




Theme © iAndrew 2016 - Forum software by © MyBB