• 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Accentuated characters in url

#1
[eluser]nico060475[/eluser]
Hi there,

I've not found in the forums the solution to use accentuated characters in url like in http://mysite.com/index.php/search/tag/bébé

Even adding it to the permitted_uri_chars in config.php does not work (and it is not a real solution as I cannot insert characters from all languages).

I found a workaround using $_GET parameters but the beauty of url is lost.

Am I sticked with non accentuated characters or is there a hack that can save my day ?

Thanks for your help.

#2
[eluser]xwero[/eluser]
Are you saying the get global accepts accented characters but path info doesn't? Could i see your solution?

I think it's better to transliterate the urls because people who don't speak the language but who want to navigate through the site are not going to know where to place the accents.

Transliterating is easy using the strtr function and to make it language aware you could extend the url helper with following function
Code:
function url_transliterate($str,$lang)
{
   $western = array('ä'=>'a','à'=>a,'á'=>'a','â'=>'a'); // and so on
   $fr = array_merge($western,array('ç'=>'c'));
   $de = array_merge($western,array('ß'=>'ss')); // i'm not sure the character is readable for everyone but in German it's called Eszett.
   return strtr($str,${$lang});
}

#3
[eluser]nico060475[/eluser]
This url can be obtain when submitting a form on the site.

Some JavaScript intercept the form submission and rewrite the url. If JavaScript is disabled then the classical form submission is used with values as get parameters (and the accentuated characters work).

You can go to http://pixbreak.free.fr and type "vendée" in search box with and without javascript activated to see it in action.

So the transliterate can be done client side before submitting but I don't like the idea of loosing such information before doing the search.

Edit : The controler used with and without javascript activated is the same but there is some URI routing implied.

#4
[eluser]xwero[/eluser]
If you add the transliterated segment to the database the vendée searchterm will search for vendée and vendee at the same time. You can't be sure if the one that inputs the word enters it correctly accented, this will cause wrong results looking for the right accented searchterm.

Most people will enter the searchterm in the search box and not in the locationbar. The url as searchbox is a secondary way not the primary way. Are people really going to complain if they see site.com/search/vendee as url if they searched using the search box? I think most people won't even notice.

If you use url_title you loose the whole character so i think transliterating is a good compromise.

#5
[eluser]patos[/eluser]
If thats doesn't sound too dangerous to your application.
Set the permitted_uri_chars parameter of the config array in the config file blank.
$config['permitted_uri_chars'] = '';

hope it helps.

#6
[eluser]nico060475[/eluser]
Thanks for the reply.

In fact I added a list of all accentuated characters. That's less dangerous. Here is my list :
ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ

Note that not all of these are french :-).

I also solved the the fact that the problem only arise when the params are passed as url segments. It was indeed my fault and due to my routing filters that needed to be also aware of accents.

And finally, here are my thoughts on accentuated urls :

The application should not produce urls that contains accentuated characters as it may lead to strange behaviours (I had problems with a dns redirect) and is not SEO efficient BUT if a guy is typing an url with accents, like the one in the previous post, then it should not produce a buggy output or an error message.

#7
[eluser]Pascal Kriete[/eluser]
And the full transliteration code for nico's list (blatantly stolen from the owasp php filter):
Code:
$string = strtr($string,
    "ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ",
    "SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy");

This is a little rough, but works for most purposes (ä, ö, ü and ß should really use the proper: ae, oe, ue, ss).

#8
[eluser]nico060475[/eluser]
I found the list here : http://ellislab.com/forums/viewthread/71139/ :-).

and I'm using a transliterate function very close to the one you gave here : http://ellislab.com/forums/viewthread/79752/#400374.

I use it as a replacement for the default url_title function.

Code:
function url_title($s, $separator = 'underscore', $chars_to_keep = '/[^a-z0-9_-]/' )
{
    // This way we don't have to mess with uppercase the rest of the time
    $s = strtolower(htmlentities(strip_tags(str_replace('?','',utf8_decode($s)))));
    
    // We don't need to capture the second group, so I made it optional || added slash for oslash
    $s = preg_replace ('/&([a-z])(?:uml|acute|grave|circ|tilde|cedil|ring|slash);/', '$1', $s);
    // remove unwanted chars
    // Weird characters that don't get caught above - also includes ðand þ, but I don't know what the best replacement for those would be.
    // While we're at it, we'll also get the http
    $s = str_replace( array('ß', 'æ', 'œ', 'http://'), array('ss', 'ae', 'oe', ''), $s);
    
    $s = html_entity_decode($s);
    
    // Normalize multiple spaces, dashes, and underscores
    $s = preg_replace( array('/\&/', '/\s+/', '/-+/', '/_+/'), ($separator=='dash')?'-':'_', $s);

    // Remove unwanted chars
    $s = preg_replace($chars_to_keep, '', $s);
    
    return $s;
}

#9
[eluser]Pascal Kriete[/eluser]
I actually went looking for that, but I couldn't remember in what context I had posted it. So thanks for finding it (and using it of course Wink ).


Digg   Delicious   Reddit   Facebook   Twitter   StumbleUpon  


  Theme © 2014 iAndrew  
Powered By MyBB, © 2002-2019 MyBB Group.