Login

02-02-2009, 01:56 PM

[eluser]felyx[/eluser]
This is not a CI related issue but since there are so many clever guys and girls around here I thought I would ask you about how would you extract meta tags and title tag from a given url. I made up my functions to do it but I got all kinds of errors/problems. I solved most of them but sometimes it just doesnt work how it should so I figured maybe I need some new ideas. If you have any please help me out. Smile

02-02-2009, 02:18 PM

[eluser]jalalski[/eluser]
Can you show us what you are doing at present?
And then explain what it is that isn't working the way that you'd like and we could maybe improve it...

02-02-2009, 02:20 PM

[eluser]felyx[/eluser]
This is what I have atm:

Code:
function get_title_data($url)

    {

        $title = "";

        $fp = @fopen( $url, 'r' );

        if ($fp) {

            $cont = "";

            while( !feof( $fp ) ) {

               $buf = trim(fgets( $fp, 4096 )) ;

               $cont .= $buf;

            }

            $title_exists = @preg_match( "/&lt;title&gt;([a-z 0-9]*)<\/title>/si", $cont, $match );

            if ($title_exists)

            {

                $title = strip_tags(@$match[ 1 ]);

                return $title;

            } else

            {

                return $url;

            }

        } else

        {

            return $url;

        }

    }

    function get_meta_data($url)

    {

        $meta = array();

        $fp = @fopen( $url, 'r' );

        if ($fp) {

            $cont = "";

            while( !feof( $fp ) ) {

               $buf = trim(fgets( $fp, 4096 )) ;

               $cont .= $buf;

            }

            $meta_exists = @preg_match_all("|&lt;meta[^&gt;]+name=\"([^\"]*)\"[^>]+content=\"([^\"]*)\"[^>]+>|i", $cont, $out, PREG_PATTERN_ORDER);

            if ( ($meta_exists != FALSE) AND ($meta_exists > 0) )

            {

                for ($i=0;$i < count($out[1]);$i++) {

                    if (strtolower($out[1][$i]) == "keywords") $meta['keywords'] = $out[2][$i];

                    if (strtolower($out[1][$i]) == "description") $meta['description'] = $out[2][$i];

                }

                foreach ($meta as $key => $value) {

                    $meta[$key] = ( !empty($meta[$key]) ) ? $meta[$key] : $url;

                }

                return $meta;

            } else {

                $meta['keywords'] = $url;

                $meta['description'] = $url;

                return $meta;

            }

        } else

        {

            $meta['keywords'] = $url;

            $meta['description'] = $url;

            return $meta;

        }

    }

02-02-2009, 02:24 PM

[eluser]jalalski[/eluser]
And what kind of errors/problems are you trying to solve?

02-02-2009, 02:29 PM

[eluser]felyx[/eluser]
[quote author="jalalski" date="1233627876"]And what kind of errors/problems are you trying to solve?[/quote]

Getting the title works almost every time, meta extraction does not work always but when I tryed http://test.com (first I thought the site does not even exists) then it returned without the title and meta tags but when I checked the source code of the site I saw title and meta tags there. So for one, sometimes it cannot extract the data from the url even if it is there. Sometimes it can extract the title but not the meta tag but in the source code there are both again. These kind of problems I have.

02-02-2009, 02:39 PM

[eluser]jalalski[/eluser]
You may want to put the regex inside the read loop, then you can break out of the loop as soon as you have found the title tag.

The problem is with the preg_match that you are using, it only searches for letters, spaces and digits. The title tags contain punctuation as well. A better regex would be to search for '<title>' then any character up to '</title>', that will be the title tag... unless the title tag extends over more than one line. Then it gets a little more complex.
In that case you need to read lines until you see the title tag, read from there until either the end tag, or read the next line until you see the end title tag.

The same principle applies to the META tags.

You may want to use file(...) to read the whole file into an array and then work with that. Depends on how big the files are.

02-02-2009, 02:42 PM

[eluser]jalalski[/eluser]
Of course, the META, there is:
http://php.net/get_meta_tags

02-02-2009, 02:43 PM

[eluser]felyx[/eluser]
Well I just changed the reading method with file_get_cotents(), the regex might be the problem for one, need to work on those, my only problem is that I am not so good at regex :/. The file can be any size, I mean it really depends on the website. This is going to be a link submission script so.

02-02-2009, 02:45 PM

[eluser]felyx[/eluser]
The title regex fix pretty much solved the problem for the title part thanks Smile

02-02-2009, 02:47 PM

[eluser]felyx[/eluser]
[quote author="jalalski" date="1233628972"]Of course, the META, there is:
http://php.net/get_meta_tags[/quote]

Yep I used that function also but that does not work so good either, and is a bit too slow. That's why I want to use the same method as for the title.