CodeIgniter Forums
web crawler - Printable Version

+- CodeIgniter Forums (https://forum.codeigniter.com)
+-- Forum: Archived Discussions (https://forum.codeigniter.com/forumdisplay.php?fid=20)
+--- Forum: Archived Libraries & Helpers (https://forum.codeigniter.com/forumdisplay.php?fid=22)
+--- Thread: web crawler (/showthread.php?tid=27352)

Pages: 1 2


web crawler - El Forum - 02-08-2010

[eluser]jcavard[/eluser]
Hi!

I developped a web crawler with CI. It does the job, but my host keep blocking my IP over 'Large connection amount' (max allowed, 25). I guess this is a direct effect of the crawler, but I would like your input on this...

Is it not best to use PHP to code a crawler?
What causes concurrent connections?
Has anyone of you ever code a crawler (or anything similar)?

The main goal is to parse overs 60 000+ html pages, to retreive specific product information. Has anyone of you ever had the same 'problem'??

thanks a lot!


web crawler - El Forum - 02-08-2010

[eluser]Sbioko[/eluser]
Yeah, I did it some years ago :-) But, today I think that PHP is not the best programming language to develop a crawler, because this is scripting language and it is written on C. But, do you hear about HipHopPHP from Facebook? Try it :-) It will turn your PHP to C++ code and increase perfomance up to 50%. About connections, I can't say something concretically because I didn't see the code.


web crawler - El Forum - 02-08-2010

[eluser]jcavard[/eluser]
yeah, I read about HipHop PHP, but I'm on an shared hosting.
I'm looking at rewriting the whole thing in JAVA so it can be multithreaded.


web crawler - El Forum - 02-08-2010

[eluser]danmontgomery[/eluser]
Your host blocking your IP because of outgoing connections isn't going to be affected by whether or not the process is threaded... It sounds like you need to limit the number of concurrent connections or talk to your host.

[edit]

Concurrent connections happen when you don't wait for one connection to close before opening another, judging from the error message it means you have 25 outgoing connections open at once.


web crawler - El Forum - 02-08-2010

[eluser]jcavard[/eluser]
Thanks for piointing me that multithreaded won't help with concurrent connection. It will save me lots of work, because concurrent connexions are my main (and sole) concern at this time.

How can I limit the number of connection then? Do you have any idea? I use cURL.. am I msissing any configuration on the cURL object?


[quote author="noctrum" date="1265681936"]Your host blocking your IP because of outgoing connections isn't going to be affected by whether or not the process is threaded... It sounds like you need to limit the number of concurrent connections or talk to your host.

[edit]

Concurrent connections happen when you don't wait for one connection to close before opening another, judging from the error message it means you have 25 outgoing connections open at once.
[/quote]


web crawler - El Forum - 02-08-2010

[eluser]Sbioko[/eluser]
Why not to rewrite it to C++. And, important question: why do you need this?(if not secret)


web crawler - El Forum - 02-08-2010

[eluser]jcavard[/eluser]
Well, I am evaluating all the possibilities right now, and if C++ is the way to go, well be it! I'm gonna rewrite it all!

The main purpose of this crawler has nothing secret. I need this to retreive information from different auction site onto one database. It helps us speed up the search, instead of searching on all the different site, I query one database that contains all the auctions available.

This script runs nightly, and it is an internal tool on a intranet.




[quote author="Sbioko" date="1265683111"]Why not to rewrite it to C++. And, important question: why do you need this?(if not secret)[/quote]


*Sexy edit: this thread has 69 views!*


web crawler - El Forum - 02-08-2010

[eluser]danmontgomery[/eluser]
I don't know if or what you're missing without seeing the code... Are you calling curl_close()?


web crawler - El Forum - 02-08-2010

[eluser]jcavard[/eluser]
[quote author="noctrum" date="1265685829"]I don't know if or what you're missing without seeing the code... Are you calling curl_close()?[/quote]

Well, I use Phil Sturgeon's cURL lib for CI. In the execute() fonction there is a curl_close()

Code:
// End a session and return the results
public function execute()
{
    // Set two default options, and merge any extra ones in
    if(!isset($this->options[CURLOPT_TIMEOUT])) $this->options[CURLOPT_TIMEOUT] = 30;
    if(!isset($this->options[CURLOPT_RETURNTRANSFER])) $this->options[CURLOPT_RETURNTRANSFER] = TRUE;
    if(!isset($this->options[CURLOPT_FOLLOWLOCATION])) $this->options[CURLOPT_FOLLOWLOCATION] = TRUE;
    if(!isset($this->options[CURLOPT_FAILONERROR])) $this->options[CURLOPT_FAILONERROR] = TRUE;

    if(!empty($this->headers))
    {
        $this->option(CURLOPT_HTTPHEADER, $this->headers);
    }

    $this->options();

    // Execute the request & and hide all output
    $this->response = curl_exec($this->session);

    // Request failed
    if($this->response === FALSE)
    {
        $this->error_code = curl_errno($this->session);
        $this->error_string = curl_error($this->session);
        
        curl_close($this->session);
        $this->session = NULL;
        return FALSE;
    }
    
    // Request successful
    else
    {
        $this->info = curl_getinfo($this->session);
        
        curl_close($this->session);
        $this->session = NULL;
        return $this->response;
    }
}


Maybe I should try plain PHP (without CI), but the thing is, this script has been running fin for the past month, only to cause connection problem today. I can try new code, but it might work fine for a few times, before it causes the same problem...


web crawler - El Forum - 02-08-2010

[eluser]Kamarg[/eluser]
If you are wanting to continue with your php version, look into pooling. This link is for thread pooling but the same idea applies substituting sockets/connections for threads.