Welcome Guest, Not a member yet? Register   Sign In
Crawler logic.
#1

[eluser]jcavard[/eluser]
Hi,

I have to code a crawler at work. I am using Phil Surgeon's Curl library and it works like a charm.

The problem is that after some time, the server crashes... The mail goal of this crawler is to parse a list of roughly 3500 html pages. However, at some point the server seems to crash...

This is the steps that I do:
Step #1: gather the list of url to parse, and put it in a file like this: list_YYYY-MM-DD.txt
Step #2: redirection to the controller Parse()
Step #3: Open list_YYYY-MM-DD.txt, get first line (which is an URL) out, and save the file.
Step #4: Parse that URL

it repeats step #3 & Step #4 99 times, then it refreshes the pages, and continue doing so until the file list_YYYY-MM-DD.txt has no more lines in it.

I would like to have an input if any of you guys has made crawler? What can I do to prevent from crashing? Thanks
#2

[eluser]Ben Edmunds[/eluser]
Is it crashing in the same place everytime? Is it crashing on a certain URL or after a certain amount of time?
#3

[eluser]jcavard[/eluser]
It crashes with the infamous 500 Internal Error...

I hit refresh, and it continues for a while, before crashing again shortly after. I would like to prevent this
#4

[eluser]@rno[/eluser]
Does your php log specify an error?
#5

[eluser]Ben Edmunds[/eluser]
So when you hit refresh, do you have it starting from the beginning again or does it continue where it left off?

OK, read your original post again and it looks like you are deleting the URLs after you read them, is that right?

And how many does it read everytime before it crashes? Always the same number of lines?
#6

[eluser]jcavard[/eluser]
[quote author="Ben Edmunds" date="1256844774"]So when you hit refresh, do you have it starting from the beginning again or does it continue where it left off?[/quote]

it continues where it left, since I have the list in a file like this:

http://www.url2parse.com/1.html
http://www.url2parse.com/2.html
http://www.url2parse.com/3.html
http://www.url2parse.com/4.html
http://www.url2parse.com/5.html
...
http://www.url2parse.com/3500.html

Whenever I parse one url, I shift it from the file, so when it refreshes the list is shorter than previously.

code explains better:
Code:
// read the list of URL to parse from file on disk. (eg: count($lines) == 3500)
$lines    = split('\|', read_file($filename));

// get first line out of the array
$url    = array_shift($lines);

// parse URL
$attr    = $this->_parse('iaa', $url);

// save the remaining url to file (now, count($lines) == 3499)
write_file($filename, join('|', $lines));
#7

[eluser]jcavard[/eluser]
It reads 200 url from the file before reloading the pages, then it goes for another 200, then reloads... until the file on disk is empty.

I refresh the page, so the request doesn't take for ever, but still...

The logs file doesn't show any error either...

How do you think the googlebot works? I mean, a crawler is basically an infinite loop?




Theme © iAndrew 2016 - Forum software by © MyBB