Crawler logic. |
[eluser]jcavard[/eluser]
Hi, I have to code a crawler at work. I am using Phil Surgeon's Curl library and it works like a charm. The problem is that after some time, the server crashes... The mail goal of this crawler is to parse a list of roughly 3500 html pages. However, at some point the server seems to crash... This is the steps that I do: Step #1: gather the list of url to parse, and put it in a file like this: list_YYYY-MM-DD.txt Step #2: redirection to the controller Parse() Step #3: Open list_YYYY-MM-DD.txt, get first line (which is an URL) out, and save the file. Step #4: Parse that URL it repeats step #3 & Step #4 99 times, then it refreshes the pages, and continue doing so until the file list_YYYY-MM-DD.txt has no more lines in it. I would like to have an input if any of you guys has made crawler? What can I do to prevent from crashing? Thanks
[eluser]Ben Edmunds[/eluser]
Is it crashing in the same place everytime? Is it crashing on a certain URL or after a certain amount of time?
[eluser]jcavard[/eluser]
It crashes with the infamous 500 Internal Error... I hit refresh, and it continues for a while, before crashing again shortly after. I would like to prevent this
[eluser]Ben Edmunds[/eluser]
So when you hit refresh, do you have it starting from the beginning again or does it continue where it left off? OK, read your original post again and it looks like you are deleting the URLs after you read them, is that right? And how many does it read everytime before it crashes? Always the same number of lines?
[eluser]jcavard[/eluser]
[quote author="Ben Edmunds" date="1256844774"]So when you hit refresh, do you have it starting from the beginning again or does it continue where it left off?[/quote] it continues where it left, since I have the list in a file like this: http://www.url2parse.com/1.html http://www.url2parse.com/2.html http://www.url2parse.com/3.html http://www.url2parse.com/4.html http://www.url2parse.com/5.html ... http://www.url2parse.com/3500.html Whenever I parse one url, I shift it from the file, so when it refreshes the list is shorter than previously. code explains better: Code: // read the list of URL to parse from file on disk. (eg: count($lines) == 3500)
[eluser]jcavard[/eluser]
It reads 200 url from the file before reloading the pages, then it goes for another 200, then reloads... until the file on disk is empty. I refresh the page, so the request doesn't take for ever, but still... The logs file doesn't show any error either... How do you think the googlebot works? I mean, a crawler is basically an infinite loop? |
Welcome Guest, Not a member yet? Register Sign In |