Login

02-22-2011, 09:48 AM

[eluser]umbungo[/eluser]
I need to extract information from a few sites. Basically product information catalogs, extracting the information in paginated tables, eventually getting it to a MYSQL database.

How should i go about doing this??

My uninformed thoughts would be I would want to use some kind of spider, and then process the information with something like regular expressions to extract the information I want.

Any help greatly appreciated!

02-23-2011, 05:36 AM

[eluser]Vheissu[/eluser]
Have you thought of using YQL - http://developer.yahoo.com/yql/ and writing some PHP to store the results in a database?

02-24-2011, 10:02 AM

[eluser]umbungo[/eluser]
Thanks, this is really useful i wasn't aware of YQL properly before.

I am at a point where i can retrieve the information from a web page (using the console)., eg;

Select content from html where url="http://www.website.com/products/Category1/Big/Red" and xpath='//a[@class="productName"]'

However, how can i get results from website.com/products/* ?? ie many different pages. Or must I get a list of all the sub-pages, and then do a query for everrry one separately?? (if so, what is the best way of doing this??)

02-25-2011, 01:09 PM

[eluser]umbungo[/eluser]
^^^Any advice on how to do this across many pages on a site would be much appreciated! Or even how to get a list of all sub-pages from a site so that i could make something crude with loops/arrays.

02-25-2011, 03:17 PM

[eluser]ChrisMiller[/eluser]
Now I have never built a spider robot I would have to say it is rather simple if you know how to extract the data...

Simply put you also build a extractor to grab links only for the website so it searches the homepage first for anything like: somewebsite . com/* and creates a list of all the pages and adds them to a queue list in a database per say. Now it slowly walks through the queue list and checks each page for your product data and also checks for more links and if it detects any new links it adds them to the list and thus your spider can now walk a website. And you program it to continue walking every link in the database until it reaches the end of the list.

Now very important is to add a feature to stop the robot if it goes on a runaway like say a variable check to make sure its allowed to process the link. Now to also keep page load times lower for the robot I would redirect itself over and over in a loop onto itself.

So a simple process would be cron starts the script and it loads up links, process a page marks it complete, reloads itself does the next page in line, refreshes, and so on...

02-25-2011, 03:54 PM

[eluser]Kamarg[/eluser]
While the basic theory is sound, there's a couple of issues to keep in mind. Wikipedia has a good entry on some of the basics to keep in mind under spider traps.