Welcome Guest, Not a member yet? Register   Sign In
Scraping sites for more info (on the fly)
#1

[eluser]taewoo[/eluser]
Hi everyone.
I'm looking for ways to scrape sites for relevant info when someone edits his/her profile. I know there's the simple HTML DOM parser but the learning curve on the syntax is pretty high. Other than regular expression, is there another way that you guys use?
#2

[eluser]zimco[/eluser]
I've successfully used the Snoopy class with codeigniter to aid in some scraping. Here's the thread that explained how to get Snoopy to work with CI see http://ellislab.com/forums/viewthread/73338/
#3

[eluser]taewoo[/eluser]
thanks zimco.
but i am not certain how snoopy (browser emulator, which to me seems like just another version of CURL) can help scrape...? i need to not just fetch, but parse and extract relevant information
#4

[eluser]zimco[/eluser]
I don't know if they will help your situation but i also utilized a couple of really basic scraping and parsing classes like:

Http.php written by Troy Wolf a Screen-scraping class with caching. Includes image_cache.php companion script. Includes static methods to extract data out of HTML tables into arrays or XML. Now supports sending XML requests and custom verbs with support for making WebDAV requests to Microsoft Exchange Server.

Parser.php a parsing class with various parsing functions used to "help" parse an HTML file for data:
-Remove forbidden HTML tags using the PHP strip_tags function
-Remove unwanted attributes from HTML source using the PHP preg_replace function
-Reformat an HTML document this will remove HTML tags, javascript sections and white space. It will also convert some common HTML entities to their text equivalent.
-Split the page HTML

But i really found that writing the parsing part myself using regexes was easier than trying to figure out how to make somebody else's parser fit the needs of my situation.




Theme © iAndrew 2016 - Forum software by © MyBB