• 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help With Screen Scraping (Need To Find Date On Page)

#1
[eluser]Jay Logan[/eluser]
Hello. Working on an application that reads a web page and stores different parts of the page in an array. I successfully retrieve the easy pieces (code wrapped in DIV, P, etc.) but I'm having a hard time getting a piece of the page only placed in the BODY tag. It is a date type on the page such as:

Code:
<hr>
Date: 2009-08-04,  1:23PM EDT<br>

I simply need 2009-08-04 and possibly the 1:23PM. I use this but it isn't working.

Code:
preg_match_all("/^[0-9]{4}-[0-9]{2}-[0-9]{2}$/", $html_of_page, $date);

It doesn't find any matched and spits out an empty array. Any help would be very much appreciated.

#2
[eluser]jedd[/eluser]
What if you dump the ^ and $ (I don't use preg_* functions, but from my understanding of regexps it'll look for a date that starts at the beginning of a line, and ends at the end of it - which isn't what you've got ...?) and perhaps just look for a space in front of your first numeric field, and a trailing comma.

#3
[eluser]zyzzzz[/eluser]
This is untested and frankly I suck at Regex (forgot how to do it until a week ago), but unless your html page consists of only the date, it wouldn't match that (I don't think? Tongue)

Try this:

Code:
$pattern = '/Date: ([\d]{4}-[\d]{2}-[\d]{2}), *([\d]{1,2}:[\d]{2}[AP]M) EDT/';
preg_match_all($pattern, $html_of_page, $matches);
// Now a match is in $matches
// The first match would be for the entire pattern, the second the date, the third for the time
echo $match[0][1].' '.$match[0][2]; // This should echo '2009-08-04 1:23PM'

I have probably made quite a few mistakes as Im still pretty new to them but give that a whirl

#4
[eluser]Sen Hu[/eluser]
Instead of regexp, perhaps extracting words may help.

Using <b>wex (word extractor) command</b> in biterscripting ( <a href="http://www.biterscripting.com">http://www.biterscripting.com</a> ),

Code:
var str page ; cat "http://www.something.com/somepage.xyz" > $page
stex -c "^Date:^]" $page > null
# Next word is date.
wex "1" $page
# Next word is time.
wex "1" $page



Thought should suggest an alternate solution.

Sen

#5
[eluser]Jay Logan[/eluser]
Thanks for all the help. This is what I ended up doing.

First, needed to get rid of all new line characters and any characters that create white space.

Code:
$html = file_get_html($this->input->post('url'));
$new_lines = array("\t","\n","\r");
$content = str_replace($new_lines, "", html_entity_decode($html));

Next I find the first instance of a certain pattern (code where the date and time is displayed).

Code:
preg_match("|<hr>(.*)<br>|U", $content, $raw_date);

Then I remove the text/code before the date & time that I don't need.

Code:
$format_date_front = str_replace('<hr>Date: ', '', $raw_date[0]);
$format_date_back = str_replace('<br>', '', $format_date_front);

And finally I separate the date and time to insert in separate database fields.

Code:
$date_time = explode(', ', $format_date_back);
$date = trim($date_time[0]);
$time = trim($date_time[1]);


And the final code (with other stuff) ends up giving me a pretty awesome tool that lets me simply paste the link of a Craig's List and insert data from the post into my database - creating a nice list of all my posts from several different accounts. I track hits, posting duration, clicks to inserted links, ad status, and other stuff. CI is the best.

#6
[eluser]jcavard[/eluser]
[quote author="J-Slim" date="1249437422"]Hello. Working on an application that reads a web page and stores different parts of the page in an array. I successfully retrieve the easy pieces (code wrapped in DIV, P, etc.) but I'm having a hard time getting a piece of the page only placed in the BODY tag. It is a date type on the page such as:

Code:
<hr>
Date: 2009-08-04,  1:23PM EDT<br>

I simply need 2009-08-04 and possibly the 1:23PM. I use this but it isn't working.

Code:
preg_match_all("/^[0-9]{4}-[0-9]{2}-[0-9]{2}$/", $html_of_page, $date);

It doesn't find any matched and spits out an empty array. Any help would be very much appreciated.[/quote]
if you only remove the ^ and $ from your regex, it will matched the date now if you want to match the 1:23PM as well, add this ',\s+\d{1,2}:\d{2}\w{2}'. But, it still spits an empty array you'll say. That's true, if you want to capture some text, you have to use (). Now, wrap the whole thing with () and you will get results
Code:
preg_match_all("/(\d{4}-\d{2}-\d{2}),\s+(\d{1,2}:\d{2}\w{2})/", $html_of_page, $date);
(\d{4}-\d{2}-\d{2}) = this will capture everything that fits this pattern 9999-99-99
(\d{1,2}:\d{2}\w{2}) = this will capture 99:99AA

9 = digit
A = alpha.

Also note, in my example:
\d is the same as [0-9]
\w represents alpha
\s represent space

This download is useful http://www.ultrapico.com/ExpressoDownload.htm

#7
[eluser]jcavard[/eluser]
[quote author="J-Slim" date="1250026575"]Thanks for all the help. This is what I ended up doing.

First, needed to get rid of all new line characters and any characters that create white space.

Code:
$html = file_get_html($this->input->post('url'));
$new_lines = array("\t","\n","\r");
$content = str_replace($new_lines, "", html_entity_decode($html));

Next I find the first instance of a certain pattern (code where the date and time is displayed).

Code:
preg_match("|<hr>(.*)<br>|U", $content, $raw_date);

Then I remove the text/code before the date & time that I don't need.

Code:
$format_date_front = str_replace('<hr>Date: ', '', $raw_date[0]);
$format_date_back = str_replace('<br>', '', $format_date_front);

And finally I separate the date and time to insert in separate database fields.

Code:
$date_time = explode(', ', $format_date_back);
$date = trim($date_time[0]);
$time = trim($date_time[1]);


And the final code (with other stuff) ends up giving me a pretty awesome tool that lets me simply paste the link of a Craig's List and insert data from the post into my database - creating a nice list of all my posts from several different accounts. I track hits, posting duration, clicks to inserted links, ad status, and other stuff. CI is the best.[/quote]

Apparently I had my browser window opened for a while, I hadn't seen this reply. Well, you went the hard way Wink
If you need any help with the regex, I be around, I've had my share of hair pullin' with that.

#8
[eluser]Jay Logan[/eluser]
I absolutely will. Regex is the worst!! Lol.


Digg   Delicious   Reddit   Facebook   Twitter   StumbleUpon  


  Theme © 2014 iAndrew  
Powered By MyBB, © 2002-2020 MyBB Group.