Login

11-17-2009, 05:28 PM

[eluser]jimtomas[/eluser]
I'm creating a small library to use on my site. I'd like to import a small bit of information (with the proper shout out) from wikipedia based on the US state the visitor is looking to get a job in. I've cobbled together parsers from different sources to make a decent library, but I'm still having trouble filtering out odd bits of strange formatting from wikipedia. Can anyone offer a good open source parser for wikipedia or point me in the right direction?

Thanks,
Jim

11-19-2009, 04:16 PM

[eluser]jimtomas[/eluser]
To answer my own question a little, I dove in to the api and found a way wikipedia can self parse. It's a bit bizarre and incredibly slow, but combined with cache, this could be my solution until a better(faster) parser is discovered.

It goes something like this

http://en.wikipedia.org/w/api.php?action...APAGE&text;={{: WIKIPEDIAPAGE}}&format=xml&prop=text

for example:
http://en.wikipedia.org/w/api.php?action...&prop=text

I'll work this out and hopefully contribute some code for others to use.

11-20-2009, 03:21 PM

[eluser]jimtomas[/eluser]
Ok, here is the helper I created, let me know if you have any questions. Certainly add on and make this better if you can.

Thanks!

Code:
&lt;?PHP

function strbetween($string, $start_tag, $end_tag)

{

    // Figure out where the specified start tag is

    $start_position = strpos($string, $start_tag);

    // Figure out where the specified end tag is

    $end_position    = strpos($string, $end_tag, ($start_position + strlen($start_tag)));

    // How far is the end tag from the start tag?

    $length = $end_position - $start_position;

    // Return the content in between the start and end tag

    return substr($string, $start_position, $length);

    }        

function fetch($url,$start,$end){

    $page = file_get_contents($url);

     $s1=explode($start, $page);

     $s2=explode($end, $page);

     $page=str_replace($s1[0], '', $page);

     $page=str_replace($s2[1], '', $page);

     return $page;

}

function grabWikipedia($pagetograb) {

    $pagetograb = urlencode($pagetograb);

    $earl = "http://en.wikipedia.org/w/api.php?action=parse&title;=:".$pagetograb."&text;={{:".$pagetograb."}}&format=xml&prop=text";

    $xml = fetch($earl,"<text xml:space=\"preserve\">","</text>");

    $xml = html_entity_decode($xml);

    $xml = strbetween($xml, "<p>", "</p>");

    $xml = strip_tags($xml);

    return preg_replace('/\[[^\]]*\]/', '', $xml);

    }

?&gt;