Welcome Guest, Not a member yet? Register   Sign In
Wikipedia Parsing Library
#1

[eluser]jimtomas[/eluser]
I'm creating a small library to use on my site. I'd like to import a small bit of information (with the proper shout out) from wikipedia based on the US state the visitor is looking to get a job in. I've cobbled together parsers from different sources to make a decent library, but I'm still having trouble filtering out odd bits of strange formatting from wikipedia. Can anyone offer a good open source parser for wikipedia or point me in the right direction?

Thanks,
Jim
#2

[eluser]jimtomas[/eluser]
To answer my own question a little, I dove in to the api and found a way wikipedia can self parse. It's a bit bizarre and incredibly slow, but combined with cache, this could be my solution until a better(faster) parser is discovered.

It goes something like this

http://en.wikipedia.org/w/api.php?action...APAGE&text;={{: WIKIPEDIAPAGE}}&format=xml&prop=text

for example:
http://en.wikipedia.org/w/api.php?action...&prop=text

I'll work this out and hopefully contribute some code for others to use.
#3

[eluser]jimtomas[/eluser]
Ok, here is the helper I created, let me know if you have any questions. Certainly add on and make this better if you can.

Thanks!

Code:
<?PHP
function strbetween($string, $start_tag, $end_tag)
{
    // Figure out where the specified start tag is
    $start_position = strpos($string, $start_tag);

    // Figure out where the specified end tag is
    $end_position    = strpos($string, $end_tag, ($start_position + strlen($start_tag)));

    // How far is the end tag from the start tag?
    $length = $end_position - $start_position;

    // Return the content in between the start and end tag
    return substr($string, $start_position, $length);
    }        

function fetch($url,$start,$end){
    $page = file_get_contents($url);
     $s1=explode($start, $page);
     $s2=explode($end, $page);
     $page=str_replace($s1[0], '', $page);
     $page=str_replace($s2[1], '', $page);
     return $page;
}


function grabWikipedia($pagetograb) {
    $pagetograb = urlencode($pagetograb);
    $earl = "http://en.wikipedia.org/w/api.php?action=parse&title;=:".$pagetograb."&text;={{:".$pagetograb."}}&format=xml&prop=text";
    $xml = fetch($earl,"<text xml:space=\"preserve\">","</text>");
    $xml = html_entity_decode($xml);
    $xml = strbetween($xml, "<p>", "</p>");
    $xml = strip_tags($xml);
    return preg_replace('/\[[^\]]*\]/', '', $xml);

    }
?&gt;




Theme © iAndrew 2016 - Forum software by © MyBB