Welcome Guest, Not a member yet? Register   Sign In
Identifying images referenced in HTML
#1

[eluser]Myles Wakeham[/eluser]
I'm sure that someone has done this, or knows of a script to do it, but I need something that can read through HTML code, looking for IMG references and identify the full image that is returned (ideally returning an array of all of the unique images referenced in the HTML code). Since I don't really want to re-invent the wheel here, does anyone know of any open source code floating around that does this?

M
#2

[eluser]TheFuzzy0ne[/eluser]
What do you mean by "identify"? Exactly what information are you wanting to glean from the source code? Just the src attribute?
#3

[eluser]Myles Wakeham[/eluser]
[quote author="TheFuzzy0ne" date="1234236388"]What do you mean by "identify"? Exactly what information are you wanting to glean from the source code? Just the href?[/quote]

No, actually the image file name. For example, if the HTML contains <IMG SRC="/images/test.jpg"> I want to get back the test.jpg part of it. If there are 20 images referenced, I need to get back an array of 20 images.

M
#4

[eluser]TheFuzzy0ne[/eluser]
Sorry, I meant to say src attribute. No idea where href came from...

Would would happen if the script were to hit a src attribute where it doesn't link to a direct image, but rather a server query string? For example <img src="get_image.php?h467jdjk" />
#5

[eluser]Myles Wakeham[/eluser]
[quote author="TheFuzzy0ne" date="1234237823"]Sorry, I meant to say src attribute. No idea where href came from...

Would would happen if the script were to hit a src attribute where it doesn't link to a direct image, but rather a server query string? For example <img src="get_image.php?h467jdjk" />[/quote]

Your question really just supports why I thought it wise to go to the list with this question, rather than try and carve out a solution myself. For my particular needs, all I need to know is the name of any image files that are resident on a page. If the page contains a script call, then I really don't need to know about it - just skip over it. I only really want to find all JPG, GIF, PNG, etc. file names. But those that are specifically in IMG tags (ie. not referenced in BODY tags, CSS, etc.).

M
#6

[eluser]SitesByJoe[/eluser]
It'd be much easier to get that info with javascript being that PHP doesn't know the DOM. Is that out of the question?

If not, you're gonna have some beefy regex-type work ahead of you.
#7

[eluser]Myles Wakeham[/eluser]
[quote author="SitesByJoe" date="1234249784"]It'd be much easier to get that info with javascript being that PHP doesn't know the DOM. Is that out of the question?

If not, you're gonna have some beefy regex-type work ahead of you.[/quote]

Thanks for the input, but unfortunately this isn't a matter of live web page scraping. I'm doing this based on a user pasting raw HTML code into a form var, then saving that var and processing it on the back-end with PHP. Although, and I agree with you, that it would be easier to do this in JS, the problem is that the HTML isn't part of the DOM that I can work with. Its just raw text, so regex is a likely candidate for this.

Myles
#8

[eluser]TheFuzzy0ne[/eluser]
This function should be a good start. It's only been briefly tested. Please let me know if you want anything else added. Simply pass the function the HTML source for the page. If no images are found, then an empty array is returned (but this won't necessarily mean there aren't any on the page). Please be advised that some Web sites use JavaScript to display their images, and they do this so you can't scrape them. I'm relying on you to test the function out.

Code:
function get_img_names($html_source) {
        $arr = array();
        $pattern = '/\&lt;img *src=[\'\"]{0,1}([^\"\'\s]+).+\/\>/';

        preg_match_all($pattern, $html_source, $matches);

        foreach ($matches[1] as $match)
        {
                $filename = basename($match);
                if (in_array($filename)) { continue; }
                $arr[] = $filename;
        }

        return $arr;
}

One more note. The script doesn't check to see what the file extension is, and whether the src attribute value is a server query string. I trust you can modify it a bit to do that.
#9

[eluser]cwt137[/eluser]
[quote author="SitesByJoe" date="1234249784"]It'd be much easier to get that info with javascript being that PHP doesn't know the DOM. Is that out of the question?

If not, you're gonna have some beefy regex-type work ahead of you.[/quote]

What? PHP doesn't know DOM? http://us2.php.net/manual/en/book.dom.php

Below is an example I put together in a few minutes on how to get every file name contained in the src attibute of a img tag. I don't use PHP's DOM API much so this solution might not be best but it gets the job done.

Code:
&lt;?php

$string = <<&lt;XML
&lt;html>
    &lt;head&gt;
        &lt;title&gt;Test Page&lt;/title&gt;
    &lt;/head&gt;
    &lt;body&gt;
        <h1>Test For Images</h1>
        <hr />
        <img src="foo.jpg" />
        <div>
            <img src="/some_dir/bar.png" />
        </div>
    &lt;/body&gt;
&lt;/html&gt;
XML;

$doc = new DOMDocument;
$doc->loadHTML($string);
$img_tags = $doc->getElementsByTagName('img');

foreach ($img_tags as $img_tag) {
    echo $img_tag->attributes->getNamedItem("src")->nodeValue . "<br />\n";

}

?&gt;
#10

[eluser]TheFuzzy0ne[/eluser]
Nicely done. I'll remember that next time before I spend 30 minutes dreaming up a regex.




Theme © iAndrew 2016 - Forum software by © MyBB