Skip to content Skip to sidebar Skip to footer

Extract Image Links From The Webpage Using Python

So I wanted to get all of the pictures on this page(of the nba teams). http://www.cbssports.com/nba/draft/mock-draft However, my code gives a bit more than that. It gives me,

Solution 1:

I know this can be "traumatic", but for those automatically generated pages, where you just want to grab the damn images away and never come back, a quick-n-dirty regular expression that takes the desired pattern tends to be my choice (no Beautiful Soup dependency is a great advantage):

import urllib, re

source = urllib.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()

## every image name is an abbreviation composed by capital letters, so...forlinkin re.findall('http://sports.cbsimg.net/images/nba/logos/30x30/[A-Z]*.png', source):
    printlink## the code above just prints the link;## if you want to actually download, set the flag below to True

    actually_download = False
    if actually_download:
        filename = link.split('/')[-1]
        urllib.urlretrieve(link, filename)

Hope this helps!

Solution 2:

To save all the images on http://www.cbssports.com/nba/draft/mock-draft,

import urllib2
import os
from BeautifulSoup import BeautifulSoup
URL = "http://www.cbssports.com/nba/draft/mock-draft"
default_dir = os.path.join(os.path.expanduser("~"),"Pictures")
opener = urllib2.build_opener()
urllib2.install_opener(opener)
soup = BeautifulSoup(urllib2.urlopen(URL).read())
imgs = soup.findAll("img",{"alt":True, "src":True})
for img in imgs:
    img_url = img["src"]
    filename = os.path.join(default_dir, img_url.split("/")[-1])
    img_data = opener.open(img_url)
    f = open(filename,"wb")
    f.write(img_data.read())
    f.close()

To save any particular image on http://www.cbssports.com/nba/draft/mock-draft, use

soup.find("img",{"src":"image_name_from_source"})

Solution 3:

You can use this functions for getting the list of all images url from url.

### get_url_images_in_text()## @param html - the html to extract urls of images from him.# @param protocol - the protocol of the website, for append to urls that not start with protocol.## @return list of imags url.##defget_url_images_in_text(html, protocol):
    urls = []
    all_urls = re.findall(r'((http\:|https\:)?\/\/[^"\' ]*?\.(png|jpg))', html, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
    for url in all_urls:
        ifnot url[0].startswith("http"):
            urls.append(protocol + url[0])
        else:
            urls.append(url[0])

    return urls

### get_images_from_url()## @param url - the url for extract images url from him. ## @return list of images url.##defget_images_from_url(url):
    protocol = url.split('/')[0]
    resp = requests.get(url)
    return get_url_images_in_text(resp.text, protocol)

Post a Comment for "Extract Image Links From The Webpage Using Python"