Skip to content Skip to sidebar Skip to footer

Extracting Images In Scrapy

I've read through a few other answers here but I'm missing something fundamental. I'm trying to extract the images from a website with a CrawlSpider. settings.py BOT_NAME = 'healt

Solution 1:

If you want to use the standard ImagesPipeline, you need to change your parse_items method to something like:

import urlparse
...

    def parse_items(self, response):
        content = Selector(response=response).xpath('//body')
        for nodes in content:

            # build absolute URLs
            img_urls = [urlparse.urljoin(response.url, src)
                        for src in nodes.xpath('//img/@src').extract()]

            item = HealthycommItem()
            item['page_heading'] = nodes.xpath("//title").extract()
            item["page_title"] = nodes.xpath("//h1/text()").extract()
            item["page_link"] = response.url
            item["page_content"] = nodes.xpath('//div[@class="CategoryDescription"]').extract()

            # use "image_urls" instead of "image_url"
            item['image_urls'] = img_urls 

            yield item

And your item definition needs "images" and "image_urls" fields (plural, not singular)

The other way is to set IMAGES_URLS_FIELD and IMAGES_RESULT_FIELD to fit your item definition

Post a Comment for "Extracting Images In Scrapy"