Skip to content Skip to sidebar Skip to footer

Scrapy Crawl Spider Only Scrape Certain Number Of Layers

Hi I want to crawl all the pages of a web using Scrapy CrawlSpider class (Documentation here). class MySpider(CrawlSpider): name = 'abc.com' allowed_domains = ['abc.com']

Solution 1:

  1. Set DEPTH_LIMIT setting:

    DEPTH_LIMIT¶

    Default: 0

    The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.

  2. No, you don't need to add an additional url check. If you don't specify allow_domains on the Rule level, it will extract only URLs with abc.com domain.

  3. If you don't define rules it won't extract any URLs (will work like a BaseSpider).

Hope that helps.

Post a Comment for "Scrapy Crawl Spider Only Scrape Certain Number Of Layers"