Scrapy Crawl Spider Only Scrape Certain Number Of Layers

March 31, 2024 Post a Comment

Hi I want to crawl all the pages of a web using Scrapy CrawlSpider class (Documentation here). class MySpider(CrawlSpider): name = 'abc.com' allowed_domains = ['abc.com']

Solution 1:

Set DEPTH_LIMIT setting:
DEPTH_LIMIT¶
Default: 0
The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
No, you don't need to add an additional url check. If you don't specify allow_domains on the Rule level, it will extract only URLs with abc.com domain.
If you don't define rules it won't extract any URLs (will work like a BaseSpider).

Hope that helps.

Python Playground

Scrapy Crawl Spider Only Scrape Certain Number Of Layers

Solution 1:

Post a Comment for "Scrapy Crawl Spider Only Scrape Certain Number Of Layers"