Scrapy Crawl Spider Only Scrape Certain Number Of Layers
Hi I want to crawl all the pages of a web using Scrapy CrawlSpider class (Documentation here). class MySpider(CrawlSpider): name = 'abc.com' allowed_domains = ['abc.com']
Solution 1:
Set DEPTH_LIMIT setting:
DEPTH_LIMIT¶
Default: 0
The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
No, you don't need to add an additional url check. If you don't specify
allow_domains
on theRule
level, it will extract only URLs withabc.com
domain.- If you don't define rules it won't extract any URLs (will work like a
BaseSpider
).
Hope that helps.
Post a Comment for "Scrapy Crawl Spider Only Scrape Certain Number Of Layers"