How To Work With A Very Large "allowed_domains" Attribute In Scrapy?

September 26, 2022 Post a Comment

The following is my scrapy code: def get_host_regex(self, spider): '''Override this method to implement a different offsite policy''' allowed_domains = getattr(spider, 'all

Solution 1:

You can build your own OffsiteMiddleware variation, with a different implementation checking requests to domains not in the spider's allowed_domains.

For example, add this in a middlewares.py file,

from scrapy.spidermiddlewares.offsite import OffsiteMiddleware
from scrapy.utils.httpobj import urlparse_cached


class SimpleOffsiteMiddleware(OffsiteMiddleware):

    def spider_opened(self, spider):
        # don't build a regex, just use the list as-is
        self.allowed_hosts = getattr(spider, 'allowed_domains', [])
        self.domains_seen = set()

    def should_follow(self, request, spider):
        if self.allowed_hosts:
            host = urlparse_cached(request).hostname or ''
            # does 'www.example.com' end with 'example.com'?
            # test this for all allowed domains
            return any([host.endswith(h) for h in self.allowed_hosts])
        else:
            return True

and change your settings to disable the default OffsiteMiddleware, and add yours:

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
    'myproject.middlewares.SimpleOffsiteMiddleware': 500,
}

Warning: this middleware is not tested. This is a very naive implementation, definitely not very efficient (testing string inclusion for each of 50'000 possible domains for each and every request). You could use another backend to store the list and test a hostname value, like sqlite for example.

Python Playground

How To Work With A Very Large "allowed_domains" Attribute In Scrapy?

Solution 1:

Post a Comment for "How To Work With A Very Large "allowed_domains" Attribute In Scrapy?"