Using Threads Within A Scrapy Spider
Solution 1:
The marked answer is not 100% correct.
Scrapy runs on twisted and it supports returning deferreds from the pipeline process_item
method.
This means you can create a deferred in the pipeline as for example from threads.deferToThread
. This will run your CPU bound code inside the reactor thread pool. Be careful to make correct use of callFromThread
where appropriate. I use a semaphore to avoid exhausting all threads from the thread pool, but setting good values for the settings mentioned below might also work.
http://twistedmatrix.com/documents/13.2.0/core/howto/threading.html
Here a method from one of my Item pipelines:
defprocess_item(self, item, spider):
defhandle_error(item):
raise DropItem("error processing %s", item)
d = self.sem.run(threads.deferToThread, self.do_cpu_intense_work, item)
d.addCallback(lambda _: item)
d.addErrback(lambda _: handle_error(item))
return d
You may want to keep an eye on
REACTOR_THREADPOOL_MAXSIZE
as described here: http://doc.scrapy.org/en/latest/topics/settings.html#reactor-threadpool-maxsize
CONCURRENT_ITEMS
as described here http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items
You are still facing the Python GIL though, which means CPU intense tasks will not really run in parallel on multiple CPUs anyway, they will just pretend to do that. The GIL is only released for IO. But You can use this method to use an IO blocking 3rd party lib (e.g. webservice calls) inside your item pipeline without blocking the reactor thread.
Solution 2:
Scrapy itself is single-threaded, and resultantly you cannot use multiple threads within a spider. You can however, make use of multiple spiders at the same time (CONCURRENT_REQUESTS
), which may help you (see Common Practices)
Scrapy does not use multithreading as it is built on Twisted, which is an asynchronous http framework.
Post a Comment for "Using Threads Within A Scrapy Spider"