Slow Django Database Operations On Large (ish) Dataset.
I set up a system to filter the twitter real time stream sample. Obviously, the database writes are too slow to keep up with anything more complex than a couple of low-volume keyw
Solution 1:
I eventually managed to cobble together an answer from some redditors and a couple of other things.
Fundamentally, though I was doing a double lookup on the id_str field, which wasn't indexed. I added indexes db_index=True
to that field on both read_tweet
and read_user
, and moved read tweet to a try/except Tweet.objects.create
approach, falling back to the get_or_create if there's a problem, and saw a 50-60x speed improvement, with the workers now being scalable - if I add 10 workers, I get 10x speed.
I currently have one worker that's happily processing 6 or so tweets a second. Next up I'll add a monitoring daemon to check the queue size and add extra workers if it's still increasing.
tl;dr - REMEMBER INDEXING!
Post a Comment for "Slow Django Database Operations On Large (ish) Dataset."