Skip to content Skip to sidebar Skip to footer

Slow Django Database Operations On Large (ish) Dataset.

I set up a system to filter the twitter real time stream sample. Obviously, the database writes are too slow to keep up with anything more complex than a couple of low-volume keyw

Solution 1:

I eventually managed to cobble together an answer from some redditors and a couple of other things.

Fundamentally, though I was doing a double lookup on the id_str field, which wasn't indexed. I added indexes db_index=True to that field on both read_tweet and read_user, and moved read tweet to a try/except Tweet.objects.create approach, falling back to the get_or_create if there's a problem, and saw a 50-60x speed improvement, with the workers now being scalable - if I add 10 workers, I get 10x speed.

I currently have one worker that's happily processing 6 or so tweets a second. Next up I'll add a monitoring daemon to check the queue size and add extra workers if it's still increasing.

tl;dr - REMEMBER INDEXING!

Post a Comment for "Slow Django Database Operations On Large (ish) Dataset."