« Terracotta working on top of Coherence | Main | Answering a reader's question: When to offload the DB with Terracotta »
April 16, 2009
Know your use case and optimize accordingly
posted by ari
Ok, this one WAS going to be short. But here it is anyways.
<soap-box>
Let's not get caught up in technology for technology's sake
I was reading about Twitter-this and Twitter-that for the past few weeks. People keep writing about its "architecture" which is really a way of writing about how they would solve the problem had they been a web giant with a killer idea, like Twitter. Again, they are not, so I have to hearken back to Werner Voegel's tweet last week that said something like...
those who have built massively scaled architectures do not comment on the challenges of those in the middle of building such things
So, lots of armchair quarterbacking going on here. But it has reached the point of absurdity because now everyone seems to be pulling in their favorite technology du jour. "I can do it this way with Scala." "No no. Hadoop, you idiot!" "Erlang would save them!" Don't get me wrong, I am not about to dig at these technologies nor am I about to claim Terracotta is better or competes or that Terracotta should be used for Twitter.
I read a post yesterday where the author asserts that on every visit to twitter.com in order to view the tweets awaiting you from friends and general twitter community, you should do a MapReduce operation over a grid of twitter users' tweet data, looking for the ones that match your personal subscription list. Yes, folks. That's 1 MapReduce of the entire tweetset per pageview. And, by the way, when a tweet occurs it can be super fast because it just as to write my tweet to my personal tweet bucket. Others will get my tweet when they come and check for it.
So, let's break this down. O(1) for the write operation. And that constant is very small, and efficient. Good. But wait. for n twitter readers, I need to look at n-1 twitter users' accounts for tweets about which I care. That's O(n(n-1)) which is O(n^2). ICK! I have millions of people viewing twitter an hour and that's an O(N^2) MapReduce (because MapReduce r3w1z!) yet I have hundreds of tweets per second which is O(1) because why? I don't know.
I think you would want to optimize the read path, not the write path. Thus, instead of a bulletin board pattern, you want to use a mailbox pattern (and something like Scala) to send tweets to each individual user's mailbox for viewing whenever he or she returns.
By using the mailbox pattern, we are trading off space for time. Let's break it down again. A tweet should go to all the interested parties who are following me. That would be my 10 - 1000 friends plus the general twitter mailbox that all can see. So the write would be O(n) for n friends (ignoring optimizations like the listener pattern which could arguably make this O(1) as well). My message, in the worst case, will visit every inbox in the cluster. But the read is now simple. Just open my inbox and display all the tweets that have arrived since last check; O(1).
There. Now the operation I do millions of times an hour through the Twitter API and direct through the web interface work and the writes aren't that slow either.
By the way, you will notice that the read-optimized solution we just built is pub/sub essentially so I can create a massive grid of point-to-point communications sockets to send messages between individuals and I can push instead of pull updates. By getting a push-model going, I can easily send the SMS update, send the AIR-clients their update, and do a Comet-style conversation with browser-based users to push the updates. This push probably helps eliminate 50% plus of my traffic by keeping the manual polling to a minimum. A MapReduce-based solution must be invoked by the user...it shouldn't run on a regular schedule in my opinion.
UPDATE: If you run the MapReduce on behalf of all twitter users, I just realized you have an O(n^n) Algorithm. Worst I have EVER seen. Think about it...n users each visiting n-1 other users tweet buckets looking for tweets. Yes. On a regular schedule, you will attempt an n^n data operation.
Thus, meta-refresh for browsers and polling-based architectures would flood my system with more and more inefficient traffic.
In sum, know your use case. Optimize for the read path versus the write path. Optimize for push versus pull. Don't just use MapReduce or Scala or Erlang or even Terracotta because you can. Use us because we solve the business problem top-down the most appropriate way.
</soap-box>
There is no silver bullet,
--Ari
Trackback Pings
TrackBack URL for this entry:
http://blog.terracottatech.com/cgi-bin/mt/mt-tb.cgi/87
Comments
Hi Ari,
very interesting post!
Just a question: I don't get why the map-reduce complexity is in the end O(n^n).
Am I missing something?
May you give a more detailed, even short, explanation?
Thanks!
Sergio B.
Posted by: Sergio Bossa at April 17, 2009 1:32 AM
Hey Ari, nice post!
I think your analysis is about right, though I did not read it *very* carefully. Yes, read optimisation and push is good design, especially since there are a lot more folks hitting page refresh on the twitter web site, than there are actually tweeting.
BUT... here's the problem. The Twitter app isn't exactly a pubsub/push/messaging system. Here's why: go to your Twitter page on their web site. Pick any new person to follow, who has tweeted recently. After you follow them, you will not only see anything they subsequently tweet, but you will also see their old tweets.
* Seeing the old tweets is a 'pull' operation and only possible if the HTML is generated from an archive (or cache or both).
* If you only saw 'new' tweets created after the follow (subscribe) operation, then you could have a pure pubsub model.
alexis
[ For possibly obvious reasons we thought about this a little bit over at RabbitMQ. See eg http://www.rabbitmq.com/resources/BayFP_RabbitMQ_talk_20090408.pdf and http://www.veodia.com/player.php?vid=ddiakWIOfFU ]
Posted by: alexis at April 17, 2009 2:24 AM
Sergio,
You might be right. I am not sure what I was thinking. To do the on-demand operation seems like O(n) to me now and the batch update is O(n^2). Hmm.
Let's reason it out. For my @replies, I need to check _every_ other mailbox in the system looking at all their historical tweets. for n twitter accts and m tweets each, that's n * m where m
So, you are right.
--Ari
Posted by: ARI ZILKA at April 18, 2009 9:57 AM
Alexis,
Totally see your point. Moreover, when I was contemplating this architecture I was thinking Rabbit + Terracotta.
We have several customers in production using the 2 together--in part because we tell them to :)
Good stuff!
--Ari
Posted by: ARI ZILKA at April 18, 2009 7:18 PM
Tell me more!
Posted by: alexis at April 19, 2009 2:03 AM
Message deletion is another Twitter peculiarity, orthogonal to delivery semantics.
Posted by: alexis at April 19, 2009 11:56 AM