« March 2009 | Main | May 2009 »

April 2009 Archives

April 1, 2009

Terracotta working on top of Coherence

After quite a lot of work I got it to function. The thing that might shock you is there is not a lot of value in this config. It doesn't do anything special. It doesn't do anything new. It doesn't make anything better.

Why?

Because all I did is rename the host I was running the Terracotta Server on to "Coherence." There. Terracotta running on Coherence.

April Fool's :)

--Ari

April 16, 2009

Know your use case and optimize accordingly

Ok, this one WAS going to be short. But here it is anyways.

<soap-box>
Let's not get caught up in technology for technology's sake

I was reading about Twitter-this and Twitter-that for the past few weeks. People keep writing about its "architecture" which is really a way of writing about how they would solve the problem had they been a web giant with a killer idea, like Twitter. Again, they are not, so I have to hearken back to Werner Voegel's tweet last week that said something like...


those who have built massively scaled architectures do not comment on the challenges of those in the middle of building such things

So, lots of armchair quarterbacking going on here. But it has reached the point of absurdity because now everyone seems to be pulling in their favorite technology du jour. "I can do it this way with Scala." "No no. Hadoop, you idiot!" "Erlang would save them!" Don't get me wrong, I am not about to dig at these technologies nor am I about to claim Terracotta is better or competes or that Terracotta should be used for Twitter.

I read a post yesterday where the author asserts that on every visit to twitter.com in order to view the tweets awaiting you from friends and general twitter community, you should do a MapReduce operation over a grid of twitter users' tweet data, looking for the ones that match your personal subscription list. Yes, folks. That's 1 MapReduce of the entire tweetset per pageview. And, by the way, when a tweet occurs it can be super fast because it just as to write my tweet to my personal tweet bucket. Others will get my tweet when they come and check for it.

So, let's break this down. O(1) for the write operation. And that constant is very small, and efficient. Good. But wait. for n twitter readers, I need to look at n-1 twitter users' accounts for tweets about which I care. That's O(n(n-1)) which is O(n^2). ICK! I have millions of people viewing twitter an hour and that's an O(N^2) MapReduce (because MapReduce r3w1z!) yet I have hundreds of tweets per second which is O(1) because why? I don't know.

I think you would want to optimize the read path, not the write path. Thus, instead of a bulletin board pattern, you want to use a mailbox pattern (and something like Scala) to send tweets to each individual user's mailbox for viewing whenever he or she returns.

By using the mailbox pattern, we are trading off space for time. Let's break it down again. A tweet should go to all the interested parties who are following me. That would be my 10 - 1000 friends plus the general twitter mailbox that all can see. So the write would be O(n) for n friends (ignoring optimizations like the listener pattern which could arguably make this O(1) as well). My message, in the worst case, will visit every inbox in the cluster. But the read is now simple. Just open my inbox and display all the tweets that have arrived since last check; O(1).

There. Now the operation I do millions of times an hour through the Twitter API and direct through the web interface work and the writes aren't that slow either.

By the way, you will notice that the read-optimized solution we just built is pub/sub essentially so I can create a massive grid of point-to-point communications sockets to send messages between individuals and I can push instead of pull updates. By getting a push-model going, I can easily send the SMS update, send the AIR-clients their update, and do a Comet-style conversation with browser-based users to push the updates. This push probably helps eliminate 50% plus of my traffic by keeping the manual polling to a minimum. A MapReduce-based solution must be invoked by the user...it shouldn't run on a regular schedule in my opinion.

UPDATE: If you run the MapReduce on behalf of all twitter users, I just realized you have an O(n^n) Algorithm. Worst I have EVER seen. Think about it...n users each visiting n-1 other users tweet buckets looking for tweets. Yes. On a regular schedule, you will attempt an n^n data operation.

Thus, meta-refresh for browsers and polling-based architectures would flood my system with more and more inefficient traffic.

In sum, know your use case. Optimize for the read path versus the write path. Optimize for push versus pull. Don't just use MapReduce or Scala or Erlang or even Terracotta because you can. Use us because we solve the business problem top-down the most appropriate way.
</soap-box>

There is no silver bullet,

--Ari

April 18, 2009

Answering a reader's question: When to offload the DB with Terracotta

I have been remiss in not answering a reader's very intelligent question. Here's the question.

Write-behind would be wonderful for performance, but how do you handle db transaction errors? What if someone places a bet (in your real-life example), and the bet fails to commit? How do we recover from that? Was the user originally told (when the bet was marked dirty) that the bet was successfully placed (and committed)? Or is the user told that the bet is “posted” or something else that implies that the bet is not truly committed yet?

If the application considers the objected committed once it is marked dirty (and/or placed on the write-behind queue), then I assume that there could be potential consistency issues with write-behind, where one box has the bet placed (on the queue, but not in the db), but the other boxes do not see the bet placed?

And even on the box that placed the bet, if the db is queried with a projection that includes the bet (rather than the Bet domain object; eg. Some report of all bets, joined in with the user and their account status), then I assume the query will return inconsistent results (ie. The report of bets will not include the bet on the write-behind queue)?

Not that any of this is an argument against write-behind; I just want to understand the use cases where this can be applied safely.

Thank you.

Now to attempt an answer.

Write-behind and detaching are not a panacea. I mean that it is true that this pattern cannot always be used. But at the same time it does not suffer from the issues you raise, at least not in this use case.

Always start from the top down as we would in this online betting use case. What is the user experience we want and why? Make design decisions from there. In the questions you ask, I sense a desire to keep the database as the master of all data which is a bottom-up approach. Let's cover the pitfalls of a bottom-up or database-up approach first.

As an example, I was on a panel with Brian Goetz and several "grid" vendors in 2008 at QCon San Francisco. A similar question to yours was asked--"Can I really always build asynchronous apps?" The answer from all the grid vendors was a thundering, "Yes!" They need you to believe you can. My answer was, "No. You can't always afford it. Asynchronous nature might be hard to add given your business requirements." Brian got much more concrete and asked the audience to think about EBay and Amazon. Amazon can be simplified down to an e-catalog site. This is not to take away from their monumental achievements in scale and reliability. I only mean to say that they have a massive need for caching where people buy maybe 5 - 10% of the time so 90% of visits are readers. EBay has a massive write rate and those writes are highly localized to auctions ending in the next hour or minute.

EBay and Amazon need to make almost entirely inverse architecture decisions yet both are called eCommerce sites. Both have a catalog. Both have a user account management function. And, both have payment clearing capabilities. Amazon also has fulfillment capabilities and warehousing concerns that EBay does not but that is beside the point I am making.

Amazon might tell you that Sleepycat or voldemort or memcache rule (they only use Sleepycat, BTW. Voldemort is an OSS clone of their Dynamo approach). EBay would tell you that partitioned Oracle works great. The two are both correct. EBay needed transactional ACID updates in many of their pageviews whereas Amazon needed a different optimization (read Amazon's Dynamo paper for their approach to eventually consistent data storage).

Back to our topic. In the case of betting, you have:


  1. a catalog of games / matches / things on which to place bets

  2. User account info

  3. payment clearing system

  4. placed bets which are like items you have sold that you have yet to fulfill / deliver


and more.

We can surmise together that when I place a bet and I get an email or web-based confirmation saying my bet has been placed for, let's say, $100, at 2:1 odds in the affirmative (meaning I think the results will be in favor of the bet direction), I fully expect to get paid $200 after the game if I win and to have my $100 debited if I lose. If I win and I do not get the $200, there will be no excuses the bookmaker can make that will have me come back. A bookmaker who does not know the commitments he has made will soon be out of business.

So as an architect, I start to think about coherence and transaction isolation, and databases sound really good. But are they necessary? No. What I really want is:


  1. all my commitments on disk so I never risk losing anything

  2. make that on more than 1 disk in case I even lose my disks

  3. coherent recording of the odds and the amounts of the bet

  4. a transactional view of my users' account balances so I don't extend credit when I do not intend to.

  5. a fast, efficient way to search for, display matches and also a coherent way to update odds as my staff learn new info (say the quarterback breaks his leg).


Mapping Terracotta to this business requirement I would suggest:

  1. well, Terracotta is disk-based just like the DB so I can in theory use it in more places than I would use a cache

  2. Yes, Terracotta server array will have at least 2 copies or more if I configure properly, on 2 or more separate _machines_. That's lots of redundancy, more than the DB in fact.

  3. Terracotta operations have isolation levels just like a DB. There are read locks, write locks, synch-write locks and concurrent locks. I would likely use a combination of these to record a bet that is being placed. Specifically, read lock the catalog to bet against the currently-available odds, and write-lock the user's account or my list of placed bets or both to debit the account and write the placed bet.

  4. I can use readwritelocks in util.concurrent to compose a transaction across a debit + credit operation against a user's account. But, I want the accounts recorded in the DB and cannot use Terracotta in a 2PC manner, so I might choose to cache the account in Terracotta but keep it in the DB, using instead a 1.5PC (I covered 1.5PC in the original blog and will not cover it again here).

  5. Perhaps put the catalog of games in the DB, and cache the catalog in Terracotta and when my back office content management tools change the DB, also send a JMS message to a node in my app cluster to update the in-memory cache as well, or I could just make my back-office updater a part of my production app cache in Terracotta.

In summary, cache all the read only data in Terracotta to offload the DB of its otherwise wasted usage under web-onlookers. Put user financial info in the DB and manage account balances there, using DB transactions to update. Then put the bets into Terracotta, lazily flushing to the DB so that a back office SQL script can run payouts and collections at the end of each match, against people's accounts. This will offload not just read-only onlookers but actual updates and business transactions as well, at least for a time. At Terracotta we call this "shaving the peak load" where the DB does not have to be sized large enough to handle traffic spikes.

To make this all work, we need a few more details though. We have to make sure that we write an "end of match" eventing system that makes sure to flush all placed bets to the DB so that we can clear payments accurately at the end of a match from entirely within the DB. Could you put accounting in Terracotta? Yes, but all your back office tools would have to be rewritten to work against Java web services of some sort instead of against the DB they most likely work against today.

As for a stable view of bets and odds that are constantly being changed by the back office team members who are trying to optimize company revenue and there are a few challenges. First, the odds. Easiest way to update them in a transaction with the DB is to have the back office app that changes the odds, work within a Terracotta transaction.

Transactional%20writer.png

Now, the odds will be updated or fail and the business will know what risk the failure poses. Our employees will always get clear messages from the Office updater app that they successfully changed the odds for a match or that the change failed. They can even choose to stop all betting if something goes horribly wrong. But they will always know the status of the games and odds.

Now, let's secure our betting model. Bets are being placed into Terracotta and onto 2 disks so they are safe. And the odds at which they are placed were valid odds at the time given we made a good back office tool for updating odds. We just need to make sure all bets are flushed to the DB before the DB executes its payout process. Should be simple. When a match begins, flip the DB state to "INCOMPLETE" for that match in some table. the payout script will refuse to run payouts against incomplete matches. Now, have the app cluster flip the match to complete using a timer task of some sort. Put the COMPLETION marker in the asynch write-behind queue with bets. When the app tier decides to COMPLETE the match, all subsequent bet attempts will be rejected. And a COMPLETION marker will be placed in the queue that will flip the DB state. But it is placed in sequence or in order with all the bets. So all you need is an in-order asynch write behind and bets can be placed asynch with the DB commit. Let's take a look at in-order queuing in a picture. It is a kewl technique that can help offload the db with fewer worries.

inorder%20queuing.png

We have 2 recipes. First, adding Terracotta and 1.5PC updates to the DB-based application. Second, using write behind queuing and sometimes guaranteeing the queue's processing order. There are more but I just want to give you a taste of the sort of thinking you must embark upon when offloading the DB using Terracotta, in a by-hand manner.

As you can see, these recipes won't apply in all use cases. And now I think we close in on an answer to your question. Bets are not shared data amongst multiple users. Bets are per-user, you see? And thus bets can be asynchronously written back to the DB and we can even build db tasks that should only run when the asynch queue is drained and have no polling. If bets were in fact as you suggest and predicated on what bets others have placed we couldn't do asynch write behind. We would instead have to do write through. If we were not a bookmaker, by the way, and instead a betting platform, then the pattern would shift from asynch write behind to matching engine. Matching engines can be built in Terracotta that are 100X faster than Oracle RDBMS, but that's a different blog entry. Nonetheless, I would still not need to write through to a DB nor do I have to give up on offloading the db in such use cases where I share data amongst user threads. I would just need a different pattern. In fact, I hear from many customers that they have moved highly contended multi-user write operations to Terracotta from the DB because the DB lock manager deadlocks regularly and Terracotta can handle much much higher write rates (check out the quote from Guy Moller, CTO of Brands4Friends on our terracottatech.com homepage--that is one such use case).


If I were to give you 1 rule it is this: use the freeze/thaw method. If you are using an ORMapper and thawing data out of a DB record back into memory every time you use that data in the app, consider keeping that data thawed in Terracotta. If the thaw occurs, but the data is only thawed once a day, or once a week or once in a month or once in a year, you could leave it frozen. Thawed data format is good for data being accessed once a second or once a minute or once an hour. Once you leave data in thawed form, you eventually have to freeze it back if your database is supporting reporting or other back office operations that your app cannot support on its own. Freezing data that has been thawed for a long time is a write-behind pattern.

Hope this helps. If not, consider just using Hibernate, Spring, and Terracotta together as in our Examinator reference app. You can then plug Terracotta in as a 2nd level cache providor and offload the DB as much as possible w/o major app surgery.

Cheers,

--Ari

About April 2009

This page contains all entries posted to POJO Mojo in April 2009. They are listed from oldest to newest.

March 2009 is the previous archive.

May 2009 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34