May 1, 2009

Sun's Wacky JavaOne idea...which I actually like

posted by ari

Sun is offering folks the opportunity to have a nice catered lunch with Alex Miller and me. I was quite flattered when they approached with the concept. And now I am just excited and eagerly anticipating the opportunity to meet 8 Java developers and discuss in detail what you guys are working on over a nice lunch, all on Sun.

To sign up, you have to pay for your JavaOne badge through an EBay auction. (Don't ask me why.) But here's the URL:
http://shop.ebay.com/merchant/javaone2009_W0QQ_nkwZQQ_armrsZ1QQ_fromZQQ_mdoZ

Expect other speakers to show up there soon as well.

Cheers!

--ARi

Comments (0) | TrackBack (0) | Permalink

April 18, 2009

Answering a reader's question: When to offload the DB with Terracotta

posted by ari

I have been remiss in not answering a reader's very intelligent question. Here's the question.

Write-behind would be wonderful for performance, but how do you handle db transaction errors? What if someone places a bet (in your real-life example), and the bet fails to commit? How do we recover from that? Was the user originally told (when the bet was marked dirty) that the bet was successfully placed (and committed)? Or is the user told that the bet is “posted” or something else that implies that the bet is not truly committed yet?

If the application considers the objected committed once it is marked dirty (and/or placed on the write-behind queue), then I assume that there could be potential consistency issues with write-behind, where one box has the bet placed (on the queue, but not in the db), but the other boxes do not see the bet placed?

And even on the box that placed the bet, if the db is queried with a projection that includes the bet (rather than the Bet domain object; eg. Some report of all bets, joined in with the user and their account status), then I assume the query will return inconsistent results (ie. The report of bets will not include the bet on the write-behind queue)?

Not that any of this is an argument against write-behind; I just want to understand the use cases where this can be applied safely.

Thank you.

Now to attempt an answer.

Write-behind and detaching are not a panacea. I mean that it is true that this pattern cannot always be used. But at the same time it does not suffer from the issues you raise, at least not in this use case.

Always start from the top down as we would in this online betting use case. What is the user experience we want and why? Make design decisions from there. In the questions you ask, I sense a desire to keep the database as the master of all data which is a bottom-up approach. Let's cover the pitfalls of a bottom-up or database-up approach first.

As an example, I was on a panel with Brian Goetz and several "grid" vendors in 2008 at QCon San Francisco. A similar question to yours was asked--"Can I really always build asynchronous apps?" The answer from all the grid vendors was a thundering, "Yes!" They need you to believe you can. My answer was, "No. You can't always afford it. Asynchronous nature might be hard to add given your business requirements." Brian got much more concrete and asked the audience to think about EBay and Amazon. Amazon can be simplified down to an e-catalog site. This is not to take away from their monumental achievements in scale and reliability. I only mean to say that they have a massive need for caching where people buy maybe 5 - 10% of the time so 90% of visits are readers. EBay has a massive write rate and those writes are highly localized to auctions ending in the next hour or minute.

EBay and Amazon need to make almost entirely inverse architecture decisions yet both are called eCommerce sites. Both have a catalog. Both have a user account management function. And, both have payment clearing capabilities. Amazon also has fulfillment capabilities and warehousing concerns that EBay does not but that is beside the point I am making.

Amazon might tell you that Sleepycat or voldemort or memcache rule (they only use Sleepycat, BTW. Voldemort is an OSS clone of their Dynamo approach). EBay would tell you that partitioned Oracle works great. The two are both correct. EBay needed transactional ACID updates in many of their pageviews whereas Amazon needed a different optimization (read Amazon's Dynamo paper for their approach to eventually consistent data storage).

Back to our topic. In the case of betting, you have:


  1. a catalog of games / matches / things on which to place bets

  2. User account info

  3. payment clearing system

  4. placed bets which are like items you have sold that you have yet to fulfill / deliver


and more.

We can surmise together that when I place a bet and I get an email or web-based confirmation saying my bet has been placed for, let's say, $100, at 2:1 odds in the affirmative (meaning I think the results will be in favor of the bet direction), I fully expect to get paid $200 after the game if I win and to have my $100 debited if I lose. If I win and I do not get the $200, there will be no excuses the bookmaker can make that will have me come back. A bookmaker who does not know the commitments he has made will soon be out of business.

So as an architect, I start to think about coherence and transaction isolation, and databases sound really good. But are they necessary? No. What I really want is:


  1. all my commitments on disk so I never risk losing anything

  2. make that on more than 1 disk in case I even lose my disks

  3. coherent recording of the odds and the amounts of the bet

  4. a transactional view of my users' account balances so I don't extend credit when I do not intend to.

  5. a fast, efficient way to search for, display matches and also a coherent way to update odds as my staff learn new info (say the quarterback breaks his leg).


Mapping Terracotta to this business requirement I would suggest:

  1. well, Terracotta is disk-based just like the DB so I can in theory use it in more places than I would use a cache

  2. Yes, Terracotta server array will have at least 2 copies or more if I configure properly, on 2 or more separate _machines_. That's lots of redundancy, more than the DB in fact.

  3. Terracotta operations have isolation levels just like a DB. There are read locks, write locks, synch-write locks and concurrent locks. I would likely use a combination of these to record a bet that is being placed. Specifically, read lock the catalog to bet against the currently-available odds, and write-lock the user's account or my list of placed bets or both to debit the account and write the placed bet.

  4. I can use readwritelocks in util.concurrent to compose a transaction across a debit + credit operation against a user's account. But, I want the accounts recorded in the DB and cannot use Terracotta in a 2PC manner, so I might choose to cache the account in Terracotta but keep it in the DB, using instead a 1.5PC (I covered 1.5PC in the original blog and will not cover it again here).

  5. Perhaps put the catalog of games in the DB, and cache the catalog in Terracotta and when my back office content management tools change the DB, also send a JMS message to a node in my app cluster to update the in-memory cache as well, or I could just make my back-office updater a part of my production app cache in Terracotta.

In summary, cache all the read only data in Terracotta to offload the DB of its otherwise wasted usage under web-onlookers. Put user financial info in the DB and manage account balances there, using DB transactions to update. Then put the bets into Terracotta, lazily flushing to the DB so that a back office SQL script can run payouts and collections at the end of each match, against people's accounts. This will offload not just read-only onlookers but actual updates and business transactions as well, at least for a time. At Terracotta we call this "shaving the peak load" where the DB does not have to be sized large enough to handle traffic spikes.

To make this all work, we need a few more details though. We have to make sure that we write an "end of match" eventing system that makes sure to flush all placed bets to the DB so that we can clear payments accurately at the end of a match from entirely within the DB. Could you put accounting in Terracotta? Yes, but all your back office tools would have to be rewritten to work against Java web services of some sort instead of against the DB they most likely work against today.

As for a stable view of bets and odds that are constantly being changed by the back office team members who are trying to optimize company revenue and there are a few challenges. First, the odds. Easiest way to update them in a transaction with the DB is to have the back office app that changes the odds, work within a Terracotta transaction.

Transactional%20writer.png

Now, the odds will be updated or fail and the business will know what risk the failure poses. Our employees will always get clear messages from the Office updater app that they successfully changed the odds for a match or that the change failed. They can even choose to stop all betting if something goes horribly wrong. But they will always know the status of the games and odds.

Now, let's secure our betting model. Bets are being placed into Terracotta and onto 2 disks so they are safe. And the odds at which they are placed were valid odds at the time given we made a good back office tool for updating odds. We just need to make sure all bets are flushed to the DB before the DB executes its payout process. Should be simple. When a match begins, flip the DB state to "INCOMPLETE" for that match in some table. the payout script will refuse to run payouts against incomplete matches. Now, have the app cluster flip the match to complete using a timer task of some sort. Put the COMPLETION marker in the asynch write-behind queue with bets. When the app tier decides to COMPLETE the match, all subsequent bet attempts will be rejected. And a COMPLETION marker will be placed in the queue that will flip the DB state. But it is placed in sequence or in order with all the bets. So all you need is an in-order asynch write behind and bets can be placed asynch with the DB commit. Let's take a look at in-order queuing in a picture. It is a kewl technique that can help offload the db with fewer worries.

inorder%20queuing.png

We have 2 recipes. First, adding Terracotta and 1.5PC updates to the DB-based application. Second, using write behind queuing and sometimes guaranteeing the queue's processing order. There are more but I just want to give you a taste of the sort of thinking you must embark upon when offloading the DB using Terracotta, in a by-hand manner.

As you can see, these recipes won't apply in all use cases. And now I think we close in on an answer to your question. Bets are not shared data amongst multiple users. Bets are per-user, you see? And thus bets can be asynchronously written back to the DB and we can even build db tasks that should only run when the asynch queue is drained and have no polling. If bets were in fact as you suggest and predicated on what bets others have placed we couldn't do asynch write behind. We would instead have to do write through. If we were not a bookmaker, by the way, and instead a betting platform, then the pattern would shift from asynch write behind to matching engine. Matching engines can be built in Terracotta that are 100X faster than Oracle RDBMS, but that's a different blog entry. Nonetheless, I would still not need to write through to a DB nor do I have to give up on offloading the db in such use cases where I share data amongst user threads. I would just need a different pattern. In fact, I hear from many customers that they have moved highly contended multi-user write operations to Terracotta from the DB because the DB lock manager deadlocks regularly and Terracotta can handle much much higher write rates (check out the quote from Guy Moller, CTO of Brands4Friends on our terracottatech.com homepage--that is one such use case).


If I were to give you 1 rule it is this: use the freeze/thaw method. If you are using an ORMapper and thawing data out of a DB record back into memory every time you use that data in the app, consider keeping that data thawed in Terracotta. If the thaw occurs, but the data is only thawed once a day, or once a week or once in a month or once in a year, you could leave it frozen. Thawed data format is good for data being accessed once a second or once a minute or once an hour. Once you leave data in thawed form, you eventually have to freeze it back if your database is supporting reporting or other back office operations that your app cannot support on its own. Freezing data that has been thawed for a long time is a write-behind pattern.

Hope this helps. If not, consider just using Hibernate, Spring, and Terracotta together as in our Examinator reference app. You can then plug Terracotta in as a 2nd level cache providor and offload the DB as much as possible w/o major app surgery.

Cheers,

--Ari

Comments (1) | TrackBack (0) | Permalink

April 16, 2009

Know your use case and optimize accordingly

posted by ari

Ok, this one WAS going to be short. But here it is anyways.

<soap-box>
Let's not get caught up in technology for technology's sake

I was reading about Twitter-this and Twitter-that for the past few weeks. People keep writing about its "architecture" which is really a way of writing about how they would solve the problem had they been a web giant with a killer idea, like Twitter. Again, they are not, so I have to hearken back to Werner Voegel's tweet last week that said something like...


those who have built massively scaled architectures do not comment on the challenges of those in the middle of building such things

So, lots of armchair quarterbacking going on here. But it has reached the point of absurdity because now everyone seems to be pulling in their favorite technology du jour. "I can do it this way with Scala." "No no. Hadoop, you idiot!" "Erlang would save them!" Don't get me wrong, I am not about to dig at these technologies nor am I about to claim Terracotta is better or competes or that Terracotta should be used for Twitter.

I read a post yesterday where the author asserts that on every visit to twitter.com in order to view the tweets awaiting you from friends and general twitter community, you should do a MapReduce operation over a grid of twitter users' tweet data, looking for the ones that match your personal subscription list. Yes, folks. That's 1 MapReduce of the entire tweetset per pageview. And, by the way, when a tweet occurs it can be super fast because it just as to write my tweet to my personal tweet bucket. Others will get my tweet when they come and check for it.

So, let's break this down. O(1) for the write operation. And that constant is very small, and efficient. Good. But wait. for n twitter readers, I need to look at n-1 twitter users' accounts for tweets about which I care. That's O(n(n-1)) which is O(n^2). ICK! I have millions of people viewing twitter an hour and that's an O(N^2) MapReduce (because MapReduce r3w1z!) yet I have hundreds of tweets per second which is O(1) because why? I don't know.

I think you would want to optimize the read path, not the write path. Thus, instead of a bulletin board pattern, you want to use a mailbox pattern (and something like Scala) to send tweets to each individual user's mailbox for viewing whenever he or she returns.

By using the mailbox pattern, we are trading off space for time. Let's break it down again. A tweet should go to all the interested parties who are following me. That would be my 10 - 1000 friends plus the general twitter mailbox that all can see. So the write would be O(n) for n friends (ignoring optimizations like the listener pattern which could arguably make this O(1) as well). My message, in the worst case, will visit every inbox in the cluster. But the read is now simple. Just open my inbox and display all the tweets that have arrived since last check; O(1).

There. Now the operation I do millions of times an hour through the Twitter API and direct through the web interface work and the writes aren't that slow either.

By the way, you will notice that the read-optimized solution we just built is pub/sub essentially so I can create a massive grid of point-to-point communications sockets to send messages between individuals and I can push instead of pull updates. By getting a push-model going, I can easily send the SMS update, send the AIR-clients their update, and do a Comet-style conversation with browser-based users to push the updates. This push probably helps eliminate 50% plus of my traffic by keeping the manual polling to a minimum. A MapReduce-based solution must be invoked by the user...it shouldn't run on a regular schedule in my opinion.

UPDATE: If you run the MapReduce on behalf of all twitter users, I just realized you have an O(n^n) Algorithm. Worst I have EVER seen. Think about it...n users each visiting n-1 other users tweet buckets looking for tweets. Yes. On a regular schedule, you will attempt an n^n data operation.

Thus, meta-refresh for browsers and polling-based architectures would flood my system with more and more inefficient traffic.

In sum, know your use case. Optimize for the read path versus the write path. Optimize for push versus pull. Don't just use MapReduce or Scala or Erlang or even Terracotta because you can. Use us because we solve the business problem top-down the most appropriate way.
</soap-box>

There is no silver bullet,

--Ari

Comments (6) | TrackBack (0) | Permalink

April 1, 2009

Terracotta working on top of Coherence

posted by ari

After quite a lot of work I got it to function. The thing that might shock you is there is not a lot of value in this config. It doesn't do anything special. It doesn't do anything new. It doesn't make anything better.

Why?

Because all I did is rename the host I was running the Terracotta Server on to "Coherence." There. Terracotta running on Coherence.

April Fool's :)

--Ari

Comments (0) | TrackBack (0) | Permalink

March 6, 2009

The fallacy of Peer to Peer

posted by ari

And one more image. The fallacy, BTW, is that you don't care if you can push to all your peers in parallel. The fact that you have to is crazy and unscalable. Data mastering / partitioning is way more scalable w/o a loss in reliability, but way more availability than a mesh (peer-to-peer or multicast).

grids%20n%20fabrics.png

--Ari

Comments (1) | TrackBack (0) | Permalink

March 5, 2009

2-tier coherent clustering

posted by ari

Contemplate this one for a while

Terracotta

2-tier%20coherence.png


Relational DB


1-tier%20coherence.png

I will tell you more about it all soon.

--Ari

Comments (1) | TrackBack (0) | Permalink

February 16, 2009

Offloading a DB even when updates need to be transactional

posted by ari

We have a lot of active deployments where people need to integrate Terracotta into a use case where the goal is to offload the database. Transactional updates between Terracotta, working as a cache, and the database would seemingly slow the cache updates down to the speed of the database. This line of reasoning is not only common but quite logical. However, it's not true at all. There are two ways that using Terracotta as a cache in front of the database can offload that database, and offload it significantly. The first mechanism for offloading the database and the most common is using Terracotta in write-behind mode. The second is to use Terracotta as a coherent read cache writing through to the database. In both cases, the database can either be made smaller or can handle many more transactions per unit time.

When writing behind to the database, Terracotta's integration module for asynchronous updates, (Async TIM) provides a simple, scalable pure Java way to offloading I/O. The Async TIM provides an API for flagging objects in any collection as "dirty" and needing update to any backing store. The module guarantees that objects dirtied by a particular JVM do not migrate across JVMs in the process of getting flushed to the DB. This means that each JVM is working with only the changes it makes--all in a highly localized fashion. The Async TIM thus provides a mechanism for flushing data that changes from application JVMs back to the database; a way to flush those changes at a rate the database can support, keeping data in the Terracotta array until it reaches the db; and a way to make sure the entire system is reentrant and that no data or events will be lost or duplicated between the cache and the database, even in the case of an entire cluster restart. Async TIM's work queues are all Java queues running on top of Terracotta so the queues are durable, transactional data backed transparently by the Terracotta server array. A large online gaming site and a social networking customer both use Async TIM to write fast changing information directly to the cache. The online gaming site writes odds and game data, as well as all bets being placed by end users into the DB for eventual settlement / payment. The social networking site writes highly contented inventory information to a central cache of product availability. Both have decreased their database from an Oracle instance running on Sun hardware from >12 CPUs down to 4, which means a really nice savings in Oracle licenses.

When writing through to the database, Terracotta's configuration-based locking semantics layered on top of a pure Java collections prove sufficient for our customers in offloading excessive I/O. Start with a simple HashMap. Execute all reads from the cache and all writes straight to the database. In order to execute a transactional update with the DB, use a Terracotta synchronous lock to flag the cache as stale, cluster-wide. Then the database write is done in a normal, fast Terracotta asynchronous lock (using just Java synchronization but flagging the object locks as asynchronous at runtime), finishing with reflagging the cache as clean. We at Terracotta call this a 1.5 phase commit. Whenever reading from cache, we are guaranteed that the database is the master of all data and that the cache is subservient because the cache will not be valid until set clean, and it cannot be clean unless the database and cache got updated without error. A major telco uses Terracotta collections with a 1.5 phase commit locking scheme in order to offload their database, increasing the capacity of their existing cluster by 7.5X. At time when you want to do more with less, that's a good return.

As you can see, Terracotta helps both asynchronous write-behind users and synchronous write-through users offload the DB quite significantly. In one case, customers use durable, transactional POJO queues to write to the DB in a throttled manner. In the other, customers use coherent cluster-wide read caching and a 1.5 phase locking scheme to all but eliminate read traffic to the database.

Comments (5) | TrackBack (0) | Permalink

February 12, 2009

Atlassian Crowd Clustered with Terracotta

posted by ari

Apparently, a famous technical author and editor just recently completed a clustering production use case on top of Terracotta. Use case? Crowd single sign on service from Atlassian was being held back by the user database underneath. So he put Terracotta into the mix for clustered caching across many Crowd instances. Now it scales easily and w/o a lot of investment.

To my friend, the editor, I say cheers and job well done.

To everyone else interested in this use case, I will see if I can get this contributed back as a TIM.

More importantly though, with Shibboleth and Crowd both clustered on top of Terracotta pretty much seamlessly, perhaps its time to take a look at adding single sign on to your app, and to make sure that SSO service is running with Terracotta under it; dare I call it a standard or best practice?

Cheers!

--Ari

Comments (0) | TrackBack (0) | Permalink

December 17, 2008

Clustered Fork / Join

posted by ari

By the way,

I checked with the Devoxx book store last week while at the conference in Antwerp. 2 books sold out by the end of Thursday: Scala by Bill Venners and company and our own Terracotta book from Apress.

This is very exciting to me as some smart concurrency folks and I just agreed to get a well-built and fast clustered Fork / Join done ASAP. This will lead to something for Scala + Terracotta, I am sure. (I am being a bit secretive on purpose because I don't want to promise too much till we get going.)

Stay tuned,

--Ari

Comments (1) | TrackBack (0) | Permalink

The man asked for SQL but he wanted POJO

posted by ari

Was talking to some Terracotta users last week. They had a classic Business Process Mgmt challenge.

An upstream system would create work for downstream systems to execute. But the work was spread across many of these downstream services and could take days to complete before reporting successful results to the upstream server.

The existing system is built via JPA and Oracle underneath. There are several special cases handled in the production system. For example, any job that requires invoking a downstream service that is down ends up getting timed out and gets stuck in the db as DEAD (some column in a table somewhere is updated when the job times out). Jobs that are attempted to be run against services that do not yet exist also end up DEAD in the jobs table in the DB.

So the production operations team and the developers regularly run SQL scripts that find all jobs created between certain date ranges who's status is DEAD and kick those jobs by resetting them to INITIAL state. Then the job processor will see them again (simply SELECT * from JOBS_TABLE where JOB_STATUS=INITIAL type thing).

Now, the developers are annoyed with the constant maintenance of this system and implement a replacement using Terracotta and POJOs and collections. There is now a map of jobs and a job is some sort of POJO (I don't have the details, sorry). Problem is, the production team asked, "how do we kick DEAD jobs without a SQL interface to this thing?" And so the developers came to me and asked, "when will Terracotta have a SQL interface and are there workarounds using, for example, Compass or Lucene, or Hibernate Search that you can help us with?"

My answer: step back a moment. The requirement is not for a SQL interface here. It is for a well-designed system that can restart DEAD jobs that die due to temporary timeout or due to misconfiguration / non-existent service definitions. Well, the system is easily designed without any SQL requirement or search requirement whatsoever! (I said triumphantly...)

I don't think I did a good job explaining myself to the users but I want to do so now. First, move to an event driven architecture. Forgo the notion of a simple map and a POJO job and iterating on the map once a minute via a daemon thread. This is very DB like in its nature--scan all the data over and over again looking for rows or objects that match a pattern.

Instead I think this use case should be implemented in an event driven manner. Write each service as sitting behind a clean interface with perhaps an invoke() method or something simple. Build a pipelining system that creates a queue (LinkedBlockingQueue perhaps) of work per service-type. Then have a thread pool (ExecutorService perhaps) that sits on the tail end of each work queue take()ing work off that queue and firing UnitOfWork.invoke() (from the interface). If the UnitOfWork times out, re-queue it. If the service-type does not exist, create its queue but do not instantiate its thread pool. Work will just pile up until you OOME on that queue (don't worry about that at the moment...Terracotta will take care of spilling the queue to disk so you don't OOME). Last, write a servlet to update service definitions in the application (turn them on, turn them off, start or pause processing, reroute a service to another service, etc.).

Now what have we achieved? We have a single-JVM multi-threaded way using SEDA-type constructs to process many different types of service requests. We have a sorting mechanism that keeps each service request in a queue where only requests of that service reside. We have a way to turn services on and off without touching each task itself but instead updating some metadata about that service. And we have a multi-tasking architecture that can invoke many different services in parallel each in its own thread pool.

Now the only problems with our architecture are:

1. It doesn't scale past 1 JVM without replacing our java.util.Queue interfaces with JMS (ick!)
2. It doesn't persist the jobs in case of JVM failure
3. The status of a service--its metadata--needs to be consistent across all threads across all JVMs so that I can flip a service on / off simply by updating the map of services inside our servlet. We might even want to notify() someone in a thread pool that the service changed. But you can't notify() across JVMs.
4. If a service is defined weeks or months before it is activated, then we would have jobs sitting waiting to execute for that long period of time. Surely a POJO-based job mgmt solution would OOME

Terracotta eliminates these 4 issues for us transparently if we build the event driven system:
1. LinkedBlockingQueue and ExecutorService are transparently clustered with Terracotta. Create jobs on one JVM, take() them and invoke() their invokable interface on another. This allows us to go so far as to create a cluster of JVMs working together on a single service or on any number of services. E.g., Service A is backed by 10 JVMs while Service B is on 2 and Services C, D, and E are all on one JVM.

2. Jobs are POJOs in this UnitOfWork / LBQ model. The queues need to be flagged as Terracotta roots and thus all jobs inside them are persisted to the Terracotta array and to the array's disk transparently.

3. Service status map is another Terracotta root and wait() and notify() will work across threads _across_ JVMs with Terracotta.

4. A queue of service requests can just sit in memory as far as you, the developer are concerned. Terracotta will transparently flush requests from your JVM to our array and from our array's memory down to disk in case of failure.

So the development team asserted, "Terracotta gives us durable job state and spill-to-disk transactionally underneath our workload processor."

The operations team retorted, "How do I update services and re-run DEAD tasks without SQL."

I respond to the operators by saying, "Forget the notion of embedding a service's state inside each request. Go to thread pools, work queues, metadata structures, and wait() / notify() all built in a single JVM. Its not easy but it will prove very powerful and scalable...well beyond what a DB can support and without a rewrite once built."

So now I hear they are beginning implementation over the holidays of a LinkedBlockingQueue based solution.

What do you all think. Should I write up such a container's implementation in the Terracotta forge? Would it help you with anything you are working on? And, BTW, would you want to compose multiple services together through an orchestration service like Apache ServiceMix?

Let me know via comments or email.

Cheers,

--Ari

Comments (9) | TrackBack (0) | Permalink