« Drop-in WLS Replicated Session Replacement | Main | Scalability vs. Correctness »

September 12, 2005

Object Identity, Tradition and DSO - Part 2

posted by pcal

This article is the second in a three-part series of articles which examines the notion of object identity in a distributed cache. Terracotta DSO preserves object identity; traditional API-based cache services don't. This is a key differentiator for Terracotta and a big win for developers.

Overview

Traditional API-based clustered caches present developers with an API, typically a hashmap. Java developers who wish to use this API to distribute their domain objects are faced with a difficult task. It is impossible to efficiently distribute a natural domain model: inevitably, the model has to be able to slice itself apart and glue itself back together using relational keys. This means more developers have to write and maintain a lot of extra code just to make their domain distributable. And just writing that extra code forces Java developers to think like relational database designers.

Terracotta believes that something is wrong with this picture. Distribution and replication should be transparent to the domain design. This article will explore how Terracotta DSO hides the machinery of distribution. It frees developers to design their domain model in the manner most natural to them.

Review

In the previous article, we explored the concept of object identity using a simple OnlineStore consisting of an Inventory of Products. The OnlineStore also contains Departments that defined arbitrary subsets of the Products in the Inventory.

We're going to expand the OnlineStore just a little bit by adding a new Department (Books) and a new Product (War & Peace):

Remember that our ultimate goal is to provide a cached view of the OnlineStore that is replicated across on multiple servers in a cluster:

Affecting Repairs

Previously, when we used a distributed hashmap to replicate the OnlineStore, we placed the entire OnlineStore in the cache. Admittedly, this was a coarse-grained and naive approach. It meant that any change to an item in the OnlineStore would require the entire OnlineStore (and all items in it) to be re-serialized to the other nodes in the cluster.

Now, we're going to try to fix that by taking a more fine-grained approach to replicating the OnlineStore in a traditional clustered hashmap. A fine-grained approach is the only way our OnlineStore will scale to hold thousands or millions of items. We need to be able to update one item in the replicated OnlineStore without having to touch any other items.

This suggests pretty strongly that we should make each Product an entry in the hashmap. We'll need a field on the Product to use as a unique key in the hashmap. That's not really a problem - we presumably will have some kind of unique Product_ID field that fits the bill:

This all seems to work pretty well. The clustered hashmap captures the notion of the Inventory object (we can get all of the hashmap entries), and we can also look up any individual item by Product_ID (a key in the hashmap):

But what happens when Departments enter the picture? The simplest thing to do would be to create a new Department abstraction that includes a list of Product objects:

Unfortunately, there is a problem here. Remember that with a clustered hashmap (such as our new Inventory), the objects you work with are copies of the replicated data. The clustered hashmap does not preserve object identity; if we were to add the Product instances from our inventory into the Departments' lists of Products, we would be adding different instances from the ones that are in the Inventory.

In effect, the picture would look like this:

This is all a bit surprising considering that our Department code would be adding Java references to its list of Products. It's also clearly problematic for our OnlineStore: we're wasting memory and we'll need a lot of fragile logic to ensure that all of the copies of the Products stay in sync.

The JVM as RDBMS

If our design were, instead, a database schema, we would say it is an example of extremely poor normalization. It is perhaps not surprising, then, that we're going to have to borrow a tool from relational database design to fix it: foreign keys.

Instead of storing instances of Product in the Department product lists, we instead need to store Product_IDs:

The picture now looks like this:

...which is at least logically equivalent to the design we started with. So we're done, right?

Well, stop for a moment to consider the unnatural act we just performed: we had to create a foreign key relationship between our Java types. This is not the way we would prefer to model our domain using the Java language. Unfortunately, the semantics of the clustered hashmap leave us no other option.

Moreover, consider how we're actually going to code against this design in common use cases. Say we want to display a table of all of the products in the department: we need to iterate through the list of Product_IDs in the Department and for each one, look up the Product instance in the Inventory hashmap. If we want to update one of those Products, we have to locate the Inventory hashmap again and put() the Product back into it. That's a fair amount of extra code to have to write; it will get exponentially worse once our simple OnlineStore becomes more complex.

The crux of the problem here is that the clustered hashmap does not preserve object identity. The hashmap can't respect Java references as first-order citizens in our domain model. It forces us to rip our domain model apart and then manually stitch it back together with keys.

State of Nature

With Terracotta DSO, all of this business with foreign keys in our Java types goes away. Unlike the clustered hashmap, DSO preserves object identity when it distributes our OnlineStore to other cluster members. This means we can rely on plain-old Java references to express relationships between our domain objects. We can model our domain exactly the way we want using the full facilities of the Java language. We don't have to add any extra stuff to the domain in order to enable clustered replication.

Thus, our OnlineStore abstractions can be as simple as:

The logical result of this is fully-normalized. It's exactly what we started with in our initial design:

At design time, there are no mysterious dashed lines. We don't have to adorn our model with a bunch of extra key-based relationships. Unlike the traditional cache service, Terracotta DSO doesn't need those things to distribute the domain across the cluster.

At development time, there is no extra code to write. No extra code to test. No extra code to debug. The developer implements the logic for her domain and nothing else - Terracotta DSO takes care of distribution transparently.

And at runtime, there are no surprises. We can locate and change objects in the OnlineStore and not worry that there are other copies of it lurking in some other context. Moreover, everything we see and do is kept in sync with the other replicas of the OnlineStore in our cluster. We don't have to write any special code and we don't have to mangle our domain model.

Summary

This article has demonstrated some of the perverse effects that the traditional clustered cache services can have on the design of domain objects. It forces Java developers to think like RDBMS designers.

Terracotta believes that this is unnatural. Developers of distributed Java apps want to return to a state of natural development. Terracotta DSO is the tool that can help them get there.

Looking Ahead

In the third and final article in this series, we'll drill down into the OnlineStore example using some real code. We'll see just how easy it is to write and deploy a distributed application using Terracotta DSO.

Comments

"September 12, 2005": is this a forward-looking article :-?

But seriousely, I think Billy is right on point here:
http://www.devwebsphere.com/devwebsphere/2005/09/opinion_on_apil.html

Posted by: Charly at September 10, 2005 05:31 PM

Actually,

I don't think that Billy is correct. And I think architects of Billy's caliber have a greater responsibility than to assert, "it probably won't work."

He asserts that (1) multi-threaded programming is hard--TRUE and (2) performance can't be optimized w/o an API--FALSE. (BTW, he also asserts than any benchmark or test can be created to prove a point and I agree 100%--numbers can lie. But I will tell you some of our performance numbers anyway.)

With respect to multi-threaded programming, a user of API-based or API-less caching HAS to understand some notion of transactionality; otherwise the program is incorrect. Just because I can do cache.get() and cache.put() without acquiring a lock of any kind doesn't mean that piece of code is correct; just that it compiles and runs--with race conditions. The situation for APIless and API-based caching is IDENTICAL. Think about GC and JIT. They make life easier during dev and faster at runtime. Yet, memory leaks can plague applications. Just because someone can have trouble with GC and references and application design doesn't mean GC is bad or not possible.

As for performance, when one builds a solution to safely and correctly cluster applications automatically, she has to do 2 things:
1. find changes where they occur
2. find a batching mechanism / boundary

With an API, you are leaving that heavy-lifting to the developer. There is no performance advantage to an API when trying to cluster those changes. Performance is arguably ONLY achieved by an APIless solution (again, the example of GC comes to mind). I have trouble believing that with my operating system, for example, I as the user should be keeping track of and moving the most popular blocks of files into cache to avoid I/O. Another example: when is the last time you wrote code to optimize what data or instructions were in your CPU's L2 cache when writing a Java app? Seems like an API forces you to ask the wrong person w/ the wrong tools to get the task done.

As for performance, customers have been very satisfied, finding, for example that TC can cluster the TPCW test and run almost as fast (5% overhead) as an unclustered application server, whereas native app server clustering or object caching costs >10% overhead (and in some cases 50%).

Customers also have had recent success dropping Terracotta into their application and clustering custom domain models in under 1 day, and then finding that TC's solution outruns the custom integration of performance-oriented API's.

Posted by: Ari Zilka at September 14, 2005 07:35 AM