« Object Identity, Tradition and DSO - Part 2 | Main | Developer vs. Runtime Responsibilities »
September 14, 2005
Scalability vs. Correctness
posted by ari
I recently had a chance to spend time with a major enterprise software vendor's R&D leaders. They have been closely watching Terracotta and our space. The VP of Engineering for this company summarized the space by saying, "Terracotta's competition leaves correctness as an exercise for the reader."
What did he mean?I think it comes down to scalability versus correctness. Fundamentally, we cannot sell "Enterprise Software" until we have something with which a developer or operator cannot shoot themselves--we call this "do no harm." But, performance vendors offer up tools that enable asynchronicity, and potential race conditions. Do correctness and scalability, therefore, conflict? Is it just too expensive to get a clustered lock from a network resource and should I thus allow "last one in wins" semantics? Well, the answer is mostly no. I should not dictate such semantics. I should not offer up a tool for scalability that focuses on "fast" in exchange for "right."
In this article I will explore the shortcomings of a speed-centric API-based approach and offer up an explanation of what Terracotta does instead that I think is both more correct and, thus, useful in your day-to-day task of building highly reliable and scalable business applications. I start with some assertions that we believe are fundamental to helping you get your job done. In a follow up blog entry, I will then add some color through retelling a few of our experiences out in the market to help validate my claim that we can provide correctness without sacrificing scalability.
My assertions
- API-based caching helps with only part of the problem
- APIs are unnecessary AND harmful because they add complexity (much like memory management used to be an API but has been factored away)
- API-based caches violate basic engineering principles: tuning during design is just too hard
- API-based caches do not address the complexity introduced by caching the database
API-based caching helps with only part of the problem
First, I should define API-based caching as cache.get() and cache.put(). Yes there is more to it, including transactions, replication schemes, fault tolerance, etc. But, what happened to the signaling across VM's from RMI and JMS that makes clustering useful? If I define clustering as sharing object data exclusively, then I have a very elegant solution that allows me to put objects in a bucket on one server and see those objects on any other server. This is good; very good; better than JMS in many ways. No developer writing business applications should be moving object data around the network just to share it between copies of the SAME application--surely we can all agree on that.
But, back to the shortcomings of API-based caching. If I have get() and put() how do I assemble a logical usage of these APIs to inform a second server that the server doing the update just placed something in the bucket and that the data is ready for retrieval? JMS? RMI? SOAP? As soon as I introduce those technologies, what then has the API really helped with? I started by using a clustering API in order to eliminate unnecessary use of heavier and more complex APIs like RMI, but now those complicated tools have returned. This is why Terracotta says that clustering != caching. Our horizontal I/O product, DSO, takes care of all the distributed signals like wait(), notify() as well as replication or distribution of the data. Without these, a solution cannot drop in transparently underneath an existing application and, critically, cannot significantly help alleviate the developer's challenges around code complexity and code tangling (as Jonas Bonér would call it) of infrastructure with business logic. With API-based clustering, signaling and coordination turns out to leave correctness as an exercise for the reader.
APIs are unnecessary AND harmful
I view the risks associated with traditional API-based caching as both operational and development-oriented. Operational challenges should not be underestimated but are not germane to this discussion.
The more critical risk in my opinion is the development risk of introducing an unnatural set of requirements to the application. First and foremost is the violation of object identity [through serialization] and lack of ability to preserve the domain model. Patrick Calahan recently wrote a great blog entry explaining why this is important; I will avoid rehashing the discussion here.
The second requirement is the violation of the very notion of object oriented programming and pass-by-reference. With API-based clustering, I must be careful to whom I pass references. Before we get to an example, let's jump to the core issue: if I, as a developer, neglect to put changes back in cache those changes will be made locally only and will result in incoherence with respect to the cluster. Now let us look at the details; a code fragment such as:
public Class CustomerCache
{
HashMap cache = new HashMap( );
public Customer loadCustomer( int customerId ){ }
public storeCustomer( Customer c ){ }
public void getAddressByCustomerId( int customerId )
{
Customer c = cache.get( customerId );
return c.getAddress( );
}
// some more stuff
}
class Customer
{
Address a;
// do some stuff
public Address getAddress()
{
return a;
}
leaves the consumer of the address object to do something like the following:
Address a = customerCache.getAddressByCustomerId( 1234 ); a.setStreet( "Townsend Street" ); a.setNumber( 650 ); customerCache.storeCustomer( a.getCustomer( ) ); //this is not natural
The example illustrates that we have two problems with object oriented pass-by-reference:
- We must always remember to call
put()after all changes (the line of sample code above depicted in red ) - We should no longer pass objects to callers outside our caching class. There is no way to ensure that if we provide access to normal Java references that anyone will remember to put the TOP-LEVEL object (customer in this case) back in the cache, regardless of the object reference they actually edited (address). Nor is it easy to get back to the containing root of the object graph (customer, in our example) when we just wanted to pass back an intermediate object (address).
Correctness or performance? API-less clustering is built upon a low-level ability to connect into the virtual machine and be notified of changes as those changes occur. Object identity works because serialization is not needed. Pass-by-reference works because the put() is not used to signal to the cache that someting has changed. With these challenges addressed under API-less clustering, the challenge for Terracotta becomes rapidly distributing change amongst servers. With API-based clustering, the right behavior cannot be guaranteed unless the developer finds a way to ensure it. The exercise of getting those changes rapidly distributed amongst servers is factored behind the API (abstracted away as with API-less clustering), but more data can be pushed than necessary due to serialization. And the cache may not see the changes due to pass-by-reference. Getting the proper changes into the cluster leaves correctness as an exercise for the reader.
API-based caches violate basic engineering principles: tuning during design is just too hard
Simple, performant domain modeling requires:
- Object identity in order to avoid object copying
- Avoidance of serialization to avoid memory over-utilization
- Pass-by-reference to enable good object oriented design.
We have discussed many reasons why API-based caches cannot provide these requirements for the application. As a result, the developer must design the domain model in accordance with the restrictions of the clustering environment. Learning how to cache and how to [re-]model my domain in a cache-friendly manner will take time and will require experience before I can build applications as quickly as I could build a prototype of that application.
Developers should be able to assume that clustering works in an efficient and natural manner. I can build my model my way and get things to cluster efficiently at runtime. The skilled application of API-based clustering technology provides for performance and scalability but leaves correctness as an exercise for the reader.
API-based caches do not address the complexity introduced by caching the database
Last, but not least is the hardest problem for API-based caches to solve--synchronization with database updates for entities or business objects (customers, orders, etc.). We should explore this assertion by way of example. The example DB with which we will interact is as follows:
Assume that:
- Third-party business partners update the
inventory_tablethrough an application outside my control - As a result of the third-party changes, I designed the application not to cache the database but to query the merchandising and pricing information on every transaction
- The DBA is complaining about the number of queries my application is launching against the database
- My query does a join on
inventory_tablewithmerchandising_tableanddepartment_tableto get the department name, the product name, and the merchandising hierarchy all in a single query.
My schema is designed so that the updates can change on_hand or price data without perturbing the merchandising data in other tables.
With my API-based "hammer", I could assume that the DB bottleneck is a "nail" waiting to be pounded upon. So I decide to place database information in my cache, including inventory information and merchandising information. How will I place it in cache? I have 2 options:
- Easiest option--throw the DB resultsets into my API-based cache and deal with the third party changes separately
- Better option--refactor my database JOIN to instead be a logical join in memory and query inventory normalized from departments and merchandising metadata.
There is a problem with option one. Inventory items can show up under multiple places in the merchandising query [due to the many-to-many relationship depicted in the ERD] and if the third-party app changes price information, I will have to walk the entire cache looking for that product's price anywhere it may have been cached.
The second option is prefered; there is only one copy of each product in cache so I can update it in a single location. But, I have to rewrite the entire application to accomodate this change [normalization] of domain model.
Regardless of which scheme I choose, all changes to the data must either go through my API so that my cache might update itself and/or those changes need to be relayed to my API via a trigger inside the DB.
With API-based caches, these design and development burdens are placed on the developer; API-based caching is too high level for the lower-level task of caching the database. Thus, database data should not be allowed into API-based in-memory caches unless the cache can detect changes to that data transparently to application code. When leveraging API-based caches to cache database information, the developer bridges the impedence mismatch on his own...thus, the caching vendor leaves correctness as an exercise for the reader yet again.
Okay, so I have hit you over the head with what I think that senior R&D leadership meant. And, clearly, it is in Terracotta's best interest to agree with them. But our findings in the marketplace are proof positive that these things should matter to you. Please continue to visit this site over the next few days as I start to discuss some of our findings and successes.
My questions for the reader
- Hundreds have downloaded the product to date. Will you join the others and download the product and help us get better?
- Will you share your current successes and failures on our support forum?
Comments
Interesting!
Posted by: Mats Henricson at September 22, 2005 5:52 AM