« 2006, The Year of the True POJO Entity | Main | Comments on Spring Scalability »

January 15, 2006

Conflicting Galactic Forces in Java

posted by ari

After spending the last several years running the JRockit JVM business at BEA and the last year working closely with Terracotta's customers and partners, it has become totally clear that there are two big galactic forces in Java today - namely: scale-out and complexity - and these two forces are actually conflicting with each other and reinforcing the difficulties facing Java developers and application operators in the market today. Layered on top of this is the need for the managed runtime (the JVM) to handle more of the operational responsibilities in Java applications today (see my previous post on Developer vs. Runtime Responsibilities). Let me explain myself a bit more clearly:

Galactic Force #1: Scale-out of Java Applications

If you walk into any major, large scale data center running mission critical Java applications today, whether it be a Wall Street trading system, an airline reservation system or a telco provisioning system, you are more likely to see racks and racks of small 2 or 4 way Intel/AMD servers running Linux or Windows than you will see large 28/64/128-way SMP boxes. The economics of commodity systems today are driving unprecedented scale out of Java applications. Some shops still scale up their systems on large SMP boxes, but by and large scale out is winning in the market today.

The problem with scaling out Java is that it breaks the original contract that Java made with the developer 10 years ago. The original promise of Java was that the developer was supposed to just write the application to target this magic runtime, the Java Virtual Machine, and the JVM was supposed to handle everything "below" it automatically. Of course, this works great for abstracting away the particular operating system or chip architecture, or for handling memory management for the developer. In fact, this works almost perfectly as long as the application runs contained within a single JVM. If the application is intended to be deployed on multiple JVM's, in other words - if it is intended to be clustered or distributed, this promise falls apart. The developer absolutely needs to explicitly code for scale-out today - the managed runtime does not take care of this automatically (at least not until Terracotta!).

Today there are three common approaches to clustering Java applications:

  1. Using complicated Java API's: Clustering a Java application means that the various nodes must share in-memory object data and be able to signal & coordinate amongst the members of the nodes. Java developers can effect this behavior by using complicated Java API's like JMS, RMI, or RPC to share state and behavior among nodes. This greatly increases the complexity of Java applications - resulting in larger, more error-prone source code, longer development times, and difficult management of deployed applications.
  2. Using a relational database: Object data can be serialized and stored in a relational database to make the data available to other nodes. This is a relatively common development pattern, however it greatly increases the load on the database and forces the developer to think about database design and deployment details during the development process. Application code is far more complex, and performance suffers greatly as the database now is in the critical path of the Java application.
  3. Using one of the many "shared space" products: Over the past 5 years, many ISV products have popped up to help relieve the developer from writing complicated communication code and eliminating the database as a temporary store for transient object data. Products based on the still preliminary standards like JCache or JavaSpaces do indeed reduce database load, but they all come with their own set of API's, and force the developer to do some very unnatural things like putting objects back into a shared hashmap when changes are made. Breaking object identity (see Patrick Calahan's excellent series of articles in this blog on breaking object identity) means that pass by reference no longer works, which forces the developer to use things like primary keys with their objects - which, when you think about it, is not that different than using a relational database to share objects. These solutions are complicated, unnatural, and are essentially simply scaled down in-memory databases for objects.

All of these above "solutions" are unnecessarily complex - exacerbating the second Galactic Force happening in Java today.

Galactic Force #2: Complexity of Enterprise Java

It will come as no surprise to any readers of this blog that enterprise Java has become overly complex in the last few years. The Java Community Process has stepped up to simplify some things, as evidenced by the new EJB 3.0 spec. However even with these innovations, Java EE is still far more complex than most developers can handle. To deal with this complexity, some excellent development frameworks like the Spring Framework have cropped up in recent years - all with the goal of vastly simplifying the task of creating full-fledged enterprise Java applications. These frameworks have been encouraging developers to think more about their business application, implemented to the extent possible using POJO's, and less about the underlying infrastructure needed to actually deploy these applications in production.

One issue with most of these frameworks is that they don't help much with the problem of scaling the application out to multiple nodes. In some respects, they actually make the task of scaling out harder than it would have been if the application were written with full J2EE specifications and run within an expensive, mission critical application server like WebLogic Server. Now, some people are actually advocating the deployment of Spring applications on a clustered J2EE server like WebLogic. I'm not sure how much sense that makes if one of the reasons people are using Spring is to get away from these heavy-duty containers in the first place.

Conflicting Galactic Forces

So, we have scale out happening today, but all current scale out solutions increase application complexity. And we have simplification solutions out there that actually hinder scale out. These two forces are conflicting with each other, making the situation ever worse for both the Java developer/architect as well as those in charge of managing/operating Java applications. Clearly, there is tremendous value in breaking this conflict and eliminating these compromises, if possible. The solution here rests with the managed runtime itself.

The Managed Runtime as a solution

As I argued in my earlier post, Developer vs. Runtime Responsibilities, the Java managed runtime (JVM) has a long and impressive heritage of abstracting unnecessary infrastructure concerns for the developer - why shouldn't scale out and clustering be added to those concerns? The Java language already has built into it an impressive set of structures/API's/principles to enable true parallel multi-threaded computing. We already have wait()/notify(), sychronized(), locking, threading, etc. already built into the language - isn't this enough for true multi-node computing as well? After all, there is nothing theoretically different between multi-threaded and multi-node computing other than bothersome things like where the physical servers begin and end and how to communicate over a network. It would seem that these are things that the managed runtime should take care of for us.

This is, of course, how Terracotta was conceived. We do enable true multi-node distributed computing with the ease and simplicity of single-node Java development. I would encourage anyone interested in learning more to
download our product
and give it a try.

Comments

> We already have wait()/notify(), sychronized(), locking, threading, etc. already built into the language - isn't this enough for true multi-node computing as well?

does wait/notify handle server failure and network delays?

Posted by: SD at February 27, 2006 12:06 PM

Hi. Yes, failure is handled. If a client node (L1) becomes unavaible due to process or network failure, it eventually gets timed out. To ensure consistency of data across the cluster, re-entry to the cluster is disallowed in the case where a network failure that subsequently corrects itself.

In the case of server (L2) failure, we provide a failover mechanism which ensures that a second 'standby' L2 can take over and maintain the correct state with respect to both data and locking - the L1s never notice the difference.

Hope that helps, please let me know if you have more questions.

Posted by: Patrick Calahan at March 10, 2006 11:31 AM

Post a comment




Remember Me?