Good article on the rise of Glassfish
Darryl Taft never fails to deliver a good read on an important topic. This time: Glassfish's rise (and the corresponding fall of expensive app servers).
--Ari
« September 2008 | Main | December 2008 »
Darryl Taft never fails to deliver a good read on an important topic. This time: Glassfish's rise (and the corresponding fall of expensive app servers).
--Ari
I was thinking last week...the database can prove extremely hard to live with for the application developer when working with hierarchical data. This is something with which many experts and seasoned veterans of enterprise Java app development to whom I know readily agree. But, when spending time out in the market with customers and users, once in a wile I land upon a few folks asking, "why Terracotta? Why can't the database keep up?" Here's an attempt at an answer to the underlying question of where the database doesn't work and, as a result, where Terracotta can help the otherwise database-centric application.
The following diagram is of a classic computer science state machine. These devices are sometimes used to define languages / grammars or business processes and workflow. In our case, let's treat the diagram as both.

Starting with a language specification, let's use the diagram together for a second. Assume we have a method named
boolean valid( final String s );
| input | returns |
|---|---|
| abe | true |
| abcde | true |
| ac | false |
valid( "ac" ) returns false because there is no arrow connecting 'a' directly to 'c' in the state machine. A few longer ones would include:
| input | returns |
|---|---|
| abcbcbcbcbcdbe | true |
| abcdbcdbcdc | false |
So, implementing valid() in Java naively would look like this (I am sure I have typos...I didn't try to compile this so I apologize up front):
int STARTED = 0; // only triggered if 'a' is encountered
int ENDED = 0; // only triggered if 'e' is encountered
int A = 0; // last token was 'a'
int B = 1; // last token was 'b'
int C = 2; // last token was 'c'
int D = 3; // last token was 'd'
int e = 4; // last token was 'e'
int lastToken = -1;boolean valid( final String s ) {
int index = 0;
while( index < s.length() ) {
char c = s.charAt( index++ );
switch( c ) {
case 'a': if( STARTED == 1 ) {
System.out.println( "error at" + index ); break; }
else { STARTED = 1; lastToken = A; }
case 'b': if( STARTED == 0 || ENDED == 1 ) {
System.out.println( "error at" + index ); break; }
else lastToken = B;
case 'c': if( STARTED == 0 || ENDED == 1 || lastToken != B ) {
System.out.println( "error at" + index); break; }
else { lastToken = C; }
case 'd': if( STARTED == 0 || ENDED == 1 || lastToken != C ) {
System.out.println( "error at" + index); break; }
else { lastToken = D; }
case 'e': if( STARTED == 0 || ENDED == 1 || lastToken == C || lastToken == A ) {
System.out.println( "error at" + index); break; }
else { lastToken = E; ENDED = 1; }
}
}return ENDED;
}
Okay. That was ugly but it sorta works. The next step is to implement this same logic in a database with SQL. I don't know how. More accurately I know several ways which I have no intention of trying to write down as the solution requires several parts. Part one--a table with some columns to track intermediate states. Part two--some fancy SQL UPDATES with JOINs onto oneself and maybe some validation rules. Perhaps I would create a limitation of, say, 256 characters on the input to keep things simple. Then I would create 256 columns in a table and I could insert the String one character at a time, looking at the previous column's value in a query before inserting the current column. Then the String would only make it into the table in its entirety if the String passed my parsing logic. This way I don't need much work and I could insert / update a new row for each input and thus report on which Strings passed and failed validation by looking for rows that do not end in 'e'. Surely my easiest option would be to write a stored procedure in the database--in Java even--to represent this state machine but that is somehow cheating.
Let's add a twist. If I want to analyze how a particular String made its way through the state machine (say I want to draw the state transitions and changes over time), in the Java version, the String is itself the history of the String. Example: "abcbe" tells me that the state started at "a" and then added a "b", "c", back to "b", and then ended in "e". I just need to take characters off the String in order to see how they got there.
I believe this state machine suggests that in an object-oriented world, the object graph itself is the hierarchy--the relationship--and I can easily represent the object's history too. I recently helped a user create a VersionedTree by hiding LinkedLists at each level in the Tree. Every time I update a level in the Tree, I create a new entry in the LinkedList representing the new path in the Tree. There VersionedTree with Copy-On-Write semantics. Kewl, right?
In the database, history or state transitions have to be modeled separately. I have to write a transaction log or history table and I have to restrict all developers to write to it via my interfaces and rules or I have to update this history via triggers. I have to do lots of work to shoehorn my "machine" into the database.
Hierarchies and directed graphs like a state machine (which are hierarchies that are allowed to loop on themselves randomly) are hard to represent in the database. Hopefully I have proven my point even though I used more than a little hand-waving.
If hierarchies and directed graphs are hard to represent and manipulate in the database, then a few things come to mind:
1. ORMappers can't help because the SQL is not related to schema or metadata and cannot be generated automatically.
2. The database round-trip is not the only thing the slows performance but so does all the weird querying the database has to do to try to assemble flow from a tabular storage format.
3. The only reason to put state machines / workflows into a database is because it is "durable" meaning it stores everything to disk.
Where I used to work, we fought with the database developers who demanded that all these workflows be modeled in stored procedures. This meant that all states and transitions needed to be modeled in tabular form. No Java-based flow or state. Their argument was that the database is persistent and workflows that crash / stop can be repaired and restarted using SQL command-line tools; in Java these broken flows are stuck in memory and must be dropped on the floor. I couldn't fight this easily before Terracotta existed but now I can:
Terracotta is persistent like a database but without the burden of ORM and tables. Entities in the database, I get. Customers and users, products and sales transactions, and other such objects belong in the database. When the database cannot keep up, then we can partition the database or cache that database. Workflow in the database, I don't believe is going to work though.
Just a random thought.
--Ari
Someone finally cleaned up JRuby + Terracotta integration!!!!
OMG OMG OMG.
I have not looked at it in detail but this represents so much to me and here's why:
1. Transparency makes it much easier.
2. One guy can compete with whole product teams based on our stuff.
3. OSS makes it possible. This makes me especially happy
What do you all think?
--Ari
Was reading a back-and-forth between our product manager and some folks from Oracle.
Its interesting because we are 100% sure that Oracle Coherence is not coherent and they are sure they are. Engineers definitely don't like be wrong so I am sure neither will budge here, but it is clear to me that we are talking past each other.
Does Coherence run in an async mode in order to scale? Absolutely not. It runs in a partitioned mode in order to scale. Async != partitioned and so I agree they are not incoherent.
Does Terracotta do something different though? Yes.
Let's go back to the CAP theorem from those wicked-smart Amazon.com guys.
Consistency
Availability
Partitioning
Can't get all 3 at the same time. (I think a few people would debate this and maybe some day someone will find an error in the mathematical proof but for now let's treat this as our foundation.)
According to CAP consistency is orthogonal to partitioning. Does this make sense? Yes. Consistency to me is "keep all nodes who have the data in sync at all times" and partitioning is "keep running even if you can't talk to any other node." Can I have consistency and partitioning? Sure. This is Oracle Coherence.
They split the data amongst JVMs (10 JVMs would each have 1/10th of the data) and then they make sure all data is spread around the JVMs so if I lose one JVM there is a backup of his data. Consistency is then assured by synchronously updating the copy of the data whenever a change comes into one of my JVMs. Kewl.
Terracotta does not split the data. Terracotta is all about Availability and Consistency. Any node can get at any and all data at all times. The ideal situation in Terracotta is to keep work balanced in some way. If you do (say HTTP requests sticky by session, for example) then Terracotta will scale linearly via a weak partitioning scheme (nothing is rigid...anything can move from 1 JVM to another but in ideal circumstances nothing is moving around the cluster). Also, with Terracotta all my writes are sent to the Terracotta array in case the JVM working with my objects dies. That's availability. And consistency for Terracotta means that--going back to the 10 JVM scenario--if all my JVMs happen to have my object resident in memory than whenever I attempt to mutate the object in a particular JVM, I am first guaranteed that anything that happened before I started reading the object in any and all other JVMs will be applied locally. My JVMs won't see stale data.
So, Terracotta is Available + Consistent meaning all shared data is backed up to the Terracotta array and JVMs will stay in synch if they end up sharing objects. Partitioning is weakly guaranteed where, if I partition my workload to my Java cluster, objects will only be resident in JVMs that are actively using those objects. partitioning is done implicitly. Kinda kewl.
In Coherence, developers partition the data by using an Oracle interface. Then the partitions are automatically backed up. Consistency and partitioning with a weaker availability guarantee in that if too many nodes go down, subsets of my data get lost.
I think this is a more accurate, albeit potentially more confusing discussion.
At the end of the day, Terracotta is the only technology capable of keeping an arbitrary number of JVMs consistent across updates. It is built ground up to leverage transparency and field-level updates to be able to provide reasonable scale w/ consistency guarantees. Nothing else I know of can do this.
That being said, if you want to partition manually, then you can get manual control over scalability and that's not necessarily bad. Its just not always necessary to partition deep inside your code in order to get scale.
Which you will prefer will depend on the size of your use case. If you are Amazon and have 10,000 JVMs, you plan to leverage partitioning at any and all levels so you don't need a system with weak / implicit partitioning. If you have 10 or 20 JVMs, I think you might prefer a system that does not require you to partition and instead lets you start with consistency and availability. After all, you can almost always add partitioning as long as your use case makes sense, right?
Think about that one...
update: Now that I think about it, Coherence is not quite focused on partitioning. in CAP, the P means an arbitrary runtime partition comes into existence...not that a developer divides up the data using an interface. If the network were to partition the cluster at runtime, Coherence would take what data surviving nodes can still reach and run with that subset as the working set. Once the network came back, you would have a split brain for some subset of data.
While Terracotta could split the brain, the manifestation would be different. The Terracotta server array would have to partition internally in addition to your application JVMs. Example: if you lose a network switch and 5 of 10 JVMs can reach 1 half of the Terracotta server array and 5 the other half, and the server array internally splits in half the exact same way, then we would keep going. If, however, the server array stays internally intact, no split brain even if clients cannot reach each other. If the server array splits and the clients do not, the clients will inform the array that this is occurring. Only in the scenario when both your cluster and our array split evenly down the middle (exactly in half) will we split.
--Ari
Ok. I thought more about it.
Remember the CAP theorem
Consistency
Availability
Partitioning
Flaw in my thinking is partitioning is all about "surviving a partial network outage".
So, Terracotta is Consistency + Availability with weak partitioning guarantees. Weak means that we will do our best to survive network outages by trying to find a path through the cluster from one half to the other. If we can reach through to the other side from some route, we will then shoot the smaller cluster in the head. If we cannot reach to the other side of a partition till after the network has been restored, we will shoot the smaller cluster in the head at the time of restoring the network. This being said, there are very few scenarios under which we end up partitioned. We do not care of your app nodes can reach each other; this is not a partition to us. We do not care if the stripe groups or partitions in our array cannot reach each other as they share nothing anyways. We only care if our mirror groups inside our array cannot reach each other--this gives a reasonable ability to survive network partitions.
In Coherence they provide Consistency + Availability with weak partitioning guarantees. Their definition of weak is different; simpler on the surface. They let partitions occur and simply merge back together when the network is healed. This is seemingly elegant and simple except "merge" is defined as last change in wins. While last one in semantics works great for WAN-based use cases where traffic is well partitioned in the first place, it doesn't help much for a single app cluster where an internal switch has failed and users are running roughshod all over your cluster, making some updates to one partition and other updates to the other. Your data is pretty much trashed if the system doesn't shoot one half of the cluster in the head. No?
So, if both are C+A what's the difference?
Terracotta's consistency guarantees are for app nodes trying to share data. Take a round robin load balanced web app with HTTP session replication. Terracotta will end up with the session resident in all JVMs and will broadcast deltas to sessions as sessions get updated. This means the session objects are _consistent_ in all JVMs at all times. (Yes it is fully ordered in this degenerate round-robin case.)
Coherence would have lots of trouble in this round-robin scenario because all changes would serialize up and broadcast the session. So they ask you to partition your data. Regardless which Tomcat instance I hit, my session is stored on a fixed node in the cluster and thus I pay higher latency update price to hop servers to my session, but there is no need to try to keep the cluster consistent inside Coherence which would be slow and not scalable.
Don't get me wrong. This is a degenerate case and round-robin will not scale well for either vendor. It just makes the point clear and tangible--the point is that consistency in Terracotta is provided for any an all use cases as designed in the app. Coherence meanwhile will never scale in a P2P broadcast everything everywhere mode and thus falls back to a model where objects get assigned to nodes and stay there.
Partitioning in CAP is all about network partitions. Partitioning in Coherence is about performance (divide up my data so I don't share too much info in too many places). partitioning in Terracotta is about performance as well but we ask you to load balance your traffic so that we can inherit your organic partitioning automatically.
In summary:
C+A == good hard problem to solve and the most valuable use case for most enterprise apps as surviving network partitions is something that quickly degrades to involving human intervention anyways.
C for Terracotta is all about pushing deltas amongst JVMs that are sharing objects
C for Coherence is all about asking the developer to partition the data set so that Coherence can efficiently go back to a fixed node for all updates.
A for Terracotta is all about keeping all data on disk at all times and mirroring data to a 2nd disk or third for backup.
A for Coherence is all about keeping all data on a fixed node plus a buddy in case of node failure. Lose both nodes and the data is lost.
Terracotta consistency is automatic and works in many use cases. Terracotta's availability is a much higher guarantee (on disk at high throughput). And Terracotta's partitioning resilience is easier to live with (Shoot the other node in the head rather than "last one in wins" merge).
I would pick Terracotta's balance of CAP almost every time. But you decide for yourself.
--Ari
In yesterday's panel on designing for scale, I polled the audience:
1. How many know what "eventually correct [or consistent] is" - 3 people
2. How many know when to use EHCache vs. Sleepycat - same 3 people
3. How many know the advantages of EHCache async replication vs. JMS - 2 of 3 people
4. How many know how to make memcache transactional - none
5. How many leverage async, event-driven designs in their apps on a daily basis - 1 person
There were about 60 people in the room.
This tells me that there is a lot of danger in putting a solution in the market and leaving the developer to figure out where to use your engine.
So I wanted to make a short table that you can use as a guide:
1. Memcache - stores key/blob pairs. Off-host caching that gets rid of the impact of caching on your heap. Linearly scalable but operationally very fragile.
2. EHCache - stores object-oriented key/value graphs. Looks like a Map but has more features for evicting data. Also has replication to share data across JVMs. (Ignoring JCS, OSCache, Whirlycache, etc. as they are somewhat variants of this same solution)
3. Sleepycat (Java Edition) - stores key/blob data to disk. Unlike the C-edition, Java is log-forward as opposed to a B-Tree on disk. This makes it very fast at write and slow at lookup. C-Edition is slow to write, fast to lookup.
4. JMS - used across apps to pass requests for data and associated responses. Very good in this use case. Used inside apps to replicate data. Bad at this use case.
5. JGroups - group communications that, unlike JMS, is for intra-app communications. JGroups takes care of nodes coming and going from the cluster, and all sorts of goodness there.
6. Terracotta - network attached memory. Your heap is still your heap. Just build up objects and Terracotta will transparently persist those objects to disk as well as coordinate updates across JVMs using the Java Memory Model.
Consider the following as to when to use each:
1. Is the solution storing to disk
2. Is the solution coherent where all JVMs see a consistent view of objects and you don't have to think about too much or is it "last one in wins"
3. Is the solution serialization-based / blob-based (also known as copy-on-read / copy-on-write)
4. heap-free caching with minimal GC impact
| product | Avoid copy-on-read / write | coherent clustering | storing to disk | minimal GC impact |
|---|---|---|---|---|
| Memcache | x | |||
| EHCache | x | x | ||
| Sleepycat | x | x | ||
| JMS | x | |||
| JGroups | - | |||
| Terracotta | x | x | x |
I view this as telling me that Terracotta gives me a coherent on-disk clustered data management solution. Its advantage over all others is that it scales out well while simultaneously putting everything on disk and keeping my JVMs in sync w/o my having to explciitly lock everything. Its short-coming is that it puts pressure on the heap and requires me to revisit my GC tuning. Most other solutions seem to deal with copy-based semantics and are not clustered or are clustered via "last one in wins" semantics. This means they will go fast, but they have lots of sharp edges for me to cut myself. In this case, cutting myself means I can lose data, I can corrupt data, and users can get very frustrated with my application.
Don't get me wrong. I am not saying, "Terracotta r3wlz and all others dr3wl!" I am saying that async is an awesome approach but no one seemed to know it. Its dangerous and should be approached with care. I would instead recommend coherent clustering via Terracotta and then partitioning / sharding to scale.
(NOTE TO SELF: What's interesting is if I run most of the paid / non-OSS solutions through this framework, they look remarkable like things I can get for free.)
Cheers,
--Ari
This page contains all entries posted to POJO Mojo in November 2008. They are listed from oldest to newest.
September 2008 is the previous archive.
December 2008 is the next archive.
Many more can be found on the main index page or by looking through the archives.