« Coherent clustering revisited | Main | The man asked for SQL but he wanted POJO »
November 21, 2008
QCon Panel: Which clustering to use when
posted by ari
In yesterday's panel on designing for scale, I polled the audience:
1. How many know what "eventually correct [or consistent] is" - 3 people
2. How many know when to use EHCache vs. Sleepycat - same 3 people
3. How many know the advantages of EHCache async replication vs. JMS - 2 of 3 people
4. How many know how to make memcache transactional - none
5. How many leverage async, event-driven designs in their apps on a daily basis - 1 person
There were about 60 people in the room.
This tells me that there is a lot of danger in putting a solution in the market and leaving the developer to figure out where to use your engine.
So I wanted to make a short table that you can use as a guide:
1. Memcache - stores key/blob pairs. Off-host caching that gets rid of the impact of caching on your heap. Linearly scalable but operationally very fragile.
2. EHCache - stores object-oriented key/value graphs. Looks like a Map but has more features for evicting data. Also has replication to share data across JVMs. (Ignoring JCS, OSCache, Whirlycache, etc. as they are somewhat variants of this same solution)
3. Sleepycat (Java Edition) - stores key/blob data to disk. Unlike the C-edition, Java is log-forward as opposed to a B-Tree on disk. This makes it very fast at write and slow at lookup. C-Edition is slow to write, fast to lookup.
4. JMS - used across apps to pass requests for data and associated responses. Very good in this use case. Used inside apps to replicate data. Bad at this use case.
5. JGroups - group communications that, unlike JMS, is for intra-app communications. JGroups takes care of nodes coming and going from the cluster, and all sorts of goodness there.
6. Terracotta - network attached memory. Your heap is still your heap. Just build up objects and Terracotta will transparently persist those objects to disk as well as coordinate updates across JVMs using the Java Memory Model.
Consider the following as to when to use each:
1. Is the solution storing to disk
2. Is the solution coherent where all JVMs see a consistent view of objects and you don't have to think about too much or is it "last one in wins"
3. Is the solution serialization-based / blob-based (also known as copy-on-read / copy-on-write)
4. heap-free caching with minimal GC impact
| product | Avoid copy-on-read / write | coherent clustering | storing to disk | minimal GC impact |
|---|---|---|---|---|
| Memcache | x | |||
| EHCache | x | x | ||
| Sleepycat | x | x | ||
| JMS | x | |||
| JGroups | - | |||
| Terracotta | x | x | x |
I view this as telling me that Terracotta gives me a coherent on-disk clustered data management solution. Its advantage over all others is that it scales out well while simultaneously putting everything on disk and keeping my JVMs in sync w/o my having to explciitly lock everything. Its short-coming is that it puts pressure on the heap and requires me to revisit my GC tuning. Most other solutions seem to deal with copy-based semantics and are not clustered or are clustered via "last one in wins" semantics. This means they will go fast, but they have lots of sharp edges for me to cut myself. In this case, cutting myself means I can lose data, I can corrupt data, and users can get very frustrated with my application.
Don't get me wrong. I am not saying, "Terracotta r3wlz and all others dr3wl!" I am saying that async is an awesome approach but no one seemed to know it. Its dangerous and should be approached with care. I would instead recommend coherent clustering via Terracotta and then partitioning / sharding to scale.
(NOTE TO SELF: What's interesting is if I run most of the paid / non-OSS solutions through this framework, they look remarkable like things I can get for free.)
Cheers,
--Ari
Trackback Pings
TrackBack URL for this entry:
http://blog.terracottatech.com/cgi-bin/mt/mt-tb.cgi/78
Comments
Hi,
could you comment on this one please:
4. JMS - used across apps to pass requests for data and associated responses. Very good in this use case. Used inside apps to replicate data. Bad at this use case.
I refer to eventually consistent architecture. I have systemA with its own DB which serves web/WS requests. The updates into systemA DB are send via JMS messages (replicate data) to a set of other systems which take the messages and save the data in their own DB using a different model. There exists an inconsistency window of course when some user aware of the data state in systemA is unable to use some of the services provided from the other systems (because the async replication is not yet finished) while it seems it should be.
I'm using non-persistent JMS messages for this replication and I believe it is the best solution for me.
Could you comment please?
Posted by: bodrin at March 13, 2009 4:38 AM