« August 2010 | Main | October 2010 »

September 2010 Archives

September 7, 2010

What is d178021a17b9c8800d8f45da53b3bd04?

More news coming soon!

--Ari

September 14, 2010

OK...so the news is out on BigMemory

BigMemory == d178021a17b9c8800d8f45da53b3bd04

How? that MD5 sum is of the integer number value for 256GB. 256GB is the maximum size of cache we had tested up to the time Greg Luck and I decided to start this little campaign. Since then we have gone past 350GB. But ah well. 350GB is a HUGE cache to load in a single JVM, without any full GC pauses and without any disk I/O (swap / flush to disk), and without any C / C++ hacks or weak reference tricks or what have you.

I will blog more about the value but for now alls [sic] you need to know is that we can load hundreds of gigabytes of cached data into a JVM without affecting GC at all. And, we can do it inside Ehcache and inside Terracotta servers. That's right. Both Ehcache and Terracotta transparently offer off-heap in-process storage for their cache data. Guess what else...that makes our server nearly pauseless.

More later on how to make a tiered storage system between heap, off-heap, local disk, and Terracotta arrays that let's you balance performance vs. offload and scale up vs. scale out!

--Ari

September 15, 2010

BigMemory Explained a bit...

Ok. Its important to explain why we did what we did relative to market need and other vendors' state of the art.

First you need some context. BigMemory is designed to provide what we call a tiered storage engine. Tierd storage looks like this:

What it implies is a system that automatically and transparently works behind the cache to keep data as close to the app as possible. You can have a cache with 350GB of data in it, keeping the most frequently used (calculated at runtime) portion in memory with read / write latency under 1 microsecond. You can keep the remaining 350GB in memory but a bit further away from Java heap, and hidden from the Garbage Collector so that it never causes a pause in the JVM while it sits there resident, and that 350GB can be accessed at 100 microseconds. You have seen this before in the hardware. In that sense it is not new. On-chip cache is faster than RAM and on-chip has multiple levels (L1, L2, L3). And, the caches transparently pull and push data from RAM to the core that needs the data. It works.

In Ehcache there is the disk which you can use to snapshot your 350GB cache locally and avoid having to re-warm the cache if you restart the JVM (with BigMemory running inside it). So BigMemory is in-RAM. It is in-process with the JVM. And it is hidden from the garbage collector--won't be collected or examined at all. Several folks seemed to miss that and a few even asserted that BigMemory writes to disk. We shall ignore their incorrect assumptions and move on. BigMemory is in RAM, in-process and 100% pure Java.

Just remember. if you want 1GB or 10GB or 100GB of BigMemory for Ehcache, what you do is set -Xmx and -Xms to, say, 500MB and then in Ehcache.xml you ask for a 1, 10, or 100GB cache. The operating system process will report as being 1G + 500MB or 10G+500M or 100G + 500M meaning as large as 101GB of physical memory being consumed. But the JVM will only be GCing 500MB! That's right. At 500MB you can pretty much not worry about GC tuning anymore, right? Well, almost. You could have other memory problems in your app other than the data inside Ehcache...I understand that.

Next, you need to remember that BigMemory is a product from Terracotta, just like our server array or Ehcache. BigMemory is a plug-in to everything else we do. In other words, you can get Ehcache with BigMemory plugged in. You can get your Terracotta server array with BigMemory plugged in. And, maybe soon, you can get HTTP session replication with BigMemory--imagine your session store inside your container not bothering your garbage collector at all. Wouldn't that be nice? So do you need to cluster via Terracotta to get BigMemory? No.

As I explained last night to the Java User Group, you could have a 1TB dataset you want to cache and run 1 JVM with a 1TB BigMemory store or 1000 JVMs each with a 1GB BigMemoryStore (still hiding the cache from the collector), or you can run anywhere in between. This is important. I was talking to a huge Telecom company last week and leaked them the news about BigMemory. They have 48GB heaps right now and talked to one of the datagrid vendors and to us about fixing their pause times. The machines they use are Sun X4800's with 256GB of RAM. With 48GB heaps, this Telecom user was seeing Java pause times over 5 minutes. The datagrid guys said to rewrite the entire app to use their grid. Then, run 80 JVMs at 2GB each after tuning for 2 weeks.

We came in and suggested offloading the 48GB of index data into Ehcache on 1 JVM. The customer was intrigued. Surely 1 JVM is better than 80 from a management perspective, they thought. Then they said, "I don't want 1 JVM...its a single point of failure." So we offered them 10GB JVMs...2 of them running Ehcahe with the 48GB dataset stored in a smallish Terracotta array using BigMemory to keep the data all in RAM and faulting in only 10GB of the index into any Ehcache node at any point in time. They went this route.

So now we are on the same page. BigMemory is in-memory and just a hair slower than Java heap. Its in process with the JVM so there is no management complexity and it is pure Java so there is no deployment complexity or sensitivity to JVM version, etc. No tricks. And no significant restrictions--save the fact that it works behind Ehcache, not all Java objects in the heap. Put another way, if it is not in cache, we cannot help yet.

What value is all this?

Interestingly enough, when I talk to Java User Groups, not all people know about 64-bit JVMs and the nightmare these things can cause. Contrary to what other vendors are claiming, most shops such as Unibet, PartyPoker, Expedia, Sabre Holdings, Intercontinental Hotels Group, JP Morgan, Goldman, and more will tell you a 64-bit JVM pauses unpredictably and for minutes at a time. even when a 64 bit JVM is small (<2GB) it takes 30+% more RAM than a 32-bit equivalent JVM running under the same app.

The pause times (stop-the-world) can be minutes and we have even seen hours in some cases. So while our Dell and HP and Cisco UCS machines are shipping with upwards of 16GB of RAM, VMWare and Xen are offering us to chop that ram into 4 or 6 virtual machines where we run smallish 2GB instances of our apps. Its a vicious cycle where the HW is now outpacing the JVM and can provide way more RAM to us than we can effectively use. So we virtualize and run many copies which adds complexity. So we buy more HW to run management, elastic, and other services as we scale out. More HW more HW more HW, all because Java cannot use the RAM that comes in the boxes we can buy off the shelf!

So, the value of BigMemory is:

1. Simplicity...write to Ehcache and flip on the BigMemory switch. It will take care of hiding all your cache data from the GC on its own.

2. Minimal GC tuning...if you have ever tuned GC, you know it is a black art. And the CPU has only a fixed amount of cycles in 1 second with which to do work. The more time you spend in GC, the less you spend on your app logic. People tune and tune and tune for weeks or months to get rid of GC pauses. Last night, I heard from our team that a customer struggling with GC issues for 6 months dropped in BigMemory alpha and the pauses went away. Of course this was an app with a 6GB heap and 4GB+ of that was cached data, but still, the point is there.

3. Density...only with EHcache on BigMemory can you chose to run 1 JVM @ 1TB of cache in RAM or 1000 at 1GB each, all the while not having to tune GC. This means you can buy 32GB of RAM per box and use all that ram with only 1 JVM running on the OS! That's density. That's operational simplicity. That's a capability you haven't had before.

4. Clustered or unclustered. Terracotta--a scaling company--now lets you scale up and out as you see fit. You should no longer wait till you have a big multi-JVM cluster to call upon our technology. If you have Ehcache (as most of you do), you can snap this in and just scale your JVM up until the point of breaking and then scale it out as well. Don't worry, our server array uses BigMemory too which means our cluster won't need any GC tuning, just like your app. Randy Shoup -- distinguished engineer at EBay and all around smartie -- loves to tell people to always weight the trade-offs and do a proper cost analysis of any architectural decision. Don't just cluster because you can. Don't just shard because you ran out of capacity in that server you own. Weigh the cost of new hardware vs. app changes and added complexity. And make sure you weigh all the costs from equipment to development to operations. If you follow Randy's advice you will find BigMemory a critical weapon in your kit. Scaling up can often be the fastest, easiest path to more capacity.

We took BigMemory and made sure it really solves the GC problem in the context of caching inside your JVM. To that end we needed to make sure it made customer applications that we had previously manually tuned run smoothly even when we removed all the delicate GC settings. We needed to make sure it handled heavy read/write scenarios and that our MemoryManager didn't just push the garbage collection problem from JVM-GC to our own. This second point is where we focused much testing effort.

See, as some have rightly questioned, the GC is years-mature now and has to clean up behind apps. Given that our cache is used in read/write scenarios, don't we essentially have to reinvent the GC-wheel? No. Caches are not as powerful as generic object heaps. You can't just store arbitrary data in a cache that is interconnected. I mean, if (k1, v1) is in cache in Ehcache BigMemory, it has no references to (k2, v2). They cannot refer to each other. Which means I don't have to walk the cache looking for cross-references in the data like a GC does. No mark-and-sweep. I merely wait for the dev to call cache.remove( element ) or for the evictor to evict stale data. The evictor and the remove() API are signaling my MemoryManager explicitly just like new() and delete() or malloc() and free(). There is no notion of garbage or collection. Only memory re-use and fragmentation over time.

If I read from the cache 7MM times / second I miss the point by doing a read-only test. 7MM reads per second might as well be 7 trillion. reads don't create any garbage or any challenge for anything but the CPU. But if I write to the cache as fast as I can, replacing (k,v) pairs with other (k,v) pairs of significantly varying sizes, then I create holes all over the heap that have to be reorganized so that I don't end up with a huge, fragmented heap that takes forever to write any objects. This was our test. Write, write write. And change the size of the payload causing different-sized holes in RAM to be managed by our system. And we can do 500,000 writes per second with a 350GB heap and we can do it for 24 hours straight without a single full pause and with the variance in transactions per second essentially ZERO. The duration is important too BTW. ConcurrentMarkSweep collectors can run for hours w/ no full pauses and then suddenly stop the world for 30 minutes trying to catch up to the mess they have made. Don't believe anyone who tells you they routinely run 10GB+ JVMs that never have pause problems. Only once have I seen a 54GB heap with CMS collector with no full pauses and even Sun engineers cannot explain to me how that's possible except if the objects are all of a certain data type (I will spare you the details here).

In Summary

We built BigMemory to help Java catch up to hardware. We got sick of customers asking us to help them max out 32GB machines and forcing them to run 2, 4, or 10 JVMs per machine to do it. We got sick of the GC pausing the JVM in which sensitive clustering code was running and thus causing all sorts of failure scenarios that wouldn't exist in a predictable realtime or near-realtime runtime environment. So we solved the problem for ourselves and our customers; then we realized we were on to something. We realized we didn't have to force people to cluster / scale-out to get more performance and more data into Java apps.

Our customers' apps proved to us (taking months of tuning and being able to delete it all and go back to a vanilla JVM configuration) that we solved the problem. And our benchmark is a well-crafted benchmark that proves we didn't punt the GC problem to another subsystem. It is something no vendor has done before in the distributed caching space--it is a heavy-write benchmark with massively varying object sizes. And, most critically, we built BigMemory to offer people the most important capability we can: scale up or out. It is now your choice.

Forget grids of thousands of machines--unless you are NASA Ames or a Wall St. firm. Avoid the need to scale to 100's of machines and instead run on 1, 2, or as many JVMs as you want. And, buy a Dell R910 with >200GB of RAM like I did and cache your 190GB data table from your MySQL box, as we did in a single JVM. That's the point. Its new. Its different. And its what people needed.

Try it out today: http://www.terracotta.org/bigmemory

September 16, 2010

Pedal to the Metal. How Car Town Grew to One Million Users in Days on EC2

Jeff Barr, the Car Town team, and I are doing a joint webinar. I think you might find it valuable...so here's the description and the link.

October 13, 11:00am PST
Ever wondered how an application can scale from zero to millions of users in a matter of weeks? Cie Games, the makers of Car Town, will tell you how they did it running on Amazon EC2 and using Java and Enterprise Ehcache to scale on-demand. Cie Games' lead-level engineers will share their experience for growing Java apps running on EC2. The discussion will cover RDS, EBS, S3, ELB, Cloudfront, Ehcache, Terracotta, and how to choose between and optimize your use of these powerful scale-out technologies.


The following link will take you straight to the event registration page. There are other events there too if you just go to terracotta.webex.com.

https://terracotta.webex.com

Cheers,

--Ari

About September 2010

This page contains all entries posted to POJO Mojo in September 2010. They are listed from oldest to newest.

August 2010 is the previous archive.

October 2010 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34