« OK...so the news is out on BigMemory | Main | Pedal to the Metal. How Car Town Grew to One Million Users in Days on EC2 »

September 15, 2010

BigMemory Explained a bit...

posted by ari

Ok. Its important to explain why we did what we did relative to market need and other vendors' state of the art.

First you need some context. BigMemory is designed to provide what we call a tiered storage engine. Tierd storage looks like this:

What it implies is a system that automatically and transparently works behind the cache to keep data as close to the app as possible. You can have a cache with 350GB of data in it, keeping the most frequently used (calculated at runtime) portion in memory with read / write latency under 1 microsecond. You can keep the remaining 350GB in memory but a bit further away from Java heap, and hidden from the Garbage Collector so that it never causes a pause in the JVM while it sits there resident, and that 350GB can be accessed at 100 microseconds. You have seen this before in the hardware. In that sense it is not new. On-chip cache is faster than RAM and on-chip has multiple levels (L1, L2, L3). And, the caches transparently pull and push data from RAM to the core that needs the data. It works.

In Ehcache there is the disk which you can use to snapshot your 350GB cache locally and avoid having to re-warm the cache if you restart the JVM (with BigMemory running inside it). So BigMemory is in-RAM. It is in-process with the JVM. And it is hidden from the garbage collector--won't be collected or examined at all. Several folks seemed to miss that and a few even asserted that BigMemory writes to disk. We shall ignore their incorrect assumptions and move on. BigMemory is in RAM, in-process and 100% pure Java.

Just remember. if you want 1GB or 10GB or 100GB of BigMemory for Ehcache, what you do is set -Xmx and -Xms to, say, 500MB and then in Ehcache.xml you ask for a 1, 10, or 100GB cache. The operating system process will report as being 1G + 500MB or 10G+500M or 100G + 500M meaning as large as 101GB of physical memory being consumed. But the JVM will only be GCing 500MB! That's right. At 500MB you can pretty much not worry about GC tuning anymore, right? Well, almost. You could have other memory problems in your app other than the data inside Ehcache...I understand that.

Next, you need to remember that BigMemory is a product from Terracotta, just like our server array or Ehcache. BigMemory is a plug-in to everything else we do. In other words, you can get Ehcache with BigMemory plugged in. You can get your Terracotta server array with BigMemory plugged in. And, maybe soon, you can get HTTP session replication with BigMemory--imagine your session store inside your container not bothering your garbage collector at all. Wouldn't that be nice? So do you need to cluster via Terracotta to get BigMemory? No.

As I explained last night to the Java User Group, you could have a 1TB dataset you want to cache and run 1 JVM with a 1TB BigMemory store or 1000 JVMs each with a 1GB BigMemoryStore (still hiding the cache from the collector), or you can run anywhere in between. This is important. I was talking to a huge Telecom company last week and leaked them the news about BigMemory. They have 48GB heaps right now and talked to one of the datagrid vendors and to us about fixing their pause times. The machines they use are Sun X4800's with 256GB of RAM. With 48GB heaps, this Telecom user was seeing Java pause times over 5 minutes. The datagrid guys said to rewrite the entire app to use their grid. Then, run 80 JVMs at 2GB each after tuning for 2 weeks.

We came in and suggested offloading the 48GB of index data into Ehcache on 1 JVM. The customer was intrigued. Surely 1 JVM is better than 80 from a management perspective, they thought. Then they said, "I don't want 1 JVM...its a single point of failure." So we offered them 10GB JVMs...2 of them running Ehcahe with the 48GB dataset stored in a smallish Terracotta array using BigMemory to keep the data all in RAM and faulting in only 10GB of the index into any Ehcache node at any point in time. They went this route.

So now we are on the same page. BigMemory is in-memory and just a hair slower than Java heap. Its in process with the JVM so there is no management complexity and it is pure Java so there is no deployment complexity or sensitivity to JVM version, etc. No tricks. And no significant restrictions--save the fact that it works behind Ehcache, not all Java objects in the heap. Put another way, if it is not in cache, we cannot help yet.

What value is all this?

Interestingly enough, when I talk to Java User Groups, not all people know about 64-bit JVMs and the nightmare these things can cause. Contrary to what other vendors are claiming, most shops such as Unibet, PartyPoker, Expedia, Sabre Holdings, Intercontinental Hotels Group, JP Morgan, Goldman, and more will tell you a 64-bit JVM pauses unpredictably and for minutes at a time. even when a 64 bit JVM is small (<2GB) it takes 30+% more RAM than a 32-bit equivalent JVM running under the same app.

The pause times (stop-the-world) can be minutes and we have even seen hours in some cases. So while our Dell and HP and Cisco UCS machines are shipping with upwards of 16GB of RAM, VMWare and Xen are offering us to chop that ram into 4 or 6 virtual machines where we run smallish 2GB instances of our apps. Its a vicious cycle where the HW is now outpacing the JVM and can provide way more RAM to us than we can effectively use. So we virtualize and run many copies which adds complexity. So we buy more HW to run management, elastic, and other services as we scale out. More HW more HW more HW, all because Java cannot use the RAM that comes in the boxes we can buy off the shelf!

So, the value of BigMemory is:

1. Simplicity...write to Ehcache and flip on the BigMemory switch. It will take care of hiding all your cache data from the GC on its own.

2. Minimal GC tuning...if you have ever tuned GC, you know it is a black art. And the CPU has only a fixed amount of cycles in 1 second with which to do work. The more time you spend in GC, the less you spend on your app logic. People tune and tune and tune for weeks or months to get rid of GC pauses. Last night, I heard from our team that a customer struggling with GC issues for 6 months dropped in BigMemory alpha and the pauses went away. Of course this was an app with a 6GB heap and 4GB+ of that was cached data, but still, the point is there.

3. Density...only with EHcache on BigMemory can you chose to run 1 JVM @ 1TB of cache in RAM or 1000 at 1GB each, all the while not having to tune GC. This means you can buy 32GB of RAM per box and use all that ram with only 1 JVM running on the OS! That's density. That's operational simplicity. That's a capability you haven't had before.

4. Clustered or unclustered. Terracotta--a scaling company--now lets you scale up and out as you see fit. You should no longer wait till you have a big multi-JVM cluster to call upon our technology. If you have Ehcache (as most of you do), you can snap this in and just scale your JVM up until the point of breaking and then scale it out as well. Don't worry, our server array uses BigMemory too which means our cluster won't need any GC tuning, just like your app. Randy Shoup -- distinguished engineer at EBay and all around smartie -- loves to tell people to always weight the trade-offs and do a proper cost analysis of any architectural decision. Don't just cluster because you can. Don't just shard because you ran out of capacity in that server you own. Weigh the cost of new hardware vs. app changes and added complexity. And make sure you weigh all the costs from equipment to development to operations. If you follow Randy's advice you will find BigMemory a critical weapon in your kit. Scaling up can often be the fastest, easiest path to more capacity.

We took BigMemory and made sure it really solves the GC problem in the context of caching inside your JVM. To that end we needed to make sure it made customer applications that we had previously manually tuned run smoothly even when we removed all the delicate GC settings. We needed to make sure it handled heavy read/write scenarios and that our MemoryManager didn't just push the garbage collection problem from JVM-GC to our own. This second point is where we focused much testing effort.

See, as some have rightly questioned, the GC is years-mature now and has to clean up behind apps. Given that our cache is used in read/write scenarios, don't we essentially have to reinvent the GC-wheel? No. Caches are not as powerful as generic object heaps. You can't just store arbitrary data in a cache that is interconnected. I mean, if (k1, v1) is in cache in Ehcache BigMemory, it has no references to (k2, v2). They cannot refer to each other. Which means I don't have to walk the cache looking for cross-references in the data like a GC does. No mark-and-sweep. I merely wait for the dev to call cache.remove( element ) or for the evictor to evict stale data. The evictor and the remove() API are signaling my MemoryManager explicitly just like new() and delete() or malloc() and free(). There is no notion of garbage or collection. Only memory re-use and fragmentation over time.

If I read from the cache 7MM times / second I miss the point by doing a read-only test. 7MM reads per second might as well be 7 trillion. reads don't create any garbage or any challenge for anything but the CPU. But if I write to the cache as fast as I can, replacing (k,v) pairs with other (k,v) pairs of significantly varying sizes, then I create holes all over the heap that have to be reorganized so that I don't end up with a huge, fragmented heap that takes forever to write any objects. This was our test. Write, write write. And change the size of the payload causing different-sized holes in RAM to be managed by our system. And we can do 500,000 writes per second with a 350GB heap and we can do it for 24 hours straight without a single full pause and with the variance in transactions per second essentially ZERO. The duration is important too BTW. ConcurrentMarkSweep collectors can run for hours w/ no full pauses and then suddenly stop the world for 30 minutes trying to catch up to the mess they have made. Don't believe anyone who tells you they routinely run 10GB+ JVMs that never have pause problems. Only once have I seen a 54GB heap with CMS collector with no full pauses and even Sun engineers cannot explain to me how that's possible except if the objects are all of a certain data type (I will spare you the details here).

In Summary

We built BigMemory to help Java catch up to hardware. We got sick of customers asking us to help them max out 32GB machines and forcing them to run 2, 4, or 10 JVMs per machine to do it. We got sick of the GC pausing the JVM in which sensitive clustering code was running and thus causing all sorts of failure scenarios that wouldn't exist in a predictable realtime or near-realtime runtime environment. So we solved the problem for ourselves and our customers; then we realized we were on to something. We realized we didn't have to force people to cluster / scale-out to get more performance and more data into Java apps.

Our customers' apps proved to us (taking months of tuning and being able to delete it all and go back to a vanilla JVM configuration) that we solved the problem. And our benchmark is a well-crafted benchmark that proves we didn't punt the GC problem to another subsystem. It is something no vendor has done before in the distributed caching space--it is a heavy-write benchmark with massively varying object sizes. And, most critically, we built BigMemory to offer people the most important capability we can: scale up or out. It is now your choice.

Forget grids of thousands of machines--unless you are NASA Ames or a Wall St. firm. Avoid the need to scale to 100's of machines and instead run on 1, 2, or as many JVMs as you want. And, buy a Dell R910 with >200GB of RAM like I did and cache your 190GB data table from your MySQL box, as we did in a single JVM. That's the point. Its new. Its different. And its what people needed.

Try it out today: http://www.terracotta.org/bigmemory

Trackback Pings

TrackBack URL for this entry:
http://blog.terracottatech.com/cgi-bin/mt/mt-tb.cgi/108

Comments

Speechless!! Nice work guyz... Was nice you took out time to explain things out. I really hope other problems plaguing the Java Platform will also be addressing.

Posted by: Dayo at September 15, 2010 10:53 AM

Hey Ari,

So BigMemory only works with Ehcache, and not just pure TC pojo clustering? Is that coming someday?

Posted by: David Hooker at September 15, 2010 12:56 PM

I definitely buy this argument. I actually think that the JVM should consider including manual memory cleanup support rather than try to build a better GC. I'd say this avoidance of "1 approach fits all" is what good architecture and design is all about.

Posted by: Simon Reavely at September 15, 2010 4:48 PM

Very interesting and incredibly overdue. The java heap & GC limitations have been a joke compared to what the hardware can provide.

I'm especially interested in seeing this and Azul working together. Azul is tackling the issue from the foundation by modifying the JVM and in some cases the OS which is great for non-cache heap. I believe they are contributing some of their JVM work back to the OpenJDK project too.

Ari - maybe you could take a minute to compare/contrast the approaches.

Great work!

Posted by: Austin at September 16, 2010 5:59 AM

Just curious if this is problem specific to JVM? Does .NET CLR suffer from the same problem you describe? If so, how do they get around it?

Thanks.

Posted by: Kris at September 16, 2010 10:35 AM

Hi,

Sounds very interesting and apropos to much of the tuning work I've been engaged on lately.

I'm curious as to how you "hide" the cache memory from the GC and still be "100% Java". I don't want you to reveal any proprietary secrets, but consultants working on configs with this technology will need to know how it affects the heap concepts with which we are familiar, since presumably this will not erase all need for heap tuning. Many objects just are not appropriate for a cache.

Posted by: Channing Benson at September 16, 2010 11:11 AM

Well I just discovered the BigMemory today, that's a nice approach to the problem.

@Kris
Well if it's pure java, there's not many things that will trick the gc to not look into this data. And as pointed out, relations between keys are not managed at this point in time.

So it might be a huge byte array, where you store your binary serialized objects. And the huge difficulty here is to manage this binary array, to allow distribution, to avoid the fragmentation, etc.

It's more of a hack, but a nifty one.

Posted by: Brice at September 17, 2010 6:46 AM

@Brice

This is still problematic because with the generational garbage collector everything gets allocated out of Eden and eventually needs to be moved to tenured space. It's not clear to me how some fragmentation (at least initially) would be avoided. As you note this would be a "huge difficulty".

I suppose I should sign on to be a beta site and at least see how it seems to affect the "regular" heap.

Posted by: Channing Benson at September 17, 2010 8:26 PM

Silly me. I should've read a little further and learned about maxdirectmemorysize. That would seem to be the direction in which to look.

At any rate, this seems eminently cool.

Posted by: Channing Benson at September 17, 2010 8:52 PM

@Channing Benson

This one is on me, I should have said "byte buffer" instead of "byte array".

I learned a few things about the allocation of native memory when reading a javacore dump from the J9 JVM.

And as you pointed out -XX:MaxDirectMemorySize affect the memory available with this call java.nio.ByteBuffer.allocateDirect(int).

However I'm not familiar at all with NIO so I'm not really sure about how it can be effectively done, I just know that this memory is not managed by the GC which can lead to OOM.

Posted by: Brice at September 20, 2010 8:49 AM

In past years, I spend many time on gc / memory tunning and I agreed some of your points. Tuning gc is really like black art.

In my case, most garbage in server application are come from temp objects allocation inside methods.

Take a look at Double.parseDouble(String), it will create many String objects (for string parsing) which will become garbages once the method returned. So-called short alive objects.

So, I cannot understand how BigMemory work as memory footprints always are by-product (that you have to clear otherwise OOME) as long as your application operates.

Please advise.

Ken

Posted by: dragonken at September 22, 2010 12:21 AM

They might have implemented that cache in c++ and called through Java native interface ... you can disable GC in native code AFAIK ( correct me if i am wroing ) ...

but what is OFF HEAP ... what part of memory is it then ?? any idea :S

Posted by: Asad Abbas at October 8, 2010 12:08 PM

@Ken
Objects on stack are not garbage collected, it's automatic, the JVM can optimize new allocations, either the object is put on the heap, either it is put on the stack. That's the purpose of the escape analysis.

You are right that memory footprint is always by-product, but with BigMemory, this footprint is only related to your code memory footprint, not about the cache.

@Asad
No, they said it was 100% pure Java. You seem to mix native memory with JNI (aka native code), and it's a different story.

Native memory is native : it means that it's the system that manage this memory segment or buffer. You indicate to JVM process how much memory the system should give to you (if possible). The JVM just allows you to access it in your Java code. As the system manage this memory, the GC threads NEVER run on this memory segment. But OOME are still possible in this memory zone, that's probably one of the task that BigMemory need do to ensure the memory is not full.

By the way the memory performance is explained by the simple fact that native memory is managed by the system.

Posted by: Brice at October 18, 2010 7:58 AM