« Measuring Terracotta Latency can be tricky... | Main | Since when did the 'P' in POJO come to stand for "Pretend" »
July 6, 2007
Network Attached Memory: concurrency and performance tuning
posted by ari
Testing concurrency just proves more and more challenging the deeper into it you get. I wanted to pick up my head and show everyone a few (mostly well known) tuning and testing tricks with Terracotta. Several are simple:
1. Make sure your server heap is not constrained such that it is stuck in GC a lot. I have a bunch of RAM in my machine so I just set Xmx Xms to 1 gig. Out of the box, Terracotta doesn't specify heap settings though, so you may want to tune this before pounding on Terracotta. Also try tuning GC behavior (young generation, sweep alg, etc.)
2. Pay attention to your CPU. What's IOSTAT / VMSTAT report for user / system / idle CPU. If you are 0% idle, you are no longer measuring what you think you are. You are now measuring the OS implementation of threading more so than Terracotta. Also, if system utilization is high (or I/O wait) something is wrong with your disks or network or both. I ran a simple tight-looped locking test with my JVM and my Terracotta server together on the same machine and separated across Gbit networking and found significant improvement over the network. This is because my CPU / OS were busy dancing with each other running code other than my test app on the processor. If you look back one blog entry, I asserted this stuff is hard. I purposely designed this tight-looped test and concluded an average latency of 330 microseconds. Closer measurement reveals that best-case latency looks more like 24 microseconds for a single 8 byte update sent to Terracotta over the network.
3. Batching / windowing / L1 / L2 concepts. Turns out, if I set a field to the same value once or n-times with Terracotta in the mix, that all gets compressed down to be the last field-level change (no sense in sending interim changes when last one in wins). So I changed the test to change a set of fields on a set of objects. And, I still get optimizations happening underneath. With Terracotta, my JVM has in-VM caching and is allowed to trust its data in the native Java heap if it at all can. For example: a cluster of 1 node will always operate against its local heap for reading field values because there is no other JVM making changes. Terracotta is still being notified of writes to object fields just in case my JVM dies and I want to be able to restart it, having it pick up where it left off. Terracotta terminology would call my JVM's cache the L1 and the Terracotta Server the L2. Its sort of like network attached memory on a giant network / 3-tier motherboard. Terracotta keeps all my JVM-memory in sync and makes sure there are no races / inconsistencies across JVMs. (Think of my JVM as a processor and it sort of makes sense in an SMP-context). Short of it, latency is low because the L1 is working on its own heap and batching up changes to send to the L2. In fact, it proves quite difficult to even come up with a way to test latency of sending field updates to Terracotta without ending up testing lots of subsystems you didn't mean to.
Set all these concepts aside and focus for a moment on the programming model. I really enjoyed 2 things. First, I wanted to coordinate 2 JVMs to get into a method, right before a for-loop and then start into the for-loop together. I did this with a barrier:
private CyclicBarrier startBarrier; ... this.startBarrier = new CyclicBarrier( 2 ); ... try { startBarrier.barrier(); }catch( InterruptedException ie ) {}
If you think about it, that is pretty kewl. A CyclicBarrier (OSWEGO construct but you could use 1.5 java.util.concurrent as well) gets instantiated and the argument to the constructor (2 in this case) means wait for 2 threads to hit this barrier. Flagging that barrier as clustered in Terracotta takes 2 lines of config:
<root>
<field-name>demo.inventory.Main.startBarrier</field-name>
</root>
<include>
<class-expression>EDU.oswego.cs.dl.util.concurrent..*</class-expression>
</include>
Ok, so now I have 2 JVMs coming to the same point in the code (the try{} block above) and then racing into the for-loop and iterating at the exact same time. Very kewl. But what about locking and object identity? Check this out. There is nothing but Java objects with no serialization, no get() / put(), and some simple synchronization. BTW, one of my pet peaves is get() / put() caches. They are sort of a performance side-effect to my domain and I don't like them. A few problems exist like what if I forget to put an object back in cache after mutating it? What if I forget to get() the object from cache and trust a stale reference:
Customer c = new Customer( );
int id = 1234;c.load( id );
//side effect of performance
customerCache.put( c );//later on I don't want to have to do this...
c = customerCache.get( id );
c.addBillingAddress( billingAddr );
// to set a new field...
In my performance harness I did the following...
for( i = 0; i < 100000; ) {
latency = System.currentTimeMillis();
synchronized ( productArray.get( 0 ) ) { // lock on only 1 of the objects so that the 2nd for loop's changes travel as 1 batch. if we were to lock on each object as we change it, we would flush each time through the 2nd loop.
for( int j = 0; j < 10; j++ ) {
d += j;
( ( Product ) productArray.get( j % 4 ) ).setPrice(d);
}
}
latencyVector.add(i, new Long( System.currentTimeMillis() - latency ) );
d = d / ++i;
}
The whole power of object identity means 2 key things in the above code. I want 2 nodes racing at this for-loop but I don't need to constantly issue get() and put() calls. This is AWESOME. It would otherwise have looked like this (note productArray is just an arrayList but I would have to change it to a map in this case):
for( int j = 0; j < 10; j++ ) {
d += j;
Product p = (Product) productCache.get( j % 4 );
productCache.lock( p );
p.setPrice(d);
productCache.put( p );
productCache.unlock( p );
}
The other key point though is that Terracotta is acting like Network Attached Memory and I, as a developer, get some control over the size of the update batch going out on the network. How did I do it? The nested for-loop. See, if I just iterated through 1MM times picking a product and changing a field or two, each update becomes its own transaction to NAM (network attached memory) because the Java Memory Model dictates that any change made in a synchronized{} block be available to other threads when they enter the MONITOR. I wanted to get better than 300microsec latency so I sent 10 changes at a time, 100,000 times. A total of a million changes done in batches of 10 at a time. And, I dropped latency to 24 microsecs. Pretty kewl, huh? Think about it. Pure POJO programming and pure POJO performance optimization. Sure, some of my code looks different than it would in the naive implementation, but I am doing network Java, after all.
Short answer: network attached memory rules. It makes programming simpler. Performance can be tuned discreetly. And, I get high scalability and high application availability.
Long answer: VPs, directors, and managers are always beating up on developers to have the features we build run without risk of slowing the system down or taking it down altogether. And, trust me, I have seen some relatively harmless changes cause catosrophic problems in production. Well, network attached memory means I write simpler code. I do not go out the network by hand to send updates to the cluster. Performance optimizations leave less of an impact on my business logic. And it also means I have total control over what gets clustered and when. It just means I do it with code that I can test at a unit level on my own machine, and I can also rapidly spin up 2, 3 or n nodes (on my own sandbox if I want) to make sure I don't have race conditions. I no longer need a message queue or a database to test my app on multiple machines. I can even have multiple JVMs race starting at the exact same place in the code. Kewl, indeed.
Trackback Pings
TrackBack URL for this entry:
http://blog.terracottatech.com/cgi-bin/mt/mt-tb.cgi/10
Comments
http://index6.diruty.com >bayareacaover30 http://index4.diruty.com >listen to pantera 101 proof http://index5.diruty.com >invent real estate http://index1.diruty.com >examples of sermo nobilis http://index2.diruty.com >e haromony
Posted by: garry-we at April 27, 2009 3:23 PM
http://index6.diruty.com >bayareacaover30 http://index4.diruty.com >listen to pantera 101 proof http://index5.diruty.com >invent real estate http://index1.diruty.com >examples of sermo nobilis http://index2.diruty.com >e haromony
Posted by: garry-we at April 27, 2009 3:23 PM
http://index6.diruty.com >bayareacaover30 http://index4.diruty.com >listen to pantera 101 proof http://index5.diruty.com >invent real estate http://index1.diruty.com >examples of sermo nobilis http://index2.diruty.com >e haromony
Posted by: garry-we at April 27, 2009 3:23 PM
http://index3.seeinall.com >widescreen revenge of the sith screencaps http://index4.seeinall.com >oviation transitions http://index6.seeinall.com >car loan amortization schedule http://index2.seeinall.com >llc in georgia file with the secretary of state http://index5.seeinall.com >hairstyles com
Posted by: garry-mz at April 27, 2009 6:11 PM
http://index3.seeinall.com >widescreen revenge of the sith screencaps http://index4.seeinall.com >oviation transitions http://index6.seeinall.com >car loan amortization schedule http://index2.seeinall.com >llc in georgia file with the secretary of state http://index5.seeinall.com >hairstyles com
Posted by: garry-mz at April 27, 2009 6:11 PM
http://index3.seeinall.com >widescreen revenge of the sith screencaps http://index4.seeinall.com >oviation transitions http://index6.seeinall.com >car loan amortization schedule http://index2.seeinall.com >llc in georgia file with the secretary of state http://index5.seeinall.com >hairstyles com
Posted by: garry-mz at April 27, 2009 6:11 PM