« How to profile locks with Terracotta (stay tuned) | Main | The Book Is Almost Done »

April 17, 2008

Lock profiling across JVMs in a cluster

posted by ari

I've wanted to write this for a while. I finally just sat down and went ahead and wrote it.

First, the concept--visualization tools for building apps. Sun is building the VisualVM. We are building a suite of tools we call the Visualization Tools. Cluster Visualization. Snapshot Visualization. Both runtime GUI graphs and snapshots stored in a DB for offline view via GUI graphs. Anyways, let me explain why visualization is important.

Think back to your Calculus 1A class. (You may have been in high school, college, or maybe grade school if you are mathematically inclined :P.) I took lots of math in my life. But some things stick better than others. My favorite example is integrals / integration. I think we all remember that integrals are conceptually defined as "the area under the curve between two points a and b." But, while I remember that and I can still do integration (and even differential equations) I still need to look up the process for integration by parts. Even worse, I never remember the integral of tan (x) dx. Why is this?

Well, some of think in visual terms and others in textual terms. 3-D vs. 2-D. Essentially, people think in different ways yet integrals are orthogonal to all this. We can remember an integral as the area under a curve as defined by some equation simply because it is a concept. We cannot remember the integral of tan (x) dx because it is a rote-memorized formula. Only those of us who use such equations day to day can remember such things. At least I personally believe this is true. (As an example, my brother has many advanced engineering degrees and he can rattle off all these equations at any time because of deeper use in his day to day.)

In Java there is a ton of coverage in the press about multi-threaded programming and how multi-core is taking us in new uncharted directions. I think what makes multi-threading difficult is not the concepts. Everyone I speak with who codes Java knows what a thread is and what synchronized{} is for sure. They do not know, however how to avoid deadlocks in their design nor how to track them down reliably. It is a familiarity thing more than a conceptual challenge.

This is why we built visualization tools: to take the notion of locking, lock tuning (striping and the like), and deadlocks from an experienced technician's hands to a conceptual thing we can all work with. Let's look at an example:

import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.CyclicBarrier;
import java.util.concurrent.BrokenBarrierException;

class Main {
static AtomicInteger a = new AtomicInteger( 1000 );
static AtomicInteger b = new AtomicInteger( 1 );
static CyclicBarrier barrier = new CyclicBarrier( 2 );

public static void main( String args[ ] ) {
AtomicInteger ref1, ref2;

System.out.println( "Waiting for my partner..." );
try {
barrier.await( );
} catch( BrokenBarrierException e ) {
System.err.println( e.getMessage( ) );
} catch( InterruptedException e ) {
System.err.println( e.getMessage( ) );
}


if( args[ 0 ].equals( "a_first" ) ) {
ref1 = a;
ref2 = b;
} else {
ref1 = b;
ref2 = a;
}

synchronized( ref1 ) {
System.out.println( "ref1 =[" + ref1.getAndIncrement() + "]. Sleeping." );
try {
Thread.sleep( 2000 );
} catch( InterruptedException e ) {
System.err.println(e.getMessage());
}
synchronized( ref2 ) {
System.out.println( "ref2 =[" + ref2.getAndIncrement() + "]. Sleeping." );
}
}
}
}



The Terracotta config to cluster this looks like the following:



<instrumented-classes>
<include>
<class-expression>*..*</class-expression>
</include>
</instrumented-classes>

<roots>
<root>
<field-name>Main.a</field-name>
</root>
<root>
<field-name>Main.b</field-name>
</root>
<root>
<field-name>Main.barrier</field-name>
</root>
</roots>

<locks>
<autolock>
<method-expression>* *..*.*(..)>/method-expression<
</autolock>
</locks>


Now, the CyclicBarrier in there guarantees that when we run Main, 2 JVMs must start up before we go racing for locks. Note that the AtomicIntegers are locked in an order defined on the command line. Passing "a_first" to Main will cause the order of lock acquisition to be:
1. Lock A
2. Lock B
3. Unlock B
4. Unlock A

Whereas passing "b_first" to Main will cause the order to be:
1. Lock B
2. Lock A
3. Unlock A
4. Unlock B

Essentially, we have nested locks in inverted order--a classic and simple deadlock. The Terracotta console will show this to us. Let's go ahead and run the application:

$TC_HOME/bin/dso-java.sh Main a_first in one window
$_TC_HOME/bin/dso-java.sh Main b_first in a second window.

The two JVMs will each print

Waiting for my partner...

Then they will increment one reference and print it before getting trapped in the deadlock.

ref1 =[1000]. Sleeping.

The only way to get the program to print ref2 would be to kill one of the JVMs at this point. Kill both now as we will need to rerun everything with lock profiling enabled inside Terracotta. Now start:

$TC_HOME/bin/admin.sh

Start the lock profiler tool in the admin console. Here's a screen shot in case you can't find how to do it:

LockProfiler.png
(click the image to enlarge)


Make sure you set the stack trace depth to 10, and that you click "start" as shown in the image above. Now run both programs. Come back to this profiling tool and continually hit "refresh" in the lock profiler. I got the following interesting results after a few seconds, and then a minute later. Note that I double-clicked the column named "Average Held Time" to sort by that column. Double click it again if it is sorted min-to-max (you want max-to-min order).

SortingLocks.png
(click the image to enlarge)


Killing the "b_first" JVM and returning to the console to refresh the display showed the 60 second wait time for one thread and vice versa.

BlockingEachOther.png
(click the image to enlarge)

Now I took the lock ID, in this case Terracotta gave it 1001, and I went back to "clients" tab to see the stack traces and find the different paths each JVM had taken to get to this lock. The answer was at Main.java, code lines 31 and 38.

CodeLocations.png
(click the image to enlarge)

Now go back to your editor and look at Main.java. Those 2 lines of code are the lines where the synchronized{} calls are in inverse order at runtime. The tool took me straight to the lines of code that were the problem!!!

Trivial example, true. Powerful tool, nonetheless.

--Ari


Trackback Pings

TrackBack URL for this entry:
http://blog.terracottatech.com/cgi-bin/mt/mt-tb.cgi/51

Comments

It is a shame that the JVM is not obligated to resolve deadlocks for you. Deadlocks in databases normally are not issues to worry about because the database can infer if there are any, or even more simple: when a transaction times out (the consequence of a deadlock) the transaction can be aborted. Although livelocks can be an issue as well after the deadlocks have been 'resolved'.

But having a tool to deal with the limitations of Java is really helpful. Especially when you are working in more than 1 JVM.

Posted by: Peter Veentjer at April 17, 2008 11:32 PM

Nice Job, Ari! I remember talking with you about this many moons ago in the old Wily offices.

Anyway, i wrote up a response.

http://theabstracttruth.wordpress.com/2008/04/18/jiedistributed-lock-management/

Posted by: bob at April 18, 2008 6:48 AM

Peter,

Well, I agree. But we are hard at work with Sun. Stay tuned!

Posted by: ARI ZILKA at April 22, 2008 10:50 PM

Post a comment




Remember Me?

(you may use HTML tags for style)