« June 2007 | Main | August 2007 »

July 2007 Archives

July 6, 2007

Network Attached Memory: concurrency and performance tuning

Testing concurrency just proves more and more challenging the deeper into it you get. I wanted to pick up my head and show everyone a few (mostly well known) tuning and testing tricks with Terracotta. Several are simple:

1. Make sure your server heap is not constrained such that it is stuck in GC a lot. I have a bunch of RAM in my machine so I just set Xmx Xms to 1 gig. Out of the box, Terracotta doesn't specify heap settings though, so you may want to tune this before pounding on Terracotta. Also try tuning GC behavior (young generation, sweep alg, etc.)

2. Pay attention to your CPU. What's IOSTAT / VMSTAT report for user / system / idle CPU. If you are 0% idle, you are no longer measuring what you think you are. You are now measuring the OS implementation of threading more so than Terracotta. Also, if system utilization is high (or I/O wait) something is wrong with your disks or network or both. I ran a simple tight-looped locking test with my JVM and my Terracotta server together on the same machine and separated across Gbit networking and found significant improvement over the network. This is because my CPU / OS were busy dancing with each other running code other than my test app on the processor. If you look back one blog entry, I asserted this stuff is hard. I purposely designed this tight-looped test and concluded an average latency of 330 microseconds. Closer measurement reveals that best-case latency looks more like 24 microseconds for a single 8 byte update sent to Terracotta over the network.

3. Batching / windowing / L1 / L2 concepts. Turns out, if I set a field to the same value once or n-times with Terracotta in the mix, that all gets compressed down to be the last field-level change (no sense in sending interim changes when last one in wins). So I changed the test to change a set of fields on a set of objects. And, I still get optimizations happening underneath. With Terracotta, my JVM has in-VM caching and is allowed to trust its data in the native Java heap if it at all can. For example: a cluster of 1 node will always operate against its local heap for reading field values because there is no other JVM making changes. Terracotta is still being notified of writes to object fields just in case my JVM dies and I want to be able to restart it, having it pick up where it left off. Terracotta terminology would call my JVM's cache the L1 and the Terracotta Server the L2. Its sort of like network attached memory on a giant network / 3-tier motherboard. Terracotta keeps all my JVM-memory in sync and makes sure there are no races / inconsistencies across JVMs. (Think of my JVM as a processor and it sort of makes sense in an SMP-context). Short of it, latency is low because the L1 is working on its own heap and batching up changes to send to the L2. In fact, it proves quite difficult to even come up with a way to test latency of sending field updates to Terracotta without ending up testing lots of subsystems you didn't mean to.

Set all these concepts aside and focus for a moment on the programming model. I really enjoyed 2 things. First, I wanted to coordinate 2 JVMs to get into a method, right before a for-loop and then start into the for-loop together. I did this with a barrier:


   private CyclicBarrier startBarrier;
...
      this.startBarrier = new CyclicBarrier( 2 );
...
      try {
        startBarrier.barrier();
      }

catch( InterruptedException ie ) {}

If you think about it, that is pretty kewl. A CyclicBarrier (OSWEGO construct but you could use 1.5 java.util.concurrent as well) gets instantiated and the argument to the constructor (2 in this case) means wait for 2 threads to hit this barrier. Flagging that barrier as clustered in Terracotta takes 2 lines of config:



<root>
<field-name>demo.inventory.Main.startBarrier</field-name>
</root>
<include>
<class-expression>EDU.oswego.cs.dl.util.concurrent..*</class-expression>
</include>

Ok, so now I have 2 JVMs coming to the same point in the code (the try{} block above) and then racing into the for-loop and iterating at the exact same time. Very kewl. But what about locking and object identity? Check this out. There is nothing but Java objects with no serialization, no get() / put(), and some simple synchronization. BTW, one of my pet peaves is get() / put() caches. They are sort of a performance side-effect to my domain and I don't like them. A few problems exist like what if I forget to put an object back in cache after mutating it? What if I forget to get() the object from cache and trust a stale reference:



Customer c = new Customer( );
int id = 1234;

c.load( id );

//side effect of performance
customerCache.put( c );

//later on I don't want to have to do this...
c = customerCache.get( id );
c.addBillingAddress( billingAddr );
// to set a new field...


In my performance harness I did the following...



for( i = 0; i < 100000; ) {
latency = System.currentTimeMillis();
synchronized ( productArray.get( 0 ) ) { // lock on only 1 of the objects so that the 2nd for loop's changes travel as 1 batch. if we were to lock on each object as we change it, we would flush each time through the 2nd loop.
for( int j = 0; j < 10; j++ ) {
d += j;
( ( Product ) productArray.get( j % 4 ) ).setPrice(d);
}
}
latencyVector.add(i, new Long( System.currentTimeMillis() - latency ) );
d = d / ++i;
}


The whole power of object identity means 2 key things in the above code. I want 2 nodes racing at this for-loop but I don't need to constantly issue get() and put() calls. This is AWESOME. It would otherwise have looked like this (note productArray is just an arrayList but I would have to change it to a map in this case):



for( int j = 0; j < 10; j++ ) {
d += j;
Product p = (Product) productCache.get( j % 4 );
productCache.lock( p );
p.setPrice(d);
productCache.put( p );
productCache.unlock( p );
}

The other key point though is that Terracotta is acting like Network Attached Memory and I, as a developer, get some control over the size of the update batch going out on the network. How did I do it? The nested for-loop. See, if I just iterated through 1MM times picking a product and changing a field or two, each update becomes its own transaction to NAM (network attached memory) because the Java Memory Model dictates that any change made in a synchronized{} block be available to other threads when they enter the MONITOR. I wanted to get better than 300microsec latency so I sent 10 changes at a time, 100,000 times. A total of a million changes done in batches of 10 at a time. And, I dropped latency to 24 microsecs. Pretty kewl, huh? Think about it. Pure POJO programming and pure POJO performance optimization. Sure, some of my code looks different than it would in the naive implementation, but I am doing network Java, after all.

Short answer: network attached memory rules. It makes programming simpler. Performance can be tuned discreetly. And, I get high scalability and high application availability.

Long answer: VPs, directors, and managers are always beating up on developers to have the features we build run without risk of slowing the system down or taking it down altogether. And, trust me, I have seen some relatively harmless changes cause catosrophic problems in production. Well, network attached memory means I write simpler code. I do not go out the network by hand to send updates to the cluster. Performance optimizations leave less of an impact on my business logic. And it also means I have total control over what gets clustered and when. It just means I do it with code that I can test at a unit level on my own machine, and I can also rapidly spin up 2, 3 or n nodes (on my own sandbox if I want) to make sure I don't have race conditions. I no longer need a message queue or a database to test my app on multiple machines. I can even have multiple JVMs race starting at the exact same place in the code. Kewl, indeed.

July 12, 2007

Since when did the 'P' in POJO come to stand for "Pretend"

POJO seems so prevalent in the past year that I think, even fear, that my grandmother knows what a POJO is. But, then again, I have to wonder if I even know what one is. Seems like recently, POJO, has come to mean “not EJB”. We as a community need a better POJO definition; one that comes complete with a razor like that of Occam with which we can rapidly conclude what is and is

not a POJO. So, bear with me while I set out to define the components of a POJO, an accompanying razor, and a software test harness that only POJO’s can work under.

I asked a friend what he thought about the POJO hype and he suggests that it boils down to object identity. He has a long explanation but this excerpt summarizes the need for POJO quite well:

I think the problem lies implicitly in the language. equals is very nearly a blemish on the language.

Here is the problem - how can two separate distinct heap objects be "equal" as in the sense of equals? What the does that mean. For example, if a.equals(b) && a != b is true, then we have some kind of mysterious object here. If I call a.set(foo) does that mean that a.equals(b) continues to hold? Or has it now flipped from being true to being false??!? For "true" objects that it continues to hold, I would argue that the programmer expectations were met, but for ones where it is not held, then programmer expectations are horribly broken.

The problem is that a.equals(b) should be reflexive, per the javadocs (b.equals(a) also must hold) but mutations of state are not transitive - thus a.set(foo) or b.set(foo) invalidates the prior "equality" of a and b that are not a == b.

You cannot do this for example:


a.equals(b); // true
c = b;
c == b; // true
c.set(foo);
c == b; // true
a.equals(b); false!

1. Components of POJO

From Martin Fowler, according to http://en.wikipedia.org/wiki/POJO> Wikipedia

"We wondered why people were so against using regular objects in their systems and concluded that it was because simple objects lacked a fancy name. So we gave them one, and it's caught on very nicely." ... As of November 2005, the term "POJO" is mainly used to denote a Java object which does not follow any of the (major) Java object models, conventions, or frameworks such as EJB.

So, here's a list of components of the definition of POJO that should matter. Note that in the rest of this discussion, any reference to "components of the definition" refers to this table.

Name

Description

Object Identity

Simply put, the equality operator must work.

Clean Business Interface

No beans with getters and setters. No Serialization. Only interfaces that I choose to define are acceptable here

Proxy-free

Proxies break .equals() and == and, thus object identity

[de]referencing

Following reference, keeping a handle to them, etc. is all ok.  If we don't have references, and everything is passed by value and treated as values-only, then how truly object oriented can our software be?

Annotation-optional



If annotations are used in place of configuration, that’s ok. If annotations are used to complete the functionality of the class and the class ceases to function w/o aspects, this is not POJO (see http://jonasboner.com/2006/04/24/domain-driven-pointcut-design)

System Aspects / Concerns dependency

System concerns are implementations of frameworks that are not simply an abstraction of objects on heap but require resources outside the heap such as sockets, IPCS, files, to function.  Interestingly, enough, there is a recursive aspect of this.  If I write a POJO but that POJO depends on a system concern, I might feel the impact of that system concern (serialization of session attributes and all objects in the attribute map, for example).  If I inherit from a class that is not a POJO, I am likely not going to be able to code a POJO.  If I delegate to a non-POJO, however, I can absolutely remain POJO (dependency injection is an example, here)

[Im/Ex]plicit Identity fields

Using an identity field to map an object to a store is not POJO.  A POJO's identity is already defined in the JVM by that object's reference.  Another notion of ID is inherently an indicator of lack of POJO nature.

No Manager

Having to get objects from a management context that manages their lifecycle and transparently calls APIs and interfaces on get() / put() is not POJO

Free-typing

No restrictions on data types I can use.  If it only works for maps, or doesn't work for arrays because they are special-cased in bytecode, I shouldn't have to know

2. Accompanying Razor

All these components amount to breaking down POJO into "plain" meaning just the Java language without restrictions on its use, "Java" meaning the Java language, and "object" meaning object-oriented design with pass-by-reference semantics. The simplest razor I could come up with is based on the fact that although all the technical components of the definition of POJO seem overwhelming, they all focus on keeping our Java classes free of dependencies and as modular and well-factored as possible. Basically, its about simplicity:

In order to be a POJO, a class must support strict object identity by operating directly on heap, cannot operate on system resources, and cannot expose system concerns.

Note that Hibernate fails this test, but that is okay. We want it to. We are not working with plain Java objects but intentionally working with database rows; they happen to be abstracted as Java objects. As for Hibernate's attempt to simplify the database, it absolutely succeeds.

Spring passes the POJO razor due to the fact that dependency injection of POJO's into other POJO's implies that all those objects operate only on heap. When a non-POJO framework such as an O/R-mapper or a clustering library like JGroups or a JMS queue gets injected into application, the fact that the exposed system concerns spill into my application code cannot be avoided. This is because the framework / library / queue does not operate on heap but on system resources such as sockets. The non-POJO nature comes from the framework and not from dependency injection itself.

Actually, to me both Spring and Hibernate are sort of orthogonal to this discussion in that they are designed to help developers factor complex business applications so as to remove the code smell of databases and scalability / tuning--all system concerns--from our code. They deliver well-factored code in that most of our code looks just like plain old Java and all the dependencies and assumptions are abstracted into a handful or core classes and XML configuration (or annotations). This discussion will cover Spring and Hibernate as compared to a poorly factored sample but it does not seek to pass judgment on those frameworks.

3. Software Test Harness

I think a good harness that can be manipulated to test all the above technologies is our own Inventory demo located in the Terracotta download kit in $TC_HOME/samples/pojo/inventory. (It seems I have been obsessed with it lately, but let's ignore that for now.) The basic construct is a domain model in which I need trees of objects and maps at the same time. The business driver in the demo comes from the real world need to update inventory by SKU (stock keeping unit -- a unique ID for each product a store might sell) but to sell that inventory in multiple departments. In the example, we have a 1 gigabyte flash card that is both in computers and electronics.

A product is defined as follows:


public class Product {
public double price;
public final String name;
public final String sku;
public Product(String n, double p, String s) {
name = n;
price = p;
sku = s;
}
public void setPrice(double p) {
synchronized (this) {
price = p;
}
}
public int hashCode() {
return sku.hashCode();
}
}

And with POJOs my store's domain model is as follows:

public class Store {
public List departments = new ArrayList();
public Map inventory = new HashMap();
...

In the demo, the Store constructor initializes our tiny little store for testing purposes:

1 public Store() {
2 Product warandpeace = new Product("War and Peace", 7.99, "WRPC");
3 Product tripod = new Product("Camera Tripod", 78.99, "TRPD");
4 Product usbmouse = new Product("USB Mouse", 19.99, "USBM");
5 Product flashram = new Product("1GB FlashRAM card", 47.99, "1GFR");
6
7 Department housewares = new Department("B", "Books", new Product[]{warandpeace});
8 Department photography = new Department("P", "Photography", new Product[]{tripod, flashram});
9 Department computers = new Department("C", "Computers", new Product[]{usbmouse, flashram,});
10
11 departments.add(housewares);
12 departments.add(photography);
13 departments.add(computers);
14
15 inventory.put(warandpeace.sku, warandpeace);
16 inventory.put(tripod.sku, tripod);
17 inventory.put(usbmouse.sku, usbmouse);
18 inventory.put(flashram.sku, flashram);
19 }

Note the reference to "flashram" above on lines 8 and 9. In a real store, this sort of thing happens all the time. For that matter, in real applications this will happen. Now, we want to start up copies of this application because, after all, scaling Java (and PHP, .Net, and most other languages) applications by running them on multiple machines is pretty commonplace now. What needs to happen to make this work with various POJO technologies?

3.A. Serialization approach (any of DB blobs, proprietary clustering, JGroups, RMI, or JMS)

First, we turn everything serializable:


public class Product implement Serializable {
...
public class Store implements Serializable {
...

Now, in the body of our main application code, anywhere we update a product or the store, we need to [de]serialize and [get or] send that change back to our storage / clustering mechanism:


1 private void updatePrice() {
2 Product p = null;
3 {
4 out.println("\nEnter SKU of product to update:");
5 out.print("> ");
6 out.flush();
7 String s = getInput().toUpperCase();
8 p = (Product) store.inventory.get(s);
9 if (p == null) {
10 out.print("[ERR] No such product with SKU '" + s + "'\n");
11 return;
12 }
13 }
14 double d = -1;
15 out.println();
16 do {
17 out.println("Enter new price for '" + p.name + "': ");
18 out.print("> ");
19 out.flush();
20 String s = getInput().toUpperCase();
21 try {
22 d = Double.valueOf(s).doubleValue();
23 }
24 catch (NumberFormatException nfe) {
25 continue;
26 }
27 synchronized (p) {
28 p.setPrice(d);
29 }
30 ;
31 } while (d < 0);
32 out.println("\nPrice updated:");
33 printProduct(p);
34 }

We must change lines 8, and line 28/29. Specifically, line 8 has to change from a map.get() call to a lookup of some sort. Perhaps a SQL SELECT query using the String s as a key and retrieving a serialized blob. Or, if we are using proprietary serialization, JGroups, or JMS, we would not have to change line 8. We would instead have some code elsewhere that asynchronously updates our inventory map so that we can trust our local map representation to be as accurate as we need it to be. Line 28/29 needs a SQL UPDATE call or some such code:


27 PreparedStatement stmt = connection.prepareStatement( "SELECT * FROM INVENTORY_TABLE WHERE PRODUCT_ID = ? FOR UPDATE");
28 try {
29 stmt.execute();
30 } catch( sql_exception e ) { }
31 try {
32 ByteArrayOutputStream bos = new ByteArrayOutputStream();
33 out = new ObjectOutputStream(bos);
34 out.writeObject(time);
35 out.close();
36 byte[] buf = bos.toByteArray();
38 PreparedStatement stmt = connection.prepareStatement( "UPDATE INVENTORY_TABLE SET BLOB=? WHERE PRODUCT_ID=?");
39 stmt.setBlob(1, buf);
40 stmt.setString(2, s);
41 stmt.execute();
42 } catch( ...

Without going into the rest of the gory details, you can see that we added lots of code to snapshot our changes to product back down to storage or snapshot those changes around our cluster. The important thing to note though, is not just the changes to class Main that does all the input / output of changes to our domain model, but to the Store design. The store was made up of an ArrayList of departments and a HashMap of Inventory (which is a map of products). So the above code does not even work because we have only updated the product in inventory and ignored the references to it in the ArrayList. (Look back at the code where "flashram" is added to the Store in its constructor both in "photography" and "computers." So, flashram breaks and when we update its price using the above code fragment, we would not see any changes in the 2 departments. So, I guess I should now redefine lines 7 - 9 of my Store constructor to not just add product references to the departments but to instead add product.sku (a String) and I can use that as a pseudo-reference to look up products by ID / SKU. But this means I have to rewrite all of Main.java to get departments out of the ArrayList and then work with Strings representing product keys that I then go get from the Inventory HashMap. Might look like this (printDepartments is an actual method in Main.java in the sample):


1 private void printDepartments() {
2 out.println("+-----------------------+");
3 out.println("| Inventory Listing by Departments |");
4 out.println("+-----------------------+");
5 out.println();
6 for (Iterator i = store.departments.iterator(); i.hasNext(); ) {
7 Department d = (Department) i.next();
8 out.println("Department: " + d.getName());
9 String[] product_skus = d.getProductKeys();
10 for (int i = 0; i < product_skus.length; i++) {
11 Product nextProduct = Inventory.get( product_skus[ i ] );
12 printProduct(nextProduct);
13 }
14 out.println();
15 }
16 }

That works. Good. And I only had to alter lines 9 - 12 to use my Inventory HashMap to lookup actual Serializable product references. So, I can definitely make this approach work but it leaves a code smell based on my scalability architecture (proprietary or OSS clustering or database blob storage). And this is clearly not POJO by the razor's definition since without the database, JMS provider, JGroups, etc. that gets wired in to UpdatePrice() and all my setter methods, I cannot run this application.

3.B. Spring

Spring has several values, one of which is removing all of the code smell and implementation dependencies from the serialization-type approach. I can actually take all my getters and setters where products are added, deleted, and pricing and inventory info changed and inject a Product instance as a Spring Bean where the bean's lifecycle is abstracted from the getter and setters. I can change my Store constructor and populate it via dependency injection so that the issue with passing references between my ArrayList and HashMap is hidden; Spring can actually map my String lookups to beans on the fly so that I don't have to see the impacts of my scalability-abstractions (database, JGroups, JMS, etc.) in my code. So Spring's dependency injection engine seems to get all the above frameworks to pass the POJO razor. In reality, this is perception. The code still behaves as in the naive serialized blob example in section 3.A. in that all my objects are getting serialized and passed across the network to a database or another application instance. The difference is that I can factor the smell such that it is not visible in Main.java, Store.java, Product.java, etc. This is important because without Spring, this application will not function on a single node, nor will it scale out to multiple nodes. If the dependencies cannot get injected then the instances and references will all be null at runtime. Thus, by our POJO razor, the Spring-version of this app, when clustered using a database, JGroups, or JMS, is no more POJO than it was when hand-coded. It is far superior in maintainability, extensibility, and more, but it is no more POJO than it ever was. (Note that I am in no way suggesting that Spring violates POJO, but more on that later.)

3.C. O/R Mappers

This is where things get interesting. Specifically, object proxying and lazy-loading of object relationships seem like they would keep this application more well-factored than the naive-serialization approach. And, in fact they do. the following bit of Hibernate config:


<class name="demo.Inventory.Product" table="INVENTORY_TABLE">
<id name="id" column="PRODUCT_ID">
<generator class="native"/>
</id>
<property name="price"/>
<property name="name"/>
<property name="sku"/>
<set name="departments" table="DEPARTMENT_TABLE">
<key column="DEPARTMENT_ID"/>
<many-to-many column="PRODUCT_ID" class="demo.Inventory.Product"/>
</set>

now implies that my departments can remain an ArrayList and that the multiple references to the "flashram" product will get resolved correctly by PRODUCT_ID. Again, like Spring, this is great. But it still fails the POJO razor because all the calls we will be making to Hibernate.getSessionFactory().* will actually not run outside the presence of Hibernate. While our application will be very well factored and the database dependency will be modularly tucked away (either in some Spring config or in just a few lines of extra code in our getters and setters) it will not run without its underlying database and data tables.

This is not to say that the Inventory example in the Terracotta download kit should not store its data in a database. In fact, I believe it should. This is merely to say that the razor holds true to my expectations that an application that uses Hibernate and O/R-mapping to scale to multiple application instances by sharing a common database instance is not a POJO app.

3.D. Terracotta

With Terracotta, we have one key line of configuration. It is as follows (visit http://www.terracotta.org/ to learn more):


<root>
<field-name>demo.inventory.Main.store</field-name >
</root>

It says that the field named "store" in Main.java should be clustered. That's it. Which means Terracotta has a chance of passing the razor. But not quite yet because our getters and setters just naively update products with no assumption that Terracotta needs to be told that the objects changed. In other words, the code in Main.java all assumes that object references and identity are not getting violated and if I do something like:


String s = "1GFR";
Product p = Inventory.get( s );
double newPrice = 12.34;
p.updatePrice( newPrice );

That I do not need to do Inventory.put( p ) because p is already in the Inventory HashMap. This is true with Terracotta because it plugs in to the JVM and replicates field-level changes at a heap level. It does not require object serialization and it works with normal Java thread coordination like the synchronized() call on line 27 in updatePrice() in one of the code samples above. In fact, here is the configuration snippet that makes Terracotta work with that code, natively. This configuration dictates that my appliaction's use of synchronization will be used by Terracotta to push heap changes from my app to Terracotta and around to my other app instances as they need it. Basically, it tells Terracotta to use the sync-calls in all methods as lock acquisition and release points:


<locks>
<autolock>
<method-expression>* *..*.*(..)</method-expression>
</autolock>
</locks>

So, by the razor's edge, Terracotta is POJO because the app compiles with /usr/bin/javac and it runs whether or not Terracotta is present. When Terracotta is present, then many instances of this application demo will work together on a shared Inventory HashMap and shared departments ArrayList. If Terracotta is not present, each copy will run stand-alone and changes to pricing in one JVM will not impact any others.

4. Aside: Why Vendors Say “POJO” When they are not

So if so many things get cut by the POJO razor's edge, why is POJO important? The value of POJO is in simplicity, and control. When the developer is in control of his object graph, from data types through object passing and references, he is in control of his domain model. Any framework that calls itself a “POJO framework” does so to connote simplicity and control. In the example above, Hibernate gave us control of our data types but we couldn't pass object references around. We needed to allow Hibernate to maintain the relationship between Departments and Inventory. In the example above, Spring gave us a way to factor out the impacts of serialization on our code, but we still could not pass objects by reference. We had to rely on dependency injection and Spring Beans to do the heavy lifting for us. And, when writing clustering code by hand, the code began to look nothing like its original form and we fear the long term maintainability and extensibility of that code base.

Without passing judgement on the value behind or the validity of any framework, most frameworks are not POJO because most frameworks fail the razor. Most frameworks are, however, trying to copy Spring's success in the market and assert that they help deliver cleanly factored code. The reality is that most frameworks that help factor out infrastructure and operational concerns do so with a combination of Spring and Hibernate, both of which fail the razor. Frameworks that use Spring or Hibernate do not produce any greater POJO-ness in applications than Spring or Hibernate themselves can. Quite the contrary. Spring makes non-POJO and otherwise leaky framework abstractions appear to be as POJO as any other Spring application.

Framework POJO? Gaps
Clustering Summary NO -
JCache implementations No fails on all components of the POJO definition
JGroups Nosame as above
JavaSpaces implementations No Objet Identity, free-typing, Identity fields, Clean Business Interface
O/R – Mapping Summary NO -
Hibernate No Identity Fields
iBatis No identity Fields
OpenJPA No Identity Fields, annotations
Dependency Injection Can Be
Spring Can Be Proxy-free (before Spring 2.0)
Others…
Messaging No ALL
App Server Clustering (Tomcat, WLS, WAS) NoObject Identity, Clean Business Interface, No Manager
Terracotta Yes

It would seem that the "P" in POJO now tends to stand for "pretend" Java object. Most of the pretenders are vendors and frameworks who produce tools that abstract system concerns. There is no longer a regard to dependencies and quality of application factoring that a framework can provide. More specifically, if good design requires flexibility, reuse, and lack of fragility, most non-POJO frameworks that wrap system concerns in fact introduce a rigid nature to our application. It seems that the basic plan is that if my framework can be dependency injected (like when Spring wraps a framework), those frameworks want to call themselves POJO . The problem is of course that they are the opposite of POJO. And dependency injection is only hiding their bootstrap and boilerplate code...but not the dependencies themselves.

The question I have is when do we as a community adopt something such as the POJO razor and hold our entire community to that yardstick? Or do we even bother? Does claiming POJO matter as much as _being_ POJO? Or should we treat it like "Free trial software" versus "Open Source"? A bait and switch is made up of 2 parts: bait and switch. Bait in this case is saying dependency injection begets POJO. Switch in this case is the reality of framework dependencies and tight coupling. I suppose time will tell. One thing is for certain and this is the fact that POJO in Martin Fowler's definition is highly valuable in keeping our day-to-day as developers sane.

July 18, 2007

Why should I look at Terracotta? Because bean-style gets tiresome

I saw an email to our sales team today that made me very happy. The gist is that a customer is using Terracotta not just because it saves them money but because it makes their application "more elegant" and they like the code they are writing in conjunction with Terracotta.

For those who don't know, Terracotta is a plug-in to your JVM to help share objects and thread state across JVMs without extra code or programming models. It makes enterprise programming lighter weight.

Bean-style programming seems to be the alternative. In most alternatives to Terracotta, be they JMS, JGroups, DAOs, EJB2, or custom clustering solutions, there is some sort of get() / put() abstraction where serialization might even get called and magically copy appropriate object graphs around a bunch of JVMs.

Terracotta is quite different from bean-style programming. As an aside for those who are wondering why they should listen to a vendor like me, we are open source so you don't have to pay Terracotta to get these benefits. What's more, the license is Mozilla-based which means you can do anything you want with it (no restriction on number of JVMs, size of dataset, and you can edit the source if you really want to) but you cannot redistribute it. As for technical differences, they fall along 3 dimensions: faster, simpler, more reliable apps result from a Terracotta-based approach.

1. Terracotta is faster because it hooks into the heap as a JVM plug-in whereas bean-style programming that doesn't use bytecode manipulation uses Java serialization. Terracotta sees fine grained changes to your object graph and pushes only where needed. For example, if you have a web app with 100Kbyte sessions and you change a 25 byte string, Terracotta will push only 25 bytes. If that web app is running 100 nodes under a sticky load balancer, and 2 app instances have been sent session requests, then that 25 bytes will only be sent to those 2 nodes out of 100. Thus, Terracotta is typically 10X or more faster than other solutions, at least according to our customers.

2. Terracotta is simpler again because it hooks in the heap as a JVM plug-in and it honors the Java Memory Model whereas bean-style interfaces that serialize rely on you to redesign your app and all the frameworks you use to get at your domain model through beans. Terracotta works not just for session but any POJO you write and those inside the open source and 3rd party frameworks you use (Spring, Wicket, Rife, EHCache, Struts, Jetty, Tomcat, and more). For example, you can use Weblogic, Websphere, Tomcat, Jetty, or Geronimo without any session clustering and bolt in Terracotta to cluster instead. You can download the kit at http://www.terracotta.org/ and look inside $TC_HOME/samples/pojo. You will see MANY examples that cannot be run on multiple JVMs with any other solution. For example, there are demos that cluster java.util.concurrent. There are demos that share objects across maps and lists and all without Java Serialization.

3. Terracotta is more reliable because it runs outside your process in its own JVM whereas most bean-style solutions run inside your application's process space. Terracotta plugs in to your JVM with some jars / libraries on your machine, but it is also a separate process that acts like a NETWORK ATTACHED MEMORY SERVICE (runs as pure Java on most HW you would want to run it on) and our software can persist everything to disk as well as to any number of backup Terracotta servers. It is more of a non-stop computing platform than others we have seen.

In short: with Terracotta you are not forced into a programming model. And, Terracotta works at a lower level than bean-style solutions and could, in theory, cluster an OSS asynchronous API framework such as CommonJ, SEDA-style frameworks, Blitz, Spring Eventing, etc.

I know we already cluster Quartz and Lucene simply by configuring those two frameworks to use RAM for storing state and then flagging that state as shared across JVMs through Terracotta. I want to take a look at Spring Batch soon. That should be fun.

Simple, eh? I clearly am biased but I do think this approach is a great tool in your arsenal.

July 19, 2007

FUD OF THE WEEK

DEFINITION

According to Wikipedia, FUD is:

Fear, uncertainty, and doubt (FUD) is a sales or marketing strategy of disseminating negative (and vague) information on a competitor's product. The term originated to describe disinformation tactics in the computer hardware industry and has since been used more broadly. FUD is a manifestation of the appeal to fear.

Each week, I will attempt to seek out common misconceptions about Terracotta and explain why they are, in fact, misconceptions. I suspect that over time a pattern will emerge for all of us.

THIS WEEK

http://www.mail-archive.com/plug@lists.linux.org.ph/msg13454.html

In short, this mailing list entry hits 2 at once. "Terracotta is a single point of failure" or "Terracotta is a single point of bottleneck".

Terracotta clusters today in active / passive mode. This does not have ANYTHING to do with my application cluster (not sure if the original post is suggesting otherwise by use of the term "two-way cluster"). If I have 1 or 1000 nodes running Java, they can communicate using a Terracotta Server Cluster which acts as NETWORK ATTACHED MEMORY for my app cluster.

The passive Terracotta server is, in fact, an exact data replica of the then-active (replicating either via TCP or via shared disk). Failover for my application from speaking through the Terracotta Active Server to a new Active server is less than half a second. This is because Terracotta's Network attached memory concept works like network attached storage (file servers, but for RAM). When I mount a filesystem, I don't have to page it all into local memory in order to read part of a file, right? The same is true with Terracotta. Just connecting to a Terracotta server does not require my JVM to fault anything in. I can instead fault lazily. So, active / passive is indeed a viable strategy and Terracotta places no limits on my cluster's configuration...just on my Terracotta server cluster config.

Terracotta as a single point of bottleneck. Logically, network attached services can become a bottleneck. Sure. But, network attached services tend to have runtime caching and other optimizations that make them easier to scale than peer to peer solutions. My favorite is "greedy locks." Terracotta transparently replicates my lock state to the rest of my cluster so that, when I say synchronized(clustered object) {} , that carries through on the network. Java synchronization is pessimistic, but I cannot assume Terracotta must therefore go out on the network to acquire this mutex. In fact, Terracotta doesn't. It keeps central score of who is using which locks and allows any one of my application JVMs to check out a lock and keep it local-only until recalled back to the cluster. That means tight locking in a for-loop can be networked and still scale.

Most people would be amazed at the scale Terracotta has achieved under some of the world's largest websites. Also note that in many cases, it is possible to chop up your domain model so that it runs on more than 1 Terracotta Server. We have a use case in the lab right now with 100 million clustered objects (totaling 100GB) and the application is randomly mutating objects as fast as it can. The net result, 1200 object mutations per second per JVM and totally linear scale. That doesn't sound like a bottleneck to me.

Keep 'em coming guys! This should get interesting!

July 23, 2007

Attributed And Redistributed: more Terracotta, in more places

I was informed last week about a community member who has integrated Terracotta inside their product and attributes use of us inside their product.

This is VERY EXCITING NEWS. It means our thesis that Terracotta needed to be open not just for business reasons, but because the application design approach it fosters is beneficial to the entire community is proving true.

After all, Terracotta is all bout simplifying Java development by helping avoid database scaling issues and clustering logic as much as possible.

Nice.

--Ari

July 30, 2007

A much better definition of POJO / usability than my own

This blog represents an actual user solving an actual problem with Spring, Terracotta, and Camel. His conclusion: A framework is good if I can't really tell I am using it.

I LIKE IT!

--Ari

July 31, 2007

FUD of the Week: Spill to Disk cannot work fast enough

Recently worked through this one with several customers. The basic premise is sort of "my database is bottlenecked by its disk I/O subsystem" and so the logical conclusion is that Terracotta will do roughly the same.

Most people are shocked, however, to learn that Terracotta can run in persistent mode at least as fast as it runs without it. For those who do not know, persistent mode is how Terracotta delivers on our concept of Network Attached Memory for Java apps, where NAM is highly available (100% uptime type of architecture). I think this is really our core value to the Java community--keeping application development simple and object oriented without sacrificing availability or scalability.

Anyways, I was talking about persistent mode as sufficiently scalable to rely on. See, Terracotta is not writing to disk like a relational database would. Those systems are designed for data normalized into a tabular format. They also provide architects and administrators with ways to lay out tables based on access patterns to the data. For example (as I learned from the founders of Veritas years ago when working with my first SAN), tables are usually appended to as new records such as users / customers or sales orders get created. If not appended to, those tables are queried against for blocks of similar records and random individual records. In the latter case, indexes and query plans help minimize the number of blocks pulled off of disk and, thus, the number of disk head moves that have to occur.

This basic tuning concept allows data architects to make intelligent decisions about striping tables across disks (random access, aggregate throughput advantages) or not (append-mostly data where one part of one disk will be hot at any one time as a block is getting loaded up with database rows). And let's not forget the transaction logs which are usually a round-robin append-only files on disk. Quick I/O was my favorite trade-off tool that made it possible to set up storage in a highly scalable manner for both backup and OLTP.

Well, Terracotta's Network Attached Memory isn't accessed like a table or file on disk. There is no append-only I/O. There is no sequential I/O. At least, if there is, it is purely coincidence derived from disk fragmentation algorithms. There is only one pattern (at least conceptually) for which Terracotta needs to be optimized:

Lots of Random I/O at a fine-grained level making up chunks of updates applied all over the disk. This is because objects are laid out on disk in a mostly random pattern which leads to quite predictable latency, by the way.

Terracotta has optimized itself for object oriented data. It is also an infrastructure service meant to help you share application state without the need for explicit APIs so it has to be designed to scale, right?

The entire internals of the Terracotta infrastructure are asynchronous and multithreaded. Coupling the asynchronous nature of the engine with the fine-grained object orientation and we found that there is time to compress updates (eliminate duplicates in-stream), and to figure out what chunks of I/O go to where when. Sort of like query-planning--sure--but on a much simpler scale since all reads and writes are for a handful of fields and linear or sequential scan through the dataset is not a worry so sorting and indexing is not necessary either.

What does this mean for you? It means you can write apps that get distributed across JVMs using our approach instead of sharing and managing state by hand. It also means that data that those JVMs place in Terracotta's NAM can be highly available.


  • Your JVMs can crash, restart, and reattach to NAM, picking up where they left off

  • Our JVMs can crash, restart or failover to backups, and pick up where they left off as well

No Single Point of Failure.

One last thing. Since Terracotta can spill to disk, this means that multiple Terracotta Server processes can share disk to share state and work as a cluster. Pretty robust and kewl but also pretty expensive. For those of us who use OSS to avoid complex infrastructure, Terracotta also have TCP-networked server clusters that don't require shared disk at all. These actually can run with each server using commodity-disk-based persistence because we tee all transactions destined for NAM to other servers in the cluster. Guess what? This mode also scales because each NAM server writes simultaneously to its local storage which means the secondaries are usually done updating the memory model at the same time as the primary.

'Nuf said on that FUD.

About July 2007

This page contains all entries posted to POJO Mojo in July 2007. They are listed from oldest to newest.

June 2007 is the previous archive.

August 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34