« Terracotta speaking at Java SIG tonight | Main | Terracotta version 1.5 released this week »

January 23, 2006

Why do clustered hashmaps force me to think in relational terms?

posted by ari

Most solutions that claim to enable scale out for Java applications are really nothing more than in-memory databases - forcing the OO developer to think in relational terms. What is needed is a pure OO model to enable scale-out.

Seems to me that relational theory is of tremendous value (reasonably established by the multi-billion-dollar-per-year industry based on it). OO-theory is of tremendous value as well. They are somewhat orthogonal and needn't be compared. So why then do Java (an OO environment) clustering vendors ask me to use relational theory when working in an OO environment? Because it is easier for me or is it actually just easier for them?

First I should justify my assertion that clustered hashmaps are in-memory databases. If a product follows these few tenets, it should be considered a database:

1. SELECT / get() as a mechanism for looking up data
2. UPDATE / put() as a mechanism for making changes to that data
3. primary key construct for accessing and cross-referencing unique instances of data.
4. a query language to select and/or update groups of or unique data safely.

Is such a product solving problems for Java developers in the most natural, expedient way? And is such a product destined to become a [in-memory] database? If it walks like a duck and quacks like a duck...then its a duck, right?

I believe the Java development community needs to stand together and assert ourselves as wanting to maintain the object oriented / POJO programming paradigm as much as possible. Bob's recent post on Galactic forces explains our belief in more detail, but suffice it to say Spring, Hibernate (and EJB3), are on to something--a return to the simpler ways of single-JVM Java.
Read on if you are interested in my assessment on the performance [dis]advantages of clustered hashmaps / in-memory databases...

I want to preempt any claims that representing your domain model in relational form provides performance, scalability, simplicity that should drive the community in the relational direction. Yes, Oracle, Sybase, IBM, and others have spent years designing relational databases that are fast, scalable, and highly available.

But those systems are just that--systems. They are not a paradigm in and of themsleves. They require proper interface and usage to work at their best. Java is an OO environment--both a development paradigm and a runtime. We have JDBC, and OR-mappers to converse with persistent relational systems. We have SQL to work with bags and sets of data either in memory or otherwise (I believe TimesTen did lots of kewl work in this space). And, when writing business logic in Java, we have the language itself.

So what then is the clustered hashmap? It is not OO for sure: I can't use Java references and object identity is violated. It is closer to a database. Is it faster just because databases are some of the largest systems in the world? No. It is inherently slower than pure Java, just like Oracle. We all know that writing to Oracle is more expensive than keeping objects in memory. We write to a database because the data we are writing has business value (sales orders, customers, etc.) The fundamental performance and scalability rules of leaving unnecessary data out of the database do not change just because the database is now in-memory instead of across the network. We only reduce I/O overhead but we still have an impedence mismatch.

Why write business logic in an object-oriented fashion while implementing the in-memory representation of your domain model in a relational fashion in the same Java process then? Because such solutions are easier for the folks who write the clustering. Developers tell a database when a SQL UPDATE needs to occur. Developers do not have to tell the OS when to flush an mmap()ed memory page. The difference is proximity of the data and memory to the processing context.

If mmap() were not just "memory" like any other heap returned by a malloc() call, the developer would have to call different API's to write to mmap()ed pages. But someone smart decided it should look just like normal memory to the developer. Similarly, databases look like databases because they are focused on preserving valuable data and thus require the developer to signal when data is in a correct state for preservation.

Clustered Hashmaps are thinly veiled databases. We need to stick to Java when we are working with objects and databases when working with business data. Don't you agree?

Comments

I do not agree, an in-memory design may be used for real time working (with its pros and cons), a relational model properly designed must be used also in a historical way, not a 2D design with simplistic cross references, a 3D one that must consider its data state within the timeline. May be that timeline is not considered in your approach.

Posted by: machin at April 6, 2006 4:49 AM

Jordi,

Good point. I did not consider time when writing the blog entry. But I have seen time accounted for in-memory designs that customers are building with clustered hashmaps. Yes, that third dimensionis more difficult to account for in memory than in a proper DB, but it is usually achieved by manual data partioning of the dataset across multiple servers. Essentially, if a set of servers are responsible for data that exist, for example, "yesterday" then time is accounted for because the time dimension requires looking up which server to query before looking up the data on that particular server.

Posted by: Ari Zilka at April 6, 2006 6:28 AM