« A much better definition of POJO / usability than my own | Main | Extreme Hibernate Performance with Terracotta »

July 31, 2007

FUD of the Week: Spill to Disk cannot work fast enough

posted by ari

Recently worked through this one with several customers. The basic premise is sort of "my database is bottlenecked by its disk I/O subsystem" and so the logical conclusion is that Terracotta will do roughly the same.

Most people are shocked, however, to learn that Terracotta can run in persistent mode at least as fast as it runs without it. For those who do not know, persistent mode is how Terracotta delivers on our concept of Network Attached Memory for Java apps, where NAM is highly available (100% uptime type of architecture). I think this is really our core value to the Java community--keeping application development simple and object oriented without sacrificing availability or scalability.

Anyways, I was talking about persistent mode as sufficiently scalable to rely on. See, Terracotta is not writing to disk like a relational database would. Those systems are designed for data normalized into a tabular format. They also provide architects and administrators with ways to lay out tables based on access patterns to the data. For example (as I learned from the founders of Veritas years ago when working with my first SAN), tables are usually appended to as new records such as users / customers or sales orders get created. If not appended to, those tables are queried against for blocks of similar records and random individual records. In the latter case, indexes and query plans help minimize the number of blocks pulled off of disk and, thus, the number of disk head moves that have to occur.

This basic tuning concept allows data architects to make intelligent decisions about striping tables across disks (random access, aggregate throughput advantages) or not (append-mostly data where one part of one disk will be hot at any one time as a block is getting loaded up with database rows). And let's not forget the transaction logs which are usually a round-robin append-only files on disk. Quick I/O was my favorite trade-off tool that made it possible to set up storage in a highly scalable manner for both backup and OLTP.

Well, Terracotta's Network Attached Memory isn't accessed like a table or file on disk. There is no append-only I/O. There is no sequential I/O. At least, if there is, it is purely coincidence derived from disk fragmentation algorithms. There is only one pattern (at least conceptually) for which Terracotta needs to be optimized:

Lots of Random I/O at a fine-grained level making up chunks of updates applied all over the disk. This is because objects are laid out on disk in a mostly random pattern which leads to quite predictable latency, by the way.

Terracotta has optimized itself for object oriented data. It is also an infrastructure service meant to help you share application state without the need for explicit APIs so it has to be designed to scale, right?

The entire internals of the Terracotta infrastructure are asynchronous and multithreaded. Coupling the asynchronous nature of the engine with the fine-grained object orientation and we found that there is time to compress updates (eliminate duplicates in-stream), and to figure out what chunks of I/O go to where when. Sort of like query-planning--sure--but on a much simpler scale since all reads and writes are for a handful of fields and linear or sequential scan through the dataset is not a worry so sorting and indexing is not necessary either.

What does this mean for you? It means you can write apps that get distributed across JVMs using our approach instead of sharing and managing state by hand. It also means that data that those JVMs place in Terracotta's NAM can be highly available.

No Single Point of Failure.

One last thing. Since Terracotta can spill to disk, this means that multiple Terracotta Server processes can share disk to share state and work as a cluster. Pretty robust and kewl but also pretty expensive. For those of us who use OSS to avoid complex infrastructure, Terracotta also have TCP-networked server clusters that don't require shared disk at all. These actually can run with each server using commodity-disk-based persistence because we tee all transactions destined for NAM to other servers in the cluster. Guess what? This mode also scales because each NAM server writes simultaneously to its local storage which means the secondaries are usually done updating the memory model at the same time as the primary.

'Nuf said on that FUD.

Trackback Pings

TrackBack URL for this entry:
http://blog.terracottatech.com/cgi-bin/mt/mt-tb.cgi/16

Comments

How does disk spill in the Terracotta server actually work? What is the on-disk format? Is there a whitepaper that actually explains it, or do we have to read the source code?

Posted by: Brian Slesinsky at January 5, 2008 10:39 AM

Brian,

Sorry for the late approval. I missed your comment amongst all the spam. Short answer is that it is not documented. But I just wrote an email explaining it to a customer. I may just publish that here.

It is based on Sleeypcat. We just use it as a log-forward hash to disk. Objects are packed into blocks and written in tunable batches (the default is 10MB per write). each field is stored separately and can be updated as a unit.

Posted by: ARI ZILKA at January 19, 2008 7:43 AM

Ari,

I too would be very interested to know more about how Terracotta's persistent storage works. One of the things I like about Terracotta is that it performs it's magic at the JVM level, rather than intruding into the code. As I understand it, I can keep my objects' fields nice and private, maintaining encapsulation whilst still having persistent storage and clustering. Marvelous! Why is no-one else doing this?

However, the doubters amongst my clients have asked a few pointed questions about their data and how safely it is stored. I'd really like to be able to answer their questions and put their minds at ease.

Thanks again for a good article,

Cheers,
Ben.

Posted by: Ben Hathaway at July 15, 2008 1:24 AM

Post a comment




Remember Me?

(you may use HTML tags for style)