MapReduce vs. the RDBMS
I just read this article over at Infoq by Scott Delap.
I must say that I find some of the things the "gods of relational databases" say completely based on assumptions that ignore (or show ignorance for) the application tier. They seem mired in thoughts that the data tier can do everything for everyone. If my assertion is correct, I would have to say that such thoughts are a major step backwards.
(I wish I could ignore the assertion that IBM and others were doing commodity-based grids tens upon tens of years ago. The desktop machine was nothing more than a dumb terminal back then so such thoughts are convenient restatements of fact. What IBM worked on was basic clustering theory--indexed sequential access vs. divide & conquer across servers. Also, the network latency and disk latency, etc. were totally different 25 - 30 years ago whereas today we can talk across machines faster than to our own disks. And today, we have lots of memory and multi-core to do amazing amounts of work per <$2K hardware node.)
First an aside, I spoke to one of the major contributors to these Mapreduce-style frameworks on the phone just last week. And I asked him, "what does one do with Hadoop or Google Mapreduce-type APIs other than query / search / read-only operations." His answer was, "well, people distribute read/write tasks and data with this stuff. The Google FS or HDFS are cornerstones of the technology in that they help distribute data. But, in all honesty, most uses of Mapreduce are misuses." Maybe Stonebraker is focused on these misuses.
In my work here at Terracotta, I am not worried about a particular interface such as Hadoop or CommonJ. The original article completely ignores the key point which is "divide & conquer." To further that point, I see divide&conquer as asking us to evaluate if commodity scale-out is better than scale up for certain tasks? Ask Google or Yahoo how much data they are working with and then ask Oracle about its largest single instance and I am willing to bet you a nice lunch which one will be bigger (Google's).
So, at the end of all this, what's the answer? I think it is simple: commodity scale-out can be linear and divide&conquer is a powerful pattern that everyone--including Oracle, across threads or processes--uses to scale.
1. Google is working with more data per day than Oracle instances see in a year of processing (of course, I am speaking in hyperbole here...I have no idea what the aggregate dataset that all Oracle instances work on). Google had to go custom to handle this. Spun another way, if Google had indeed purchased Oracle licenses to do what they do with MapReduce, Oracle's entire dev and support teams would be consumed doing what Google's own IT staff do today. Oracle's product would have changed to compensate for Google's demands, and Oracle would have had less time to build the repeatable business. Oracle would run from the challenge of building Google's core filesystem and web indexing / search engine.
2. Commodity-server based scale out is good for two things. First, data storage. HDFS / GFS help you use all this commodity hardware to get perfectly linear increases in throughput. Specifically, the memory bus, the disk controllers, the disks themselves the network cards, the CPU, everything is discreet and unshared (unlike SMP) so the bottlenecks Oracle faces when scaling up will not appear. This brings us to the 2nd point. The processors will all be spinning as fast as they can. I/O bottlenecks are more manageable at small scales. Keeping a 2.6GHz core spinning on the business problem as opposed to waiting for the disks means fewer clock cycles wasted. I would love to do a study on a Mapreduce-style cluster's CPU utilization versus an Oracle server's. Which one would spend more time in I/O wait? Which would spend more system time? And which would spend more in user space? Its not obvious to me off the cuff, but would help fundamentally answer the question of which technology is better.
3. This is the most important point IMO. Nothing actually requires "divide&conquer" to be tasked with search and query. The more generic concept at play leading to the popularity of MapReduce is that of partitioning. (BTW, even the database requires partitioning--at least it did at the top-ten eCommerce site at which I worked.) Partition data on some key, and then partition the processing along the same boundaries and scalability can become much easier to deliver. Example: if you are working with customer data, you could split on a hash of the last name, or a simple mod of the first letter of the last name. 'A' on one server, 'B' on another, and so on. You store all the customers in a map to get to them by name quickly. But you later realize you are frequently operating on the customers by zipcode so you ask each server to index its subset of the data into a 2nd map of sortedlists by zipcode. There. Flexibility & linear scale in one solution.
What's in all this for me? For Terracotta? I am confident that database is woefully inadequate for many workloads. I think that partitioning in the app tier is important. I also think it is too hard. I want us at Terracotta to work this year on taking transparency and declarative-clustering to the next level. I want our system to provide transparency of both clustering as well as partitioning. Specifically, this year, we should be producing software that allows someone to say, "cluster my existing app with no changes" and then "spread all the clustered data and corresponding I/O across many Terracotta instances without changing anything." We shall see...
--Ari