While the era of parallel computing has been coming our way slowly, the era of big data has arrived suddenly. We knew it would come eventually. The convergence of scads of real-time data (GPS, RFID, audio, video) with data extracted from the Web, combined with incredibly inexpensive storage capacity, has led to vast amounts of data sitting on systems waiting to be processed. As this surge of data has crested, software technologies to process it efficiently have emerged. Some of these are entirely new (map/reduce), while others expand tools and techniques that were previously known but had garnered less attention (dataflow, non-SQL databases, etc.).
However, these data-handling tools face a rather under-discussed problem: efficient data movement. Suppose you have 2TB of unstructured data to process. Given plenty of processing horsepower, what are the bottlenecks you’ll face? Chances are that processing might be a minor gating factor, but I suspect it won’t be nearly as much of an obstacle as I/O latency.
Moving 2TB of data takes a long time. The two primary components—disk access speed and network bandwidth—work against throughput in compelling ways. Their limitations derive from the fact that I/O systems simply cannot stay even with the growth in data loads.
If I/O is the gating factor, what is the solution? In simple terms, it is moving the computation to the data, rather than the traditional path of moving the data to the computation engine. This concept, which is analogous to stored procedures running on the database server, takes a different form when processing unstructured data.
One implementation is to use virtualization and move the VM to the data. This approach has benefits—VMs are much smaller than the blobs of data we’re discussing—and drawbacks. The principal limitation is that this model moves data off a storage fabric and onto the machines that perform computing. Clusters with data resident on nodes, for example, is a model that does not currently have wide usage. But, it may soon be the preferred way of handling large data.
One feature this model needs is a new filesystem. Filesystems will now need to be able to communicate not only the path to the data, but where exactly it is located. In a very direct way, this violates the goal of virtualization, which aims to break the physical/virtual linkage. In fact, such future filesystems will need to know where exactly they are located and be able to announce the physical system’s location as part of the file’s URI or designation.
Such a filesystem is still a topic of research, but it is not a pie-in-the-sky project. It is very much a response to present or soon-to-be present challenges for IT organizations. Racing to solve the problem is akin to finding a vaccine to thwart a flu epidemic. If the solution is not found, the problem is inevitable.
Returning to the question of putting data on computing servers, there are other problems that result from this design—problems that themselves are research topics. The first of these is power consumption. Nodes that hold data that could be of use to other nodes cannot be shut down to save power. Hence, such clusters cannot adopt one of the emerging best practices in power reduction, which is to shut down unused or little-needed nodes when cluster loads are light.
One possible solution is simple planning. 2TB data blobs are rarely processed in their entirety at unscheduled, unexpected moments. These are jobs that can be scheduled and run in batch mode, so that the appropriate resources can be made available at the pre-set time.
That’s fine for batch computing, but it doesn’t work as well for streaming data applications. These apps typically are unscheduled and require low latency and fast access. For these apps, there is as yet no magical solution to the I/O problem.
Fortunately, if streaming data apps are constrained today, it’s primarily at the processing, not the I/O levels. And for them, the parallel capabilities of processors comes just in time. Long term, we may well have to rethink algorithms in light of the performance needs of big data.
For the moment, we can duck the issues presented here. Multicore CPUs, 10GbE adapters, and SSDs are all helping to accelerate processing and diminish I/O overhead of these kinds of applications. However, it’s clear we are living on borrowed time. The amount of data is growing faster than I/O bandwidth and faster than our research capabilities. There is no doubt that successful data processing in the near future will depend on new forms of intelligent planning and design for big data apps, not only at the processing level, but also in architecture of physical and virtual solutions.
Andrew Binstock is the principal analyst at Pacific Data Works. Read his blog at binstock.blogspot.com.