Petascale Technologies

MapReduce and Transaction-Based Computing Experiments

Given the current trend toward computer systems that provide petascale computing capabilities composed of hundreds of thousands to millions of processing elements, we have to ask the question, “Is continuing to 24 scale our traditional approaches to parallel computing going to let us take advantage of these new architectures?”. One of the major issues facing the community with so many processing elements has to do with the recovery of user application codes after CPU, communication, and I/O related failures. Over the past twenty years there has been considerable effort put into the design of "state recovery" at both the system and user code level that allows applications to back up and restart at a point before the failure.

Since the Cray machines in the mid-80’s, these attempts have not been effective or successful, and as systems get larger, the problem becomes more difficult. To address this, the ITAPS team is researching the redesign of mesh-based computational algorithms so that they are inherently tolerant to component failure. Our approach leverages transaction-based parallel computing techniques using a MapReduce formulation within a Hadoop framework.

The MapReduce approach works by decomposing algorithms into a series of map and reduce phases that input (key,value) pairs and output new (key, value) lists. Hadoop is an open source implementation of MapReduce based on Google's description of how they use MapReduce for indexing the web and internet searches. One of the major advantages of the MapReduce paradigm is that it is inherently tolerant to failures of CPU, communication, and I/O subsystems through the creative use of redundant computational and storage strategies. Within ITAPS, we have been exploring MapReduce using Hadoop to generate computational kernels designed to operate on mesh data (vertices, edges, and regions). These kernels form the basis of mesh-based PDE operations that perform volume, gradient, and divergence calculations. The results of these calculations are used to advance the solution from one time level to the next.

Figure: The left image (a) is an example plume calculation run using a Hadoop MapReduce implementation; the middle image (b) shows the execution time for mesh problems ranging from 10 million to 100 million elements on a Dual XPS laptop computer, the right image (c) shows the execution time to run the Map (red) and the Reduce (green) phases of a calculation for a 10 million element mesh.

Besides fault-tolerance, one of the advantages of a MapReduce/Hadoop infrastructure is its ability to run and scale to very large problem sizes. This capability is highly scalable from high-performance supercomputers down to laptops. For example, when using the PNNL Chinook supercomputer ( 18K processing elements), with a high-performance Luster file system attached, Hadoop has been shown to scale well when running a seismic wave calculation with 500 billion elements (Figure 12 (a)). Figure 12 (b) shows the timings for a series of calculations that were run on a Dell XPS laptop (dual core, 8GBytes of memory). The problem sizes ranged from 10 million to 100 million tetrahedral elements. The scaling is relatively linear throughout this range of problem sizes. Figure 12 (c) shows a breakout of the execution time for the Map and Reduce phases of a mesh calculation. For efficient parallel computing it is important to be able to overlap various phases of calculations to minimize the latency; this plot show that the Hadoop implementation of MapReduce has that feature built into the framework.