The End of Programming as We Know It
Future growth in computing performance will have to come from software parallelism that can exploit hardware parallelism. Programs will need to be expressed by dividing work into multiple computations that execute on separate processors and that communicate infrequently or, better yet, not at all. This chapter first explains how current software reaped the benefits of Moore’s law and how much of the resulting software is not well suited to parallelism. It then explores the challenges of programming parallel systems. The committee explores examples of software and programming parallelism successes and possibilities for leveraging these successes, as well as examples of limitations of parallelism and challenges to programming parallel systems. The sudden shift from single-core to multiple-core processor chips requires a dramatic change in programming, but software developers are also challenged by the continuously widening gap between memory system and processor performance. That gap is often referred to as the “memory wall,” but it reflects a continuous rather than discrete shift in the balance between the costs of computational and memory operations and adds to the difficulty of obtaining high performance. To optimize for locality, software must be written to minimize both communication between processors and data transfers between processors and memory.
MOORE’S BOUNTY: SOFTWARE ABSTRACTION
Moore’s bounty is a portable sequential-programming model.1 Programmers did not have to rewrite their software—software just ran faster on the next generation of hardware. Programmers therefore designed and built new software applications that executed correctly on current hardware but were often too compute-intensive to be useful (that is, it took too long to get a useful answer in many cases), anticipating the next generation of faster hardware. The software pressure built demand for next-generation hardware. The sequential-programming model evolved in that ecosystem as well. To build innovative, more capable sophisticated software, software developers turned increasingly to higher-level sequential-programming languages and higher levels of abstraction (that is, the reuse of libraries and component software for common tasks). Moore’s law helped to drive the progression in sequential-language abstractions because increasing processor speed hid their costs.
For example, early sequential computers were programmed in assembly language. Assembly-language statements have a one-to-one mapping to the instructions that the computer executes. In 1957, Backus and his colleagues at IBM recognized that assembly-language programming was arduous, and they introduced the first implementation of a high-level sequential scientific computing language, called Fortran, for the IBM 704 computer.2 Instead of writing assembly language, programmers wrote in Fortran, and then a compiler translated Fortran into the computer’s assembly language. The IBM team made the following claims for that approach:
- Programs will contain fewer errors because writing, reading, and understanding Fortran is easier than performing the same tasks in assembly language.
- It will take less time to write a correct program in Fortran because each Fortran statement is an abstraction of many assembly instructions.
- The performance of the program will be comparable with that of assembly language given good compiler technology.
1Jim Larus makes this argument in an article in 2009, Spending Moore’s dividend, Communications of the ACM 52(5): 62-69.
2J.W. Backus, R.I. Beeber, S. Best, R. Goldberg, L.M. Haibt, H.L. Herrick, R.A. Nelson, D. Sayre, P.B. Sheridan, H. Stern, Ziller, R.A. Hughes, and R. Nutt, 1957, The Fortran automatic coding system, Proceedings of the Western Joint Computer Conference, Los Angeles, Cal., pp. 187-198, available online at http://archive.computerhistory.org/resources/text/Fortran/102663113.05.01.acc.pdf.
That pattern of increases in processor performance coupling with increases in the use of programming-language abstractions has played out repeatedly. The above discussion describes the coupling for general-purpose computing devices, and it is at various stages in other hardware devices, such as graphics hardware, cell phones, personal digital assistants, and other embedded devices.
High-level programming languages have made it easier to create capable, large sophisticated sequential programs. During the height of the synergy between software and increasing single-processor speeds in 1997, Nathan Myhrvold, former chief technology officer for Microsoft, postulated four “laws of software”:3
- Software is a gas. Software always expands to fit whatever container it is stored in.
- Software grows until it becomes limited by Moore’s law. The growth of software is initially rapid, like gas expanding, but is inevitably limited by the rate of increase in hardware speed.
- Software growth makes Moore’s law possible. People buy new hardware because the software requires it.
- Software is limited only by human ambition and expectation. We will always find new algorithms, new applications, and new users.
3These laws were described in a 1997 presentation that the Association for Computing Machinery hosted on the next 50 years of computing (Nathan P. Myhrvold, 1997, The next fifty years of software, Presentation, available at http://research.microsoft.com/en-us/um/siliconvalley/events/acm97/nmNoVid.ppt).
Myhrvold’s analysis explains both the expansion of existing applications and the explosive growth in innovative applications. Some of the code expansion can be attributed to a lack of attention to performance and memory use: it is often easier to leave old and inefficient code in a system than to optimize it and clean it up. But the growth in performance also enabled the addition of new features into existing software systems and new paradigms for computing. For example, Vincent Maraia reports that in 1993, the Windows NT 3.1 operating system (OS) consisted of 4-5 million lines of code and by 2003, the Windows Server OS had 50 million lines, 10 times as many.4 Similarly, from 2000 to 2007, the Debian 2.2 Linux OS grew from about 59 to 283 million lines in version 4.0, about 5 times as many.5 Those operating systems added capabilities, such as better reliability, without slowing down the existing features, and users experienced faster operating system startup time and improvement in overall performance. Furthermore, the improvements described by Moore’s law enabled new applications in such domains as science, entertainment, business, and communication. Thus, the key driver in the virtuous cycle of exploiting Moore’s law is that applications benefited from processor performance improvements without those applications having to be adapted to changes in hardware. Programs ran faster on successive generations of hardware, allowing new features to be added without slowing the application performance.
The problem is that much of the innovative software is sequential and is designed to execute on only one processor, whereas the previous chapters explained why all future computers will contain multiple processors. Thus, current programs will not run faster on successive generations of hardware.6 The shift in the hardware industry has broken the performance-portability connection in the virtuous cycle—sequential programs will not benefit from increases in processor performance that stem from the use of multiple processors. There were and are many problems—for example, in search, Web applications, graphics, and scientific computing—that require much more processing capability than a single processor pro-
4Vincent Maraia, The Build Master: Microsoft’s Software Configuration Management Best Practices, Addison-Wesley Professional, 2005.
5Debian Web site, Wikipedia.com.http://en.wikipedia.org/wiki/Debian.
6Successive generations of hardware processors will not continue to increase in performance as they have in the past; this may be an incentive for programmers to develop tools and methods to optimize and extract the most performance possible from sequential programs. In other words, working to eliminate the inefficiencies in software may yield impressive gains given that past progress in hardware performance encouraged work on new functions rather than optimizing existing functions. However, optimizing deployed software for efficiency will ultimately reach a point of diminishing returns and is not a long-term alternative to moving to parallel systems for more performance.
vides. The developers of the applications and programming systems have made much progress in providing appropriate abstractions (discussed in detail below) but not enough in that most developers and programming systems currently use the sequential model. Conventional sequential programs and programming systems are ill equipped to support parallel programming because they lack abstractions to deal with the problems of extracting parallelism, synchronizing computations, managing locality, and balancing load. In the future, however, all software must be able to exploit multiple processors to enter into a new virtuous cycle with successive generations of parallel hardware that expands software capabilities and generates new applications.7
Finding: There is no known alternative to parallel systems for sustaining growth in computing performance; however, no compelling programming paradigms for general parallel systems have yet emerged.
To develop parallel applications, future developers must invent new parallel algorithms and build new parallel applications. The applications will require new parallel-programming languages, abstractions, compilers, debuggers, execution environments, operating systems, and hardware virtualization systems. We refer to those tools collectively as a programming system. Future programming systems will need to take advantage of all those features to build applications whose performance will be able to improve on successive generations of parallel hardware that increase their capabilities by increasing the number of processors. In contrast, what we have today are conventional sequential-programming systems based on two abstractions that are fundamentally at odds with parallelism and locality. First, they tie the ordering of statements in a program to a serial execution order of the statements. Any form of parallelism violates that model unless it is unobservable. Second, conventional programs are written on the assumption of a flat, uniform-cost global memory system. Coordinating locality (minimizing the number of expensive main memory references) is at odds with the flat model of memory that does not distinguish between fast and slow memory (for example, on and off chip). Parallelism and locality are also often in conflict in that
7Indeed, the compiler community is bracing for the challenges ahead. On p. 62 of their book, Mary Hall et al. observe that “exploiting large-scale parallel hardware will be essential for improving an application’s performance or its capabilities in terms of execution speed and power consumption. The challenge for compiler research is how to enable the exploitation of the power [that is, performance, not thermals or energy] of the target machine, including its parallelism, without undue programmer effort.” (Mary Hall, David Padua, and Keshav Pingali, 2009, Compiler research: The next 50 years, Communications of the AC 52(2): 60-67.
locality will encourage designs that put all data close to a single processor to avoid expensive remote references, whereas performing computations in parallel requires spreading data among processors.
SOFTWARE IMPLICATIONS OF PARALLELISM
There are five main challenges to increasing performance and energy efficiency through parallelism:
- Finding independent operations.
- Communicating between operations.
- Preserving locality between operations.
- Synchronizing operations.
- Balancing the load represented by the operations among the system resources.
The first challenge in making an application parallel is to design a parallel algorithm to solve the problem at hand that provides enough independent operations to keep the available parallel resources busy. Some demanding problems have large amounts of data parallelism—that is, a single operation can be performed for every data element of a set, and the operations are independent of one another (or can be made so via transformations). Some problems also have moderate amounts of control or task parallelism in which different operations can be performed in parallel on different data items. In both task and data parallelism, an operation may comprise a sequence of instructions. For some applications, the parallelism is limited by a sequence of dependent operations, and performance is limited not by throughput but by the latency along this critical path.8
The second challenge, communication, occurs when computations that execute in parallel are not entirely independent and must communicate. Some demanding problems cannot be divided into completely independent parallel tasks, but they can be divided into parallel tasks that communicate to find a solution cooperatively. For example, to search for a particular object in an image, one may divide the image into pieces that are searched by independent tasks. If the object crosses pieces, the tasks will need to communicate. The programming system can perform communication through inputs and outputs along dependences by reading and writing to shared data structures or by explicitly sending messages
8This limit on parallelism is often called Amdahl’s law, after Gene Amdahl. For more on this law, see Box 2.4.
between parallel operations. Even in the implicit case, some data will need to transfer between processors to allow access to shared data.
Locality, the third challenge, reduces the costs associated with communication by placing two operations that access the same data near each other in space or in time. Scheduling operations nearby in space on the same processor avoids communication entirely, and placing them on nearby processors may reduce the distance that data need to travel. Scheduling operations nearby in time shortens the lifetime of data produced by one operation and consumed by another; this reduces the volume of live data and allows the data to be captured in small on-chip memories. Locality avoids the need for communication between processors, but is also critical for avoiding another form of communication: the movement of data between memory and processors.
The fourth challenge, synchronization, is also needed to provide cooperation between parallel computations. Some operations must be performed in a particular order to observe dependence. Other operations may be performed in an arbitrary order but must be grouped so that some sequences execute atomically (without interference from other sequences). Synchronization is used to serialize parts of an otherwise parallel execution, and there is often a tension between the performance gained from parallelism and the correctness ensured by synchronization. For example, barrier synchronization forces a set of parallel computations to wait until all of them reach the barrier. Locks are used to control access to shared data structures by allowing only one thread to hold a lock at a given time. Unnecessary synchronization may occur when an entire data structure is locked to manipulate one element or when a barrier is placed on every loop iteration even when the iterations are independent.
Finally, load balancing involves distributing operations evenly among processors. If the load becomes unbalanced because some processors have more work than others or take more time to perform their work, other processors will be idle at barrier synchronization or when program execution ends. The difficulty of load balancing depends on the characteristics of the application. Load balancing is trivial if all parallel computations have the same cost, more difficult if they have different costs that are known in advance, and even more difficult if the costs are not known until the tasks execute.
Locality’s Increasing Importance
Effective parallel computation is tied to coordinating computations and data; that is, the system must collocate computations with their data. Data are stored in memory. Main-memory bandwidth, access energy, and latency have all scaled at a lower rate than the corresponding characteris-
tics of processor chips for many years. In short, there is an ever-widening gap between processor and memory performance. On-chip cache memories are used to bridge the gap between processor and memory performance partially. However, even a cache with the best algorithm to predict the next operands needed by the processor does not have a success rate high enough to close the gap effectively. The advent of chip multiprocessors means that the bandwidth gap will probably continue to widen in that the aggregate rate of computation on a single chip will continue to outpace main-memory capacity and performance improvements. The gap between memory latency and computation is also a limitation in software performance, although this gap will not grow with multicore technology, inasmuch as clock rates are relatively constant. In addition to performance concerns, the movement of data between cores and between the processor and memory chips consumes a substantial fraction of a system’s total power budget. Hence, to keep memory from severely limiting system power and performance, applications must have locality, and we must increase the amount of locality. In other words, the mapping of data and computation should minimize the distance that data must travel.
To see the importance of locality in future systems, it is instructive to examine the relative energy per operation for contemporary systems and how it is expected to scale with technology. In a contemporary 40-nm CMOS process, performing a 64-bit floating-point multiply-add (FMA) operation requires that the energy of the operation, Eop, be equal to 100 pJ. The energy consumed in moving data over 1 mm of wire, Ew, is 200 fJ/bit-mm, or 12.8 pJ/W-mm (for 64-bit words). Moving data off-chip takes energy, Ep, of 2 pJ/bit (128 pJ/W) or more. Supplying the four operands (three input and one output) of the FMA operation from even 1 mm away takes 51.2 pJ—half as much energy as doing the operation itself. Supplying the data globally on-chip—say, over a distance of 20 mm—takes about 1 nJ, an order of magnitude more energy than doing the operation. Moving data off-chip is comparably expensive. Thus, to avoid having the vast majority of all energy be spent in moving data, it is imperative that data be kept local.
Locality is inherently present in many algorithms, but the computation must be properly ordered to express locality. For dense matrix computations, ordering is usually expressed by blocking the algorithm. For example, consider multiplying two 10,000 × 10,000 matrices. Using the straightforward algorithm, it requires performing 2 × 1012 arithmetic operations. If we perform the operations in a random order, there is little locality, and 4 × 1012 memory references will be required to compute the result, so both arithmetic operations and data access grow with the cube of the matrix dimension. Even with a natural implementation based on three nested loops, data accesses will grow with the cube of the matrix
dimension, because one of the matrices will be accessed in an order that allows little reuse of data in the cache. However, if we decompose the problem into smaller matrix multiplication problems, we can capture locality, reusing each word fetched from memory many times.
Suppose we have a memory capable of holding 256 kB (32 kW) 1 mm from our floating-point unit. The local memory is large enough to hold three 100 × 100 submatrices, one for each input operand and one for the partial result. We can perform a 100 × 100 matrix multiplication entirely out of the local memory, performing 2 × 106 operations with only 4 ×104 memory references—a ratio of 50 operations per reference. We can apply this blocking recursively. If there is aggregate on-chip memory of 32 MB (4 MW), we can hold three 1,000 × 1,000 submatrices at this level of the storage hierarchy. In a seminal paper by Hong and Kung, that idea was proved to be optimal for matrix multiplication in the sense that this kind of blocked algorithm moves the minimum amount of data possible between processor and memory system.9 Other array computations, including convolutions and fast Fourier transformations, can be blocked in this manner—although with different computation-to-communication ratios—and there are theoretical results on the optimality of communication for several linear algebra problems for both parallel and serial machines.
The recursive nature of the blocked algorithm also led to the notion of “cache-oblivious” algorithms, in which the recursive subdivision produces successively smaller subproblems that eventually fit into a cache or other fast memory layer.10 Whereas other blocked algorithms are implemented to match the size of a cache, the oblivious algorithms are optimized for locality without having specific constants, such as cache size, in their implementation. Locality optimizations for irregular codes, such as graph algorithms, can be much more difficult because the data structures are built with pointers or indexed structures that lead to random memory accesses. Even some of the graph algorithms have considerable locality that can be realized by partitioning the graph subgraphs that fit into a local memory and reorganizing the computations to operate on each subgraph with reuse before moving on to the next subgraph. There are many algorithms and software libraries for performing graph partition-
9See Hong Jia-Wei and H.T. Kung, 1981, I/O complexity: The red-blue pebble game, Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, Milwaukee, Wis., May 11-13, 1981, pp. 326-333
10Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran, 1999, Cache-oblivious algorithms, Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, New York, N.Y., October 17-19, 1999, pp. 285-297.
ing that minimize edge cuts for locality but with equal subgraph sizes for load-balancing.11
A key challenge in exploiting locality is developing abstractions for locality that allow a programmer to express the locality in a program independent of any particular target machine. One promising approach, used by the Sequoia programming system,12 is to present the programmer with an abstract memory hierarchy. The programmer views the machines as a tree of memories; the number of levels in the tree and the size of the memory at each level are unspecified. The programmer describes a decomposition method that subdivides the problem at one level into smaller problems at the next level and combines the partial solutions and a leaf method that solves the subproblem at the lowest level of the hierarchy. An autotuner then determines the number of times to apply the decomposition method and the appropriate data sizes at each level to map the program optimally onto a specific machine. The result is a programming approach that gives good locality with portability among diverse target machines.
Software Abstractions and Hardware Mechanisms Needed
Simplifying the task of parallel programming requires software abstractions that provide powerful mechanisms for synchronization, load balance, communication, and locality, as described above, while hiding the underlying details. Most current mechanisms for these operations are low-level and architecture-specific. The mechanisms must be carefully programmed to obtain good performance with a given parallel architecture, and the resulting programs are typically not performance-portable; that is, they do not exhibit better performance with a similar parallel architecture that has more processors. Successful software abstractions are needed to enable programmers to express the parallelism that is inherent in a program and the dependences between operations and to structure a program to enhance locality without being bogged down in low-level architectural details. Which abstractions make parallel programming convenient and result in performance-portable programs is an open research question. Successful abstractions will probably involve global address spaces, accessible ways to describe or invoke parallel operations over
11For one example of a graph-partitioning library, see George Karypis and Vipin Kumar, 1995, METIS: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Technical report, Minneapolis, Minn.: University of Minnesota.
12See Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan, 2006, Sequoia: Programming the memory hierarchy, Proceedings of the ACM/IEEE Conference on Supercomputing, Tampa, Fla., November 11-17, 2006.
collections of data, and constructs for atomic operations. Abstractions may also involve abstract machine models that capture resource costs and locality while hiding details of particular machines. Abstractions for parallelism are typically encapsulated in a programming system and execution model.
At the same time, reasonable performance requires efficient underlying hardware mechanisms, particularly in cases that need fine-grained communication and synchronization. Some parallel machines require interactions between processors to occur by means of high-overhead message transfers or by passing data via shared memory locations. Such mechanisms are useful but can be cumbersome and restrict the granularity of parallelism that can be efficiently exploited. Resolving those details will require research, but successful mechanisms will enable low-overhead communication and synchronization and will facilitate migration of data and operations to balance load. There are several emerging directions in hardware to support parallel computations. It is too early to know which hardware architecture or architectures will prove most successful, but several trends are evident:
- Multiple processors sharing a memory. This direction was taken by chip multiprocessors and was the primary approach used by semiconductor companies once they could not continue to increase their single-processor products.
- Multiple computers interconnected via a high-speed communication network. When very large computation facilities are needed for research or business, it is impractical for all the processors to share a memory, and a high-speed interconnect is used to tie the hundreds or thousands of processors together in a single system. Data centers use this model.
- A single processor containing multiple execution units. In this architecture, a single processor, or instruction stream, controls an array of similar execution units. This is sometimes termed single-instruction stream multiple-data (SIMD) architecture.
- Array of specialized processors. This approach is effective for executing a specialized task, such as a graphic or video processing algorithm. Each individual processor and its interconnections can be tailored and simplied for the target application.
- Field-programmable gate arrays (FPGAs) used in some parallel computing systems. FPGAs with execution units embedded in their fabric can yield high performance because they exploit locality and program their on-chip interconnects to match the data flow of the application.
That list of parallel architectures is not exhaustive, and some systems will use a combination of them. We expect current versions of the architectures to evolve substantially to support the most promising programming systems, and we may see entirely new hardware architectures in support of not-yet-developed programming approaches to parallel computing.
An encouraging development is that programs of research in parallelism being initiated or revived in a few research universities. Some research efforts already under way are aimed at some of the challenges that this report outlines. For example, in 2008, the University of California, Berkeley, and the University of Illinois at Urbana–Champaign were awarded research grants from Microsoft and Intel to establish Universal Parallel Computing Research Centers. In 2009, Stanford University—with industrial funding from Sun, AMD, NVIDIA, and other companies—started the Pervasive Parallelism Laboratory. Those centers at leading research universities are a good beginning to address the broad and challenging research agenda that we outline below, but they are just a beginning. History shows that the development of technology similar to that needed for parallelism often takes a decade or more. The results of such research are needed now, so the research is starting a decade late. Moreover, there is no guarantee that there is an answer to the challenges. If there is not a good answer, we need to know that as soon as possible so that we can push innovation in some other direction in a timely way.
THE CHALLENGES OF PARALLELISM
Parallelism has long been recognized as promising to achieve greater computing performance. Research on parallel hardware architectures began in earnest in the 1960s.13 Many ways of organizing computers have been investigated, including vector machines, SIMD machines, shared-memory multiprocessors, very-long-instruction-word machines, data-flow machines, distributed-memory machines, nonuniform-memory architectures, and multithreaded architectures. As described elsewhere in this report, single-processor performance has historically been making it difficult exponentially for companies promoting specialized parallel architectures to succeed. Over the years, however, ideas that have originated or been refined in the parallel-computer architecture research community have become standard features on PC processors, such as having SIMD instructions, a small degree of instruction-level parallelism, and multiple cores on a chip. In addition, higher performance has been obtained by using a network of such PC or server processors both for
13W. J. Bouknight, Stewart A. Denenberg, David F. McIntyre, J.M. Randal, Amed H. Sameh, and Daniel L. Slotnick, 1972, The Illiac IV system, Proceedings of the IEEE 60(4): 369-388.
scientific computing and to serve an aggregate workload of independent tasks, such as Web services. The recent graphical-processing-unit chips also borrow ideas from the body of work on parallel hardware.
As noted previously, it has long been clear that one of the major hurdles in parallel computing is software development. Even if there were sufficient and appropriate software abstractions to enable parallel programming (Google’s MapReduce, discussed below, is an example of a successful approach for a particular class of problems), characteristics of the application under consideration can still pose challenges. To exploit parallelism successfully, several things are necessary:
- The application under consideration must inherently have parallelism. Not all programs are amenable to parallelization, but many computationally intensive problems have high-level tasks that are largely independent or they are processing large datasets in which the operations on each individual item are mostly independent. Scientific simulations and graphics applications, for example, often have substantial parallelism to exploit because they perform operations on large arrays of data. Web servers process requests for a large set of users that involve mostly independent operations.
- Assuming that the application under consideration has sufficient parallelism, the parallelism must be identified. Either the programmer explicitly specifies the parallel tasks when developing the application or the system needs to infer the parallelism and automatically take advantage of it. If the parallelism involves tasks that are not entirely independent, the programmer or system also needs to identify communication and synchronization between tasks.
- Efficiency needs to be taken into account, inasmuch as it is not unusual for an initial parallel implementation to run more slowly than its serial counterpart. Parallelism inevitably incurs overhead costs, which include the time to create parallelism and to communicate and synchronize between parallel components. Some applications do not divide neatly into tasks that entail equal amounts of work, so the load must be balanced and any overhead associated with load-balancing managed. Locality is no longer a question of working within a single memory hierarchy, but one of managing the distribution of data between parallel tasks. It is important that multiprocessors exploit coarse-grain parallelism to minimize synchronization and communication overhead and exploit locality. It is this phase that naturally turns programmers
- Last, but definitely not least, the parallel program must be correct. Parallelism introduces a new class of errors due to the creation of parallel computations for work that is not independent or to failure to communicate or synchronize correctly between parallel tasks. Parallelism also introduces new problems into testing and debugging, in that program behavior can depend on the schedule of execution of different processes. Those dependences make it difficult to test programs thoroughly and to reproduce faulty behavior when it is observed. Parallel programming approaches can be restricted to eliminate some of the problems by requiring, for example, that programs communicate only through synchronous messaging or that a compiler verify the independence of loop iterations before running them in parallel. But those approaches can limit the effectiveness of parallel computing by adding overhead or restricting its use. Writing correct sequential code is hard enough, but the complexity of parallel programming is so high that only a small percentage of the programmers in the industry today are competent at it.
The software industry has invested a lot of time and effort in creating the existing software base. In the past, when growth in computing performance was on its exponentially rising curve (see previous chapters), most applications would automatically run faster on a faster machine.
There has been a lot of research to try to minimize the cost of software development for parallel machines. There will be a major prize if we succeed in doing it in a way that allows reuse of the large software-code base that has been developed over many years. Automatic parallelization has some successes, such as instruction-level parallelism and fine-grained loop-level parallelism in FORTRAN programs operating on arrays. The theory for automatically transforming code of this sort is well understood, and compilers often rely on substantial code restructuring to run effectively. In practice, the performance of the programs is quite sensitive to the particular details of how the program is written, and these approaches are more effective in fine-grained parallelism than in the more useful coarse-grained parallelism. However, there has not been sufficient demand in the parallel-software tool industry to sustain research and development.
Most programs, once coded sequentially, have many data dependences that prevent automatic parallelization. Various studies that analyzed the inherent dependences in a sequential program have found a lot of data dependences in such programs. Sometimes, a data dependence is
a result of a reuse of memory locations, which may be eliminated through analysis. It is perhaps not surprising that programs written with a sequential-machine model cannot automatically be parallelized. Researchers have also explored whether expressing computation in a different way may expose the parallelism inherent in a program more readily. In data-flow and functional programs, the memory is not reused, and computation can proceed as soon as the operands are ready. That translates to abundant parallelism but adds substantial cost in memory use and copying overhead because data structures cannot be updated in place. Analysis is then necessary to determine when memory can be reused, which is the case if the program will no longer touch the old structure. Optimizing a functional language then becomes a problem of replacing the creation of new data structures with in-place updates of old ones. The analysis requires the discovery of all potential read accesses to a data structure before it can be reclaimed, which in turn necessitates analyzing aliases to detect whether two expressions can refer to the same value. Such analysis is no easier than automatic parallelization in that both require accurate aliasing information, which is not practical for problems with complex pointer-based data structures.
THE STATE OF THE ART OF PARALLEL PROGRAMMING
Notwithstanding the challenges presented by parallelism, there have been some success stories over the years. This section describes several parallel approaches to illustrate the array of applications to which parallelism has been applied and the array of approaches that are encompassed under the term parallelism. None of the approaches described here constitutes a general-purpose solution, and none meets all the emerging requirements (described above) that the performance slowdown and new architectures will require; but they may offer lessons for moving forward. Historically, the success of any given programming approach has been strongly influenced by the availability of hardware that is well matched to it. Although there are cases of programming systems that run on hardware that is not a natural fit, the trends in parallel hardware have largely determined which approaches are successful. The specific examples discussed here are thread programming for shared memory, message-passing interface, MapReduce (used to exploit data parallelism and distributed computation), and ad hoc distributed computation (as in such efforts as SETI@home). The latter is not normally thought of as a parallel-programming approach, but it is offered here to demonstrate the variety of approaches that can be considered parallelism.
Thread Programming for Shared Memory
The concept of independent computations within a shared-memory space as threads is popular for programming of parallel shared-memory systems and for writing applications that involve asynchronous interaction with the environment—for example, user interfaces in which one thread of a computation is waiting for a response from the user while another thread is updating a display and a third may be performing calculations on the basis of earlier input. In the latter case, there may be only a single processor in the system and therefore no real parallelism, but the thread execution is interleaved, making the computations appear to be concurrent. And the performance advantage is real, in the same sense that allowing someone with only one item to go ahead in a supermarket line can result in a net “system throughput” for all concerned. The word thread is used in a variety of ways in different programming systems, but in general two properties are associated with threads: the ability to create parallel work dynamically, so the number of threads in a given execution may vary over time; and the ability of threads to read and write shared variables.
Threads require little or no language modification but only a small set of primitive features to create and destroy parallelism and synchronization to control access to shared variables. The most common system-level library for threads is the POSIX Thread, or “PThread” library, which allows a programmer to create a parallel computation by providing a function and an argument that will be passed to that function when it begins executing. Threads are first-class values in the language, so they can be named, stored in data structures, and passed as arguments, and one can wait for completion of a thread by performing a “join” operation on the thread. The PThread library contains synchronization primitives to acquire and release locks, which are used to give one thread exclusive access to shared data structures. There are other features of thread creation and synchronization, but the set of primitives is relatively small and easy to learn.
Although the set of primitives in PThreads is small, it is a low-level programming interface that involves function pointers, loss of type information on the arguments, and manual error-checking. To address those issues, there are several language-level versions of threads that provide a more convenient interface for programmers. For example, the Java thread model and more recent Thread Building Blocks (TBB) library for C++ use object-oriented programming abstractions to provide thread-management capabilities in those languages. Java threads are widely use for programming user interfaces and other concurrent programming problems, as described above, but the runtime support for true parallelism is more recent, so there is less experience in using Java threads for parallel pro-
gramming. TBB is also relatively recent but has demonstrated support in both industry and academe. In the 1980s, when shared-memory hardware was especially popular, the functional language community introduced the notion of a “future”14 that wrapped around a function invocation and resulted in an implicit synchronization of thread completion; any attempt to access the return value of the function would wait until the thread had completed. The closely related idea of a “promise”15 is also wrapped around a function invocation but uses a special return type that must be explicitly unwrapped, making the wait for thread completion explicit.
Two issues with dynamic thread creation are the overhead of thread creation and the policy for load-balancing threads among processors. A program written to create a thread everywhere that one could be used will typically overwhelm the scheduler. Several research efforts address such problems, including one extension of C called Cilk,16 which is now supported by the Cilk Arts company. Cilk uses the syntactic block structure of the language to restrict thread creation and completion to simple nested patterns. It also uses a lazy thread creation model, which allows many threads to execute with low overhead as simple function calls if no processor is available to execute the thread. Instead, the runtime system on an idle processor steals work randomly from other processors. Allowing lazy-thread creation affects the semantics of the threading model; the PThread semantics require that each thread eventually execute even if there are enough other threads to keep the processors busy. In contrast, Cilk makes no such guarantee, so in a Cilk program, if one thread waits for a variable to be set by another thread, it may wait forever. Waiting for a variable to be set or a data structure to be updated without some explicit synchronization is generally considered dubious programming practice in parallel code although it is a popular technique for avoiding the overhead associated with system-provided synchronization primitives.
In scientific computing, the most popular programming interface for shared-memory programming is OpenMP, a standard that emphasizes loop-level parallelism but also has support for more general task parallelism. OpenMP addresses the thread-overhead issue by dividing a set of iterations into groups so that each thread handles a set of iterations, and the programmer is able to control the load-balancing policy. It also gives more flexibility in restricting data that are private to a thread, as opposed to allowing them to be shared by all threads; and controlled forms of syn-
14Robert Halstead, 1985, MULTILISP: A language for concurrent symbolic computation, ACM Transactions on Programming Languages and Systems 7(4): 501-538.
15Barbara Liskov, 1998, Distributed programming in Argus, Communications of the ACM 31(3): 300-312.
16For more on Cilk, see its project page at The Cilk Project, MIT website, at http://supertech.csail.mit.edu/cilk/index.html.
chronization and parallelism avoid some kinds of programming errors. OpenMP is sometimes used with message-passing to create a hybrid programming model: large-scale computing clusters in which individual nodes have hardware support for shared memory.
The biggest drawbacks to thread programming are the potential for uncontrolled access to shared variables and the lack of locality control. Shared-variable access results in race conditions, in which two threads access a variable and at least one writes the variable; this results in indeterminate behavior that depends on the access order and may vary from one run to the next. Those accesses make testing and debugging especially difficult. Synchronization primitives—such as locks, which are used to avoid races—have their own forms of subtle errors: threads acquiring multiple locks can form deadlocks in which each of two threads are waiting for a lock held by the other thread. Some tools have been developed by the research community to detect those kinds of parallel-programming errors, but they have not reached the level of generality, accuracy, and speed that would encourage widespread deployment. Thread programming remains an error-prone process best handled by expert programmers, not the broader programming community of persons who have little formal training in programming, who would find it extremely challenging to create and maintain reliable code with these models. The broader community fueled the growth in computing applications and the associated economic and social effects.
The lack of locality support in threaded models limits the scalability of the underlying architecture and calls for some form of cache coherence, which traditionally has been a hardware challenge that grows exponentially harder as the number of processors grows.17 On the scale of chip multiprocessor systems available today, the problem is tractable, but even with on-chip data transfer rates, it is unclear how performance will be affected as core counts grow. Further complicating the programming problem for shared memory, many shared-memory machines with coherent caches use a relaxed consistency model; that is, some memory operations performed by one thread may appear to be performed in a different order by another thread. There is some research on mapping OpenMP and Cilk to distributed-memory systems or building shared-memory support with cache coherence in software, but the locality-aware model has often proved superior in performance even on systems with hardware-supported shared memory. Savvy thread programmers will use system mechanisms to control data layout and thread affinity to processors, but
17More recent parallel languages, such as Chapel and X10, have explicitly included support for locality.
in the end, this model is best reserved for a relatively small set of compiler writers, runtime-system developers, and low-level library programmers.
The combination of scalability limits of shared-memory architectures and the cost and performance benefits of building parallel machines from commodity processors made distributed-memory multiprocessors a popular architecture for high-end parallel computing. Those systems can vary from generic clusters built from personal computers and an Ethernet network to more specialized supercomputers with low-latency high-bandwidth networks that are more closely integrated into the processor nodes. As this architectural model become dominant in the early 1990s, several message-passing systems were developed by scientific programming communities, computer scientists, and industry. In the early 1990s, a group with representatives from those communities began a process to specify the message-passing interface (MPI). MPI has since emerged as the de facto standard programming model for high-performance computing and has nearly ubiquitous support among machines, including open-source implementations, such as MPICH and OpenMPI, that can be ported to new interconnection networks with modest effort.18 Although the standardization decisions in defining MPI were far from obvious, the relative ease with which application developers moved code from one of the previous models to MPI reflects the commonality of base concepts that already existed in the community. MPI has also proved to be a highly scalable programming model and is used today in applications that regularly run on tens of thousands of processor cores, and sometimes over 100,000, in the largest supercomputers in the world. MPI is generally used to program a single cluster or supercomputer that resides at one site, but “grid-computing” variations of MPI and related libraries support programming among machines at multiple sites. It is also low-level programming and has analogues to the challenges presented by machine-language programming mentioned earlier.
MPI has enabled tremendous scientific breakthroughs in nearly every scientific domain with some of the largest computations in climate change, chemistry, astronomy, and various aspects of defense. Computer simulations have demonstrated human effects on climate change and are critical
18For more on the MPI standard, see the final version of the draft released in May 1994, available online at http://www.mcs.anl.gov/Projects/mpi/standard.html. See also Open MPI: Open source high performance computing at http://www.open-mpi.org. and Peter Pacheco, 1997, Parallel Programming with MPI, fourth edition, San Francisco, Cal.: Morgan Kaufmann.
for international environmental-policy negotiations, as recognized in the 2007 Nobel prize awarded to the Intergovernmental Panel on Climate Change. The climate-modeling problem is far from solved as researchers attempt to identify particular phenomena, such as the disappearance of polar ice caps; effects on the frequency or severity of hurricanes, floods, or droughts; and the effects of various mitigation proposals. The size and accuracy of the computations continue to push the limits of available computing systems, consuming tens of millions of processor-hours each year. The Community Climate Simulation Model and nearly all other major codes used for global-scale long-term climate simulations are written in MPI. A related problem that also relies heavily on MPI is weather-forecasting, which is more detailed but of shorter term and on smaller regional scales than climate models. MPI programs have been used to understand the origins of the universe on the basis of a study of cosmic microwave background (CMB) and to study quantum chromodynamics in particle physics—both uses are based on Nobel prize-wining work in physics. MPI is used in the design of new materials, in crash simulations in the automobile industry, in simulating earthquakes to improve building designs and standards, and in fluid-dynamics simulations for improving aircraft design, engine design, and biomedical simulations.
There are limitations, however. MPI is designed for comparatively coarse-grained parallelism—among clusters of computers—not for parallelism between cores on a chip. For example, most supercomputers installed in the last few years have dual-core or quad-core processing chips, and most applications use an MPI process per core. The model is shifting as application scientists strive to make more effective use of shared-memory chip multiprocessors by exploiting a mixture of MPI with OpenMP or PThreads. MPI-3 is a recent effort to implement MPI on multicore processors although memory-per-core constraints are still a barrier to MPI-style weak scaling. Indeed, the motivation to mix programming models is often driven more by memory footprint concerns than by performance itself because the shared-memory model exploits fine-grained parallelism better in that it requires less memory per thread. Moreover, there is a broad class of applications of particular interest to the defense and intelligence communities for which MPI is not appropriate, owing to its mismatch with the computational patterns of particular problems.
MapReduce: Exploiting Data Parallelism and Distributed Computation
MapReduce is a data-processing infrastructure19 developed internally by Google and later popularized in the Hadoop open-source version.20 MapReduce is targeted at large-scale data-parallel workloads in which the input is represented as a set of key-value pairs and computation is expressed as a sequence of two user-provided functions: map and reduce. The Map function processes the input to create an intermediate key-value pair. Intermediate pairs with the same key are aggregated by the system and fed to the reduce function, which produces the final output.
What makes MapReduce particularly compelling is that it frees the programmer from the need to worry about much of the complexity required to run large-scale computations. The programmer needs only to produce the body of the two MapReduce methods, and the system takes care of parallelization, data distribution, fault tolerance, and load-balancing.
The class of problems that can be mapped to this relatively simple processing framework is surprisingly broad. MapReduce was conceived to simplify the data-processing involved in creating a Web index from a set of crawled documents, but developers have also used it for large-scale graph algorithms (for example, finding citation occurrences among scientific articles), computing query-term popularity (for example, Google Trends21), and creating language models for statistical machine translation (for example, finding and keeping a count for every unique sequence of five or more terms in a large corpus), and other applications. Within Google itself, MapReduce’s popularity has exploded since its introduction.
MapReduce has also been found to be useful for systems much smaller than the large distributed clusters for which it was designed. Ranger et al. examined the adequacy of the MapReduce framework for multicore and multiprocessor systems22 and found that it was equally compelling as a programming system for this class of machines. In a set of eight applications both coded with a version of MapReduce (the Phoenix runtime) and hand-coded directly, the authors found that MapReduce-coded ver-
19See Jeffrey Dean and Sanjay Ghemawat, 2008, MapReduce: Simplified data processing on large clusters, Communications of the ACM 51(1): 107-113, and Micheal Noth, 2006, Building a computing system for the world’s information, Invited talk, University of Iowa, IT Tech Forum Presentations, April 20, 2006.
20See The Apache Hadoop project, available at http://hadoop.apache.org.
21See Google trends, available at http://trends.google.com.
22Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, 2007, Evaluating MapReduce for multi-core and multiprocessor systems, Proceedings of the IEEE 13th International Symposium on High-Performance Computer Architecture, Phoenix, Ariz., February 10-14, 2007.
sions roughly matched the performance of the hand-coded versions for five. The remaining three applications did not fit well with MapReduce’s key-value data-stream model, and the hand-coded versions performed significantly better.
Despite MapReduce’s success as a programming system beyond its initial application domain and machine-architecture context, it is far from a complete solution for extracting and managing parallelism in general. It remains limited to, for example, batch-processing systems and is therefore not suitable for many on-line serving systems. MapReduce also does not extract implicit parallelism from an otherwise sequential algorithm but instead facilitates the partitioning, distribution, and runtime management of an application that is already essentially data-parallel. Such systems as MapReduce and NVIDIA’s CUDA,23 however, point to a solution strategy for the general programming challenge of large-scale parallel systems. The solutions are not aimed at a single programming paradigm for all possible cases but are based on a small set of programming systems that can be specialized for particular types of applications.
Distributed Computation—Harnessing the World’s Spare Computing Capacity
The increasing quality of Internet connectivity around the globe has led researchers to contemplate the possibility of harnessing the unused computing capability of the world’s personal computers and servers to perform extremely compute-intensive parallel tasks. The most notable examples have come in the form of volunteer-based initiatives, such as SETI@home (http://setiathome.berkeley.edu/) and Folding@home (http://folding.stanford.edu/). The model consists of breaking a very large-scale computation into subtasks that can operate on relatively few input and result data, require substantial processing over that input set, and do not require much (or any) communication between subtasks other than passing of inputs and results. Those initiatives attract volunteers (individuals or organizations) that sympathize with their scientific goals to donate computing cycles on their equipment to take on and execute a number of subtasks and return the results to a coordinating server that coalesces and combines final results.
The relatively few success cases of the model have relied not only on the friendly nature of the computation to be performed (vast data parallelism with very low communication requirements for each unit of
23 CUDA is a parallel computing approach aimed at taking advantages of NVIDIA graphical processing units. For more, see CUDA zone, NVIDIA.com, available at http://www.nvidia.com/object/cuda_home.html.
computing) but also on the trust of the volunteers that the code is safe to execute in their machines. For more widespread adoption, this particular programming model would require continuing improvements in secure execution technologies, incorporation of an economic model that provides users an incentive to donate their spare computing capacity, and improvements in Internet connectivity. To some extent, one can consider large illegitimately assembled networks of hijacked computers (or botnets) to be an exploitation of this computing model; this exemplifies the potential value of harnessing a large number of well-connected computing systems toward nobler aims.
These success stories show that there are already a wide variety of computational models for parallel computation and that science and industry are successfully harnessing parallelism in some domains. The parallelism success stories bode well for the future if we can find ways to map more applications to the models or, for computations that do not map well to the models, if we can develop new models.
PARALLEL-PROGRAMMING SYSTEMS AND THE PARALLEL SOFTWARE “STACK”
The general problem of designing parallel algorithms and programming them to exploit parallelism is an extremely important, timely, and unsolved problem. The vast majority of software in use today is sequential. Although the previous section described examples of parallel approaches that work in particular domains, general solutions are still lacking. Many successful parallel approaches are tied to a specific type of parallel hardware (MPI and distributed clusters; storage-cluster architecture, which heavily influenced MapReduce; openGL and SIMD graphics processors; and so on). The looming crisis that is the subject of this report comes down to the question of how to continue to improve performance scalability as architectures continue to change and as more and more processors are added. There has been some limited success, but there is not yet an analogue of the sequential-programming models that have been so successful in software for decades.
We know some things about what new parallel-programming approaches will need. A high-level performance-portable programming model is the only way to restart the virtuous cycle described in Chapter 2. The new model will need to be portable over successive generations of chips, multiple architectures, and different kinds of parallel hardware, and it will need to scale well. For all of those goals to be achieved, the
entire software stack will need to be rethought, and architectural assumptions will need to be included in the stack. Indeed, in the future, the term software stack will be a misnomer. A “software-hardware stack” will be the norm. The hardware, the programming model, and the applications will all need to change.
A key part of modern programming systems is the software stack that executes the program on the hardware. The stack must also allow reasoning about the five main challenges to scalable and efficient performance: parallelism, communication, locality, synchronization, and load-balancing. The components of a modern software stack include
- Libraries: Generic and domain-specific libraries provide application programmers with predefined software components that can be included in applications. Because library software may be reused in many applications, it is often highly optimized by hand or with automated tools.
- Compiler: An ahead-of-time or a just-in-time compiler translates the program into assembly code and optimizes it for the underlying hardware. Just-in-time compilers combine profile data from the current execution with static program analysis to perform optimizations.
- Runtime system or virtual machine: These systems manage fine-grain memory resources, application-thread creation and scheduling, runtime profiling, and runtime compilation.
- Operating system: The operating system manages processes and their resources, including coarse-grain memory management.
- Hypervisors: Hypervisors abstract the hardware context to provide performance portability for operating systems among hardware platforms.
Because programming systems are mostly sequential, the software stack mostly optimizes and manages sequential programs. Optimizing and understanding the five challenges at all levels of the stack for parallel approaches will require substantial changes in these systems and their interfaces, and perhaps researchers should reconsider whether the overall structure of the stack is a good one for parallel systems.
Researchers have made some progress in system support for pro-
viding and supporting parallel-programming models.24 Over the years, researchers and industry have developed parallel-programming system tools, which include languages, compilers, runtime environments, libraries, components, and frameworks to assist programmers and software developers in managing parallelism. We list some examples below.
- Runtime abstractions: multiprogramming and virtualization. The operating system can exploit chip multiprocessors at a coarse granularity immediately because operating systems can run multiple user and kernel processes in parallel. Virtualization runs multiple operating systems in parallel. However, it remains challenging to manage competition for shared resources, such as caches, when a particular application load varies dramatically.
- Components: database transactions and Web applications. Database-transaction systems provide an extremely effective abstraction in which programs use a sequential model, without the need to worry about synchronization and communication, and the database coordinates all the parallelism between user programs. Success in this regard emerged from over 20 years of research in parallelizing database systems.
- Frameworks: three-tiered Web applications and MapReduce. Such frameworks as J2EE and Websphere make it easy to create large-scale parallel Web applications. For example, MapReduce (described above) simplifies the development of a large class of distributed applications that combine the results of the computation of distributed nodes. Web applications follow the database-transaction model in which users write sequential tasks and the framework manages the parallelism.
- Libraries: graphics libraries. Graphics libraries for DirectX 10 and OpenGl hide the details of parallelism in graphics hardware from the user.
- Languages: Cuda Fortress, Cilk, x10, and Chapel. These languages seek to provide an array of high-level and low-level abstractions that help programmers to develop classes of efficient parallel software faster.
What those tools suggest is that managing parallelism is another, more
24One recent example was a parallel debugger, STAT, from the Lawrence Livermore National Laboratory, available at http://www.paradyn.org/STAT/STAT.html, presented at Supercomputing 2008. See Gregory L. Lee, Dong H. Ahn, Dorian C. Arnold, Bronis R. de Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben Liblit, 2008, Lessons learned at 208K: Towards debugging millions of cores, available online at ftp://ftp.cs.wisc.edu/paradyn/papers/Lee08ScalingSTAT.pdf, last accessed on November 8, 2010.
challenging facet of software engineering—it can be thought of as akin to a complex version of the problem of resource management. Parallel-program productivity can be improved if we can develop languages that provide useful software-engineering abstractions for parallelism, parallel components and libraries that programmers can reuse, and a software-hardware stack that can facilitate reasoning about all of them.
MEETING THE CHALLENGES OF PARALLELISM
The need for robust, general, and scalable parallel-software approaches presents challenges that affect the entire computing ecosystem. There are numerous possible paths toward a future that exploits abundant parallelism while managing locality. Parallel programming and parallel computers have been around since the 1960s, and much progress has been made. Much of the challenge of parallel programming deals with making parallel programs efficient and portable without requiring heroic efforts on the part of the programmer. No subfield or niche will be able to solve the problem of sustaining growth in computing performance on its own. The uncertainty about the best way forward is inhibiting investment. In other words, there is currently no parallel-programming approach that can help drive hardware development. Historically, a vendor might have taken on risk and invested heavily in developing an ecosystem, but given all the uncertainty, there is not enough of this investment, which entails risk as well as innovation. Research investment along multiple fronts, as described in this report, is essential.25
Software lasts a long time. The huge entrenched base of legacy software is part of the reason that people resist change and resist investment in new models, which may or may not take advantage of the capital investment represented by legacy software. Rewriting software is expensive. The economics of software results in pressure against any kind of innovative models and approaches. It also explains why the approaches we have seen have had relatively narrow applications or been incremental. Industry, for example, has turned to chip multiprocessors (CMPs) that replicate existing cores a few times (many times in the future). Care is taken to maintain backward compatibility to bring forward the existing multi-billion-dollar installed software base. With prospects dim for
25A recent overview in Communications of the ACM articulates the view that developing software for parallel cores needs to become as straightforward as writing software for traditional processors: Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick, 2009, A view of the parallel computing landscape, Communications of the ACM 52(10): 56-67, available online at http://cacm.acm.org/magazines/2009/10/42368-a-view-of-the-parallel-computing-landscape/fulltext.
repeated doublings of single-core performance, CMPs inherit the mantle as the most obvious alternative, and industry is motivated to devote substantial resources to moving compatible CMPs forward. The downside is that the core designs being replicated are optimized for serial code with support for dynamic parallelism discovery, such as speculation and out-of-order execution, which may waste area and energy for programs that are already parallel. At the same time, they may be missing some of the features needed for highly efficient parallel programming, such as lightweight synchronization, global communication, and locality control in software. A great deal of research remains to be done on on-chip networking, cache coherence, and distributed cache and memory management.
One important role for academe is to explore CMP designs that are more aggressive than industry’s designs. Academics should project both hardware and software trends much further into the future to seek possible inflection points even if they are not sure when or even whether transitioning a technology from academe to industry will occur. Moreover, researchers have the opportunity to break the shackles of strict backward compatibility. Promising ideas should be nurtured to see whether they can either create enough benefit to be adopted without portability or to enable portability strategies to be developed later. There needs to be an intellectual ecosystem that enables ideas to be proposed, cross-fertilized, and refined and, ultimately, the best approaches to be adopted. Such an ecosystem requires sufficient resources to enable contributions from many competing and cooperating research teams.
Meeting the challenges will involve essentially all aspects of computing. Focusing on a single component—assuming a CMP architecture or a particular number of transistors, focusing on data parallelism or on heterogeneity, and so on—will be insufficient to the task. Chapter 5 discusses recommendations for research aimed at meeting the challenges.