Skip to main content

Currently Skimming:

4 The End of Programming as We Know It
Pages 105-131

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 105...
... The committee explores examples of software and programming parallelism successes and possi bilities for leveraging these successes, as well as examples of limitations of parallelism and challenges to programming parallel systems. The sudden shift from single-core to multiple-core processor chips requires a dramatic change in programming, but software developers are also challenged by the continuously widening gap between memory system and processor performance.
From page 106...
... The sequential-programming model evolved in that ecosystem as well. To build innovative, more capable sophisticated software, software developers turned increasingly to higher-level sequential-programming languages and higher levels of abstraction (that is, the reuse of libraries and component software for common tasks)
From page 107...
... Although higher levels of abstraction often result in performance penalties, the initial tran sition away from hand-coded assembly language came with performance gains, in that compilers are better at managing the complexity of low-level code generation, such as register allocation and instruction scheduling, better than most programmers. That pattern of increases in processor performance coupling with increases in the use of programming-language abstractions has played out repeatedly.
From page 108...
... The problem is that much of the innovative software is sequential and is designed to execute on only one processor, whereas the previous chapters explained why all future computers will contain multiple processors. Thus, current programs will not run faster on successive generations of hardware.6 The shift in the hardware industry has broken the performance-portability connection in the virtuous cycle -- sequential programs will not benefit from increases in processor performance that stem from the use of multiple processors.
From page 109...
... In the future, however, all software must be able to exploit multiple processors to enter into a new virtuous cycle with successive generations of parallel hardware that expands software capabilities and generates new applications.7 Finding: There is no known alternative to parallel systems for sustaining growth in computing performance; however, no compelling programming paradigms for general parallel systems have yet emerged. To develop parallel applications, future developers must invent new parallel algorithms and build new parallel applications.
From page 110...
... If the object crosses pieces, the tasks will need to communicate. The programming system can perform communication through inputs and outputs along dependences by reading and writing to shared data structures or by explicitly sending messages 8 This limit on parallelism is often called Amdahl's law, after Gene Amdahl.
From page 111...
... For example, barrier synchronization forces a set of parallel computations to wait until all of them reach the barrier. Locks are used to control access to shared data structures by allowing only one thread to hold a lock at a given time.
From page 112...
... The advent of chip multiprocessors means that the bandwidth gap will probably continue to widen in that the aggregate rate of computation on a single chip will continue to outpace main-memory capacity and performance improvements. The gap between memory latency and computation is also a limitation in software performance, although this gap will not grow with multicore technology, inasmuch as clock rates are relatively constant.
From page 113...
... The recursive nature of the blocked algorithm also led to the notion of "cache-oblivious" algorithms, in which the recursive subdivision pro duces successively smaller subproblems that eventually fit into a cache or other fast memory layer.10 Whereas other blocked algorithms are implemented to match the size of a cache, the oblivious algorithms are opti mized for locality without having specific constants, such as cache size, in their implementation. Locality optimizations for irregular codes, such as graph algorithms, can be much more difficult because the data structures are built with pointers or indexed structures that lead to random memory accesses.
From page 114...
... Software Abstractions and Hardware Mechanisms Needed Simplifying the task of parallel programming requires software abstractions that provide powerful mechanisms for synchronization, load balance, communication, and locality, as described above, while hiding the underlying details. Most current mechanisms for these operations are low-level and architecture-specific.
From page 115...
... Resolving those details will require research, but successful mechanisms will enable lowoverhead communication and synchronization and will facilitate migra tion of data and operations to balance load. There are several emerging directions in hardware to support parallel computations.
From page 116...
... Research on parallel hardware architectures began in earnest in the 1960s.13 Many ways of organizing computers have been investigated, including vector machines, SIMD machines, shared-memory multiprocessors, very-long-instruction-word machines, data-flow machines, distributed-memory machines, nonuniform-memory architectures, and multithreaded architectures. As described elsewhere in this report, single-processor performance has historically been mak ing it difficult exponentially for companies promoting specialized parallel architectures to succeed.
From page 117...
... As noted previously, it has long been clear that one of the major hurdles in parallel computing is software development. Even if there were sufficient and appropriate software abstractions to enable parallel programming (Google's MapReduce, discussed below, is an example of a successful approach for a particular class of problems)
From page 118...
... But those approaches can limit the effectiveness of parallel computing by adding overhead or restricting its use. Writing correct sequential code is hard enough, but the complexity of parallel programming is so high that only a small percentage of the programmers in the industry today are competent at it.
From page 119...
... Such analysis is no easier than automatic parallelization in that both require accurate aliasing information, which is not practical for problems with complex pointer-based data structures. THE STATE OF THE ART OF PARALLEL PROGRAMMING Notwithstanding the challenges presented by parallelism, there have been some success stories over the years.
From page 120...
... Threads are first-class values in the language, so they can be named, stored in data structures, and passed as arguments, and one can wait for completion of a thread by performing a "join" operation on the thread. The PThread library contains synchronization primitives to acquire and release locks, which are used to give one thread exclusive access to shared data structures.
From page 121...
... Waiting for a variable to be set or a data structure to be updated without some explicit synchronization is generally considered dubious programming practice in parallel code although it is a popular technique for avoiding the overhead associated with system-provided synchronization primitives. In scientific computing, the most popular programming interface for shared-memory programming is OpenMP, a standard that emphasizes loop-level parallelism but also has support for more general task parallelism.
From page 122...
... Further complicating the programming problem for shared memory, many shared-memory machines with coherent caches use a relaxed consistency model; that is, some memory opera tions performed by one thread may appear to be performed in a different order by another thread. There is some research on mapping OpenMP and Cilk to distributed-memory systems or building shared-memory support with cache coherence in software, but the locality-aware model has often proved superior in performance even on systems with hardwaresupported shared memory.
From page 123...
... and Peter Pacheco, 1997, Parallel Programming with MPI, fourth edition, San Francisco, Cal.: Morgan Kaufmann.
From page 124...
... MPI-3 is a recent effort to implement MPI on multicore processors although memory-per-core constraints are still a barrier to MPI-style weak scaling. Indeed, the motivation to mix programming models is often driven more by memory footprint concerns than by performance itself because the shared-memory model exploits fine-grained parallelism better in that it requires less memory per thread.
From page 125...
... MapReduce was conceived to simplify the data-processing involved in creating a Web index from a set of crawled documents, but developers have also used it for largescale graph algorithms (for example, finding citation occurrences among scientific articles) , computing query-term popularity (for example, Google Trends21)
From page 126...
... The solutions are not aimed at a single programming paradigm for all possible cases but are based on a small set of programming systems that can be specialized for particular types of applications. Distributed Computation -- Harnessing the World's Spare Computing Capacity The increasing quality of Internet connectivity around the globe has led researchers to contemplate the possibility of harnessing the unused computing capability of the world's personal computers and servers to perform extremely compute-intensive parallel tasks.
From page 127...
... A high-level performance-portable programming model is the only way to restart the virtuous cycle described in Chapter 2. The new model will need to be portable over successive generations of chips, multiple architectures, and different kinds of parallel hardware, and it will need to scale well.
From page 128...
... A key part of modern programming systems is the software stack that executes the program on the hardware. The stack must also allow reasoning about the five main challenges to scalable and efficient performance: parallelism, communication, locality, synchronization, and loadbalancing.
From page 129...
... Such frameworks as J2EE and Websphere make it easy to create large-scale parallel Web applications. For example, MapReduce (described above)
From page 130...
... Parallel programming and parallel computers have been around since the 1960s, and much progress has been made. Much of the challenge of parallel programming deals with making parallel programs efficient and portable without requiring heroic efforts on the part of the programmer.
From page 131...
... At the same time, they may be missing some of the features needed for highly efficient parallel programming, such as lightweight synchronization, global communication, and locality control in software. A great deal of research remains to be done on on-chip networking, cache coherence, and distributed cache and memory management.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.