Skip to main content

Currently Skimming:

5 Today's Supercomputing Technology
Pages 104-156

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 104...
... Solutions have a higher utility if provided earlier: A weather forecast is much less valuable after the storm starts. The aggressiveness of the effort to advance supercomputing technology depends on how much added utility and how much added cost come from solving the problem faster.
From page 105...
... In particular, the arithmetic performance increases much faster than the local and global bandwidth of the system. Latency to local memory or to a remote node is decreasing only very slowly.
From page 106...
... Manufacturers are expected to compensate for this drop in the scaling of single-processor performance by placing several processors on a single chip. The aggregate performance of such chip multiprocessors is expected to scale at least as rapidly as the curve shown in Figure 5.1.
From page 107...
... of commodity microprocessor memory interfaces and DRAM chips per calendar year. much slower rate than processor performance.
From page 108...
... As the gap between processor and memory performance continues to grow, more applications that now make good use of a cache will become limited by memory bandwidth. The evolution of DRAM row access latency (total memory latency 10000 Memory BW (Mword/sec)
From page 109...
... There is also an increase in memory latency when measured in memory bandwidth, as shown in Figure 5.6. This graph plots the frontside bus bandwidth of Figure 5.2 multiplied by the memory latency of Figure 5.4.
From page 110...
... 10 1 Jan 84 Jan 86 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 FIGURE 5.6 Increase in the number of simultaneous memory operations in flight needed to sustain front-side bus bandwidth.
From page 111...
... However, because commodity processors are optimized for applications with memory access patterns different from those found in many scientific applications, they realize a small fraction of their nominal performance on scientific applications. Many of these scientific applications are important for national security.
From page 112...
... of commodity processors while taking advantage of custom interconnect (and possibly a custom processor-memory interface) to overcome the global (and local)
From page 113...
... , it does not use a commodity processor chip but rather integrates this processor as part of a system on a chip. The processor used is almost three times less powerful than with singlechip commodity processors5 (because it operates at a much lower clock rate and with little instruction-level parallelism)
From page 114...
... . Scientific applications that have high spatial and temporal locality, and hence make most of their accesses from the cache, perform extremely well on commodity processors, and commodity cluster machines represent the most cost-effective platforms for such applications.
From page 115...
... While some commodity processors provide limited multithreading, they fall short of the tens to hundreds of threads needed to hide main memory latency -- currently hundreds of cycles and growing. Vectors or streams use data parallelism9 to hide latency.
From page 116...
... Because the cost of bandwidth increases with distance it is prohibitively expensive to provide flat memory bandwidth across a supercomputer. Even the best custom machines have a bandwidth taper with a local to global bandwidth ratio of about 10:1.
From page 117...
... Memory latency hiding is becoming increasingly important as processor speed increases faster than memory access time. Global latency hiding is becoming increasingly important as global latency becomes constrained by the speed of light (see Table 5.1)
From page 118...
... Trade-offs It is important to understand the trade-offs among various supercomputer architectures. The use of custom processors with higher memory bandwidth and effective latency-hiding mechanisms leads to higher processor performance for the many scientific codes that have poor temporal and spatial locality.
From page 119...
... Because of their limited volumes, custom processors are significantly more expensive than commodity processors. Thus, in many cases, the reduction in execution time is achieved at the expense of an increase in cost per solution.
From page 120...
... For codes where data caches are not effective, performance is determined by the rate at which operands are brought from memory. The main memory of custom processors has similar latency to the main memory of commodity processors; in order to achieve a given level of performance, both need to sustain the same number of concurrent memory accesses.
From page 121...
... The memory-centric discussion does not change the basic conclusions reached on the relative advantages of custom or hybrid supercomputers, but it introduces some caveats: To take advantage of custom supercomputers, one needs problems where the level of intrinsic parallelism available is much higher than the number of processors and where most communications are local. One often needs a multilevel problem decomposition and different mechanisms for extracting intranode and internode parallelism.
From page 122...
... For example, if processor speed increases but the interconnect is not improved, then global communication may become a bottleneck. At some point, parametric evolution breaks down and qualitative changes to hardware and software are needed.
From page 123...
... It is not clear if commodity processors will provide the required innovations to overcome this "memory wall." While the PC and server applications for which commodity processors are tuned also suffer from the increased gap between arithmetic and memory performance, they exhibit sufficient spatial and temporal locality so that aggressive cache memory systems are largely sufficient to solve the problem. If commodity processors do not offer latency-hiding and/or locality-enhancing mechanisms, it is likely that a smaller fraction of scientific applications will be adequately addressed by these processors as the processor-memory performance gap grows.
From page 124...
... However, even the numbers are grossly inaccurate; they clearly show that a parametric evolution of current communication architectures is not sustainable.
From page 125...
... SUPERCOMPUTING ALGORITHMS An algorithm is the sequence of basic operations (arithmetic, logic, branches, and memory accesses) that must be performed to solve the user's task.
From page 126...
... The committee first describes the nature of the algorithms in common use, including their demands on the underlying hardware, and then summarizes some of their shortcomings and future challenges. Solving Partial and Ordinary Differential Equations Differential equations are the fundamental equations for many problems governed by the basic laws of physics and chemistry.
From page 127...
... . In contrast to elliptic equations, time-dependent equations may (e.g., parabolic PDEs arising in diffusion or heat flow or their approximations by systems of ordinary differential equations [ODEs]
From page 128...
... When the mesh represents a deforming material, algorithms are needed to deform the mesh as 17Based on excerpts from the white paper "Computational Challenges in Nuclear Weapons Simulation," by Charles F McMillan et al., LLNL, prepared for the committee's Santa Fe, N.M., applications workshop, September 2003.
From page 129...
... . It is critical to exploit this mathematical structure to reduce memory and arithmetic operations, rather than using dense linear algebra.
From page 130...
... , or even other parallel computing algorithms (balancing the workload or partitioning a sparse matrix among different parallel processors)
From page 131...
... New Algorithmic Demands Arising from Supercomputing In addition to opportunities to improve algorithms (as described above in the categories of differential equations, mesh generation, linear algebra, discrete algorithms, and fast transforms) , there are new, crosscutting algorithmic needs driven by supercomputing that are common to many application areas.
From page 132...
... . It is sometimes possible to use physics-based algorithms (like the fast multipole method)
From page 133...
... operations. However, the most expensive operation on a machine is not arithmetic but, rather, fetching data from memory, especially remote memory.
From page 134...
... The system software -- the operating system, the scheduler, the accounting system, for example -- provide the infrastructure for using the machine, independently of the particular applications for which it is used. The programming languages and tools help the user in writing and debugging applications and in understanding their performance.
From page 135...
... Today's largest systems typically have on the order of 10,000 processors to keep busy concurrently. Future systems may push this degree of concurrency to 100,000 or 1 million processors and beyond, and the concurrency level within each processor will need to increase in order to hide the larger memory latency.
From page 136...
... Management software for supercomputing typically uses straightforward extensions or improvements to software for smaller systems, together with policies tailored to their user community. It is challenging to scale an operating system to a large number of processors.
From page 137...
... Most existing parallel programming models implicitly assume that the application controls a dedicated set of processors executing at the same speed. Thus, many parallel codes consist of an alternation of compute phases, where an equal amount of computation work is performed by each process and by global communication and synchronization phases.
From page 138...
... . The use of a given programming model requires that the operating system, the programming languages, and the software tools provide the services that support that abstraction.
From page 139...
... One exception is the TotalView debugger,34 which supports Fortran, C, OpenMP, and MPI. Parallel programming languages and parallel programming models are necessarily compromises between conflicting requirements.
From page 140...
... 1999. "C and tcc: A Language and Compiler for Dynamic Code Generation." ACM Transactions on Programming Languages and Systems 21(2)
From page 141...
... If support for shared memory is deemed important for good software productivity, then it may be necessary to forsake porting to clusters that use LAN interconnects.41 Different forms of parallelism operate not only on different supercomputers but at different levels within one supercomputer. For instance, the Earth Simulator uses vector parallelism on one processor, shared memory parallelism within one node, and message passing parallelism across nodes.42 If each hardware mechanism is directly reflected by a similar software mechanism, then the user has to manage three different parallel programming models within one application and manage the interaction among these models, a difficult task.
From page 142...
... Key examples used for supercomputing include mathematical libraries such as LAPACK43 for linear algebra, templates such as C++ Standard Template Library,44 run-time support such as MPI for message passing, and visualization packages such as the Visualization Tool Kit (VTK) .45 The libraries of most interest to supercomputing involve mathematical functions, including linear algebra (e.g., LAPACK and its kin)
From page 143...
... Scalability of applications is a major challenge. One issue already discussed is that of appropriate programming languages and programming models for the development of supercomputing applications.
From page 144...
... Finally, implementation effort is a major consideration given the limited resources available for HPC software. One important reason that MPI is so successful is that simple MPI implementations can be created quickly by supplying device drivers for a public-domain MPI implementation like MPICH.50 Moreover, that MPI implementation can be improved incrementally by improving those drivers and by tuning higher-level routines for the particular architecture.
From page 145...
... There is little incentive to reduce failure rates of commodity processors to less than one error per few years of operations. Failure rates can be reduced using suitable fault-tolerant hardware in a custom processor or by using triplicated processors in hybrid supercomputers.
From page 146...
... Industry performance benchmarks include Linpack, SPEC, NAS, and Stream, among many others.51 By their nature they can only measure lim 51 See . Other industrial benchmark efforts include Real Applications on Parallel Systems (RAPS)
From page 147...
... that is relatively insensitive to memory and network bandwidth and so cannot accurately predict the performance of more irregular or sparse algorithms. Stream measures peak memory bandwidth, but slight changes in the memory access pattern might result in a far lower attained bandwidth in a particular application due to poor spatial locality.
From page 148...
... It is sometimes difficult for the application programmer to relate the results to source code and to understand how to use the monitoring information to improve performance. Performance Modeling and Simulation There has been a great deal of interest recently in mathematically modeling the performance of an application with enough accuracy to predict its behavior either on a rather different problem size or a rather different computer system, typically much larger than now available.
From page 149...
... 2003. "A Performance Prediction Framework for Scientific Applications." Workshop on Performance Modeling and Analysis, 2003 ICCS.
From page 150...
... Measuring performance on existing systems can certainly identify current bottlenecks, but it not adequate to guide investments to solve future problems. For example, current hardware trends are for processor speeds to increasingly outstrip local memory bandwidth (the memory wall63)
From page 151...
... Lower purchase cost may bias the supercomputing market toward commodity supercomputers if organizations do not account properly for the total cost of ownership and are more sensitive to hardware cost. THE IMPERATIVE TO INNOVATE AND BARRIERS TO INNOVATION Systems Issues The committee summarizes trends in parallel hardware in Table 5.1.
From page 152...
... Such failure rates require innovation in both fault detection and fault handling to give the user the illusion of a fault-free machine. The growing gap between processor performance and global bandwidth and latency is also expected to force innovation.
From page 153...
... At present, there is no high-level programming model that exposes essential performance characteristics of parallel algorithms. Consequently, much of the transfer of such knowledge is done by personal relationships, a mechanism that does not scale and that cannot reach a large enough user community.
From page 154...
... Report of the June 3-5 Science Networking Workshop, conducted by the Energy Sciences Network Steering Committee at the request of the Office of Advanced Scientific Computing Research of the DOE Office of Science. 69From the white paper "Computational Fluid Dynamics for Multiphysics and Multiscale Problems," by Phillip Colella, LBNL, prepared for the committee's Santa Fe, N.M., applications workshop, September 2003.
From page 155...
... is justified. The main expense in large supercomputing programs such as ASC is software related: In FY 2004, 40 percent of the ASC budget was allocated for application development; in addition, a significant fraction of the acquisition budget also goes, directly or indirectly, to software purchase.70 A significant fraction of the time to solution is spent developing, tuning, verifying, and validating codes.
From page 156...
... The DARPA HPCS effort emphasizes software productivity, but it is vendor driven and hardware focused and has not generated a broad, coordinated community effort for new programming models. Meanwhile, larger and more complex hardware systems continue to be put in production, and larger and more complex application packages are developed.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.