‎"Behind every stack of books there is a flood of knowledge."

Designing and Building Parallel Programs



Welcome to Designing and Building Parallel Programs ! My goal in this book is to provide a practitioner’s guide for students, programmers, engineers, and scientists who wish to design and build efficient and cost-effective programs for parallel and distributed computer systems. I cover both the techniques used to design parallel programs and the tools used to implement these programs. I assume familiarity with sequential programming, but no prior exposure to parallel computing.


Designing and Building Parallel Programs promotes a view of parallel programming as an engineering discipline, in which programs are developed in a methodical fashion and both cost and performance are considered in a design. This view is reflected in the structure of the book, which is divided into three parts. The first part, Concepts, provides a thorough discussion of parallel algorithm design, performance analysis, and program construction, with numerous examples to illustrate fundamental principles. The second part, Tools, provides an in-depth treatment of four parallel programming tools: the parallel languages Compositional C++ (CC++ ), Fortran M (FM), and High Performance Fortran (HPF), and the Message Passing Interface (MPI) library. HPF and MPI are standard parallel programming systems, and CC++ and FM are modern languages particularly well-suited for parallel software engineering. Part II also describes tools for collecting and analyzing performance data. The third part,Resources surveys some fundamental parallel algorithms and provides many pointers to other sources of information.


How to Use This Book

In writing this book, I chose to decouple the presentation of fundamental parallel programming concepts from the discussion of the parallel tools used to realize these concepts in programs. This separation allowed me to present concepts in a tool-independent manner; hence, commonalities between different approaches are emphasized, and the book does not become a manual for a particular programming language.

However, this separation also has its dangers. In particular, it may encourage you to think that the concepts introduced in Part I can be studied independently of the practical discipline of writing parallel programs. This assumption would be a serious mistake. Parallel programming, like most engineering activities, is best learned by doing. Practical experience is essential! Hence, I recommend that chapters from Parts I and II be studied concurrently. This approach will enable you to acquire the hands-on experience needed to translate knowledge of the concepts introduced in the book into the intuition that makes a good programmer. For the same reason, I also recommend that you attempt as many of the end-of-chapter exercises as possible.

Designing and Building Parallel Programs can be used as both a textbook for students and a reference book for professionals. Because the hands-on aspects of parallel programming are so important, professionals may find it useful to approach the book with a programming problem in mind and make the development of a solution to this problem part of the learning process. The basic materials have been classroom tested. For example, I have used them to teach a two-quarter graduate-level course in parallel computing to students from both computer science and noncomputer science backgrounds. In the first quarter, students covered much of the material in this book; in the second quarter, they tackled a substantial programming project. Colleagues have used the same material to teach a one-semester undergraduate introduction to parallel computing, augmenting this book’s treatment of design and programming with additional readings in parallel architecture and algorithms.



It is a pleasure to thank the colleagues with whom and from whom I have gained the insights that I have tried to distill in this book: in particular Mani Chandy, Bill Gropp, Carl Kesselman, Ewing Lusk, John Michalakes, Ross Overbeek, Rick Stevens, Steven Taylor, Steven Tuecke, and Patrick Worley. In addition, I am grateful to the many people who reviewed the text. Enrique Castro-Leon, Alok Choudhary, Carl Kesselman, Rick Kendall, Ewing Lusk, Rob Schreiber, and Rick Stevens reviewed one or more chapters. Gail Pieper, Brian Toonen, and Steven Tuecke were kind enough to read the entire text. Addison-Wesley’s anonymous reviewers also provided invaluable comments. Nikos Drakos provided the latex2html software used to construct the online version, and Cris Perez helped run it. Brian Toonen tested all the programs and helped in other ways too numerous to mention. Carl Kesselman made major contributions to Chapter 5. Finally, all the staff at Addison-Wesley, and in particular editor Tom Stone and editorial assistant Kathleen Billus, were always a pleasure to work with.


  All logarithms in this book are to base 2; hence  should be read as .

  The notation  is used in the formal sense: A problem has size  if and only if there exists some constant c and some minimum problem size  such that for all , size.

Various symbols are assumed to have the following conventional meanings, unless stated otherwise.


Parallelism and Computing

  A parallel computer is a set of processors that are able to work cooperatively to solve a computational problem. This definition is broad enough to include parallel supercomputers that have hundreds or thousands of processors, networks of workstations, multiple-processor workstations, and embedded systems. Parallel computers are interesting because they offer the potential to concentrate computational resources—whether processors, memory, or I/O bandwidth—on important computational problems.

Parallelism has sometimes been viewed as a rare and exotic subarea of computing, interesting but of little relevance to the average programmer. A study of trends in applications, computer architecture, and networking shows that this view is no longer tenable. Parallelism is becoming ubiquitous, and parallel programming is becoming central to the programming enterprise.


1.1.1 Trends in Applications

As computers become ever faster, it can be tempting to suppose that   they will eventually become “fast enough” and that appetite for increased computing power will be sated. However, history suggests that as a particular technology satisfies known applications, new applications will arise that are enabled by that technology and that will demand the development of new technology. As an amusing illustration of this phenomenon, a report prepared for the British government in the late 1940s concluded that Great Britain’s computational requirements could be met by two or perhaps three computers. In those days, computers were used primarily for computing ballistics tables. The authors of the report did not consider other applications in science and engineering, let alone the commercial applications that would soon come to dominate computing. Similarly, the initial prospectus for Cray Research predicted a market for ten supercomputers; many hundreds have since been sold.

Traditionally, developments at the high end of computing have been motivated by numerical simulations of complex systems such as weather,   climate, mechanical devices, electronic circuits, manufacturing  processes, and chemical reactions. However, the most significant forces driving the development of faster computers today are emerging commercial applications that require a computer to be able to process large amounts of data in sophisticated ways. These applications   include video conferencing, collaborative work environments,   computer-aided diagnosis in medicine, parallel databases used for   decision support, and advanced graphics and virtual reality,   particularly in the entertainment industry. For example, the integration of parallel computation, high-performance networking, and multimedia technologies is leading to the development of video   servers, computers designed to serve hundreds or thousands of simultaneous requests for real-time video. Each video stream can involve both data transfer rates of many megabytes per second and large amounts of processing for encoding and decoding. In graphics, three-dimensional data sets are now approaching  volume elements (1024 on a side). At 200 operations per element, a display updated 30 times per second requires a computer capable of 6.4 operations per second.

Although commercial applications may define the architecture of most future parallel computers, traditional scientific applications will remain important users of parallel computing technology. Indeed, as nonlinear effects place limits on the insights offered by purely theoretical investigations and as experimentation becomes more costly or impractical, computational studies of complex systems are becoming ever more important. Computational costs typically increase as the fourth power or more of the “resolution” that determines accuracy, so these studies have a seemingly insatiable demand for more computer power. They are also often characterized by large memory and input/output requirements. For example, a ten-year simulation of the earth’s climate using a state-of-the-art model may involve    floating-point operations—ten days at an execution speed of  floating-point operations per second (10 gigaflops). This same simulation can easily generate a hundred gigabytes ( bytes) or more of data. Yet as Table 1.1 shows, scientists can easily imagine refinements to these models that would increase these computational requirements 10,000 times.


Table 1.1: Various refinements proposed to climate models, and the increased computational requirements associated with these refinements. Altogether, these refinements could increase computational requirements by a factor of between  and 


In summary, the need for faster computers is driven by the demands of both data-intensive applications in commerce and computation-intensive applications in science and engineering. Increasingly, the requirements of these fields are merging, as scientific and engineering applications become more data intensive and commercial applications perform more sophisticated computations.


1.1.2 Trends in Computer Design

The performance of the fastest computers has grown exponentially from   1945 to the present, averaging a factor of 10 every five years. While the first computers performed a few tens of floating-point  operations per second, the parallel computers of the mid-1990s achieve tens of billions of operations per second (Figure 1.1). Similar trends can be observed in the low-end computers of different eras: the calculators, personal computers, and workstations. There is little to suggest that this growth will not continue. However, the computer architectures used to sustain this growth are changing radically—from sequential to parallel.


Figure 1.1: Peak performance of some of the fastest supercomputers, 1945–1995. The exponential growth flattened off somewhat in the 1980s but is accelerating again as massively parallel supercomputers become available. Here, “o” are uniprocessors, “+” denotes modestly parallel vector computers with 4–16 processors, and “x” denotes massively parallel computers with hundreds or thousands of processors. Typically, massively parallel computers achieve a lower proportion of their peak performance on realistic applications than do vector computers.


  The performance of a computer depends directly on the time required to perform a basic operation and the number of these basic operations   that can be performed concurrently. The time to perform a basic   operation is ultimately limited by the “clock cycle” of the processor, that is, the time required to perform the most primitive operation. However, clock cycle times are decreasing slowly and appear to be approaching physical limits such as the speed of light (Figure 1.2). We cannot depend on faster processors to provide increased computational performance.



Figure 1.2: Trends in computer clock cycle times. Conventional vector supercomputer cycle times (denoted “o”) have decreased only by a factor of 3 in sixteen years, from the CRAY-1 (12.5 nanoseconds) to the C90 (4.0). RISC microprocessors (denoted “+”) are fast approaching the same performance. Both architectures appear to be approaching physical limits.


To circumvent these limitations, the designer may attempt to utilize internal concurrency in a chip, for example, by operating simultaneously on all 64 bits of two numbers that are to be multiplied. However, a fundamental result in Very Large Scale   Integration (VLSI) complexity theory says that this strategy is expensive. This result states that for certain transitive computations (in which any output may depend on any input), the chip area A and the time T required to perform this computation are related so that  must exceed some problem-dependent function of problem size. This result can be explained informally by assuming that a computation must move a certain amount of information from one side of a square chip to the other. The amount of information that can be moved in a time unit is limited by the cross section of the chip, . This gives a transfer rate of , from which the  relation is obtained. To decrease the time required to move the information by a certain factor, the cross section must be increased by the same factor, and hence the total area must be increased by the square of that factor.

This  result means that not only is it difficult to build individual components that operate faster, it may not even be desirable to do so. It may be cheaper to use more, slower components. For example, if we have an area  of silicon to use in a computer, we can either build  components, each of size A and able to perform an operation in time T , or build a single component able to perform the same operation in time T/n . The multicomponent system is potentially n times faster.

Computer designers use a variety of techniques to overcome these   limitations on single computer performance, including pipelining (different stages of several instructions execute concurrently) and multiple function units (several multipliers, adders, etc., are controlled by a single instruction stream). Increasingly, designers are incorporating multiple “computers,” each with its own processor, memory, and associated interconnection logic. This approach is   facilitated by advances in VLSI technology that continue to decrease the number of components required to implement a computer. As the cost of a computer is (very approximately) proportional to the number of components that it contains, increased integration also increases the number of processors that can be included in a computer for a particular cost. The result is continued growth in processor counts (Figure 1.3).



Figure 1.3: Number of processors in massively parallel computers (“o”) and vector multiprocessors (“+”). In both cases, a steady increase in processor count is apparent. A similar trend is starting to occur in workstations, and personal computers can be expected to follow the same trend.


1.1.3 Trends in Networking

  Another important trend changing the face of computing is an enormous   increase in the capabilities of the networks that connect computers. Not long ago, high-speed networks ran at 1.5 Mbits per second; by the end of the 1990s, bandwidths in excess of 1000 Mbits per second will be commonplace. Significant improvements in reliability are also expected. These trends make it feasible to develop applications that use physically distributed resources as if they were part of the same computer. A typical application of this sort may utilize processors on multiple remote computers, access a selection of remote databases, perform rendering on one or more graphics computers, and provide real-time output and control on a workstation.

  We emphasize that computing on networked computers (“distributed computing”) is not just a subfield of parallel computing. Distributed computing is deeply concerned with problems such as reliability, security, and heterogeneity that are generally regarded as tangential in parallel computing. (As Leslie Lamport has observed, “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”) Yet the basic task of developing programs that can run on many computers at once is a parallel computing problem. In this respect, the previously distinct worlds of parallel and distributed computing are converging.


1.1.4 Summary of Trends

This brief survey of trends in applications, computer architecture, and networking suggests a future in which parallelism pervades not only supercomputers but also workstations, personal computers, and networks. In this future, programs will be required to exploit the multiple processors located inside each computer and the additional processors available across a network. Because most existing algorithms  are specialized for a single processor, this situation implies a need   for new algorithms and program structures able to perform many operations at once. Concurrency becomes a fundamental requirement for algorithms and programs.

This survey also suggests a second fundamental lesson. It appears likely that processor counts will continue to increase—perhaps, as they do in some environments at present, by doubling each year or two. Hence, software systems can be expected to experience substantial increases in processor count over their lifetime. In this   environment, scalability —resilience to increasing   processor counts—is as important as portability for protecting software investments. A program able to use only a fixed number of processors is a bad program, as is a program able to execute on only a single computer. Scalability is a major theme that will be stressed throughout this book.




One comment on “Designing and Building Parallel Programs

  1. mestreseo
    February 25, 2013

    thanks for posting your articles so often, every day i access your website and check for updates. so i am always informed. mestreseo mestreseo mestreseo mestreseo mestreseo

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Virtual Fashion Technology

Virtual Fashion Education


"chúng tôi chỉ là tôi tớ của anh em, vì Đức Kitô" (2Cr 4,5b)


News About Tech, Money and Innovation


Modern art using the GPU

Theme Showcase

Find the perfect theme for your blog.


Learn to Learn

Gocomay's Blog

Con tằm đến thác vẫn còn vương tơ

Toán cho Vật lý

Khoa Vật lý, Đại học Sư phạm Tp.HCM - ĐT :(08)-38352020 - 109

Maths 4 Physics & more...

Blog Toán Cao Cấp (M4Ps)

Bucket List Publications

Indulge- Travel, Adventure, & New Experiences


‎"Behind every stack of books there is a flood of knowledge."

The Blog

The latest news on and the WordPress community.

%d bloggers like this: