Programming Massively Parallel Processors: A Hands-on Approach

  • 2 Want to read
  • 1 Currently reading

My Reading Lists:

Create a new list

  • 2 Want to read
  • 1 Currently reading

Buy this book

Last edited by Drini
September 22, 2025 | History

Programming Massively Parallel Processors: A Hands-on Approach

  • 2 Want to read
  • 1 Currently reading

This edition doesn't have a description yet. Can you add one?

Publish Date
Publisher
Morgan Kaufmann
Pages
514

Buy this book

Edition Availability
Cover of: Programming Massively Parallel Processors
Programming Massively Parallel Processors: A Hands-On Approach
2022, Elsevier Science & Technology, Morgan Kaufmann
in English
Cover of: Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
Dec 28, 2012, Morgan Kaufmann
Cover of: Programming Massively Parallel Processors
Programming Massively Parallel Processors
December 28, 2012 , Morgan Kaufmann
Cover of: Programming Massively Parallel Processors
Programming Massively Parallel Processors: A Hands-On Approach
2010, Elsevier Science & Technology Books
in English

Add another edition?

Book Details


Table of Contents

Preface
Page xiii
Acknowledgements
Page xix
Dedication
Page xxi
Chapter 1. Introduction
Page 1
1.1. Heterogeneous Parallel Computing
Page 2
1.2. Architecture of a Modern GPU
Page 8
1.3. Why More Speed or Parallelism?
Page 10
1.4. Speeding Up Real Applications
Page 12
1.5. Parallel Programming Languages and Models
Page 14
1.6. Overarching Goals
Page 16
1.7. Organization of the Book
Page 17
References
Page 21
Chapter 2. History of GPU Computing
Page 23
2.1. Evolution of Graphics Pipelines
Page 23
The Era of Fixed-Function Graphics Pipelines
Page 24
Evolution of Programmable Real-Time Graphics
Page 28
Unified Graphics and Computing Processors
Page 31
2.2. GPGPU: An Intermediate Step
Page 32
2.3. GPU Computing
Page 34
Scalable GPUs
Page 35
Recent Developments
Page 36
Future Trends
Page 37
References and Further Reading
Page 37
Chapter 3. Introduction to Data Parallelism and CUDA C
Page 41
3.1. Data Parallelism
Page 42
3.2. CUDA Program Structure
Page 43
3.3. A Vector Addition Kernel
Page 45
3.4. Device Global Memory and Data Transfer
Page 48
3.5. Kernel Functions and Threading
Page 53
3.6. Summary
Page 58
Function Declarations
Page 59
Kernel Launch
Page 59
Predefined Variables
Page 60
Runtime API
Page 60
3.7. Exercises
Page 60
References
Page 62
Chapter 4. Data-Parallel Execution Model
Page 63
4.1. CUDA Thread Organization
Page 64
4.2. Mapping Threads to Multidimensional Data
Page 68
4.3. Matrix-Matrix Multiplication—A More Complex Kernel
Page 74
4.4. Synchronization and Transparent Scalability
Page 81
4.5. Assigning Resources to Blocks
Page 83
4.6. Querying Device Properties
Page 85
4.7. Thread Scheduling and Latency Tolerance
Page 87
4.8. Summary
Page 91
4.9. Exercises
Page 91
Chapter 5. CUDA Memories
Page 95
5.1. Importance of Memory Access Efficiency
Page 96
5.2. CUDA Device Memory Types
Page 97
5.3. A Strategy for Reducing Global Memory Traffic
Page 105
5.4. A Tiled Matrix-Matrix Multiplication Kernel
Page 109
5.5. Memory as a Limiting Factor to Parallelism
Page 115
5.6. Summary
Page 118
5.7. Exercises
Page 119
Chapter 6. Performance Considerations
Page 123
6.1. Warps and Thread Execution
Page 124
6.2. Global Memory Bandwidth
Page 132
6.3. Dynamic Partitioning of Execution Resources
Page 141
6.4. Instruction Mix and Thread Granularity
Page 143
6.5. Summary
Page 145
6.6. Exercises
Page 145
References
Page 149
Chapter 7. Floating-Point Considerations
Page 151
7.1. Floating-Point Format
Page 152
Normalized Representation of \(M\)
Page 152
Excess Encoding of \(E\)
Page 153
7.2. Representable Numbers
Page 155
7.3. Special Bit Patterns and Precision in IEEE Format
Page 160
7.4. Arithmetic Accuracy and Rounding
Page 161
7.5. Algorithm Considerations
Page 162
7.6. Numerical Stability
Page 164
7.7. Summary
Page 169
7.8. Exercises
Page 170
References
Page 171
Chapter 8. Parallel Patterns: Convolution
Page 173
8.1. Background
Page 174
8.2. 1D Parallel Convolution—A Basic Algorithm
Page 179
8.3. Constant Memory and Caching
Page 181
8.4. Tiled 1D Convolution with Halo Elements
Page 185
8.5. A Simpler Tiled 1D Convolution—General Caching
Page 192
8.6. Summary
Page 193
8.7. Exercises
Page 194
Chapter 9. Parallel Patterns: Prefix Sum
Page 197
9.1. Background
Page 198
9.2. A Simple Parallel Scan
Page 200
9.3. Work Efficiency Considerations
Page 204
9.4. A Work-Efficient Parallel Scan
Page 205
9.5. Parallel Scan for Arbitrary-Length Inputs
Page 210
9.6. Summary
Page 214
9.7. Exercises
Page 215
Reference
Page 216
Chapter 10. Parallel Patterns: Sparse Matrix-Vector Multiplication
Page 217
10.1. Background
Page 218
10.2. Parallel SpMV Using CSR
Page 222
10.3. Padding and Transposition
Page 224
10.4. Using Hybrid to Control Padding
Page 226
10.5. Sorting and Partitioning for Regularization
Page 230
10.6. Summary
Page 232
10.7. Exercises
Page 233
References
Page 234
Chapter 11. Application Case Study: Advanced MRI Reconstruction
Page 235
11.1. Application Background
Page 236
11.2. Iterative Reconstruction
Page 239
11.3. Computing FHD
Page 241
Step 1: Determine the Kernel Parallelism Structure
Page 243
Step 2: Getting Around the Memory Bandwidth Limitation
Page 249
Step 3: Using Hardware Trigonometry Functions
Page 255
Step 4: Experimental Performance Tuning
Page 259
11.4. Final Evaluation
Page 260
11.5. Exercises
Page 262
References
Page 264
Chapter 12. Application Case Study: Molecular Visualization and Analysis
Page 265
12.1. Application Background
Page 266
12.2. A Simple Kernel Implementation
Page 268
12.3. Thread Granularity Adjustment
Page 272
12.4. Memory Coalescing
Page 274
12.5. Summary
Page 277
12.6. Exercises
Page 279
References
Page 279
Chapter 13. Parallel Programming and Computational Thinking
Page 281
13.1. Goals of Parallel Computing
Page 282
13.2. Problem Decomposition
Page 283
13.3. Algorithm Selection
Page 287
13.4. Computational Thinking
Page 293
13.5. Summary
Page 294
13.6. Exercises
Page 294
References
Page 295
Chapter 14. An Introduction to OpenCL
Page 297
14.1. Background
Page 297
14.2. Data Parallelism Model
Page 299
14.3. Device Architecture
Page 301
14.4. Kernel Functions
Page 303
14.5. Device Management and Kernel Launch
Page 304
14.6. Electrostatic Potential Map in OpenCL
Page 307
14.7. Summary
Page 311
14.8. Exercises
Page 312
References
Page 313
Chapter 15. Parallel Programming with OpenACC
Page 315
15.1. OpenACC Versus CUDA C
Page 315
15.2. Execution Model
Page 318
15.3. Memory Model
Page 319
15.4. Basic OpenACC Programs
Page 320
Parallel Construct
Page 320
Loop Construct
Page 322
Kernels Construct
Page 327
Data Management
Page 331
Asynchronous Computation and Data Transfer
Page 335
15.5. Future Directions of OpenACC
Page 336
15.6. Exercises
Page 337
Chapter 16. Thrust: A Productivity-Oriented Library for CUDA
Page 339
16.1. Background
Page 339
16.2. Motivation
Page 342
16.3. Basic Thrust Features
Page 343
Iterators and Memory Space
Page 344
Interoperability
Page 345
16.4. Generic Programming
Page 347
16.5. Benefits of Abstraction
Page 349
16.6. Programmer Productivity
Page 349
Robustness
Page 350
Real-World Performance
Page 350
16.7. Best Practices
Page 352
Fusion
Page 353
Structure of Arrays
Page 354
Implicit Ranges
Page 356
16.8. Exercises
Page 357
References
Page 358
Chapter 17. CUDA Fortran
Page 359
17.1. CUDA Fortran and CUDA C Differences
Page 360
17.2. A First CUDA Fortran Program
Page 361
17.3. Multidimensional Array in CUDA Fortran
Page 363
17.4. Overloading Host/Device Routines With Generic Interfaces
Page 364
17.5. Calling CUDA C Via Iso_C_Binding
Page 367
17.6. Kernel Loop Directives and Reduction Operations
Page 369
17.7. Dynamic Shared Memory
Page 370
17.8. Asynchronous Data Transfers
Page 371
17.9. Compilation and Profiling
Page 377
17.10. Calling Thrust from CUDA Fortran
Page 378
17.11. Exercises
Page 382
Chapter 18. An Introduction to C++ AMP
Page 383
18.1. Core C++ AMP Features
Page 384
18.2. Details of the C++ AMP Execution Model
Page 391
Explicit and Implicit Data Copies
Page 391
Asynchronous Operation
Page 393
Section Summary
Page 395
18.3. Managing Accelerators
Page 395
18.4. Tiled Execution
Page 398
18.5. C++ AMP Graphics Features
Page 401
18.6. Summary
Page 405
18.7. Exercises
Page 405
Chapter 19. Programming a Heterogeneous Computing Cluster
Page 407
19.1. Background
Page 408
19.2. A Running Example
Page 408
19.3. MPI Basics
Page 410
19.4. MPI Point-to-Point Communication Types
Page 414
19.5. Overlapping Computation and Communication
Page 421
19.6. MPI Collective Communication
Page 431
19.7. Summary
Page 432
19.8. Exercises
Page 433
Reference
Chapter 20. CUDA Dynamic Parallelism
Page 435
20.1. Background
Page 436
20.2. Dynamic Parallelism Overview
Page 438
20.3. Important Details
Page 439
Launch Environment Configuration
Page 439
API Errors and Launch Failures
Page 439
Events
Page 439
Streams
Page 440
Synchronization Scope
Page 441
20.4. Memory Visibility
Page 442
Global Memory
Page 442
Zero-Copy Memory
Page 442
Constant Memory
Page 442
Texture Memory
Page 443
20.5. A Simple Example
Page 444
20.6. Runtime Limitations
Page 446
Memory Footprint
Page 446
Nesting Depth
Page 448
Memory Allocation and Lifetime
Page 448
ECC Errors
Page 449
Streams
Page 449
Events
Page 449
Launch Pool
Page 449
20.7. A More Complex Example
Page 449
Linear Bezier Curves
Page 450
Quadratic Bezier Curves
Page 450
Bezier Curve Calculation (Predynamic Parallelism)
Page 450
Bezier Curve Calculation (with Dynamic Parallelism)
Page 453
20.8. Summary
Page 456
Reference
Page 457
Chapter 21. Conclusion and Future Outlook
Page 459
21.1. Goals Revisited
Page 459
21.2. Memory Model Evolution
Page 461
21.3. Kernel Execution Control Evolution
Page 464
21.4. Core Performance
Page 467
21.5. Programming Environment
Page 467
21.6. Future Outlook
Page 468
References
Page 469
Appendix A: Matrix Multiplication Host-Only Version Source Code
Page 471
Appendix B: GPU Compute Capabilities
Page 481
Index
Page 487

Classifications

Library of Congress
QA76.58, QA76.642 .K57 2013eb

Edition Identifiers

Open Library
OL26838891M
Internet Archive
programmingmassi0000kirk
ISBN 10
0124159923
ISBN 13
9780124159921
OCLC/WorldCat
841331948

Work Identifiers

Work ID
OL25666151W

Community Reviews (0)

No community reviews have been submitted for this work.

Lists

History

Download catalog record: RDF / JSON / OPDS | Wikipedia citation
September 22, 2025 Edited by Drini Add TOC from Tocky
September 14, 2024 Edited by MARC Bot import existing book
August 6, 2024 Edited by Seth Pellegrino Edited without comment.
August 6, 2024 Edited by Seth Pellegrino Edited without comment.
April 6, 2019 Created by ImportBot Imported from amazon.com record