9783540419457

Preface

Santosh Pande

Dharma P. Agrawal

Introduction

xxi

Santosh Pande

Dharma P. Agrawal

Compiling for Distributed Memory Multiprocessors

xxi

Motivation

xxi

Complexity

xxii

Outline of the Monograph

xxii

Future Directions

xxvii

Section I : Languages

High Performance Fortran 2.0

(42)

Ken Kennedy

Charles Koelbel

Introduction

(1)

History and Overview of HPF

(4)

Data Mapping

(11)

Basic Language Features

(6)

Advanced Topics

(5)

Data Parallelism

(16)

Basic Language Features

(10)

Advanced Topics

(5)

Task Parallelism

(5)

EXTRINSIC Procedures

(3)

The TASK_REGION Directive

(2)

Input and Output

(2)

Summary and Future Outlook

(4)

The Sisal Project: Real World Functional Programming

(28)

Jean-Luc Gaudiot

Tom DeBoni

John Feo

Wim Bohm

Walid Najjar

Patrick Miller

Introduction

(1)

The Sisal Language: A Short Tutorial

(3)

An Early Implementation: The Optimizing Sisal Compiler

(4)

Update in Place and Copy Elimination

(1)

Build in Place

(1)

Reference Counting Optimization

(1)

Vectorization

(1)

Loop Fusion, Double Buffering Pointer Swap, and Inversion

(2)

Sisal90

(5)

The Foreign Language Interface

(4)

A Prototype Distributed-Memory SISAL Compiler

(4)

Base Compiler

(1)

Rectangular Arrays

(1)

Block Messages

(1)

Multiple Alignment

(1)

Results

(1)

Further Work

(1)

Architecture Support for Multithreaded Execution

(7)

Blocking and Non-blocking Models

(1)

Code Generation

(4)

Summary of Performance Results

(1)

Conclusions and Future Research

(4)

HPC++ and the HPC++Lib Toolkit

(36)

Dennis Gannon

Peter Beckman

Elizabeth Johnson

Todd Green

Mike Levine

Introduction

(1)

The HPC++ Programming and Execution Model

(4)

Level 1 HPC++

(1)

The Parallel Standard Template Library

(1)

Parallel Iterators

(1)

Parallel Algorithms

(1)

Distributed Containers

(1)

A Simple Example: The Spanning Tree of a Graph

(4)

Multi-threaded Programming

(14)

Synchronization

(8)

Examples of Multi-threaded Computations

(4)

Implementing the HPC++ Parallel Loop Directives

(3)

Multi-context Programming and Global Pointers

(6)

Remote Function and Member Calls

101

(2)

Using Corba IDL to Generate Proxies

103

(2)

The SPMD Execution Model

105

(1)

Barrier Synchronization and Collective Operations

105

(1)

Conclusion

106

(3)

A Concurrency Abstraction Model for Avoiding Inheritance Anomaly in Object-Oriented Programs

109

(32)

Sandeep Kumar

Dharma P. Agrawal

Introduction

109

(4)

Approaches to Parallelism Specification

113

(2)

Issues in Designing a COOPL

113

(1)

Issues in Designing Libraries

114

(1)

What Is the Inheritance Anomaly?

115

(5)

State Partitioning Anomaly (SPA)

116

(2)

History Sensitiveness of Acceptable States Anomaly (HSASA)

118

(1)

State Modification Anomaly (SMA)

118

(1)

Anomaly A

119

(1)

Anomaly B

120

(1)

What Is the Reusability of Sequential Classes?

120

(1)

A Framework for Specifying Parallelism

121

(1)

Previous Approaches

122

(1)

The Concurrency Abstraction Model

123

(3)

The CORE Language

126

(3)

Specifying a Concurrent Region

126

(1)

Defining an AC

126

(1)

Defining a Parallel Block

127

(2)

Synchronization Schemes

129

(1)

Illustrations

129

(4)

Reusability of Sequential Classes

130

(1)

Avoiding the Inheritance Anomaly

131

(2)

The Implementation Approach

133

(1)

Conclusions and Future Directions

134

(7)

Section II: Analysis

Loop Parallelization Algorithms

141

(32)

Alain Darte

Yves Robert

Frederic Vivien

Introduction

141

(1)

Input and Output of Parallelization Algorithms

142

(3)

Input: Dependence Graph

143

(1)

Output: Nested Loops

144

(1)

Dependence Abstractions

145

(4)

Dependence Graphs and Distance Sets

145

(2)

Polyhedral Reduced Dependence Graphs

147

(1)

Definition and Simulation of Classical Dependence Representations

148

(1)

Allen and Kennedy's Algorithm

149

(3)

Algorithm

150

(1)

Power and Limitations

151

(1)

Wolf and Lam's Algorithm

152

(4)

Purpose

153

(1)

Theoretical Interpretation

153

(1)

The General Algorithm

154

(1)

Power and Limitations

155

(1)

Darte and Vivien's Algorithm

156

(11)

Another Algorithm Is Needed

156

(2)

Polyhedral Dependences: A Motivating Example

158

(2)

Illustrating Example

160

(2)

Uniformization Step

162

(1)

Scheduling Step

162

(3)

Schematic Explanations

165

(1)

Power and Limitations

166

(1)

Feautrier's Algorithm

167

(2)

Conclusion

169

(4)

Array Dataflow Analysis

173

(48)

Paul Feautrier

Introduction

173

(3)

Exact Array Dataflow Analysis

176

(14)

Notations

176

(1)

The Program Model

176

(5)

Data Flow Analysis

181

(8)

Summary of the Algorithm

189

(1)

Related Work

190

(1)

Approximate Array Dataflow Analysis

190

(14)

From ADA to FADA

191

(4)

Introducing Parameters

195

(2)

Taking Properties of Parameters into Account

197

(4)

Eliminating Parameters

201

(1)

Related Work

202

(2)

Analysis of Complex Statements

204

(4)

What Is a Complex Statement

204

(2)

ADA in the Presence of Complex Statements

206

(1)

Procedure Calls as Complex Statements

206

(2)

Applications of ADA and FADA

208

(6)

Program Comprehension and Debugging

209

(2)

Parallelization

211

(1)

Array Expansion and Array Privatization

212

(2)

Conclusions

214

(7)

Appendix : Mathematical Tools

214

(1)

Polyhedra and Polytopes

214

(1)

Z-modules

215

(1)

Z-polyhedra

216

(1)

Parametric Problems

216

(5)

Interprocedural Analysis Based on Guarded Array Regions

221

(26)

Zhiyuan Li

Junjie Gu

Gyungho Lee

Introduction

221

(2)

Preliminary

223

(3)

Traditional Flow-Insensitive Summaries

223

(2)

Array Data Flow Summaries

225

(1)

Guarded Array Regions

226

(6)

Operations on GAR's

228

(2)

Predicate Operations

230

(2)

Constructing Summary GAR's Interprocedurally

232

(6)

Hierarchical Supergraph

232

(1)

Summary Algorithms

233

(2)

Expansions

235

(3)

Implementation Considerations

238

(2)

Symbolic Analysis

238

(1)

Region Numbering

239

(1)

Range Operations

240

(1)

Application to Array Privatization and Preliminary Experimental Results

240

(3)

Array Privatization

241

(1)

Preliminary Experimental Results

241

(2)

Related Works

243

(1)

Conclusion

244

(3)

Automatic Array Privatization

247

(38)

Peng Tu

David Padua

Introduction

247

(1)

Background

248

(2)

Algorithm for Array Privatization

250

(11)

Data Flow Framework

250

(2)

Inner Loop Abstraction

252

(4)

An Example

256

(1)

Profitability of Privatization

257

(1)

Last Value Assignment

258

(3)

Demand-Driven Symbolic Analysis

261

(16)

Gated Single Assignment

263

(1)

Demand-Driven Backward Substitution

264

(2)

Backward Substitution in the Presence of Gating Functions

266

(1)

Examples of Backward Substitution

267

(2)

Bounds of Symbolic Expression

269

(1)

Comparison of Symbolic Expressions

269

(3)

Recurrence and the μ Function

272

(1)

Bounds of Monotonic Variables

273

(1)

Index Array

274

(1)

Conditional Data Flow Analysis

275

(1)

Implementation and Experiments

276

(1)

Related Work

277

(8)

Section III : Communication Optimizations

Optimal Tiling for Minimizing Communication in Distributed Shared-Memory Multiprocessors

285

(54)

Anant Agarwal

David Kranz

Rajeev Barua

Venkat Natarajan

Introduction

285

(4)

Contributions and Related Work

286

(2)

Overview of the Paper

288

(1)

Problem Domain and Assumptions

289

(3)

Program Assumptions

289

(2)

System Model

291

(1)

Loop Partitions and Data Partitions

292

(3)

A Framework for Loop and Data Partitioning

295

(19)

Loop Tiles in the Iteration Space

296

(2)

Footprints in the Data Space

298

(2)

Size of a Footprint for a Single Reference

300

(4)

Size of the Cumulative Footprint

304

(7)

Minimizing the Size of the Cumulative Footprint

311

(3)

General Case of G

314

(4)

G Is Invertible, but Not Unimodular

314

(2)

Columns of G Are Dependent and the Rows Are Independent

316

(1)

The Rows of G Are Dependent

316

(2)

Other System Environments

318

(4)

Coherence-Related Cache Misses

318

(2)

Effect of Cache Line Size

320

(1)

Data Partitioning in Distributed-Memory Multicomputers

320

(2)

Combined Loop and Data Partitioning in DSMs

322

(6)

The Cost Model

322

(3)

The Multiple Loops Heuristic Method

325

(3)

Implementation and Results

328

(6)

Algorithm Simulator Experiments

330

(1)

Experiments on the Alewife Multiprocessor

330

(4)

Conclusions

334

(5)

A Formulation of Loop Tiles Using Bounding Hyperplanes

337

(1)

Synchronization References

337

(2)

Communication-Free Partitioning of Nested Loops

339

(46)

Kuei-Ping Shih

Chua-Huang Huang

Jang-Ping Sheu

Introduction

339

(2)

Fundamentals of Array References

341

(6)

Iteration Spaces and Data Spaces

342

(1)

Reference Functions

343

(1)

Properties of Reference Functions

343

(4)

Loop-Level Partitioning

347

(18)

Iteration and Data Spaces Partitioning - Uniformly Generated References

347

(6)

Hyperplane Partitioning of Data Space

353

(6)

Hyperplane Partitioning of Iteration and Data Spaces

359

(6)

Statement-Level Partitioning

365

(12)

Affine Processor Mapping

366

(6)

Hyperplane Partitioning

372

(5)

Comparisons and Discussions

377

(4)

Conclusions

381

(4)

Solving Alignment Using Elementary Linear Algebra

385

(28)

Vladimir Kotlyar

David Bau

Induprakas Kodukula

Keshav Pingali

Paul Stodghill

Introduction

385

(3)

Linear Alignment

388

(5)

Equational Constraints

388

(2)

Reduction to Null Space Computation

390

(1)

Remarks

391

(1)

Reducing the Solution Basis

392

(1)

Affine Alignment

393

(3)

Encoding Affine Constraints as Linear Constraints

393

(3)

Replication

396

(2)

Formulation of Replication

397

(1)

Heuristics

398

(4)

Lessons from Some Common Computational Kernels

399

(3)

Implications for Alignment Heuristic

402

(1)

Conclusion

402

(11)

Reducing the Solution Matrix

404

(1)

Unrelated Constraints

404

(1)

General Procedure

405

(3)

A Comment on Affine Encoding

408

(5)

A Compilation Method for Communication--Efficient Partitioning of DOALL Loops

413

(32)

Santosh Pande

Tareq Bali

Introduction

413

(1)

DOALL Partitioning

414

(7)

Motivating Example

415

(4)

Our Approach

419

(2)

Terms and Definitions

421

(2)

Example

422

(1)

Problem

423

(4)

Compatibility Subsets

423

(1)

Cyclic Directions

424

(3)

Communication Minimization

427

(2)

Algorithm : Maximal Compatibility Subsets

427

(1)

Algorithm : Maximal Fibonacci Sequence

428

(1)

Data Partitioning

428

(1)

Partition Merging

429

(3)

Granularity Adjustment

431

(1)

Load Balancing

431

(1)

Mapping

432

(1)

Example : Texture Smoothing Code

432

(3)

Performance on Cray T3D

435

(10)

Conclusions

440

(5)

Compiler Optimization of Dynamic Data Distributions for Distributed-Memory Multicomputers

445

(40)

Daniel J. Palermo

Eugene W. Hodges IV

Prithviraj Banerjee

Introduction

445

(2)

Related Work

447

(2)

Dynamic Distribution Selection

449

(13)

Motivation for Dynamic Distributions

449

(1)

Overview of the Dynamic Distribution Approach

450

(1)

Phase Decomposition

451

(6)

Phase and Phase Transition Selection

457

(5)

Data Redistribution Analysis

462

(3)

Reaching Distributions and the Distribution Flow Graph

462

(1)

Computing Reaching Distributions

463

(1)

Representing Distribution Sets

464

(1)

Interprocedural Redistribution Analysis

465

(7)

Distribution Synthesis

467

(1)

Redistribution Synthesis

468

(3)

Static Distribution Assignment (SDA)

471

(1)

Results

472

(8)

Synthetic HPF Redistribution Example

473

(2)

2-D Alternating Direction Implicit (ADI2D) Iterative Method

475

(3)

Shallow Water Weather Prediction Benchmark

478

(2)

Conclusions

480

(5)

A Framework for Global Communication Analysis and Optimizations

485

(40)

Manish Gupta

Introduction

485

(2)

Motivating Example

487

(1)

Available Section Descriptor

488

(6)

Representation of ASD

490

(2)

Computing Generated Communication

492

(2)

Data Flow Analysis

494

(11)

Data Flow Variables and Equations

495

(3)

Decomposition of Bidirectional Problem

498

(1)

Overall Data-Flow Procedure

499

(6)

Communication Optimizations

505

(3)

Elimination of Redundant Communication

505

(1)

Reduction in Volume of Communication

506

(1)

Movement of Communication for Subsumption and for Hiding Latency

507

(1)

Extensions: Communication Placement

508

(2)

Operations on Available Section Descriptors

510

(6)

Operations on Bounded Regular Section Descriptors

512

(2)

Operations on Mapping Function Descriptors

514

(2)

Preliminary Implementation and Results

516

(3)

Related Work

519

(2)

Global Communication Optimizations

519

(1)

Data Flow Analysis and Data Descriptors

520

(1)

Conclusions

521

(4)

Tolerating Communication Latency through Dynamic Thread Invocation in a Multithreaded Architecture

525

(28)

Andrew Sohn

Yuetsu Kodama

Jui-Yuan Ku

Mitsuhisa Sato

Yoshinori Yamaguchi

Introduction

525

(2)

Multithreading Principles and Its Realization

527

(8)

The Principle

527

(3)

The EM-X Multithreaded Distributed-Memory Multiprocessor

530

(3)

Architectural Support for Fine-Grain Multithreading

533

(2)

Designing Multithreaded Algorithms

535

(5)

Multithreaded Bitonic Sorting

535

(3)

Multithreaded Fast Fourier Transform

538

(2)

Overlapping Analysis

540

(4)

Analysis of Switches

544

(3)

Conclusions

547

(6)

Section IV: Code Generation

Advanced Code Generation for High Performance Fortran

553

(44)

Vikram Adve

John Mellor-Crummey

Introduction

553

(3)

Background: The Code Generation Problem for HPF

556

(5)

Communication Analysis and Code Generation for HPF

556

(2)

Previous Approaches to Communication Analysis and Code Generation

558

(3)

An Integer Set Framework for Data-Parallel Compilation

561

(4)

Primitive Components of the Framework

561

(1)

Implementation of the Framework

562

(3)

Computation Partitioning

565

(8)

Computation Partitioning Models

565

(2)

Code Generation to Realize Computation Partitions

567

(6)

Communication Code Generation

573

(11)

Communication Generation with Message Vectorization and Coalescing

577

(4)

Recognizing In-Place Communication

581

(1)

Implementing Loop-Splitting for Reducing Communication Overhead

582

(2)

Control Flow Simplification

584

(6)

Motivation

584

(4)

Overview of Algorithm

588

(1)

Evaluation and Discussion

589

(1)

Conclusions

590

(7)

Integer Lattice Based Methods for Local Address Generation for Block-Cyclic Distributions

597

(52)

J. Ramanujam

Introduction

597

(2)

Background and Related Work

599

(4)

Related Work on One-Level Mapping

600

(2)

Related Work on Two-Level Mapping

602

(1)

A Lattice Based Approach for Address Generation

603

(2)

Assumptions

603

(1)

Lattices

604

(1)

Determination of Basis Vectors

605

(9)

Basis Determination Algorithm

607

(2)

Extremal Basis Vectors

609

(3)

Improvements to the Algorithm for s > k

612

(1)

Complexity

613

(1)

Address Sequence Generation by Lattice Enumeration

614

(2)

Optimization of Loop Enumeration: GO-LEFT and GO-RIGHT

616

(4)

Implementation

620

(1)

Experimental Results for One-Level Mapping

620

(6)

Address Sequence Generation for Two-Level Mapping

626

(2)

Problem Statement

626

(2)

Algorithms for Two-Level Mapping

628

(7)

Itable: An Algorithm That Constructs a Table of Offsets

629

(2)

Optimization of the Itable Method

631

(3)

Search-Based Algorithms

634

(1)

Experimental Results for Two-Level Mapping

635

(3)

Amazon no longer offers textbook rentals. We do!

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

Compiler Optimizations for Scalable Parallel Systems : Languages, Compilation Techniques, and Run Time Systems

3540419454

Supplemental Materials

Summary

Table of Contents

Supplemental Materials

Rewards Program