Modern computer architectures designed with high-performance microprocessors offer tremendous potential gains in performance over previous designs. Yet their very complexity makes it increasingly difficult to produce efficient code and to realize their full potential. This landmark text from two leaders in the field focuses on the pivotal role that compilers can play in addressing this critical issue. The basis for all the methods presented in this book is data dependence, a fundamental compiler analysis tool for optimizing programs on high-performance microprocessors and parallel architectures. It enables compiler designers to write compilers that automatically transform simple, sequential programs into forms that can exploit special features of these modern architectures. The text provides a broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling. The authors demonstrate the importance and wide applicability of dependence-based compiler optimizations and give the compiler writer the basics needed to understand and implement them. They also offer cookbook explanations for transforming applications by hand to computational scientists and engineers who are driven to obtain the best possible performance of their complex applications. The approaches presented are based on research conducted over the past two decades, emphasizing the strategies implemented in research prototypes at Rice University and in several associated commercial systems. Randy Allen and Ken Kennedy have provided an indispensable resource for researchers, practicing professionals, and graduate students engaged in designing and optimizing compilers for modern computer architectures. * Offers a guide to the simple, practical algorithms and approaches that are most effective in real-world, high-performance microprocessor and parallel systems. * Demonstrates each transformation in worked examples. * Examines how two case study compilers implement the theories and practices described in each chapter. * Presents the most complete treatment of memory hierarchy issues of any compiler text. * Illustrates ordering relationships with dependence graphs throughout the book. * Applies the techniques to a variety of languages, including Fortran 77, C, hardware definition languages, Fortran 90, and High Performance Fortran. * Provides extensive references to the most sophisticated algorithms known in research.

Preface

xvii

Compiler Challenges for High-Performance Architectures

(34)

Overview and Goals

(3)

Pipelining

(7)

Pipelined Instruction Units

(2)

Pipelined Execution Units

(1)

Parallel Functional Units

(1)

Compiling for Scalar Pipelines

(3)

Vector Instructions

(3)

Vector Hardware Overview

(1)

Compiling for Vecotr Pipelines

(2)

Superscalar and VLIW Processors

(3)

Multiple-Issue Instruction Units

(1)

Compiling for Multiple-Issue Processors

(2)

Processor Parallelism

(4)

Overview of Processor Parallelism

(2)

Compiling for Asynchronous Parallelism

(2)

Memory Hierarchy

(2)

Overview of Memory Systems

(1)

Compiling for Memory Hierarchy

(1)

A Case Study: Matrix Multiplication

(5)

Advanced Compiler Technology

(3)

Dependence

(2)

Transformations

(1)

Summary

(1)

Case Studies

(1)

Historical Comments and References

(3)

Exercises

(2)

Dependence: Theory and Practice

(38)

Introduction

(1)

Dependence and Its Properties

(20)

Load-Store Classification

(1)

Dependence in Loops

(3)

Dependence and Transformations

(4)

Distance and Direction Vectors

(4)

Loop-Carried and Loop-Independent Dependences

(7)

Simple Dependence Testing

(3)

Parallelization and Vectorization

(10)

Parallelization

(1)

Vectorization

(3)

An Advanced Vectorization Algorithm

(6)

Summary

(1)

Case Studies

(1)

Historical Comments and References

(3)

Exercises

(2)

Dependence Testing

(62)

Introduction

(6)

Background and Terminology

(5)

Dependence Testing Overview

(2)

Subscript Partitioning

(1)

Merging Direction Vectors

(1)

Single-Subscript Dependence Tests

(30)

ZIV Test

(1)

SIV Tests

(12)

Multiple Induction-Variable Tests

(17)

Testing in Coupled Groups

111

(10)

The Delta Test

112

(8)

More Powerful Multiple-Subscript Tests

120

(1)

An Empirical Study

121

(2)

Putting It All Together

123

(6)

Summary

129

(1)

Case Studies

130

(1)

Historical Comments and References

131

(4)

Exercises

132

(3)

Preliminary Transformations

135

(36)

Introduction

135

(3)

Information Requirements

138

(1)

Loop Normalization

138

(3)

Data Flow Analysis

141

(14)

Definition-Use Chains

142

(3)

Dead Code Elimination

145

(1)

Constant Propagation

146

(2)

Static Single-Assignment Form

148

(7)

Induction-Variable Exposure

155

(11)

Forward Expression Substitution

156

(2)

Induction-Variable Substitution

158

(4)

Driving the Substitution Process

162

(4)

Summary

166

(1)

Case Studies

167

(1)

Historical Comments and References

168

(3)

Exercises

169

(2)

Enhancing Fine-Grained Parallelism

171

(68)

Introduction

171

(2)

Loop Interchange

173

(11)

Safety of Loop Interchange

174

(3)

Profitability of Loop Interchange

177

(2)

Loop Interchange and Vectorization

179

(5)

Scalar Expansion

184

(11)

Scalar and Array Renaming

195

(7)

Node Splitting

202

(3)

Recognition of Reductions

205

(4)

Index-Set Splitting

209

(5)

Threshold Analysis

209

(2)

Loop Peeling

211

(1)

Section-Based Splitting

212

(2)

Run-Time Symbolic Resolution

214

(2)

Loop Skewing

216

(5)

Putting It All Together

221

(5)

Complications of Real Machines

226

(4)

Summary

230

(1)

Case Studies

231

(5)

PFC

231

(1)

Ardent Titan Compiler

232

(2)

Vectorization Performance

234

(2)

Historical Comments and References

236

(3)

Exercises

237

(2)

Creating Coarse-Grained Parallelism

239

(80)

Introduction

239

(1)

Single-Loop Methods

240

(31)

Privatization

240

(5)

Loop Distribution

245

(1)

Alignment

246

(3)

Code Replication

249

(5)

Loop Fusion

254

(17)

Perfect Loop Nests

271

(17)

Loop Interchange for Parallelization

271

(4)

Loop Selection

275

(4)

Loop Reversal

279

(1)

Loop Skewing for Parallelization

280

(4)

Unimodular Transformations

284

(1)

Profitability-Based Parallelization Methods

285

(3)

Imperfectly Nested Loops

288

(9)

Multilevel Loop Fusion

289

(3)

A Parallel Code Generation Algorithm

292

(5)

An Extended Example

297

(1)

Packaging of Parallelism

298

(11)

Strip Mining

300

(1)

Pipeline Parallelism

301

(3)

Scheduling Parallel Work

304

(2)

Guided Self-Scheduling

306

(3)

Summary

309

(1)

Case Studies

310

(5)

PFC and ParaScope

310

(1)

Ardent Titan Compiler

311

(4)

Historical Comments and References

315

(4)

Exercises

316

(3)

Handling Control Flow

319

(62)

Introduction

319

(1)

If-Conversion

320

(30)

Definition

321

(1)

Branch Classification

322

(1)

Forward Branches

323

(4)

Exit Branches

327

(6)

Backward Branches

333

(3)

Complete Forward Branch Removal

336

(2)

Simplification

338

(5)

Iterative Dependences

343

(5)

If-Reconstruction

348

(2)

Control Dependence

350

(26)

Constructing Control Dependence

352

(3)

Control Dependence in Loops

355

(1)

An Execution Model for Control Dependences

356

(3)

Application of Control Dependence to Parallelization

359

(17)

Summary

376

(1)

Case Studies

376

(1)

Historical Comments and References

377

(4)

Exercises

378

(3)

Improving Register Usage

381

(88)

Introduction

381

(1)

Scalar Register Allocation

381

(6)

Data Dependence for Register Reuse

383

(1)

Loop-Carried and Loop-Independent Reuse

384

(2)

A Register Allocation Example

386

(1)

Scalar Replacement

387

(16)

Pruning the Dependence Graph

387

(5)

Simple Replacement

392

(1)

Handling Loop-Carried Dependences

392

(1)

Dependences Spanning Multiple Iterations

393

(1)

Eliminating Scalar Copies

394

(1)

Moderating Register Pressure

395

(1)

Scalar Replacement Algorithm

396

(4)

Experimental Data

400

(3)

Unroll-and-Jam

403

(12)

Legality of Unroll-and-Jam

406

(3)

Unroll-and-Jam Algorithm

409

(3)

Effectiveness of Unroll-and-Jam

412

(3)

Loop Interchange for Register Reuse

415

(5)

Considerations for Loop Interchange

417

(2)

Loop Interchange Algorithm

419

(1)

Loop Fusion for Register Reuse

420

(33)

Profitable Loop Fusion for Reuse

421

(3)

Loop Alignment for Fusion

424

(4)

Fusion Mechanics

428

(6)

A Weighted Loop Fusion Algorithm

434

(16)

Multilevel Loop Fusion for Register Reuse

450

(3)

Putting It All Together

453

(3)

Ordering the Transformations

453

(1)

An Example: Matrix Multiplication

454

(2)

Complex Loop Nests

456

(9)

Loops with If Statements

457

(2)

Trapezoidal Loops

459

(6)

Summary

465

(1)

Case Studies

465

(1)

Historical Comments and References

466

(3)

Exercises

467

(2)

Managing Cache

469

(42)

Introduction

469

(2)

Loop Interchange for Spatial Locality

471

(6)

Blocking

477

(14)

Unaligned Data

478

(2)

Legality of Blocking

480

(2)

Profitability of Blocking

482

(1)

A Simple Blocking Algorithm

483

(2)

Blocking with Skewing

485

(1)

Fusion and Alignment

486

(3)

Blocking in Combination with Other Transformations

489

(1)

Effectiveness

490

(1)

Cache Management in Complex Loop Nests

491

(4)

Triangular Cache Blocking

491

(1)

Special-Purpose Transformations

492

(3)

Software Prefetching

495

(12)

A Software Prefetching Algorithm

496

(10)

Effectiveness of Software Prefetching

506

(1)

Summary

507

(1)

Case Studies

508

(1)

Historical Comments and References

509

(2)

Exercises

510

(1)

Scheduling

511

(38)

Introduction

511

(1)

Instruction Scheduling

512

(24)

Machine Model

514

(1)

Straight-Line Graph Scheduling

515

(1)

List Scheduling

516

(2)

Trace Scheduling

518

(6)

Scheduling in Loops

524

(12)

Vector Unit Scheduling

536

(7)

Chaining

537

(3)

Coprocessors

540

(3)

Summary

543

(1)

Case Studies

543

(3)

Historical Comments and References

546

(3)

Exercises

547

(2)

Interprocedural Analysis and Optimization

549

(56)

Introduction

549

(1)

Interprocedural Analysis

550

(42)

Interprocedural Problems

550

(6)

Interprocedural Probem Classification

556

(4)

Flow-Insensitive Side Effect Analysis

560

(8)

Flow-Insensitive Alias Analysis

568

(5)

Constant Propagation

573

(5)

Kill Analysis

578

(3)

Symbolic Analysis

581

(4)

Array Section Analysis

585

(3)

Call Graph Construction

588

(4)

Interprocedural Optimization

592

(3)

Inline Substitution

592

(2)

Procedure Cloning

594

(1)

Hybrid Optimizations

595

(1)

Managing Whole-Program Compilation

595

(4)

Summary

599

(1)

Case Studies

600

(2)

Historical Comments and References

602

(3)

Exercises

604

(1)

Dependence in C and Hardware Design

605

(50)

Introduction

605

(1)

Optimizing C

606

(11)

Pointers

608

(2)

Naming and Structures

610

(2)

Loops

612

(1)

Scoping and Statics

613

(1)

Dialect

613

(2)

Miscellaneous

615

(2)

Hardware Design

617

(34)

Hardware Description Languages

619

(3)

Optimizing Simulation

622

(17)

Synthesis Optimization

639

(12)

Summary

651

(1)

Case Studies

652

(1)

Historical Comments and References

652

(3)

Exercises

653

(2)

Compiling Array Assignments

655

(34)

Introduction

655

(1)

Simple Scalarization

656

(4)

Scalarization Transformations

660

(8)

Loop Reversal

660

(1)

Input Prefetching

661

(5)

Loop Splitting

666

(2)

Multidimensional Scalarization

668

(15)

Simple Scalarization in Multiple Dimensions

670

(1)

Outer Loop Prefetching

671

(3)

Loop Interchange for Scalarization

674

(3)

General Multidimensional Scalarization

677

(4)

A Scalarization Example

681

(2)

Considerations for Vector Machines

683

(1)

Postscalarization Interchange and Fusion

684

(2)

Summary

686

(1)

Case Studies

687

(1)

Historical Comments and References

687

(2)

Exercises

688

(1)

Compiling High Performance Fortran

689

(46)

Introduction

689

(5)

HPF Compiler Overview

694

(4)

Basic Loop Compilation

698

(12)

Distribution Propagation and Analysis

698

(2)

Iteration Partitioning

700

(4)

Communication Generation

704

(6)

Optimization

710

(18)

Communication Vectorization

710

(6)

Overlapping Communication and Computation

716

(1)

Alignment and Replication

717

(2)

Pipelining

719

(2)

Identification of Common Recurrences

721

(1)

Storage Management

722

(4)

Handling Multiple Dimensions

726

(2)

Interprocedural Optimization for HPF

728

(1)

Summary

729

(1)

Case Studies

729

(3)

Historical Comments and References

732

(3)

Exercises

733

(2)

Appendix Fundamentals of Fortran 90

735

(8)

Introduction

735

(1)

Lexical Properties

735

(1)

Array Assignment

736

(5)

Library Functions

741

(1)

Rent More, Save More! Use code: ECRENTAL

Rent More, Save More! Use code: ECRENTAL

5% off 1 book, 7% off 2 books, 10% off 3+ books

Optimizing Compilers for Modern Architectures

9781558602861

1558602860

Summary

Table of Contents

Supplemental Materials

Rewards Program

Rent More, Save More! Use code: ECRENTAL

Rent More, Save More! Use code: ECRENTAL

5% off 1 book, 7% off 2 books, 10% off 3+ books

Optimizing Compilers for Modern Architectures

9781558602861

1558602860

Summary

Table of Contents

Supplemental Materials

Rewards Program

Digital License