1 Prologue

1.1 Introduction

2 Phase Ordering in Optimising Compilation

2.1 Introduction to the Phase Ordering Problem

2.2 Background on Phase Ordering

2.2.1 Performance Modelling and Prediction

2.2.2 Some Attempts in Phase Ordering

2.3 Towards a Theoretical Model for Phase Ordering Problem

2.3.1 Decidability Results

2.3.2 Another Formulation of the Phase Ordering Problem

2.4 Examples of Decidable Simplified Cases

2.4.1 Models with Compilation Costs

2.4.2 One-pass Generative Compilers

2.5 Compiler Optimisation Parameters Space Exploration

2.5.1 Towards a Theoretical model

2.5.2 Examples of Simplified Decidable Cases

2.6 Conclusion on Phase Ordering

3 The Register Need

3.1 Data Dependence Graph and Processor Models

3.2 The Acyclic Register Need

3.3 The Periodic Register Need

3.3.1 Software Pipelining, Periodic Scheduling, Cyclic Scheduling

3.3.2 The Circular Lifetime Intervals

3.4 Computing the Periodic Register Need

3.5 Some Results on the Periodic Register Need

3.5.1 Minimal Periodic Register Need vs. Initiation Interval

3.5.2 Computing the Periodic Register Sufficiency

3.5.3 Stage Scheduling under Register Constraints

3.6 Conclusion on the Register Requirement

4 The Register Saturation

4.1 Motivations on the Register Saturation Concept

4.2 Computing the Acyclic Register Saturation

4.2.1 Characterising the Register Saturation

4.2.2 Efficient Algorithmic Heuristic for RS Computation

4.2.3 Experimental Efficiency of Greedy-k

4.3 Computing the Periodic Register Saturation

4.4 Conclusion on the Register Saturation

5 Spill Code Reduction

5.1 Introduction on Register Constraints in Software Pipelining

5.2 Related Work in Periodic Register Allocation

5.3 SIRA: Schedule Independant Register Allocation

5.3.1 Reuse Graphs

5.3.2 DDG Associated to Reuse Graph

5.3.3 Exact SIRA with Integer Linear Programming

5.3.4 SIRA with Fixed Reuse Edges

5.4 SIRALINA: An Efficient Polynomial Heuristic for SIRA

5.5 Experimental Results with SIRA

5.6 Conclusion on Spill Code Reduction

6 Exploiting the Register Access Delays

6.1 Problem Description of DDG Cycles with Non-positive Distances .

6.2 Eliminating Non-Positive Cycles

6.3 Experimental Results on Eliminating Non-Positive Cycles

6.4 Conclusion on Non-Positive Cycles Elimination

7 Loop Unrolling Degree Minimisation

7.1 Introduction

7.2 Unroll Degree Minimisation of Unscheduled Loops

7.2.1 Problem Description of Unroll Factor Minimisation for Unscheduled Loops

7.2.2 Algorithmic Solution for Unroll Factor Minimisation: Single Register Type

7.2.3 Solution for LCM Problem

7.2.4 Unroll Factor Minimisation in Presence of Multiple Register Types

7.2.5 Solution for Minimal Loop Unrolling

7.3 Unroll Degree Minimisation of Scheduled Loops

7.4 Experimental Results

7.5 Conclusion on Loop Unroll Degree Minimisation

8 Memory Hierarchy E↵ects and ILP

8.1 Problem of Memory Disambiguation at Runtime

8.1.1 Introduction

8.1.2 Related Work

8.1.3 Experimental Environment

8.1.4 Experimentation Methodology

8.1.5 Experimental Study of Cache Behavior

8.1.6 The E↵ectiveness of Load/Store Vectorisation

8.1.7 Conclusion on Memory Disambiguation Mechanisms

8.2 Data Preloading and Prefetching

8.2.1 Introduction

8.2.2 Related Work

8.2.3 Problems of Optimising Cache E↵ects at the Instruction Level

8.2.4 Target Processor Description

8.2.5 Our Methodology of Instruction-Level Code Optimisation

8.2.6 Experimental Results

8.2.7 Conclusion on Pre-fetching and Pre-Loading

9 Statistical Performance Analysis

9.1 Code Performance Variation

9.2 The Speedup-Test Protocole

9.2.1 The Observed Speedups

9.2.2 The Speedup of the Observed Average Execution Time

9.2.3 The Speedup of the Observed Median Execution Time, as well as Individual Runs

9.3 Discussion and Conclusion on the Speedup-Test

10 Epilogue

10.1 Problem of Instruction Selection

10.2 Perspectives on Code Optimisation for Multi-Core Processors

10.3 General Conclusion

A Benchmarks Presentation 125

A.1 Qualitative Benchmarks Presentation

A.2 Quantitative Benchmarks Presentation

A.3 Changing the Architectural Configuration of the Processor

B Experiments on Register Saturation

B.1 The Acyclic Register Saturation

B.1.1 On the Oprimal RS Computation

B.1.2 On the Accuracy of Greedy-k Heuristic vs. Optimal RS

B.1.3 Greedy-k Execution Times

B.2 The Periodic Register Saturation

B.2.1 Optimal PRS Computation

B.2.2 Approximate PRS Computation with Heuristic

C Experiments on SIRA

C.1 Efficiency of SIRALINA on Standalone DDG

C.1.1 Naming conventions for register optimisation orders

C.1.2 Experimental efficiency of SIRALINA

C.1.3 Measuring the Increase of the MII

C.1.4 Efficiency of SIRALINA Execution Times

C.2 Efficiency of SIRALINA plugged Inside Industrial Compiler

C.2.1 Static Performance Results

C.2.2 Execution Time Performance Results

D Experiments on Non-Positive Cycles Elimination 151

D.1 Experimental Setup

D.1.1 Heuristics Nomenclature

D.1.2 Empirical Efficiency Measures

D.2 Comparison of the Heuristics Execution Times

D.2.1 Time to Minimise Register Pressure for a fixed II

D.3 Convergence of the Proactive Heuristic (Iterative SIRALINA)

D.4 Qualitative Analysis of the Heuristics

D.4.1 Number of Saved Registers

D.4.2 Proportion of Success

D.4.3 Increase of the MII when Success

D.5 Conclusion on Non-Positive Cycles Elimination Strategy

E Experiments on Unroll Degree Minimisation

E.1 Standalone Experiments with Single Register types

E.1.1 Experiments with Unscheduled Loops

E.1.2 Results on Randomly Generated DDG

E.1.3 Experiments on Real DDG

E.1.4 Experiments with Scheduled Loops

E.2 Experiments with Multiple Register Types

F Experiments on Preloading and Prefetching