HPC@Green It

The authors present methods to reduce computer energy consumption by means of a better usage of a specific set of resources and maximizing the efficiency of the running applications. The processor frequency is adjusted to the needs of the running job, leading to a power drop by a factor of 2 and doubling battery life time of laptops. It is shown how computer resources can be optimally adapted to application needs, reducing job run time. Examples on how to optimize algorithms on single node and parallel RISC architectures are discussed. The job-related data are stored and reused to help computer managers to replace machines.

Ralf Gruber won the Cray Gigaflop Performance Award in 1989 with the world's fastest parallel program running at 1.7 GFlop/s sustained. He was responsible for the Swiss-Tx cluster project, a co-operation between EPFL, Compaq, and Supercomputing Systems. He has been teaching the doctoral school course on "High Performance Computing Methods" for the last 6 years. Vincent Keller received his Master's degree in Computer Science from the University of Geneva (Switzerland) in 2004, and his Ph.D. in 2008 from the cole Polytechnique Fdrale de Lausanne (EPFL) in the fields of HPCN and HPC Grids. Since 2009, Dr. Vincent Keller has held a full-time research position at the University of Bonn in Germany. His research interests are in HPC applications analysis, Grid and cluster computing, and the energy efficiency of large computing ecosystems.

Introduction	p. 1
Basic goals of the book	p. 1
What do I get for one Watt today?	p. 1
Main memory bottleneck	p. 3
Optimize resource usage	p. 3
Application design	p. 4
Organization of the book	p. 4
Historical aspects	p. 4
Parameterization	p. 5
Models	p. 5
Core optimization	p. 6
Node optimization	p. 6
Cluster optimization	p. 6
Grid-brokering to save energy	p. 7
Historical highlights	p. 9
Evolution of computing	p. 9
The first computer companies	p. 14
ERA, EMCC and Univac	p. 14
Control Data Corporation, CDC	p. 14
Cray Research	p. 15
Thinking Machines Corporation	p. 16
International Business Machines (IBM)	p. 17
The ASCI effort	p. 18
The Japanese efforts	p. 19
The computer generations	p. 20
The evolution in computing performance	p. 20
Performance/price evolution	p. 22
Evolution of basic software	p. 22
Evolution of algorithmic complexity	p. 23
The TOP500 list	p. 25
Outlook with the TOP500 curves	p. 27
The GREEN500 List	p. 28
Proposal for a REAL500 list	p. 30
Parameterization	p. 31
Definitions	p. 31
Parameterization of applications	p. 35
Application parameter set	p. 35
Parameterization of BLAS library routines	p. 36
SMXV: Parameterization of sparse matrix*vector operation	p. 38
Parameterization of a computational nodes P_i ∈ r_i	p. 39
Parameterization of the interconnection networks	p. 41
Types of networks	p. 41
Parameterization of clusters and networks	p. 42
Parameters related to running applications	p. 44
Conclusion	p. 47
Models	p. 49
The performance prediction model	p. 49
The execution time evaluation model (ETEM)	p. 53
A network performance model	p. 53
The extended ¿ - ¿ model	p. 55
Validation of the models	p. 56
Methodology	p. 56
Example: The full matrix*matrix multiplication DGEMM	p. 57
Example: Sparse matrix*vector multiplication SMXV	p. 59
Core optimization	p. 63
Some useful notions	p. 63
Data hierarchy	p. 63
Data representation	p. 64
Floating point operations	p. 67
Pipelining	p. 68
Single core optimization	p. 70
Single core architectures	p. 70
Memory conflicts	p. 70
Indirect addressing	p. 74
Unrolling	p. 75
Dependency	p. 76
Inlining	p. 78
If statement in a loop	p. 78
Code porting aspects	p. 79
How to develop application software	p. 83
Application to plasma physics codes	p. 84
Tokamaks and Stellerators	p. 84
Optimization of VMEC	p. 88
Optimization of TERPSICHORE	p. 91
Conclusions for single core optimization	p. 94
Node optimization	p. 95
Shared memory computer architectures	p. 95
SMP/NUMA architectures	p. 95
The Cell	p. 99
GPGPU for HPC	p. 100
Node comparison and OpenMP	p. 105
Race condition with OpenMP	p. 109
Application optimization with OpenMP: the 3D Helmholtz solver	p. 110
Fast Helmholtz solver for parallelepipedic geometries	p. 111
NEC SX-5 reference benchmark	p. 113
Single processor benchmarks	p. 114
Parallelizalion with OpenMP	p. 115
Parallelizalion with MPI	p. 115
Conclusion	p. 119
Application optimization with OpenMP: TERPSICHORE	p. 119
Cluster optimization	p. 121
Introduction on parallelization	p. 121
Internode communication networks	p. 121
Network architectures	p. 121
Comparison between network architectures	p. 129
Distributed memory parallel computer architectures	p. 131
Integrated parallel computer architectures	p. 131
Commodity cluster architectures	p. 134
Energy consumption issues	p. 136
The issue of resilience	p. 137
Type of parallel applications	p. 138
Embarrassingly parallel applications	p. 138
Applications with point-to-point communications	p. 138
Applications with multicast communication needs	p. 139
Shared memory applications (OpenMP)	p. 139
Components based applications	p. 139
Domain decomposition techniques	p. 139
Test example: The Gyrotron	p. 140
The geometry and the mesh	p. 142
Connectivity conditions	p. 142
Parallel matrix solver	p. 143
The electrostatic precipitator	p. 145
Scheduling of parallel applications	p. 146
Static scheduling	p. 146
Dynamic scheduling	p. 146
SpecuLOOS	p. 147
Introduction	p. 147
Test case description	p. 147
Complexity on one node	p. 149
Wrong complexity on the Blue Gene/L	p. 150
Fine results on the Blue Gene/L	p. 151
Conclusions	p. 151
TERPSICHORE	p. 153
Parallelization of the LEMan code with MPI and OpenMP	p. 154
Introduction	p. 154
Parallelization	p. 154
CPU time results	p. 156
Conclusions	p. 159
Grid-level Brokering to save energy	p. 161
About Grid resource brokering	p. 161
An Introduction to ïanos	p. 162
Job Submission Scenario	p. 164
The cost model	p. 165
Mathematical formulation	p. 165
CPU costs K_e	p. 167
License fees K_l	p. 169
Costs due to waiting time K_w	p. 169
Energy costs K_eco	p. 169
Data transfer costs K_d	p. 171
Example: The Pleiades clusters CPU cost per hour	p. 171
Different currencies in a Grid environment	p. 173
The implementation	p. 173
Architecture & Design	p. 174
The Grid Adapter	p. 174
The Meta Scheduling Service (MSS)	p. 175
The Resource Broker	p. 176
The System Information	p. 177
The Data Warehouse	p. 177
The Monitoring Service	p. 177
The Monitoring Module VAMOS	p. 178
Integration with UNICORL Grid System	p. 179
Scheduling algorithm	p. 179
User Interfaces to the ïanos framework	p. 181
DVS-able processors	p. 182
Power consumption of a CPU	p. 183
An algorithm to save energy	p. 184
First results with SMXV	p. 185
A first implementation	p. 186
Conclusions	p. 188
Recommendations	p. 189
Application oriented recommendations	p. 189
Code development	p. 189
Code validation	p. 189
Porting codes	p. 190
Optimizing parallelized applications	p. 190
Race condition	p. 190
Hardware and basic software aspects	p. 191
Basic software	p. l91
Choice of system software	p. 192
Energy reduction	p. 192
Processor frequency adaptation	p. 192
Improved cooling	p. 193
Choice of optimal resources	p. 193
Best choice of new computer	p. 193
Last but not least	p. 194
Miscellaneous	p. 194
Course material	p. 194
A new REAL500 List	p. 194
Glossary	p. 197
References	p. 205
About the authors	p. 213
Index	p. 215
Table of Contents provided by Ingram. All Rights Reserved.

What is included with this book?

The New copy of this book will include any supplemental materials advertised. Please check the title of the book to determine if it should include any access cards, study guides, lab manuals, CDs, etc.

The Used, Rental and eBook copies of this book are not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.