High performance embedded architectures and compilers : Third International Conference, HiPEAC 2008, Göteborg, Sweden, January 27-29, 2008 : proceedings

This book constitutes the refereed proceedings of the Third International Conference on High Performance Embedded Architectures and Compilers, HiPEAC 2008, held in G??teborg, Sweden, January 27-29, 2008. The 25 revised full papers presented together with 1 invited keynote paper were carefully review...

Full description

Saved in:

Bibliographic Details
Main Authors	Hutchison, David, Kanade, Takeo, Kittler, Josef, Gupta, Rajiv K, Ungerer, Theo
Format	eBook Book
Language	English
Published	Berlin Springer 2008 Springer Berlin / Heidelberg
Edition	1
Series	Lecture Notes in Computer Science
Subjects	Compilers (Computer programs) Computer architecture Congresses Embedded computer systems Embedded computer systems-Congresses
Online Access	Get full text
ISBN	3540775595 9783540775591

Cover

Table of Contents:

Intro -- Preface -- Organization -- Table of Contents -- Supercomputing for the Future, Supercomputing from the Past (Keynote) -- MIPS MT: A Multithreaded RISC Architecture for Embedded Real-Time Processing -- rMPI: Message Passing on Multicore Processors with On-Chip Interconnect -- Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors: A Case Study of the Cell BE -- BRAM-LUT Tradeoff on a Polymorphic DES Design -- Architecture Enhancements for the ADRES Coarse-Grained Reconfigurable Array -- Implementation of an UWB Impulse-Radio Acquisition and Despreading Algorithm on a Low Power ASIP -- Fast Bounds Checking Using Debug Register -- Studying Compiler Optimizations on Superscalar Processors Through Interval Analysis -- An Experimental Environment Validating the Suitability of CLI as an Effective Deployment Format for Embedded Systems -- Compilation Strategies for Reducing Code Size on a VLIW Processor with Variable Length Instructions -- Experiences with Parallelizing a Bio-informatics Program on the Cell BE -- Drug Design Issues on the Cell BE -- COFFEE: COmpiler Framework for Energy-Aware Exploration -- Integrated CPU Cache Power Management in Multiple Clock Domain Processors -- Variation-Aware Software Techniques for Cache Leakage Reduction Using Value-Dependence of SRAM Leakage Due to Within-Die Process Variation -- The Significance of Affectors and Affectees Correlations for Branch Prediction -- Turbo-ROB: A Low Cost Checkpoint/Restore Accelerator -- LPA: A First Approach to the Loop Processor Architecture -- Complementing Missing and Inaccurate Profiling Using a Minimum Cost Circulation Algorithm -- Using Dynamic Binary Instrumentation to Generate Multi- platform SimPoints: Methodology and Accuracy -- Phase Complexity Surfaces: Characterizing Time-Varying Program Behavior -- MLP-Aware Dynamic Cache Partitioning
Compiler Techniques for Reducing Data Cache Miss Rate on a Multithreaded Architecture -- Code Arrangement of Embedded Java Virtual Machine for NAND Flash Memory -- Aggressive Function Inlining: Preventing Loop Blockings in the Instruction Cache -- Author Index
Intro -- Title Page -- Preface -- Organization -- Table of Contents -- Invited Program -- Supercomputing for the Future, Supercomputing from the Past (Keynote) -- Part I Multithreaded and Multicore Processors -- MIPS MT: A Multithreaded RISC Architecture for Embedded Real-Time Processing -- Multithreading and Embedded Processors -- The Hierarchy of Multithreaded Entities -- VPEs as Exception Domains -- VPEs as Scheduling Domains -- Thread Creation and Destruction -- Thread Creation with FORK -- Thread Termination with YIELD -- Inter-thread Synchronization -- Hybrid Scheduling Control -- Zero-Latency Event Service Using YIELD Instructions -- Hierarchically Programmable Scheduling -- Gating Storage as a Peripheral Interface -- Virtualization of MIPS MT Resources -- Thread Context Virtualization -- ITC Storage Virtualization -- YIELD Qualifier Virtualization -- Software Use Models -- Asymmetric Multiple Virtual Processors (AMVP) -- Symmetric Multiple Virtual Processors (SMVP) -- Symmetric Multiple TC (SMTC) -- The ROPE Kernel -- Experimental Results -- Synchronization -- Thread Creation/Destruction -- Latency Tolerance -- Conclusions -- References -- rMPI: Message Passing on Multicore Processors with On-Chip Interconnect -- Introduction -- Background -- Design -- Evaluation and Analysis -- Related Work -- Conclusion -- Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors: A Case Study of the Cell BE -- Introduction -- Modeling Abstractions -- Hardware Abstraction -- Program Abstraction -- Model of Multi-grain Parallelism -- Modeling Sequential Execution -- Modeling Parallel Execution on APUs -- Modeling Parallel Execution on HPUs -- Experimental Validation and Results -- Parameter Approximation -- PBPI Outline -- PBPI with One Dimension of Parallelism -- PBPI with Two Dimensions of Parallelism -- RAxML Outline
RAxML with Two Dimensions of Parallelism -- Related Work -- Conclusions -- Part IIaReconfigurable - ASIP -- BRAM-LUT Tradeoff on a Polymorphic DES Design -- Introduction -- DES Computation -- Proposed DES Structure -- Polymorphic Implementation -- Performance Analysis -- Conclusions -- Architecture Enhancements for the ADRES Coarse-Grained Reconfigurable Array -- Introduction -- ADRES Base Architecture and Programming Model -- Tool Flow -- Architectural Explorations -- Distributing Local Data Register Files -- Interconnection Topology -- Register File Size Modification -- Final Results -- Putting It All Together -- Final Architecture Power Decomposition -- Energy-Delay Architectures Analysis of Different Array Sizes -- Conclusions -- Implementation of an UWB Impulse-Radio Acquisition and Despreading Algorithm on a Low Power ASIP -- Introduction -- UWB Impulse-Radio -- Introduction -- Transmitter -- Receiver -- Reference Application and Processor -- Application Code -- Processor -- Reference Results -- Optimizing for Power Reduction -- Reduction of the Instruction Word Size -- Loopcache -- Input Operand Isolation -- Clock Gating -- Power Gating -- Custom Operations -- Constructing the Optimized Application and Processor -- Application Code Modifications -- Processor Modifications -- Optimized Results -- Performance Comparison ASIP vs. ASIC Implementation -- Conclusions -- Part IIbCompiler Optimizations -- Fast Bounds Checking Using Debug Register -- Introduction -- Related Work -- The Boud Approach -- Association Between References and Referents -- Debug Register in Intel X86 Architecture -- Detecting Bounds Violations Using Debug Registers -- Optimizations -- Performance Evaluation -- Methodology -- Batch Programs -- Network Applications -- Conclusion -- Studying Compiler Optimizations on Superscalar Processors Through Interval Analysis
Introduction -- Decomposing Execution Time into Cycle Components -- Interval Analysis -- Evaluating Cycle Count Components -- Experimental Setup -- The Impact of Compiler Optimizations -- Out-of-Order Processor Performance -- Compiler Optimization Analysis Case Studies -- Comparison with In-Order Processors -- Related Work -- Conclusion and Impact on Future Work -- An Experimental Environment Validating the Suitability of CLI as an Effective Deployment Format for Embedded Systems -- Introduction -- Implementation -- GCC Structure -- CLI Code Generator -- CLI to Native Translation -- Tools -- Experiments and Results -- Setup -- Experiments -- Analysis -- Related Work -- Not CLI-Based -- CLI-Based -- Conclusion -- Part III Industrial Processors and Application Parallelization -- Compilation Strategies for Reducing Code Size on a VLIW Processor with Variable Length Instructions -- Introduction -- The TMS320C6000 VLIW DSP Core -- C64+ Compact Instructions -- Compact 16-bit Instructions -- The Fetch Packet Header -- Branch and Call Instructions -- The Compressor -- Compiler Strategies -- Instruction Selection -- Register Allocation -- Instruction Scheduling -- Calling Convention Customization -- Results -- Conclusions -- Experiences with Parallelizing a Bio-informatics Program on the Cell BE -- Introduction -- The Cell BE Architecture -- Clustal W -- Analysis of Clustal W -- Optimization of Clustal W -- Optimizing for the SPU -- Modifications to Data Structures -- Parallelization of Pairwise Alignment -- Parallelization of Progressive Alignment -- Evaluation -- Pairwise Alignment -- Progressive Alignment -- Scaling with Multiple SPUs -- Discussion -- Related Work -- Conclusion -- Drug Design Issues on the Cell BE -- Introduction -- Related Work -- Experimental Setup -- FTDock Algorithm -- Algorithm Description -- Cell BE Implementation of FTDock
3D FFT/iFFT -- Complex Multiplication Function for F_{C}=(F_{A}^{*})(F_B) -- Scoring Filter Function -- Discretize Function -- Performance Evaluation -- Memory Bandwidth Issues and 3D FFT -- Function Offloading and SIMDization -- Parallelization Using the Two Cell BE of the Blade -- Parallelization Using Dual-Thread PPU Feature -- Comparison with a POWER5 Multicore Platform -- Conclusions -- Part IV Power-Aware Techniques -- COFFEE: COmpiler Framework for Energy-Aware Exploration -- Introduction and Motivation -- Related Work -- Compiler and Simulator Flow -- Memory Architecture Subsystem -- Processor Core Subsystem -- Loop Transformations -- Energy Estimation Flow -- Experimental Setup and Results -- Benchmark Driver: WCDMA -- Processor Architectures -- Loop Transformations -- Results and Analysis -- Use Cases -- Conclusion and Future Work -- Integrated CPU Cache Power Management in Multiple Clock Domain Processors -- Introduction -- Application and MCD Chip Models -- DVS in Multiple Clock Domains -- Domain Interaction-Aware DVS -- MCD Inter-domain 3Interactions -- Integrated Core and L2 Cache DVS Policy -- Evaluation -- Related Work -- Conclusion -- Variation-Aware Software Techniques for Cache Leakage Reduction Using Value-Dependence of SRAM Leakage Due to Within-Die Process Variation -- Introduction -- Related Works -- Motivation and Our Approach -- Our Approach -- Problem Formulation -- Experimental Results -- Conclusion -- References -- Part V High-Performance Processors -- The Significance of Affectors and Affectees Correlations for Branch Prediction -- Introduction -- Affectors and Affectees -- Definitions and Intuition -- Memory Instructions -- How to Use Affectors and Affectees for Prediction -- Experimental Framework -- Results -- Characterization of Affectors and Affectees -- GTL Results -- L-TAGE Results -- Related Work
Conclusions and Future Work -- Turbo-ROB: A Low Cost Checkpoint/Restore Accelerator -- Introduction -- Turbo-ROB Recovery -- Mechanism: Structure and Operation -- Recovery Example -- Selective Repair Point Initiation -- Eliminating the ROB -- TROB and In-RAT Global Checkpoints -- Related Work -- Experimental Results -- Methodology -- Performance Metric -- TROB as a ROB Replacement -- Selective Repair Point Initiation -- Turbo-ROB with GCs -- Conclusions -- LPA: A First Approach to the Loop Processor Architecture -- Introduction -- The LPA Architecture -- The Renaming Mechanism -- Loop Detection and Storage -- Fetching from the Loop Window -- Experimental Methodology -- LPA Evaluation -- Front-End Activity -- Performance Evaluation -- Related Work -- Conclusions -- Part VI Profiles: Collection and Analysis -- Complementing Missing and Inaccurate Profiling Using a Minimum Cost Circulation Algorithm -- Introduction -- Related Work for Profiling Techniques -- Formulating the Problem -- Polynomial Algorithm for Finding the Optimal Fixup Vector -- Constructing the Fixup Graph -- Complexity of the Algorithm -- Estimating Vertex and Edge Frequencies -- Experimental Results -- Filling Edge Profile from Vertex Profile -- Approximating Dynamic Control Flow for External Sampling -- Future Directions -- Summary -- References -- Using Dynamic Binary Instrumentation to Generate Multi-platform SimPoints: Methodology and Accuracy -- Introduction -- Generating Simulation Points -- BBV Generation -- Pin -- Qemu -- Valgrind -- Evaluation -- The {\tt Rep} Prefix -- The {\tt Art} Benchmark -- Results -- Related Work -- Conclusions and Future Work -- Phase Complexity Surfaces: Characterizing Time-Varying Program Behavior -- Introduction -- Related Work -- Phase Complexity Surfaces -- Basic Block Vector (BBV) -- Phase Classification -- Phase Count Surfaces
Phase Predictability Surfaces