Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

EMAX: Energy-aware Multimode Accelerator Extension is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data and image processing and also to achieve low power consu...

Full description

Saved in:
Bibliographic Details
Published in2014 Second International Symposium on Computing and Networking pp. 388 - 393
Main Authors Inagaki, Yoshikazu, Takamaeda-Yamazaki, Shinya, Jun Yao, Nakashima, Yasuhiko
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2014
Subjects
Online AccessGet full text
ISSN2379-1888
DOI10.1109/CANDAR.2014.100

Cover

More Information
Summary:EMAX: Energy-aware Multimode Accelerator Extension is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data and image processing and also to achieve low power consumption. However, before mapping algorithms on the accelerator, application developers should have sufficient knowledge of the hardware organization and specially designed instructions. They will, furthermore, need to make significant efforts to tune the code for improving execution efficiency, in the case that no well-designed compiler or library is available. To address this problem, we focus especially on library support for the stencil (nearest-neighbor) computations, which represent a class of algorithms popularly used in many partial differential equation (PDE) solvers. In this research, we take up the following topics: (1) System configuration, features and mnemonics of EMAX, (2) Instruction mapping techniques that can reduce the amount of data to be read from the main memory, (3) Performance evaluation of the library for PDE solvers. With the features of the library that can reuse the local data across the outer loop iterations and can map many instructions by unrolling outer loops, the amount of data to be read from main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library was found capable of reducing 23% of the execution time compared with a general purpose processor.
ISSN:2379-1888
DOI:10.1109/CANDAR.2014.100