Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel h...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 30; no. 1; pp. 145 - 162
Main Author	郑方李宏亮吕晖过锋许晓红谢向辉
Format	Journal Article
Language	English
Published	Boston Springer US 2015 Springer Nature B.V
Subjects	Architecture (computers) Artificial Intelligence Bandwidths Communication Computation Computer architecture Computer Science Computer simulation Data Structures and Information Theory Data transmission Fast Fourier transformations Finite difference method Fourier transforms High performance computing Information Systems Applications (incl.Internet) Instruction sets (computers) Microprocessors Optimization Optimization techniques Processors R&D Regular Paper Research & development Semiconductors Software Engineering Studies Synchronism Theory of Computation Walls 共享存储器协同计算多核心处理器异构处理器快速傅立叶变换架构计算技术高性能计算系统 heterogeneous many-core processor register-level communication mechanism data stream transfer hardware synchronization technique processor prototype
Online Access	Get full text
ISSN	1000-9000 1860-4749
DOI	10.1007/s11390-015-1510-9

Cover

More Information
Summary:	Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core （DFMC） for high performance computing systems. DFMC integrates management processing ele- ments （MPEs） and computing processing elements （CPEs）, which are heterogeneous processor cores for different application features with a unified ISA （instruction set architecture）, a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM （double-precision matrix multiplication） achieving an efficiency of 94%, FFT （fast Fourier transform） obtaining a performance of 207 GFLOPS and FDTD （finite-difference time-domain） obtaining a performance of 27 GFLOPS.
Bibliography:	11-2296/TP heterogeneous many-core processor, data stream transfer, register-level communication mechanism, hardwaresynchronization technique, processor prototype Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core （DFMC） for high performance computing systems. DFMC integrates management processing ele- ments （MPEs） and computing processing elements （CPEs）, which are heterogeneous processor cores for different application features with a unified ISA （instruction set architecture）, a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM （double-precision matrix multiplication） achieving an efficiency of 94%, FFT （fast Fourier transform） obtaining a performance of 207 GFLOPS and FDTD （finite-difference time-domain） obtaining a performance of 27 GFLOPS. ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-015-1510-9