Weak execution ordering - exploiting iterative methods on many-core GPUs

On NVIDIA's many-core GPUs, there is no synchronization function among parallel thread blocks. When fine-granularity of data communication and synchronization is required for large-scale parallel programs executed by multiple thread blocks, frequent host synchronization are necessary, and they...

Full description

Saved in:

Bibliographic Details
Published in	2010 IEEE International Symposium on Performance Analysis of Systems and Software pp. 154 - 163
Main Authors	Jianmin Chen, Zhuo Huang, Feiqi Su, Peir, Jih-Kwon, Ho, Jeff, Lu Peng
Format	Conference Proceeding
Language	English
Published	IEEE 01.03.2010
Subjects	Application software Chaotic communication Computer vision Data communication Graphics processing unit Iterative methods Large-scale systems Multicore processing Partial differential equations Shape measurement
Online Access	Get full text
ISBN	1424460239 9781424460236
DOI	10.1109/ISPASS.2010.5452028

Cover

More Information
Summary:	On NVIDIA's many-core GPUs, there is no synchronization function among parallel thread blocks. When fine-granularity of data communication and synchronization is required for large-scale parallel programs executed by multiple thread blocks, frequent host synchronization are necessary, and they incur a significant overhead. In this paper, we investigate a class of applications which uses a chaotic version of iterative methods [5], [22] to obtain numerical solutions for partial differential equations (PDE). Such a fast PDE solver is parallelized on GPUs with multiple thread blocks. In this parallel implementation, although frequent data communication is needed between adjacent thread blocks, a precise order of the data communication is not necessary. Separate communication threads are used for periodically exchanging the boundary values with adjacent thread blocks through the global memory. Since a precise order of the data communication is not required, the computation and the communication threads can be overlapped to alleviate the communication overhead. Performance measurements of two popular applications, Poisson image editing from computer graphics and shape from shading from computer vision, on Tesla C1060 show that a speedup of 4-5 times is achievable for both applications in comparison with the solution using host synchronization.
ISBN:	1424460239 9781424460236
DOI:	10.1109/ISPASS.2010.5452028