MapReduce across Distributed Clusters for Data-intensive Applications

Recently, the computational requirements for large scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data are processed on more than 140 compu...

Full description

Saved in:

Bibliographic Details
Published in	2012 26th IEEE International Parallel and Distributed Processing Symposium Workshops pp. 2004 - 2011
Main Authors	Lizhe Wang, Jie Tao, Marten, H., Streit, A., Khan, S. U., Kolodziej, J., Chen, D.
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2012
Subjects	Computational modeling Computer architecture Data Intensive Computing Data processing Distributed databases Hadoop MapReduce Servers Software Torque
Online Access	Get full text
ISBN	1467309745 9781467309745
DOI	10.1109/IPDPSW.2012.249

Cover

More Information
Summary:	Recently, the computational requirements for large scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data are processed on more than 140 computing centers distributed across 34 countries. The MapReduce paradigm has emerged as a highly successful programming model for large-scale data-intensive computing applications. However, current MapReduce implementations are developed to operate on single cluster environments and cannot be leveraged for large-scale distributed data processing across multiple clusters. On the other hand, workflow systems are used for distributed data processing across data centers. It has been reported that the workflow paradigm has some limitations for distributed data processing, such as reliability and efficiency. In this paper, we present the design and implementation of GHadoop, a MapReduce framework that aims to enable large-scale distributed computing across multiple clusters. G-Hadoop uses the Gfarm file system as an underlying file system and executes MapReduce tasks across distributed clusters. Experiments of the G-Hadoop framework on distributed clusters show encouraging results.
ISBN:	1467309745 9781467309745
DOI:	10.1109/IPDPSW.2012.249