Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on...

Full description

Saved in:
Bibliographic Details
Published in2010 IEEE 12th International Conference on High Performance Computing and Communications pp. 72 - 78
Main Authors Bohm, S, Engelmann, C, Scott, S L
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.09.2010
Subjects
Online AccessGet full text
ISBN9781424483358
1424483352
DOI10.1109/HPCC.2010.32

Cover

Abstract We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of ~56 in comparison to the Ganglia distributed monitoring system. A simple scaling study reveals, however, that further efforts are needed in reducing the amount of data to monitor future-generation extreme-scale systems with up to 1,000,000 nodes. The implemented solution did not had a measurable performance impact as the 32-node test system did not produce enough monitoring data to interfere with running applications.
AbstractList We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of ~56 in comparison to the Ganglia distributed monitoring system. A simple scaling study reveals, however, that further efforts are needed in reducing the amount of data to monitor future-generation extreme-scale systems with up to 1,000,000 nodes. The implemented solution did not had a measurable performance impact as the 32-node test system did not produce enough monitoring data to interfere with running applications.
Author Scott, S L
Engelmann, C
Bohm, S
Author_xml – sequence: 1
  givenname: S
  surname: Bohm
  fullname: Bohm, S
  email: swen.boehm@bnc.info
  organization: Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
– sequence: 2
  givenname: C
  surname: Engelmann
  fullname: Engelmann, C
  email: engelmannc@ornl.gov
  organization: Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
– sequence: 3
  givenname: S L
  surname: Scott
  fullname: Scott, S L
  email: scottsl@ornl.gov
  organization: Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
BookMark eNotjMtOwzAURI0ACSjdsWPjH0jxK02yrNJCkYqoaPfVjXMdWXLsynGRwtdTHps5OqPR3JErHzwS8sDZjHNWPa23dT0T7KxSXJBpVZSsmFe5Elyxy1_nSihVSpmXN2Q6DLZhkrG5VFzeknHRdRE7SDZ4Ggz9QHDZ3vZId-OQsKdvwdsUovUdXUICakKkCw9u_PqpNhA7zHYaHNItRHAOHQXf0qUdUrTNKWFL69AfT-lnvvKfNgbfo0_DPbk24Aac_nNC9s-rfb3ONu8vr_Vik9mKpaxUWrVFxaBpRGu0RtQtMHUOURSCw5lmLivkzHDdNFJLBBSiMrwtODAtJ-Tx79Yi4uEYbQ9xPOR5yaVk8hta42JK
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/HPCC.2010.32
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore Digital Library (LUT)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9780769542140
076954214X
EndPage 78
ExternalDocumentID 5581330
Genre orig-research
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ADFMO
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-i90t-84c4d790abb2dfcceecda04cda27721ada2f639e10f1cbb3c3eae229f1d71a0c3
IEDL.DBID RIE
ISBN 9781424483358
1424483352
IngestDate Wed Aug 27 03:03:44 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-84c4d790abb2dfcceecda04cda27721ada2f639e10f1cbb3c3eae229f1d71a0c3
PageCount 7
ParticipantIDs ieee_primary_5581330
PublicationCentury 2000
PublicationDate 2010-Sept.
PublicationDateYYYYMMDD 2010-09-01
PublicationDate_xml – month: 09
  year: 2010
  text: 2010-Sept.
PublicationDecade 2010
PublicationTitle 2010 IEEE 12th International Conference on High Performance Computing and Communications
PublicationTitleAbbrev HPCC
PublicationYear 2010
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib030063413
ssj0000452073
Score 1.5055449
Snippet We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain...
SourceID ieee
SourceType Publisher
StartPage 72
SubjectTerms Gallium nitride
Measurement
Monitoring
Peer to peer computing
Real time systems
Scalability
Title Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
URI https://ieeexplore.ieee.org/document/5581330
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJyZALeItD4y4jR9pkhH1oQpRVEGRulV-pUKgFFXpQH89ZzspFWJgycOKIsd38d3Z932H0K3gWjAtKbFURQQsBCWSaU7AewZx94RNpAMKT55641fxMI_nDXS3w8JYa33yme24S7-Xb1Z645bKunGcQkgFAfpBkvYCVqvWHe5srahMk5-FRcxAfWssl8cW1RRP1X26S4TPuuNpvx8SvVwhkr1CK97OjI7QpO5hSC9572xK1dHbX-SN__2EY9T-QfTh6c5WnaCGLVro634J0fbSywavcvwMTiNxmBAceMxx-OHdyh8eyFJicHCxZzHZuqZHl0ROXkDI8G65dkVZPrAsDB44Ml5XR8saHKpGuMeHe5C6NpqNhrP-mFSlGMhbFpUkFVqYJIukUszkGjqrjYwEHBh451TCOQdXx9Iop1oprrmVlrEspyahMtL8FDWLVWHPEIbZN5OaUcMdBanNM8WZFloySTlluT5HLTdqi89AtrGoBuzi7-ZLdBi2813S1xVqluuNvQYvoVQ3Xj2-AeOruh0
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELaqMsAEqEW88cCI2_iRthlRHwrQVhUUqVtlO06FQCmq0oH-es52UirEwJKHFUWO7-K7s-_7DqFbwbVgWlJiqAoIWAhKJNOcgPcM4m4J05YWKDwat-JX8TgLZxV0t8XCGGNc8plp2Eu3l58s9doulTXDsAMhFQToe6EQIvRorVJ7uLW2ojBObh4WIQMFLtFcDl1UkjwV951tKnzUjCfdrk_1sqVIdkqtOEszOESjso8-weS9sc5VQ29-0Tf-9yOOUP0H04cnW2t1jComq6Gv-wXE2wsnHbxM8TO4jcSiQrBnMsf-l7drf7gnc4nBxcWOx2Rjm4Y2jZy8gJjh3XJly7J8YJkluGfpeG0lLZNgXzfCPt7fAdXV0XTQn3ZjUhRjIG9RkJOO0CJpR4FUiiWphs7qRAYCDgz8cyrhnIKzY2iQUq0U19xIw1iU0qRNZaD5Capmy8ycIgzzbyQ1owm3JKQmjRRnWmjJJOWUpfoM1eyozT893ca8GLDzv5tv0H48HQ3nw4fx0wU68Jv7NgXsElXz1dpcgc-Qq2unKt_QU71q
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2010+IEEE+12th+International+Conference+on+High+Performance+Computing+and+Communications&rft.atitle=Aggregation+of+Real-Time+System+Monitoring+Data+for+Analyzing+Large-Scale+Parallel+and+Distributed+Computing+Environments&rft.au=Bohm%2C+S&rft.au=Engelmann%2C+C&rft.au=Scott%2C+S+L&rft.date=2010-09-01&rft.pub=IEEE&rft.isbn=9781424483358&rft.spage=72&rft.epage=78&rft_id=info:doi/10.1109%2FHPCC.2010.32&rft.externalDocID=5581330
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424483358/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424483358/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424483358/sc.gif&client=summon&freeimage=true