Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on...
        Saved in:
      
    
          | Published in | 2010 IEEE 12th International Conference on High Performance Computing and Communications pp. 72 - 78 | 
|---|---|
| Main Authors | , , | 
| Format | Conference Proceeding | 
| Language | English | 
| Published | 
            IEEE
    
        01.09.2010
     | 
| Subjects | |
| Online Access | Get full text | 
| ISBN | 9781424483358 1424483352  | 
| DOI | 10.1109/HPCC.2010.32 | 
Cover
| Abstract | We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of ~56 in comparison to the Ganglia distributed monitoring system. A simple scaling study reveals, however, that further efforts are needed in reducing the amount of data to monitor future-generation extreme-scale systems with up to 1,000,000 nodes. The implemented solution did not had a measurable performance impact as the 32-node test system did not produce enough monitoring data to interfere with running applications. | 
    
|---|---|
| AbstractList | We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data, e.g., by a factor of ~56 in comparison to the Ganglia distributed monitoring system. A simple scaling study reveals, however, that further efforts are needed in reducing the amount of data to monitor future-generation extreme-scale systems with up to 1,000,000 nodes. The implemented solution did not had a measurable performance impact as the 32-node test system did not produce enough monitoring data to interfere with running applications. | 
    
| Author | Scott, S L Engelmann, C Bohm, S  | 
    
| Author_xml | – sequence: 1 givenname: S surname: Bohm fullname: Bohm, S email: swen.boehm@bnc.info organization: Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA – sequence: 2 givenname: C surname: Engelmann fullname: Engelmann, C email: engelmannc@ornl.gov organization: Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA – sequence: 3 givenname: S L surname: Scott fullname: Scott, S L email: scottsl@ornl.gov organization: Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA  | 
    
| BookMark | eNotjMtOwzAURI0ACSjdsWPjH0jxK02yrNJCkYqoaPfVjXMdWXLsynGRwtdTHps5OqPR3JErHzwS8sDZjHNWPa23dT0T7KxSXJBpVZSsmFe5Elyxy1_nSihVSpmXN2Q6DLZhkrG5VFzeknHRdRE7SDZ4Ggz9QHDZ3vZId-OQsKdvwdsUovUdXUICakKkCw9u_PqpNhA7zHYaHNItRHAOHQXf0qUdUrTNKWFL69AfT-lnvvKfNgbfo0_DPbk24Aac_nNC9s-rfb3ONu8vr_Vik9mKpaxUWrVFxaBpRGu0RtQtMHUOURSCw5lmLivkzHDdNFJLBBSiMrwtODAtJ-Tx79Yi4uEYbQ9xPOR5yaVk8hta42JK | 
    
| ContentType | Conference Proceeding | 
    
| DBID | 6IE 6IL CBEJK RIE RIL  | 
    
| DOI | 10.1109/HPCC.2010.32 | 
    
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present  | 
    
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore Digital Library (LUT) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| EISBN | 9780769542140 076954214X  | 
    
| EndPage | 78 | 
    
| ExternalDocumentID | 5581330 | 
    
| Genre | orig-research | 
    
| GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ADFMO ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL  | 
    
| ID | FETCH-LOGICAL-i90t-84c4d790abb2dfcceecda04cda27721ada2f639e10f1cbb3c3eae229f1d71a0c3 | 
    
| IEDL.DBID | RIE | 
    
| ISBN | 9781424483358 1424483352  | 
    
| IngestDate | Wed Aug 27 03:03:44 EDT 2025 | 
    
| IsPeerReviewed | false | 
    
| IsScholarly | false | 
    
| Language | English | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-i90t-84c4d790abb2dfcceecda04cda27721ada2f639e10f1cbb3c3eae229f1d71a0c3 | 
    
| PageCount | 7 | 
    
| ParticipantIDs | ieee_primary_5581330 | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 2010-Sept. | 
    
| PublicationDateYYYYMMDD | 2010-09-01 | 
    
| PublicationDate_xml | – month: 09 year: 2010 text: 2010-Sept.  | 
    
| PublicationDecade | 2010 | 
    
| PublicationTitle | 2010 IEEE 12th International Conference on High Performance Computing and Communications | 
    
| PublicationTitleAbbrev | HPCC | 
    
| PublicationYear | 2010 | 
    
| Publisher | IEEE | 
    
| Publisher_xml | – name: IEEE | 
    
| SSID | ssib030063413 ssj0000452073  | 
    
| Score | 1.5055449 | 
    
| Snippet | We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain... | 
    
| SourceID | ieee | 
    
| SourceType | Publisher | 
    
| StartPage | 72 | 
    
| SubjectTerms | Gallium nitride Measurement Monitoring Peer to peer computing Real time systems Scalability  | 
    
| Title | Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments | 
    
| URI | https://ieeexplore.ieee.org/document/5581330 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJyZALeItD4y4jR9pkhH1oQpRVEGRulV-pUKgFFXpQH89ZzspFWJgycOKIsd38d3Z932H0K3gWjAtKbFURQQsBCWSaU7AewZx94RNpAMKT55641fxMI_nDXS3w8JYa33yme24S7-Xb1Z645bKunGcQkgFAfpBkvYCVqvWHe5srahMk5-FRcxAfWssl8cW1RRP1X26S4TPuuNpvx8SvVwhkr1CK97OjI7QpO5hSC9572xK1dHbX-SN__2EY9T-QfTh6c5WnaCGLVro634J0fbSywavcvwMTiNxmBAceMxx-OHdyh8eyFJicHCxZzHZuqZHl0ROXkDI8G65dkVZPrAsDB44Ml5XR8saHKpGuMeHe5C6NpqNhrP-mFSlGMhbFpUkFVqYJIukUszkGjqrjYwEHBh451TCOQdXx9Iop1oprrmVlrEspyahMtL8FDWLVWHPEIbZN5OaUcMdBanNM8WZFloySTlluT5HLTdqi89AtrGoBuzi7-ZLdBi2813S1xVqluuNvQYvoVQ3Xj2-AeOruh0 | 
    
| linkProvider | IEEE | 
    
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELaqMsAEqEW88cCI2_iRthlRHwrQVhUUqVtlO06FQCmq0oH-es52UirEwJKHFUWO7-K7s-_7DqFbwbVgWlJiqAoIWAhKJNOcgPcM4m4J05YWKDwat-JX8TgLZxV0t8XCGGNc8plp2Eu3l58s9doulTXDsAMhFQToe6EQIvRorVJ7uLW2ojBObh4WIQMFLtFcDl1UkjwV951tKnzUjCfdrk_1sqVIdkqtOEszOESjso8-weS9sc5VQ29-0Tf-9yOOUP0H04cnW2t1jComq6Gv-wXE2wsnHbxM8TO4jcSiQrBnMsf-l7drf7gnc4nBxcWOx2Rjm4Y2jZy8gJjh3XJly7J8YJkluGfpeG0lLZNgXzfCPt7fAdXV0XTQn3ZjUhRjIG9RkJOO0CJpR4FUiiWphs7qRAYCDgz8cyrhnIKzY2iQUq0U19xIw1iU0qRNZaD5Capmy8ycIgzzbyQ1owm3JKQmjRRnWmjJJOWUpfoM1eyozT893ca8GLDzv5tv0H48HQ3nw4fx0wU68Jv7NgXsElXz1dpcgc-Qq2unKt_QU71q | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2010+IEEE+12th+International+Conference+on+High+Performance+Computing+and+Communications&rft.atitle=Aggregation+of+Real-Time+System+Monitoring+Data+for+Analyzing+Large-Scale+Parallel+and+Distributed+Computing+Environments&rft.au=Bohm%2C+S&rft.au=Engelmann%2C+C&rft.au=Scott%2C+S+L&rft.date=2010-09-01&rft.pub=IEEE&rft.isbn=9781424483358&rft.spage=72&rft.epage=78&rft_id=info:doi/10.1109%2FHPCC.2010.32&rft.externalDocID=5581330 | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424483358/lc.gif&client=summon&freeimage=true | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424483358/mc.gif&client=summon&freeimage=true | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781424483358/sc.gif&client=summon&freeimage=true |