Fault Monitoring with Sequential Matrix Factorization
For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be defined a priori. Accordingly, this paper considers fault monitoring as a latent factors discovery problem. In the context of end-to-end probing,...
        Saved in:
      
    
          | Published in | ACM transactions on autonomous and adaptive systems Vol. 10; no. 3; pp. 1 - 25 | 
|---|---|
| Main Authors | , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
            Association for Computing Machinery (ACM)
    
        01.10.2015
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1556-4665 1556-4703 1556-4703  | 
| DOI | 10.1145/2797141 | 
Cover
| Abstract | For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be defined a priori. Accordingly, this paper considers fault monitoring as a latent factors discovery problem. In the context of end-to-end probing, the goal is to devise an efficient sampling policy that makes the best use of a constrained sampling budget.
Previous work addresses fault monitoring in a collaborative prediction framework, where the information is a snapshot of the probes outcomes. Here, we take into account the fact that the system dynamically evolves at various time scales. We propose and evaluate Sequential Matrix Factorization (SMF) that exploits both the recent advances in matrix factorization for the instantaneous information and a new sampling heuristics based on historical information. The effectiveness of the SMF approach is exemplified on datasets of increasing difficulty and compared with state of the art history-based or snapshot-based methods. In all cases, strong adaptivity under the specific flavor of active learning is required to unleash the full potential of coupling the most confident and the most uncertain sampling heuristics, which is the cornerstone of SMF. | 
    
|---|---|
| AbstractList | For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be defined a priori. Accordingly, this paper considers fault monitoring as a latent factors discovery problem. In the context of end-to-end probing, the goal is to devise an efficient sampling policy that makes the best use of a constrained sampling budget. Previous work addresses fault monitoring in a collaborative prediction framework, where the information is a snapshot of the probes outcomes. Here, we take into account the fact that the system dynamically evolves at various time scales. We propose and evaluate Sequential Matrix Factorization (SMF) that exploits both the recent advances in matrix factorization for the instantaneous information and a new sampling heuristics based on historical information. The effectiveness of the SMF approach is exemplified on datasets of increasing difficulty and compared with state of the art history-based or snapshot-based methods. In all cases, strong adaptivity under the specific flavor of active learning is required to unleash the full potential of coupling the most confident and the most uncertain sampling heuristics, which is the cornerstone of SMF. For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be defined a priori. Accordingly, this paper considers fault monitoring as a latent factors discovery problem. In the context of end-to-end probing, the goal is to devise an efficient sampling policy that makes the best use of a constrained sampling budget. Previous work addresses fault monitoring in a collaborative prediction framework, where the information is a snapshot of the probes outcomes. Here, we take into account the fact that the system dynamically evolves at various time scales. We propose and evaluate Sequential Matrix Factorization (SMF) that exploits both the recent advances in matrix factorization for the instantaneous information and a new sampling heuristics based on historical information. The effectiveness of the SMF approach is exemplified on datasets of increasing difficulty and compared with state of the art history-based or snapshot-based methods. In all cases, strong adaptivity under the specific flavor of active learning is required to unleash the full potential of coupling the most confident and the most uncertain sampling heuristics, which is the cornerstone of SMF. For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be defined a priori. Accordingly, this paper considers fault monitoring as a latent factors discovery problem. In the context of end-to-end probing, the goal is to devise an efficient sampling policy that makes the best use of a constrained sampling budget. Previous work addresses fault monitoring in a Collaborative Prediction framework, where the information is a snapshot of the probes outcomes. Here, we take into account the fact that the system dynamically evolves at various time scales. We propose and evaluate Sequential Matrix Factor-ization (SMF) that exploits both the recent advances in matrix factoriza-tion for the instantaneous information and a new sampling heuristics based on historical information. The effectiveness of the SMF approach is exemplified on datasets of increasing difficulty and compared with state of the art history-based or snapshot-based methods. In all cases, strong adaptivity under the specific flavor of active learning is required to unleash the full potential of coupling the most confident and the most uncertain sampling heuristics, which is the cornerstone of SMF.  | 
    
| Author | Germain, Cecile Feng, Dawei  | 
    
| Author_xml | – sequence: 1 givenname: Dawei surname: Feng fullname: Feng, Dawei organization: National University of Defense Technology, Université Paris Sud, INRIA and CNRS, Changsha, China – sequence: 2 givenname: Cecile surname: Germain fullname: Germain, Cecile organization: Université Paris Sud, INRIA and CNRS, Orsay Cedex  | 
    
| BackLink | https://inria.hal.science/hal-01176013$$DView record in HAL | 
    
| BookMark | eNp90E1LAzEQBuAgFWyr-Bf2ph5Wk81X91iKtUKLB_Uc0mzWRtKkJllr_fVubaWg4GmG4WGYd3qg47zTAJwjeI0QoTcFLzki6Ah0EaUsJxzizk_PGD0BvRhfIaQIYtQFdCwbm7KZdyb5YNxLtjZpkT3qt0a7ZKTNZjIF85GNpdqCT5mMd6fguJY26rN97YPn8e3TaJJPH-7uR8NprnBBU67YvCKV5lypGtJ6oKBWNVaYwVKWczrHBWIVKTGpCCtIoeqa4kqqgmGuOUQV7oPL3d7GreRmLa0Vq2CWMmwEgmIbV-zjtvRqRxfygLw0YjKciu0MIsQZRPgdHdaugm9zxiSWJiptrXTaN1Eg3j6NEcgGLb3YURV8jEHX_xyQ_5LKpO9vpSCN_eO_AJrEf9E | 
    
| CitedBy_id | crossref_primary_10_1145_3469440 | 
    
| Cites_doi | 10.1145/1102351.1102441 10.1088/1742-6596/119/6/062012 10.1088/1742-6596/219/6/062029 10.1145/1102351.1102399 10.1145/564585.564601 10.1109/TKDE.2013.146 10.5555/2283696.2283780 10.1109/INM.2007.374794 10.1145/1132952.1132955 10.1007/s10208-009-9045-5 10.1109/NSSMIC.2003.1352187 10.12921/cmst.2006.12.01.33-45 10.5555/560889.792357 10.1109/TNN.2005.853423 10.1016/j.future.2013.06.001 10.1016/S0169-7439(97)00032-4 10.1145/2348832.2348837 10.1093/imaiai/iau006 10.1088/1742-6596/119/6/062036 10.1007/s10994-013-5369-0 10.1109/TKDE.2008.239 10.1002/cpe.1915 10.1016/j.chemolab.2010.08.004 10.1109/ICDM.2012.106 10.1016/j.cam.2006.05.008 10.5555/1953048.2185803 10.5555/647883.738238 10.1137/07070111X 10.1109/TIT.2007.901152 10.1109/TIT.2010.2044061  | 
    
| ContentType | Journal Article | 
    
| Copyright | Distributed under a Creative Commons Attribution 4.0 International License | 
    
| Copyright_xml | – notice: Distributed under a Creative Commons Attribution 4.0 International License | 
    
| DBID | AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D 1XC VOOES ADTOC UNPAY  | 
    
| DOI | 10.1145/2797141 | 
    
| DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts  Academic Computer and Information Systems Abstracts Professional Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) Unpaywall for CDI: Periodical Content Unpaywall  | 
    
| DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional  | 
    
| DatabaseTitleList | Computer and Information Systems Abstracts CrossRef  | 
    
| Database_xml | – sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Computer Science | 
    
| EISSN | 1556-4703 | 
    
| EndPage | 25 | 
    
| ExternalDocumentID | oai:HAL:hal-01176013v1 10_1145_2797141  | 
    
| GroupedDBID | .4S .DC 23M 4.4 5GY 5VS 8US AAKMM AALFJ AAYFX AAYXX ABPPZ ACM ADBCU ADL ADMLS AEBYY AEFXT AEJOY AENEX AENSD AFWIH AFWXC AIKLT AKRVB ALMA_UNASSIGNED_HOLDINGS ARCSS ASPBG AVWKF BDXCO CCLIF CITATION CS3 EBS EDO EJD GUFHI HGAVV H~9 I07 LHSKQ P1C P2P RNS ROL TUS ZCA 7SC 8FD JQ2 L7M L~C L~D 1XC AFFNX FEDTE VOOES XOL ADTOC UNPAY  | 
    
| ID | FETCH-LOGICAL-c325t-c6bd4de77ccf05f8c0ecf3c3609a9b5b3216d4934d46242cff53dac2637e701d3 | 
    
| IEDL.DBID | UNPAY | 
    
| ISSN | 1556-4665 1556-4703  | 
    
| IngestDate | Sun Oct 26 04:06:03 EDT 2025 Tue Oct 14 20:05:13 EDT 2025 Thu Oct 02 06:28:19 EDT 2025 Thu Apr 24 23:00:55 EDT 2025 Wed Oct 01 05:47:02 EDT 2025  | 
    
| IsDoiOpenAccess | true | 
    
| IsOpenAccess | true | 
    
| IsPeerReviewed | true | 
    
| IsScholarly | true | 
    
| Issue | 3 | 
    
| Keywords | Machine Learning Additional Key Words and Phrases: Fault Inference Categories and Subject Descriptors: [Computer systems organization]: Dependable and fault-tolerant systems and networks—Reliability Matrix Factorization [Com-puter systems organization]: Dependable and fault-tolerant systems and networks—Availability General Terms: Grids and Clouds Active Learning ACM Reference Format  | 
    
| Language | English | 
    
| License | Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 other-oa  | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-c325t-c6bd4de77ccf05f8c0ecf3c3609a9b5b3216d4934d46242cff53dac2637e701d3 | 
    
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23  | 
    
| OpenAccessLink | https://proxy.k.utb.cz/login?url=https://inria.hal.science/hal-01176013 | 
    
| PQID | 1770364068 | 
    
| PQPubID | 23500 | 
    
| PageCount | 25 | 
    
| ParticipantIDs | unpaywall_primary_10_1145_2797141 hal_primary_oai_HAL_hal_01176013v1 proquest_miscellaneous_1770364068 crossref_primary_10_1145_2797141 crossref_citationtrail_10_1145_2797141  | 
    
| ProviderPackageCode | CITATION AAYXX  | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 2015-10-01 | 
    
| PublicationDateYYYYMMDD | 2015-10-01 | 
    
| PublicationDate_xml | – month: 10 year: 2015 text: 2015-10-01 day: 01  | 
    
| PublicationDecade | 2010 | 
    
| PublicationTitle | ACM transactions on autonomous and adaptive systems | 
    
| PublicationYear | 2015 | 
    
| Publisher | Association for Computing Machinery (ACM) | 
    
| Publisher_xml | – name: Association for Computing Machinery (ACM) | 
    
| References | Tokic Michel (e_1_2_1_40_1) 2010 Killian Charles (e_1_2_1_20_1) 2007 e_1_2_1_42_1 Liu Xuezheng (e_1_2_1_26_1) 2008; 8 e_1_2_1_41_1 Barham Paul (e_1_2_1_4_1) 2004; 4 e_1_2_1_23_1 e_1_2_1_24_1 e_1_2_1_45_1 e_1_2_1_21_1 Reynolds Patrick (e_1_2_1_35_1) 2006; 6 e_1_2_1_44_1 e_1_2_1_22_1 Srebro Nathan (e_1_2_1_39_1) 2005; 17 e_1_2_1_43_1 e_1_2_1_27_1 e_1_2_1_28_1 Borchers Brian (e_1_2_1_5_1) 1999; 11 Geels Dennis (e_1_2_1_16_1) 2007; 7 McGill Robert (e_1_2_1_29_1) 1978; 32 e_1_2_1_7_1 e_1_2_1_31_1 e_1_2_1_8_1 e_1_2_1_30_1 e_1_2_1_6_1 e_1_2_1_3_1 e_1_2_1_13_1 Liu Ji (e_1_2_1_25_1) 2009 e_1_2_1_34_1 e_1_2_1_1_1 e_1_2_1_10_1 e_1_2_1_33_1 e_1_2_1_11_1 e_1_2_1_32_1 e_1_2_1_17_1 e_1_2_1_38_1 e_1_2_1_37_1 e_1_2_1_15_1 e_1_2_1_36_1 e_1_2_1_9_1 e_1_2_1_18_1 Fonseca Rodrigo (e_1_2_1_14_1) 2007 e_1_2_1_19_1  | 
    
| References_xml | – ident: e_1_2_1_34_1 doi: 10.1145/1102351.1102441 – ident: e_1_2_1_3_1 doi: 10.1088/1742-6596/119/6/062012 – ident: e_1_2_1_42_1 doi: 10.1088/1742-6596/219/6/062029 – volume: 4 start-page: 18 year: 2004 ident: e_1_2_1_4_1 article-title: Using magpie for request extraction and workload modelling publication-title: OSDI – ident: e_1_2_1_19_1 doi: 10.1145/1102351.1102399 – ident: e_1_2_1_27_1 – volume-title: Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 20 year: 2007 ident: e_1_2_1_14_1 – ident: e_1_2_1_17_1 doi: 10.1145/564585.564601 – volume: 17 start-page: 1329 year: 2005 ident: e_1_2_1_39_1 article-title: Maximum-margin matrix factorization publication-title: Advances in Neural Information Processing Systems – volume: 8 start-page: 423 year: 2008 ident: e_1_2_1_26_1 article-title: D3S: Debugging deployed distributed systems publication-title: NSDI – ident: e_1_2_1_45_1 doi: 10.1109/TKDE.2013.146 – volume-title: Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. IEEE, 2114--2121 year: 2009 ident: e_1_2_1_25_1 – ident: e_1_2_1_24_1 doi: 10.5555/2283696.2283780 – ident: e_1_2_1_37_1 doi: 10.1109/INM.2007.374794 – ident: e_1_2_1_43_1 doi: 10.1145/1132952.1132955 – ident: e_1_2_1_7_1 doi: 10.1007/s10208-009-9045-5 – volume: 7 start-page: 285 year: 2007 ident: e_1_2_1_16_1 article-title: Friday: Global comprehension for distributed replay publication-title: NSDI – volume: 6 start-page: 115 year: 2006 ident: e_1_2_1_35_1 article-title: Pip: Detecting the unexpected in distributed systems publication-title: NSDI – ident: e_1_2_1_30_1 doi: 10.1109/NSSMIC.2003.1352187 – volume: 32 start-page: 1 year: 1978 ident: e_1_2_1_29_1 article-title: Variations of box plots publication-title: The American Statistician – volume: 11 start-page: 1 year: 1999 ident: e_1_2_1_5_1 article-title: CSDP, AC library for semidefinite programming publication-title: Optimization Methods and Software – ident: e_1_2_1_23_1 doi: 10.12921/cmst.2006.12.01.33-45 – ident: e_1_2_1_15_1 doi: 10.5555/560889.792357 – ident: e_1_2_1_36_1 doi: 10.1109/TNN.2005.853423 – ident: e_1_2_1_13_1 doi: 10.1016/j.future.2013.06.001 – ident: e_1_2_1_6_1 doi: 10.1016/S0169-7439(97)00032-4 – volume-title: death, and the critical transition: Finding liveness bugs in systems code. NSDI 07: Networked Systems Design and Implementation year: 2007 ident: e_1_2_1_20_1 – ident: e_1_2_1_32_1 doi: 10.1145/2348832.2348837 – ident: e_1_2_1_10_1 doi: 10.1093/imaiai/iau006 – ident: e_1_2_1_22_1 – ident: e_1_2_1_28_1 doi: 10.1088/1742-6596/119/6/062036 – volume-title: Proceedings of the 33rd Annual German Conference on Advances in Artificial Intelligence (LNCS 6359) year: 2010 ident: e_1_2_1_40_1 – ident: e_1_2_1_44_1 doi: 10.1007/s10994-013-5369-0 – ident: e_1_2_1_18_1 doi: 10.1109/TKDE.2008.239 – ident: e_1_2_1_41_1 doi: 10.1002/cpe.1915 – ident: e_1_2_1_1_1 doi: 10.1016/j.chemolab.2010.08.004 – ident: e_1_2_1_31_1 doi: 10.1109/ICDM.2012.106 – ident: e_1_2_1_11_1 doi: 10.1016/j.cam.2006.05.008 – ident: e_1_2_1_33_1 doi: 10.5555/1953048.2185803 – ident: e_1_2_1_9_1 doi: 10.5555/647883.738238 – ident: e_1_2_1_21_1 doi: 10.1137/07070111X – ident: e_1_2_1_38_1 doi: 10.1109/TIT.2007.901152 – ident: e_1_2_1_8_1 doi: 10.1109/TIT.2010.2044061  | 
    
| SSID | ssj0051031 | 
    
| Score | 2.0255964 | 
    
| Snippet | For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be... | 
    
| SourceID | unpaywall hal proquest crossref  | 
    
| SourceType | Open Access Repository Aggregation Database Enrichment Source Index Database  | 
    
| StartPage | 1 | 
    
| SubjectTerms | Autonomous Computer Science Distributed, Parallel, and Cluster Computing Dynamical systems Factorization Faults Heuristic Machine Learning Monitoring Policies Sampling  | 
    
| Title | Fault Monitoring with Sequential Matrix Factorization | 
    
| URI | https://www.proquest.com/docview/1770364068 https://inria.hal.science/hal-01176013  | 
    
| UnpaywallVersion | submittedVersion | 
    
| Volume | 10 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVEBS databaseName: Inspec with Full Text customDbUrl: eissn: 1556-4703 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0051031 issn: 1556-4665 databaseCode: ADMLS dateStart: 20070601 isFulltext: true titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text providerName: EBSCOhost  | 
    
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9QwEB7R3QNcKE9RKJVBiJu3SfzI5rgClhVqK6SyUjlF9sRRJaK0oklp-fUdx86qgBDcrGQkW_OQP9sz3wC8Sf1zk60zjs44LguLfG7RcmlQOYu6tiHb4kiv1vLTiTqJCbJDLUxLep-dEu6MO8A-jbknLqOTg9iCqVaEuScwXR99XnwdyFCV5lIPPSPDmHw4VMcS1Ff7WV7kqUx_2Xa2Tn3S4y1Eebdvz831D9M0tzaX5TZ8HJcVckq-zfrOzvDnb4yN_173A7gf8SVbBId4CHdc-wi2x94NLIbyY1BL0zcdCyHt7_aYv5Flx0NmNUV9ww49ef8VWw4NeWK15hNYLz98ebfisYUCR5GpjqO2laxcniPWiarnmDisBQqdFKawyoos1ZUshKykLxTBulaiMphpkbs8SSvxFCbtWeueAatypwzi3Mc8nel0oSWZt6DzYSKNTdMdeDtquMTIL-7bXDRlqH1WZTTFDrCN4Hmg1PhT5DWpb_PXU2CvFgel_zaq9JKEXo0WLCkm_EOHad1Zf1GmuacVI6gyJ5mNaf822fP_kHkB9wgpqZDFtwuT7nvvXhIa6eweTBfvDw-O96JL3gCmoNyS | 
    
| linkProvider | Unpaywall | 
    
| linkToUnpaywall | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS-QwFA46Puy-eNkL3layi_iWsW0unT4OsuMgriysA-5TSU5TBEsVbb39ek-adFCXRd9CeyDhXMiX5JzvELIbu-cmUyYMrLZMZAbYyIBhQoO0BlRpfLbFiZrOxNGZPAsJsl0tTI16H54j7gw7wD6OmSMuw5MDXyRLSiLmHpCl2cnv8d-ODFUqJlTXM9KP0Yd9dSxCfbmfpFkai_jFtrN47pIenyHKD219pR_udFU921wmK-SwX5bPKbkYto0ZwuMrxsa3171KlgO-pGPvEGtkwdafyErfu4GGUP5M5ES3VUN9SLu7PepuZOmfLrMao76ivxx5_z2ddA15QrXmFzKb_Dw9mLLQQoEBT2TDQJlCFDZNAcpIliOILJQcuIoynRlpeBKrQmRcFMIVikBZSl5oSBRPbRrFBf9KBvVlbdcJLVIrNcDIxTye6VSmBJo3w_NhJLSJ4w2y12s4h8Av7tpcVLmvfZZ5MMUGoXPBK0-p8a_ID1Tf_K-jwJ6Oj3P3rVfpLQp97y2YY0y4hw5d28v2Jo9TRyuGUGWEMnPT_m-yzXfIbJGPiJSkz-LbJoPmurXfEI00Zie44hPJ8tr- | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Fault+monitoring+with+sequential+matrix+factorization&rft.jtitle=ACM+transactions+on+autonomous+and+adaptive+systems&rft.au=Feng%2C+Dawei&rft.au=Germain%2C+Cecile&rft.date=2015-10-01&rft.pub=Association+for+Computing+Machinery+%28ACM%29&rft.issn=1556-4665&rft.eissn=1556-4703&rft.volume=10&rft.issue=3&rft.spage=20%3A1&rft.epage=20%3A25&rft_id=info:doi/10.1145%2F2797141&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai%3AHAL%3Ahal-01176013v1 | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1556-4665&client=summon | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1556-4665&client=summon | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1556-4665&client=summon |