Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture

The emergence of microservice architecture in Cloud systems poses a new challenges for the reliability operation and maintenance. Due to numerous services and diverse types of metrics, it is time-consuming and challenging to identify the root cause of anomaly in large-scale microservice architecture...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on services computing Vol. 15; no. 3; pp. 1399 - 1410
Main Authors Ma, Meng, Lin, Weilan, Pan, Disheng, Wang, Ping
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.05.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1939-1374
2372-0204
DOI10.1109/TSC.2020.2993251

Cover

More Information
Summary:The emergence of microservice architecture in Cloud systems poses a new challenges for the reliability operation and maintenance. Due to numerous services and diverse types of metrics, it is time-consuming and challenging to identify the root cause of anomaly in large-scale microservice architecture. To solve this issue, this article presents a multi-metric and self-adaptive root cause diagnosis framework, named MS-Rank. MS-Rank decomposes the task into four phases: impact graph construction, random walk diagnosis, result precision evaluation, metrics weight update. Initially, we introduce the concept of implicit metrics and propose a composite impact graph construction algorithm, using multiple types of metrics to discover causal relationships between services. Afterwards, we propose a diagnostic algorithm in which forward, selfward and backward transitions are designed to heuristically identify the root cause services. In addition, we establish a self-adaptive mechanism to update the confidence of different metrics dynamically according to their diagnostic precision. Lastly, we develop a prototype system and integrate MS-Rank into real production system - IBM Cloud. Experimental results show that MS-Rank has a high diagnostic precision and its performance outperforms several selected benchmarks. Through multiple rounds of diagnosis, MS-Rank can optimize itself effectively. MS-Rank can be rapidly deployed in various microservice-based systems and applications, requiring no predefined knowledge. MS-Rank also allows us to introduce expert experiences into its framework to improve the diagnostic efficiency and precision.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1939-1374
2372-0204
DOI:10.1109/TSC.2020.2993251