Software Analytics for Incident Management of Online Services: An Experience Report Jian-Guang Lou, Qingwei Lin, Rui Ding, Tao Xie Qiang Fu, Dongmei Zhang University of Illinois at Urbana-Champaign Microsoft Research Asia, Beijing, P. R. China Urbana, IL, USA {jlou, qlin, juding, qifu, dongmeiz}@microsoft.com
[email protected] Abstract—As online services become more and more popular, incident management needs to be efficient and effective in incident management has become a critical task that aims to order to ensure high availability and reliability of the services. minimize the service downtime and to ensure high quality of the A typical procedure of incident management in practice provided services. In practice, incident management is conducted (e.g., at Microsoft and other service-provider companies) goes through analyzing a huge amount of monitoring data collected at as follow. When the service monitoring system detects a ser- runtime of a service. Such data-driven incident management faces several significant challenges such as the large data scale, vice violation, the system automatically sends out an alert and complex problem space, and incomplete knowledge. To address makes a phone call to a set of On-Call Engineers (OCEs) to these challenges, we carried out two-year software-analytics trigger the investigation on the incident in order to restore the research where we designed a set of novel data-driven techniques service as soon as possible. Given an incident, OCEs need to and developed an industrial system called the Service Analysis understand what the problem is and how to resolve it. In ideal Studio (SAS) targeting real scenarios in a large-scale online cases, OCEs can identify the root cause of the incident and fix service of Microsoft.