Skip to content

Latest commit

 

History

History
266 lines (220 loc) · 38.5 KB

README_en.md

File metadata and controls

266 lines (220 loc) · 38.5 KB

aiops-handbook

Collection of slides, repositories, papers about AIOps, sort by the scenes which were advanced by the .

Chinese edition: <README.md>.

Anomaly Detection

Metric

single

multiple series

Large Model Methods

  • GPT4TS large model based on GPT2 from Tsinghua University: https://github.com/DAMO-DI-ML/NeurIPS2023-One-Fits-All. A pioneering work in this field, serving as a benchmark for future progress.
  • MOMENT model and Time-Series Pile dataset from Carnegie Mellon University: https://arxiv.org/pdf/2402.03885.pdf. Analogous to the Pile dataset for large language models, it collects the 5 most commonly used metric datasets, covering single and multi-dimensional metrics for tasks like classification, long and short-term forecasting, and anomaly detection. Pre-trained moment-base, large, and small metric models in a manner similar to T5. The paper mainly compares baselines such as TimesNet and GPT4TS.
  • TimesFM, an open-source time series forecasting base model from Google: https://github.com/google-research/timesfm

Logdata

Traditional Methods

Large Model Methods

  • LogQA paper from Beihang University, using the T5 large model and manually labeled [training data](https https://github.com/LogQA-dataset/LogQA/tree/main/data) to enable natural language question answering for logs: https://arxiv.org/pdf/2303.11715.pdf
  • LogPPT open-source project from the University of Newcastle, Australia, using the RoBERTa large model and loghub dataset. The most interesting point is that although the loghub dataset contains 80G of logs, only 2k logs per class are labeled. This paper takes a reverse approach and uses the 2k labeled logs as prompts: https://github.com/LogIntelligence/LogPPT
  • DivLog paper from The Chinese University of Hong Kong, using the GPT3 large model, comprehensively outperforming LogPPT. It also explores the ICL method, where 5-shot may be optimal: https://arxiv.org/pdf/2307.09950v3.pdf
    • The subsequent LILAC open-source project, through designed sampling methods and caching, approaches Drain's template inference speed! Additionally, in the comparison with LogPPT/LogDiv, it verifies that as the base model grows from the 110MB RoBerta to the 13B Curie to the 176B ChatGPT, the improvement is not substantial. For template recognition tasks, the language understanding ability of mid-sized LMs may already be decent: https://github.com/logpai/LILAC
  • BERTOps open-source project from IBM, using the BERT large model and some manually labeled data, attempting three classification tasks in the log domain: log format classification, golden signal classification, and fault classification (however, this library is just a demonstration and cannot run; the pretrain.txt file in train.sh is missing, and only the cleaned Excel annotation file is provided): https://github.com/BertOps/bertops
  • Log anomaly detection model based on language models from IBM Research, comparing the effects of fasttext and BERT: https://www.researchgate.net/publication/344693315_Using_Language_Models_to_Pre-train_Features_for_Optimizing_Information_Technology_Operations_Management_Tasks
  • KTeleBERT open-source project from Zhejiang University/Huawei, integrating knowledge graphs and the BERT large model, and utilizing product manuals, device alert logs, and KPIs for fault analysis in the telecommunications domain: https://github.com/hackerchenzhuo/KTeleBERT
  • Biglog large model from Huawei/USTC, based on Bert and unsupervised pre-training on 450 million logs from 16 projects: https://github.com/BiglogOpenSource/PretrainedModel.Corresponding paper for Biglog: https://ieeexplore.ieee.org/document/10188759/
  • LogPrompt paper from Huawei/Beijing University of Posts and Telecommunications, using ChatGPT and Vicuna-13B to verify the effects of zero-shot, CoT, and ICL prompt strategies for log template extraction and anomaly detection: https://arxiv.org/pdf/2308.07610.pdf. The baselines for comparison are the aforementioned LogPPT. The conclusion is that even in the zero-shot setting, ChatGPT slightly outperforms LogPPT, while the open-source Vicuna-13B performs poorly in the zero-shot setting but greatly improves with the ICL approach, approaching a usable level.
  • "Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models" paper from Microsoft, studying whether GPT models have an advantage over BERT models in fault diagnosis by analyzing 40,000 internal fault incidents at Microsoft. The rough conclusion is that there is an advantage, but it is still not very useful: https://arxiv.org/pdf/2301.03797.pdf
  • "Assess and Summarize: Improve Outage Understanding with Large Language Models" paper from Microsoft Asia Research/Nankai University, comparing GPT2 (local single-GPU fine-tuning), GPT3 (6.7b), and GPT3.5 (175b) in generating alert summaries. The difference between 3 and 2 is indeed very significant, but the improvement from 6.7b to 175b is not substantial: https://arxiv.org/pdf/2305.18084.pdf
  • Owl Operations Large Model Dataset from Beihang University/Yunzhihu, including question-answering and multiple-choice questions: https://github.com/HC-Guo/Owl. The corresponding paper also evaluates the differences in MoA fine-tuning, NBCE long context support, and log pattern recognition on the loghub dataset, although the advantages are very marginal.
  • OpsEval paper from Tsinghua University/Mustshowme, with a similar scenario to Owl, but only comparing the performance of open-source models and distinguishing between Chinese and English. Practice has shown that the quality of Chinese question answering is much poorer: https://arxiv.org/pdf/2310.07637.pdf.
  • CodeFuse-DevOpsEval evaluation dataset from Peking University/Ant Financial, covering 12 scenarios in DevOps and AIOps: https://github.com/codefuse-ai/codefuse-devops-eval/blob/main/README_zh.md. However, the scores for the root cause analysis scenario "qwen" in AIOps are abnormally high, leading to suspicion that the pretraining may have used internal data from Alibaba.
  • UniLog paper from The Chinese University of Hong Kong/Microsoft, applying the ICL method of LLMs to log enhancement: https://www.computer.org/csdl/proceedings-article/icse/2024/021700a129/1RLIWpCelqg
  • KnowLog open-source project from Fudan University, crawling descriptions of log templates from public documentation of Cisco, New H3C, and Huawei network devices, and creating pre-trained models based on Bert and RoBerta: https://github.com/LeaperOvO/KnowLog
  • Xpert paper from Microsoft, generating Microsoft Azure's proprietary Kusto Query Language based on alert messages as context: https://arxiv.org/pdf/2312.11988.pdf. The paper proposes an Xcore evaluation method, comprehensively evaluating text, symbol, and field name matching. However, the error examples given in the paper show no overlap between the alert context and the correct output, making it impossible to generate the correct query - a suggestion that at the current stage, purely relying on Chat to generate query languages from prompts is too challenging due to the lack of context information.
  • RCACopilot paper from Microsoft: https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pdf. It first summarizes the alert information, then uses a pre-trained fasttext embedding model to perform vector search on historical faults, and includes the summary and fault classification and description in the final prompt for the LLM to determine if it is an old fault and how to handle it if so. The paper provides a fair amount of evaluation data, but it has strong dependencies on the team and business environment being evaluated, making it difficult to judge its applicability.
  • Another technical report from Microsoft on using the ReAct framework for RCA: https://arxiv.org/pdf/2403.04123.pdf. The rough conclusion is that without developing a specific Tool, relying on a generic document retrieval tool, ReAct performs worse than directly using RAG or CoT. Even with a specific Tool developed, the quality of the analysis plans written in the knowledge base is the most influential factor. Once multiple knowledge base documents are involved, ReAct tends to fail continuously from the second or third round onwards.
  • A technical report from Flip.AI, a company that developed its own DevOps large model. It adopts a 1 encoder -> N decoder MoE architecture, with incremental pre-training on 80B tokens; the fine-tuning training data is mainly from simulated data based on RAG, supplemented by 18 months of human double-blind filtering; the reinforcement learning stage is RLHAIF, building a fault injection environment for the model to generate RCA reports: https://assets-global.website-files.com/65379657a6e8b5a6ad9463ed/65a6ec298f8b53c8ddb87408_System%20of%20Intelligent%20Actors_FlipAI.pdf

Label

Label tools for timeseries

Prediction

KPI

Capacity Planning

Network

Event Correlation

Root Cause Analysis

tracing

bottleneck localization

timeseries correlation

Solution Relevance Recommendation

Alert Grouping

Knowledge Graph

Behavior Anomaly Detection

Further Reading