My learning notes/codes for ML SYS. English version is under development and only available for some texts.
-
Intro to HybridFlow/veRL
[English TODO] | [中文版]:SGLang's hybrid RLHF engine design and implementation. -
Extending OpenRLHF's Inference Engine
[English TODO] | [中文版]:Notes on integrating SGLang with OpenRLHF, an exhausting process with frequent NCCL hang bugs. -
SWE-Bench: How to Construct a Great Benchmark for the LLM Era
[中文版] -
Intro to Workflow in OpenRLHF-like Post-Training Systems
[中文版] -
The Illustrated PPO: Theory and Source Code Explanation
[中文版]
Also see RLHF 的计算流. -
Latency Optimization for Weight Updates
[English TODO] | [中文版]:An experience of debugging loading efficiency. -
Intro to Alignment Algorithms and NeMo-Aligner Framework
[中文版]
-
Concepts and Optimization of Constraint Decoding
[English TODO] | [中文版] -
SGLang Code Walkthrough
[English version]:The lifecycle of a request in the SGLang Engine, a good start for SGLang beginners. -
Walk Through SGLang / VLLM Worker
[English version]:Demystifying the SGLang worker (model executor). -
Reward / Embed Model Server Engine
[English TODO] | [中文版] -
SGLang Backend Analysis
[English TODO] | [中文版] -
Using vLLM to Serve New Embedding Models
[English TODO] | [中文版] -
Using SGL to Serve Embedding Models
[English TODO] | [中文版] -
From vLLM to SGLang: A User's Perspective
[English TODO] | [中文版]
-
Mooncake: Maximizing PD Disaggregation
[中文版]:Taking prefill and decode separation to the extreme. -
Should Prefill and Decode Be Separated onto Different Cards?
[中文版]:A discussion on separating prefill and decode tasks. -
Understanding Prefill and Decode Computational Characteristics Based on Chunked Prefill
[中文版]:Analyzing computational characteristics using chunked prefill. -
ModelServer: A Frontend Distribution System Based on SGLang
[中文版]:A frontend distribution system built on SGLang.
-
NCCL and NVIDIA TOPO
[English TODO] | [中文版]:An introduction to NCCL and NVIDIA topology. -
PyTorch Distributed
[English TODO] | [中文版]:Practical communication intorch.distributed
. -
Give Me BF16 or Give Me Death: A Comprehensive Evaluation of Current Quantization Methods
[中文版]:A detailed evaluation of current quantization methods. -
AWQ: Model Quantization Should Focus on Activation Values
[中文版]:Why activation values should be the focus of model quantization. -
Deep Dive into PyTorch DDP Series Part 1: Beginner's Tutorial
[中文版]:A beginner's guide to PyTorch Distributed Data Parallel (DDP). -
Detailed Explanation of nvidia-smi Command and Some Advanced Techniques
[中文版]:Advanced techniques for usingnvidia-smi
.
-
Setting Up a Clean Development Environment
[English TODO] | [中文版]:How to set up a clean and efficient development environment. -
Understanding Special Tokens and Chat Templates
[English TODO] | [中文版]:A guide to understanding special tokens and chat templates. -
Compiling Jupyter Notebooks on CI and Deploying as Documentation
[中文版]:A guide on compiling Jupyter notebooks in CI and deploying them as documentation.