For experiments and research on Applied AI.
Housing a variety of Triton and CUDA kernels for training and inference.
Inference kernels = no backward pass support.
1 - Triton - MoE (Mixtral) GEMM for accelerating inference. Uses col major access pattern to increase locality.
![moe_gemm_a100](https://private-user-images.githubusercontent.com/46302957/320303886-9eece843-b5e1-4250-a98a-3ae79dff1bc3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxOTk4NTYsIm5iZiI6MTczOTE5OTU1NiwicGF0aCI6Ii80NjMwMjk1Ny8zMjAzMDM4ODYtOWVlY2U4NDMtYjVlMS00MjUwLWE5OGEtM2FlNzlkZmYxYmMzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDE0NTkxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNkOGI4NTM1YmJlMjdlZTVmZWU3MjliOTZlMmRlM2I2NTczZmM5Yzk0OWMwMWEzOTZjNjhmNTc1NGFiNjkyYWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Sbh9OaHO4PIkaLXnypgW8DL6Xn-_8kJcVgCD97EiBhc)
![softmax_fused](https://private-user-images.githubusercontent.com/46302957/320303912-de11686b-4c17-4696-857a-4f56488d6df3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxOTk4NTYsIm5iZiI6MTczOTE5OTU1NiwicGF0aCI6Ii80NjMwMjk1Ny8zMjAzMDM5MTItZGUxMTY4NmItNGMxNy00Njk2LTg1N2EtNGY1NjQ4OGQ2ZGYzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDE0NTkxNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNiNjgwM2FjZTdjNTM4YmY3OWExMjk3MDcyN2UzMzM4ZGQ0YzRiOGZmNjliYzE0MDQ4YTg4MGJkMjNhZWMzYTYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.3_bk46KwDV5KJ4BZQGp_kjPj2X_zIKxfA90VUbm_GsY)
- CUDA Mode - Reading group for learning CUDA programming - (Discord, Lecture Materials, Lecture recordings)
- llama-recipes - Recipes for fine-tuning and inference for Llama model series
- NeurIPS'23 LLM Efficiency Challenge - 1LLM + 1GPU + 1Day competition - (website, code, NeurIPS Workshop recordings)
- PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation paper
- Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK Work Decomposition paper
- PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel paper
- Sustainable AI: Environmental Implications, Challenges and Opportunities paper
The applied-ai repo is released under the BSD 3 license.