diff --git a/docs/blog/gpu.mdx b/docs/blog/gpu.mdx
index 581e72b0..935089f7 100644
--- a/docs/blog/gpu.mdx
+++ b/docs/blog/gpu.mdx
@@ -1,6 +1,6 @@
---
title: 'Compiling fast GPU kernels'
-description: 'Bringing support for Apple and Nvidia GPUs to Luminal through compilers'
+description: 'Bringing support for Nvidia and Apple GPUs to Luminal through compilers'
'og:image': '/images/gpu_notext.png'
'twitter:image': '/images/gpu_notext.png'
---
@@ -8,14 +8,14 @@ description: 'Bringing support for Apple and Nvidia GPUs to Luminal through comp
Image Credit: https://exxactcorp.com/
**Luminal compilers can now generate CUDA and Metal kernels on the fly, yielding specialized GPU compute for each model.** -In our day-to-day lives most computing is done on general purpose CPUs. The combination of ubuquity and flexibility makes them an attractive option for most software. However, certian types of software like graphics are very compute-intensive. CPUs execute a single stream of instructions, and therefore have very little (or no) parallelism, leading to slow performance and high power usage. +In our day-to-day lives most computing is done on general purpose CPUs. The combination of ubuquity and flexibility makes them an attractive option for most software. However, certian types of software like graphics are very compute-intensive, and since CPUs execute a single stream of instructions they have very little (or no) parallelism, leading to slow performance and high power usage. As graphics improved in the 80s and 90s, especially with the onset of 3D graphics, specialized hardware was required to render complex scenes at reasonable speed. Companies like Nvidia began releasing specialized chips able to do massively parallel compute, which served graphics applications well since individual pixels tend not to depend on other pixels. @@ -39,16 +39,62 @@ This kernel gets ran for each element of the input arrays, all in parallel. ## Compiler flow The typical approach in Luminal for supporting new backends would be: -1) Swap out each primop with a backend-specific primop. -2) Add in operations to copy to device and copy from device before and after Function ops. -3) Pattern-match to swap out chunks of ops with specialized variants. +1) Swap out each primitive operation with a backend-specific operation. +2) Add in operations to copy to device and copy from device before and after Function operations. +3) Pattern-match to swap out chunks of operations with specialized variants. 4) All other optimizations. -So let's go through how we do this for the Metal backend to support Apple GPUs. +Since we looked at a CUDA kernel above, let's go through how we do this for the Metal backend to support Apple GPUs. ### Step 1: Metal Primops We want to generically support all possible models in Luminal, so our first step is to replicate all primitive operations with a Metal version. Since there are 11 primitive operations, we need 11 Metal ops. You can see these in `crates/luminal_metal/src/prim.rs`. The compiler simply loops through all ops in the graph and swaps them out with the Metal variant. +These primitive operations are very simple. Here's the [MetalExp2](https://github.com/jafioti/luminal/blob/d3178b3443ee7fc887f8f0988a77736b73e618d0/crates/luminal_metal/src/prim.rs#L346) op, slightly simplified for clarity: +```rust +#[derive(Clone)] +pub struct MetalExp2