diff --git a/docs/blog/gpu.mdx b/docs/blog/gpu.mdx index 935089f7..a3f35b19 100644 --- a/docs/blog/gpu.mdx +++ b/docs/blog/gpu.mdx @@ -116,23 +116,53 @@ This actually is taken much furthur, fusing unary operations, binary operations, Here's an example of how many ops fusion can merge together. On the left is the unfused graph, on the right is the functionally identical fused graph: The Luminal graph with and without kernel fusion +The Luminal graph with and without kernel fusion ### Step 5: Buffer Sharing #### Storage Buffers Since we're reading buffers which may not be read again, it would make sense to re-use that memory directly. And since we know in advance how big all the buffers are, and when we'll be using them, we can decide at compile time which buffers should be used where. -That's the concept behind Storage Buffer sharing. If for instance we have `a_buffer -> cos -> b_buffer -> exp -> c_buffer`, we can just re-use `a_buffer` as `c_buffer`, rather than allocate a third buffer. So the Metal StorageBufferCompiler does just this, computes the optimal buffer assignments to minimise memory usage, and adds a single op at the beginning of the graph to allocate all required buffers once. This op also keeps track of previously allocated buffers on earlier runs of the graph, and tries to re-use those buffers. Ideally each time a graph is ran, zero allocations need to happen! +Shared Storage Buffers +Shared Storage Buffers + + +The Metal StorageBufferCompiler does just this, computes the optimal buffer assignments to minimise memory usage, and adds a single op at the beginning of the graph to allocate all required buffers once. + +This op also keeps track of previously allocated buffers on earlier runs of the graph, and tries to re-use those buffers. Ideally each time a graph is ran, zero allocations need to happen! #### Command Buffers Another important concept in Metal is the Command Buffer. This is where we queue all our kernels to run. Naievely we can simply run the command buffer after queueing a single kernel, and simply do that for each op we run. But running the command buffer has latency, transferring the kernel to the GPU has latency, and this is a very inefficient use of the CommandBuffer. Instead, wouldn't it be great if we can just build a massive command buffer with all of our kernels, with inputs and outputs already set up, and just run the whole thing at once? +Shared Command Buffer +Shared Command Buffer + That's exactly what the CommandBufferCompiler does. It can create groups of ops to share the command buffer between, and run the command buffer only when we actually need the outputs. Usually we can put an entire model on a single command buffer. For instance, the entire Llama 3 8B uses just one command buffer! ### What does this get us? diff --git a/docs/images/command_buffer_dark.png b/docs/images/command_buffer_dark.png new file mode 100644 index 00000000..c2049f5a Binary files /dev/null and b/docs/images/command_buffer_dark.png differ diff --git a/docs/images/command_buffer_light.png b/docs/images/command_buffer_light.png new file mode 100644 index 00000000..a2406a7a Binary files /dev/null and b/docs/images/command_buffer_light.png differ diff --git a/docs/images/fusion_dark.png b/docs/images/fusion_dark.png new file mode 100644 index 00000000..82afe7fe Binary files /dev/null and b/docs/images/fusion_dark.png differ diff --git a/docs/images/gpu.png b/docs/images/gpu.png deleted file mode 100644 index 9618395e..00000000 Binary files a/docs/images/gpu.png and /dev/null differ diff --git a/docs/images/storage_buffers_dark.png b/docs/images/storage_buffers_dark.png new file mode 100644 index 00000000..d89fa432 Binary files /dev/null and b/docs/images/storage_buffers_dark.png differ diff --git a/docs/images/storage_buffers_light.png b/docs/images/storage_buffers_light.png new file mode 100644 index 00000000..629fb0c3 Binary files /dev/null and b/docs/images/storage_buffers_light.png differ