Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jafioti authored Dec 30, 2023
1 parent 542f74f commit 517124b
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ println!("Result: {:?}", c);
## Getting Started
**Mistral 7B**
```bash
sh examples/mistral/setup/setup.sh # Download the model weights
cargo run --release --example mistral # Run the model
sh examples/mistral/setup/setup.sh # Download the model weights
cargo run --release --example mistral # Run the model
```

## Why does this look so different from other DL libraries?
Expand All @@ -46,7 +46,7 @@ A consequence of this is that the actual computation that gets ran can be radica

Of course if we want to insert dynamic control flow part-way through we can still split the network into multiple seperate graphs, which means this method doesn't preclude optimizations like KV caching, because the KV cached forward pass is just a seperate graph! See `examples/llama` for an example of a KV cache.

Some huge benefits are now unlocked:
Now we can do:
- Aggressive kernel fusion
- Shape-specific kernels compiled at runtime
- Devices and Dtypes are handled through compilers (just run the CUDA compiler to convert the graph to use CUDA kernels, then the fp16 compiler to convert to half-precision kernels)
Expand All @@ -67,17 +67,17 @@ Once you've written all your computation code, run `cx.display()` to see the ent
Currently luminal is extremely alpha. Please don't use this in prod.

- Metal and Cuda are supported for running models on Macs and Nvidia GPUs respectively, in both full and half precision.
- Llama 1 is implemented in `examples/llama`. See instructions above for running.
- Mistral 7B and Llama 7B are implemented in `examples/`. See instructions above for running.
- The llama example shows how to implement a loader for a custom format. Safetensors loaders are already implemented, and are the recommended way to load a model.
- We have a small library of NN modules in `nn`, including transformers.
- A signifigant amount of high-level ops are implemented in `hl_ops`. We are aiming to match the tinygrad / pytorch api.
- Next release will bring a signifigant amount of compilers which should fuse primops into much faster ops. The aim for 0.3 is to be usably fast, not SOTA yet.
- The aim for 0.4 is to achieve SOTA performance on macs, and near SOTA on single nvidia gpus, as well as support all mainstream models (Stable Diffusion, Whisper, Flamingo, etc.)
- Next release will bring a signifigant amount of compilers which should fuse primops into much faster ops. The aim for 0.3 is to be usably fast, not SOTA yet (10-20 tok/s in fp16).
- The aim for 0.4 is to achieve SOTA performance on macs (50 tok/s), and near SOTA on single nvidia gpus (>200 tok/s), as well as support all mainstream models (Whisper, Stable Diffusion, Yolo v8, etc.)

Some things on the roadmap:
- Optimize cuda and metal matmul kernels
- Fine-grained metal and cuda IR
- Build benchmarking suite to test against other libs
- Write specialized Cuda and Metal kernels for full transformer architecture (FlashAttention, etc.)
- Autograd engine
- Beat PT 2.0 perf on LLM training
- Write compiler for quantum photonic retro encabulator
Expand Down

0 comments on commit 517124b

Please sign in to comment.