Optimized-LLM

Vicuna 7B Model Serving with FastAPI and Dynamic Batching

This repository provides an implementation of a FastAPI-based server to serve a compiled Vicuna-7B model using Llama.cpp. The server implements dynamic batching to efficiently manage multiple concurrent requests, reducing the overall response time by grouping requests into a single batch before sending them to the model for inference.

Model Compilation Steps

To use the compiled Vicuna-7B model, follow these steps to compile the model using llama.cpp:

Download the Base Model:
- Obtain the base Vicuna-7B model from the LMSYS repository.
Install Llama.cpp:
- Clone the llama.cpp repository and install the required dependencies.
- Llama.cpp supports quantization to reduce memory usage and offers a simple API for loading and running models.
- It's compatible with various Llama-based models, including Vicuna.
Convert Base Model to GGUF Format:
- Convert the downloaded model to the GGUF format, which is more optimized for llama.cpp inference.
Quantize the GGUF Model:
- Use the quantization feature in llama.cpp to reduce the model size for faster inference and lower memory consumption.
Memory/Disk Requirements (for the 7B Vicuna model):
- Original size: 13 GB
- Quantized size (Q4_K_M): 3.9 GB

FastAPI Server with Dynamic Batching

The FastAPI server is designed to handle multiple incoming requests, batch them together, and perform inference on the compiled Vicuna model in an optimized manner. This ensures efficient GPU usage and reduces overall latency by processing several requests in a single forward pass.

Features

Dynamic Batching: Collects multiple requests within a short window (e.g., 50ms) and sends them in one batch to the model for inference.
Efficient Memory Management: Takes advantage of llama.cpp's ability to load quantized models, reducing memory footprint.
Concurrency Handling: Manages multiple requests asynchronously using FastAPI and asyncio for high throughput.

Prerequisites

Vicuna Model: Ensure you have a quantized version of the Vicuna-7B model (in GGUF format) compiled with llama.cpp.
Python Libraries:
- llama_cpp
- fastapi
- uvicorn
- aiohttp

Installation

Install the necessary Python libraries:

pip install fastapi uvicorn llama_cpp aiohttp

Place the quantized Vicuna model in a directory (e.g., ./quantized_model/vicuna_7b_FP16_K_M.gguf).

Code Overview

FastAPI Server Code with Dynamic Batching

The FastAPI app runs a server that handles incoming requests, batches them, and sends them to the model for inference.

Dynamic Batching: Requests are batched dynamically based on the configured batch size and timeout values (BATCH_SIZE, BATCH_TIMEOUT).
Async Processing: The server uses asynchronous request handling to maximize concurrency.
Vicuna Inference: The compiled Vicuna model is loaded using the llama_cpp.Llama interface.

Running the Server

To start the FastAPI server, run the following command:

uvicorn your_fastapi_script:app --reload

This will serve the model at http://localhost:8000/generate/.

Testing Dynamic Batching

You can test the dynamic batching feature by sending multiple requests concurrently. The following script uses asyncio and aiohttp to send requests in parallel to the FastAPI server.

Test Script

After running the FastAPI server, execute the test script with:

python test_batch_inference.py

The script sends multiple requests concurrently and checks whether the server batches them. The responses should indicate that batching is working correctly, and the total time taken should reflect the efficiency gains from batching.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirement.txt		requirement.txt
test_batch.py		test_batch.py
vicuna_cpp_compilation.ipynb		vicuna_cpp_compilation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimized-LLM

Vicuna 7B Model Serving with FastAPI and Dynamic Batching

Model Compilation Steps

FastAPI Server with Dynamic Batching

Features

Prerequisites

Installation

Code Overview

FastAPI Server Code with Dynamic Batching

Running the Server

Testing Dynamic Batching

Test Script

About

Releases

Packages

Languages

ambirpatel/Optimized-LLM

Folders and files

Latest commit

History

Repository files navigation

Optimized-LLM

Vicuna 7B Model Serving with FastAPI and Dynamic Batching

Model Compilation Steps

FastAPI Server with Dynamic Batching

Features

Prerequisites

Installation

Code Overview

FastAPI Server Code with Dynamic Batching

Running the Server

Testing Dynamic Batching

Test Script

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages