-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(knowledge-base): Add document processing service with vector search
- Implement document upload and processing (PDF, DOCX, TXT) - Add vector search using LanceDB and Cohere embeddings - Implement document deletion with vector cleanup - Add comprehensive test suite - Add Docker configuration for service and testing - Update documentation with setup and usage instructions
- Loading branch information
Showing
10 changed files
with
676 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,7 @@ | ||
rust-backend/target | ||
|
||
# Knowledge Base | ||
knowledge-base/.env | ||
knowledge-base/uploads/ | ||
knowledge-base/lancedb/ | ||
knowledge-base/venv/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Git | ||
.git | ||
.gitignore | ||
|
||
# Python | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
*.so | ||
.Python | ||
env/ | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
|
||
# Virtual Environment | ||
venv/ | ||
ENV/ | ||
|
||
# IDE | ||
.idea/ | ||
.vscode/ | ||
*.swp | ||
*.swo | ||
|
||
# Local development | ||
.env | ||
.env.local | ||
.env.* | ||
|
||
# Docker | ||
Dockerfile | ||
Dockerfile.test | ||
docker-compose.yml | ||
.dockerignore | ||
|
||
# Database | ||
lancedb/* | ||
!lancedb/.gitkeep | ||
|
||
# Uploads | ||
uploads/* | ||
!uploads/.gitkeep |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Use Python 3.11 slim image | ||
FROM python:3.11-slim | ||
|
||
# Set working directory | ||
WORKDIR /app | ||
|
||
# Install system dependencies | ||
RUN apt-get update && \ | ||
apt-get install -y --no-install-recommends \ | ||
build-essential \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# Copy requirements first to leverage Docker cache | ||
COPY requirements.txt . | ||
|
||
# Install Python dependencies | ||
RUN pip install --no-cache-dir -r requirements.txt | ||
|
||
# Copy the rest of the application | ||
COPY app/ app/ | ||
COPY uploads/ uploads/ | ||
|
||
# Create necessary directories | ||
RUN mkdir -p uploads lancedb | ||
|
||
# Set environment variables | ||
ENV PYTHONPATH=/app | ||
ENV UPLOAD_DIR=/app/uploads | ||
ENV DB_PATH=/app/lancedb | ||
|
||
# Expose the port the app runs on | ||
EXPOSE 8000 | ||
|
||
# Command to run the application | ||
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Use Python 3.11 slim image | ||
FROM python:3.11-slim | ||
|
||
# Set working directory | ||
WORKDIR /app | ||
|
||
# Install system dependencies | ||
RUN apt-get update && \ | ||
apt-get install -y --no-install-recommends \ | ||
build-essential \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# Copy requirements first to leverage Docker cache | ||
COPY requirements.txt . | ||
|
||
# Install Python dependencies | ||
RUN pip install --no-cache-dir -r requirements.txt | ||
|
||
# Copy the application and test files | ||
COPY app/ app/ | ||
COPY tests/ tests/ | ||
COPY uploads/ uploads/ | ||
|
||
# Create necessary directories | ||
RUN mkdir -p uploads lancedb | ||
|
||
# Set environment variables | ||
ENV PYTHONPATH=/app | ||
ENV UPLOAD_DIR=/app/uploads | ||
ENV DB_PATH=/app/lancedb | ||
ENV COHERE_API_KEY=dummy_key_for_testing | ||
|
||
# Command to run tests | ||
CMD ["pytest", "tests/", "-v"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Knowledge Base Service | ||
|
||
This service handles document uploads (PDF, DOCX, and TXT files), processes them, and stores their content in a vector database using LanceDB with Cohere embeddings and reranking. | ||
|
||
## Features | ||
|
||
- Upload PDF, DOCX, and TXT files | ||
- Automatic text extraction from documents | ||
- Text chunking using Langchain's RecursiveCharacterTextSplitter | ||
- Vector embeddings generation using Cohere's multilingual model | ||
- Vector storage using LanceDB with Langchain integration | ||
- Semantic search with Cohere reranking | ||
- Document deletion with corresponding vector data cleanup | ||
|
||
## Docker Setup | ||
|
||
### Prerequisites | ||
|
||
- Docker and Docker Compose installed | ||
- Cohere API key | ||
|
||
### Running the Service | ||
|
||
1. Create a `.env` file with your Cohere API key: | ||
```bash | ||
COHERE_API_KEY=your_cohere_api_key_here | ||
``` | ||
|
||
2. Build and start the service: | ||
```bash | ||
docker-compose up app | ||
``` | ||
|
||
The service will be available at `http://localhost:8000` | ||
|
||
### Running Tests | ||
|
||
Run the tests in a Docker container: | ||
```bash | ||
docker-compose up test | ||
``` | ||
|
||
### Development Setup | ||
|
||
If you prefer to run the service without Docker: | ||
|
||
1. Create a virtual environment and activate it: | ||
```bash | ||
python -m venv venv | ||
source venv/bin/activate # On Windows: venv\Scripts\activate | ||
``` | ||
|
||
2. Install dependencies: | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
3. Run the service: | ||
```bash | ||
cd app | ||
uvicorn main:app --reload | ||
``` | ||
|
||
## API Endpoints | ||
|
||
### Upload Document | ||
- **URL**: `/upload` | ||
- **Method**: `POST` | ||
- **Content-Type**: `multipart/form-data` | ||
- **Parameter**: `file` (PDF, DOCX, or TXT file) | ||
|
||
### Search Documents | ||
- **URL**: `/search` | ||
- **Method**: `GET` | ||
- **Parameters**: | ||
- `query` (string): The search query | ||
- `limit` (integer, optional): Maximum number of results (default: 5) | ||
|
||
### Delete Document | ||
- **URL**: `/document/{filename}` | ||
- **Method**: `DELETE` | ||
- **Parameter**: `filename` (name of the file to delete) | ||
|
||
## Project Structure | ||
``` | ||
knowledge-base/ | ||
├── app/ | ||
│ └── main.py | ||
├── tests/ | ||
│ ├── conftest.py | ||
│ └── test_main.py | ||
├── uploads/ # Directory for stored documents | ||
├── lancedb/ # Vector database storage | ||
├── Dockerfile # Main service Dockerfile | ||
├── Dockerfile.test # Testing Dockerfile | ||
├── docker-compose.yml | ||
├── requirements.txt | ||
└── .env # Environment variables | ||
``` | ||
|
||
## Notes | ||
- Uploaded files are stored in the `uploads` directory | ||
- Vector embeddings are stored in LanceDB | ||
- The service uses Cohere's embed-multilingual-v3.0 model for embeddings | ||
- Search results are reranked using Cohere's rerank-v3.5 model | ||
- Text is split into chunks with 200-token overlap for better context preservation |
Oops, something went wrong.