Skip to content

Commit

Permalink
feat(knowledge-base): Add document processing service with vector search
Browse files Browse the repository at this point in the history
- Implement document upload and processing (PDF, DOCX, TXT)
- Add vector search using LanceDB and Cohere embeddings
- Implement document deletion with vector cleanup
- Add comprehensive test suite
- Add Docker configuration for service and testing
- Update documentation with setup and usage instructions
  • Loading branch information
Jss-on committed Dec 3, 2024
1 parent e72fe69 commit cdd1ad9
Show file tree
Hide file tree
Showing 10 changed files with 676 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
rust-backend/target

# Knowledge Base
knowledge-base/.env
knowledge-base/uploads/
knowledge-base/lancedb/
knowledge-base/venv/
54 changes: 54 additions & 0 deletions knowledge-base/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Git
.git
.gitignore

# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# Virtual Environment
venv/
ENV/

# IDE
.idea/
.vscode/
*.swp
*.swo

# Local development
.env
.env.local
.env.*

# Docker
Dockerfile
Dockerfile.test
docker-compose.yml
.dockerignore

# Database
lancedb/*
!lancedb/.gitkeep

# Uploads
uploads/*
!uploads/.gitkeep
35 changes: 35 additions & 0 deletions knowledge-base/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Use Python 3.11 slim image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Copy requirements first to leverage Docker cache
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application
COPY app/ app/
COPY uploads/ uploads/

# Create necessary directories
RUN mkdir -p uploads lancedb

# Set environment variables
ENV PYTHONPATH=/app
ENV UPLOAD_DIR=/app/uploads
ENV DB_PATH=/app/lancedb

# Expose the port the app runs on
EXPOSE 8000

# Command to run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
34 changes: 34 additions & 0 deletions knowledge-base/Dockerfile.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Use Python 3.11 slim image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Copy requirements first to leverage Docker cache
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application and test files
COPY app/ app/
COPY tests/ tests/
COPY uploads/ uploads/

# Create necessary directories
RUN mkdir -p uploads lancedb

# Set environment variables
ENV PYTHONPATH=/app
ENV UPLOAD_DIR=/app/uploads
ENV DB_PATH=/app/lancedb
ENV COHERE_API_KEY=dummy_key_for_testing

# Command to run tests
CMD ["pytest", "tests/", "-v"]
106 changes: 106 additions & 0 deletions knowledge-base/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Knowledge Base Service

This service handles document uploads (PDF, DOCX, and TXT files), processes them, and stores their content in a vector database using LanceDB with Cohere embeddings and reranking.

## Features

- Upload PDF, DOCX, and TXT files
- Automatic text extraction from documents
- Text chunking using Langchain's RecursiveCharacterTextSplitter
- Vector embeddings generation using Cohere's multilingual model
- Vector storage using LanceDB with Langchain integration
- Semantic search with Cohere reranking
- Document deletion with corresponding vector data cleanup

## Docker Setup

### Prerequisites

- Docker and Docker Compose installed
- Cohere API key

### Running the Service

1. Create a `.env` file with your Cohere API key:
```bash
COHERE_API_KEY=your_cohere_api_key_here
```

2. Build and start the service:
```bash
docker-compose up app
```

The service will be available at `http://localhost:8000`

### Running Tests

Run the tests in a Docker container:
```bash
docker-compose up test
```

### Development Setup

If you prefer to run the service without Docker:

1. Create a virtual environment and activate it:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Run the service:
```bash
cd app
uvicorn main:app --reload
```

## API Endpoints

### Upload Document
- **URL**: `/upload`
- **Method**: `POST`
- **Content-Type**: `multipart/form-data`
- **Parameter**: `file` (PDF, DOCX, or TXT file)

### Search Documents
- **URL**: `/search`
- **Method**: `GET`
- **Parameters**:
- `query` (string): The search query
- `limit` (integer, optional): Maximum number of results (default: 5)

### Delete Document
- **URL**: `/document/{filename}`
- **Method**: `DELETE`
- **Parameter**: `filename` (name of the file to delete)

## Project Structure
```
knowledge-base/
├── app/
│ └── main.py
├── tests/
│ ├── conftest.py
│ └── test_main.py
├── uploads/ # Directory for stored documents
├── lancedb/ # Vector database storage
├── Dockerfile # Main service Dockerfile
├── Dockerfile.test # Testing Dockerfile
├── docker-compose.yml
├── requirements.txt
└── .env # Environment variables
```

## Notes
- Uploaded files are stored in the `uploads` directory
- Vector embeddings are stored in LanceDB
- The service uses Cohere's embed-multilingual-v3.0 model for embeddings
- Search results are reranked using Cohere's rerank-v3.5 model
- Text is split into chunks with 200-token overlap for better context preservation
Loading

0 comments on commit cdd1ad9

Please sign in to comment.