EmbedExplorer

EmbedExplorer is a Python application for processing text documents, generating embeddings using the all-MiniLM-L6-v2 model from Sentence Transformers, and storing them in a local vector database. The project uses SQLite for metadata storage and FAISS for vector storage. The entire process runs locally and offline for document processing, but it uses the OpenAI GPT model for real-time chatbot responses.

Features

Extracts text from PDF, TXT, and Markdown files.
Chunks text into smaller segments.
Generates embeddings using the all-MiniLM-L6-v2 model.
Stores embeddings in a local FAISS vector database.
Manages document metadata in a local SQLite database.
Supports CRUD operations on the document metadata.
Provides a real-time chatbot that can answer queries based on the embedded documents using OpenAI GPT.

Project Structure

EmbedExplorer/
│
├── vector_db/
│   ├── __init__.py
│   ├── database.py
│   └── document_processor.py
│
├── chatbot/
│   ├── __init__.py
│   ├── model_handler.py
│   ├── query_handler.py
│   └── response_generator.py
│
├── knowledge/
│   └── text_documents/ # Place your text documents here
│
├── database/ # Local database files
│
├── tests/
│   └── test_database.py # Unit tests
│
├── config.py # Global configuration
├── main.py # Main entry point for document processing
├── start_chatbot.py # Integrated chatbot and query example
├── .env # Environment variables for secrets
├── venv # Virtual environment
└── requirements.txt # Dependencies

Installation

Clone the repository:

git clone https://github.com/d-zienke/EmbedExplorer.git
cd EmbedExplorer

Set up a virtual environment:

python -m venv venv
venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Configure the Environment:

Create an .env file in the root directory and add your Hugging Face token and/or OpenAI token.

(If you don't intend to use OpenAI's model, write anything for OPENAI_API_KEY)

HUGGINGFACE_TOKEN="your_huggingface_token_here"
OPENAI_API_KEY="your_openai_token_here"

Configuration

Edit the config.py file to configure the chunk size, overlap size, paths for the SQLite and FAISS databases, and model settings

import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

class Config:
    # General settings
    CHUNK_SIZE = 300
    CHUNK_OVERLAP = 50
    EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
    SQLITE_DB_PATH = "database/metadata.db"
    FAISS_INDEX_PATH = "database/faiss.index"
    EMBEDDING_DIMENSION = 384  # Dimension of the embeddings used

    # Chatbot settings
    MODEL_TYPE = "gpt-4o-mini"
    LLAMA_MODEL_NAME = "meta-llama/Meta-Llama-3-8B"
    GPT4_MODEL_NAME = "gpt-4o-mini"
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
    TEMPERATURE = 0.7
    MAX_TOKENS = 300
    TOP_P = 0.9
    FREQUENCY_PENALTY = 0.2
    PRESENCE_PENALTY = 0.2
    SYSTEM_PROMPT = (
        "You are a knowledgeable assistant. Your primary function is to provide "
        "information strictly based on the embedded documents. When answering queries, "
        "ensure your responses are concise and directly related to the content of the "
        "documents. If possible, always include the title of the source document in your "
        "response to indicate the origin of the information. If a query cannot be answered "
        "from the documents, state that explicitly. If a query requests general information or opinions, make "
        "it clear that your primary function is to provide information based on the embedded documents."
    )

Usage

Place your text documents in the knowledge/text_documents/ directory.
Run the main application:
```
python main.py
```
The application will automatically create the necessary directories (database and knowledge/text_documents) if they don't exist.
Run the chatbot:
```
python start_chatbot.py --mode chatbot
```
Test the query mechanism:
```
python start_chatbot.py --mode test
```

Running Tests

To run the unit tests, execute:

python -m unittest discover -s tests

Code Overview

`vector_db/database.py`

Manages the SQLite and FAISS database operations.

`vector_db/document_processor.py`

Processes documents, extracts text, chunks text, generates embeddings, and stores them in the database.

`chatbot/model_handler.py`

Manages the language models (GPT-4o-mini and LLaMA), generates embeddings using SBERT, generates responses using the appropriate model based on configuration.

`chatbot/query_handler.py`

Handles document retrieval based on query embeddings.

`chatbot/response_generator.py`

Generates chatbot responses using the LLaMA model and OpenAI GPT-4o-mini model.

`tests/test_database.py`

Contains unit tests for the VectorDatabase class.

`main.py`

The main entry point of the application. Processes all documents in the knowledge/text_documents/ directory.

`start_chatbot.py`

Integrates the chatbot functionality and provides a mechanism to test the query.

Contributing

Feel free to submit issues, fork the repository, and send pull requests. For major changes, please open an issue first to discuss what you would like to change.

The main branch is locked and read-only. Please always create a new branch for your changes.

Keep your branch names and commit messages clear, consistent, and easy to understand. Please adhere to the following naming convention.

Branch Naming Convention

Feature Branches:

Use the prefix feature_ followed by a brief description of the feature with hyphens separating words.
Example: feature_user-authentication

Bugfix Branches:

Use the prefix bugfix_ followed by a brief description of the bug with hyphens separating words.
Example: bugfix_fix-login-error

Hotfix Branches:

Use the prefix hotfix_ for urgent fixes in production with hyphens separating words.
Example: hotfix_security-patch

Release Branches:

Use the prefix release_ followed by the version number.
Example: release_v1.2.0

Experimental Branches:

Use the prefix exp_ for experimental features or spikes with hyphens separating words.
Example: exp_new-ui-experiment

Documentation Branches:

Use the prefix docs_ followed by a brief description of the documentation update with hyphens separating words.
Example: docs_update-readme

Commit Message Convention

Type:

feat: A new feature.
fix: A bug fix.
docs: Documentation changes.
style: Code style changes (formatting, missing semi-colons, etc).
refactor: Code refactoring without changing functionality.
perf: Performance improvements.
test: Adding or updating tests.
chore: Other changes that don't modify src or test files.

Description:

Keep the first line (summary) under 50 characters.
Use the imperative mood (e.g., "Add", "Fix", "Update").

Body (optional):

Use if a more detailed explanation is necessary.
Separate from the summary with a blank line.
Explain the motivation for the change and contrast with the previous behavior.

Examples:

Branch Names:

feature_user-authentication
bugfix_fix-login-error
hotfix_security-patch
release_v1.2.0
exp_new-ui-experiment

Commit Messages:

feat: Add user authentication
fix: Correct login error when user is inactive
docs: Update API documentation
style: Format code according to new guidelines
refactor: Reorganize user model
perf: Improve query performance for dashboard
test: Add unit tests for login functionality
chore: Update dependencies

License

This project is licensed under the Apache License Version 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmbedExplorer

Features

Project Structure

Installation

Configuration

Usage

Running Tests

Code Overview

`vector_db/database.py`

`vector_db/document_processor.py`

`chatbot/model_handler.py`

`chatbot/query_handler.py`

`chatbot/response_generator.py`

`tests/test_database.py`

`main.py`

`start_chatbot.py`

Contributing

Branch Naming Convention

Commit Message Convention

Examples:

Branch Names:

Commit Messages:

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
chatbot		chatbot
tests		tests
vector_db		vector_db
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
start_chatbot.py		start_chatbot.py

License

d-zienke/EmbedExplorer

Folders and files

Latest commit

History

Repository files navigation

EmbedExplorer

Features

Project Structure

Installation

Configuration

Usage

Running Tests

Code Overview

vector_db/database.py

vector_db/document_processor.py

chatbot/model_handler.py

chatbot/query_handler.py

chatbot/response_generator.py

tests/test_database.py

main.py

start_chatbot.py

Contributing

Branch Naming Convention

Commit Message Convention

Examples:

Branch Names:

Commit Messages:

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`vector_db/database.py`

`vector_db/document_processor.py`

`chatbot/model_handler.py`

`chatbot/query_handler.py`

`chatbot/response_generator.py`

`tests/test_database.py`

`main.py`

`start_chatbot.py`

Packages