This repository demonstrates how to use Docling and LlamaIndex to build a RAG (Retrieval-Augmented Generation) system with personal documents in PDF format. The goal is to allow language models to access and process information stored locally in personal documents to answer questions accurately and efficiently.
- RAG for personal documents: The code allows you to ask questions directly about the content of personal PDF documents.
- Docling: Used for document processing and analysis.
- LlamaIndex: Responsible for indexing and building an efficient data retrieval pipeline.
README.md
: This documentation file.run_first_prepare_data.ipynb
: A Jupyter Notebook dedicated to the preparation of data for the Retrieval-Augmented Generation (RAG) system.run_second_qa.ipynb
: A Jupyter Notebook designed to implement the question-answering (QA) capabilities of the RAG system.
- Loading PDFs: The code uses libraries to load and process files in PDF format.
- Content Indexing: Documents are processed and indexed using LlamaIndex.
- Query and Response Generation: It is possible to ask questions based on the content of the documents and obtain accurate answers.
- Simple Interface: Implemented in Jupyter Notebook to facilitate execution and understanding of the workflow.
Make sure you have the following items installed in your environment:
-
Python: Version 3.12 or higher.
-
Miniconda:
- Install Miniconda according to your operating system:
- After downloading, follow the installation instructions available on the official website.
-
VSCode:
- Download and install Visual Studio Code:
- Install the recommended extensions:
- Python Extension: For Python support.
-
Clone this repository:
git clone https://github.com/homerokzam/rag-docling-llamaindex.git cd rag-docling-llamaindex
-
Create and activate the virtual environment using Miniconda:
conda create -n venv-rag-docling-llamaindex python=3.12.7 conda activate venv-rag-docling-llamaindex
-
Install the Jupyter kernel and dependencies:
pip install ipykernel pip install -r requirements.txt
-
Open the repository in VSCode:
code .
-
Ensure that the Python and Jupyter extensions are installed in VSCode.
-
Select the kernel of the virtual environment created in Jupyter Notebook.
-
Create the directories: database, input/pdfs, and input/mds.
-
Copy your files to the directory: input/pdfs.