BioRAG is a specialized Retrieval-Augmented Generation (RAG) system designed to assist researchers in oocyte studies. It combines advanced NLP technologies with scientific literature processing to provide intelligent research assistance.
- 🔍 Semantic search across scientific papers
- 💬 Interactive research-focused chat interface
- 📚 Real-time citation tracking
- 🎨 Intuitive user interface
- 📊 Persistent session management
- 🔄 System reset functionality
BioRAG implements a Representation-based Similarity approach for document retrieval:
Key characteristics:
- Documents and queries are independently encoded into dense vector representations
- Retrieval is performed through vector similarity matching (e.g., cosine similarity)
- Efficient for large-scale scientific document retrieval
- Optimized for research paper processing
- Implemented using OpenAI embeddings and Chroma vector store
class DocumentProcessor:
def __init__(self):
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
- Specialized PDF processor for scientific literature
- Recursive character text splitting
- Optimized chunk size and overlap for research papers
- ChromaDB integration for similarity search
- Efficient document embedding storage
- Persistent vector database
- Optimized for research paper embeddings
- Language Model: GPT-3.5-turbo
- Conversational memory for context retention
- Scientific context-aware retrieval
- Citation-aware response generation
flowchart LR
docs[Documents] --> loader[DocumentLoader]
loader --> splitter[TextSplitter]
splitter --> embeddings[OpenAIEmbeddings]
embeddings --> chroma[ChromaDB]
query[User Query] --> retriever[Retriever]
chroma --> retriever
retriever --> chain[ConversationalRetrievalChain]
memory[ConversationMemory] --> chain
llm[ChatOpenAI] --> chain
chain --> response[Response]
style docs fill:#f9d5e5
style chroma fill:#eeac99
style llm fill:#84b6f4
style response fill:#77dd77
project/
├── app.py # Streamlit application
├── process_pdfs.py # PDF processing
├── src/
│ ├── document_loader.py # Document processing
│ ├── embeddings.py # Vector embeddings
│ └── rag_pipeline.py # RAG implementation
The application is deployed on Streamlit Cloud with:
- Automated GitHub-based deployment
- Secure environment variable management
- Continuous availability
- Protected API key handling
- Python 3.8+
- OpenAI API key
- Dependencies: streamlit, langchain, chromadb, openai
git clone https://github.com/qianxu05172019/biorag-oocyte.git
cd biorag
pip install -r requirements.txt
- Create
.env
in project root:
OPENAI_API_KEY=your-api-key
from document_loader import DocumentProcessor
from embeddings import VectorStoreManager
from rag_pipeline import RAGPipeline
# Process research papers
processor = DocumentProcessor()
docs = processor.load_pdfs("path/to/papers")
# Create vector store
vector_store_manager = VectorStoreManager()
vector_store = vector_store_manager.create_vector_store(docs)
# Initialize RAG pipeline
rag = RAGPipeline(vector_store)
# Research query
response = rag.ask("What are the key factors affecting oocyte maturation?")
-
Enhanced Citations
- Detailed tracking system
- Export functionality
- Citation network visualization
-
Analytics Integration
- Research trend analysis
- Document clustering
- Knowledge graph visualization
-
System Enhancements
- Multi-model support (GPT-4, Claude)
- Automated metadata extraction
- Enhanced conversation memory
- Chat history export
-
Knowledge Base Updates
- Meeting notes integration
- Experiment results tracking
- Oocyte stage imaging
Access the live application: BioRAG Oocyte Expert
We welcome contributions! Please submit pull requests for any improvements.
This project is licensed under the MIT License - see the LICENSE file for details.
Developed as a showcase of Machine Learning Engineering and Data Science capabilities, with focus on NLP, RAG systems, and LLM integration in biomedical research.