BioRAG: Intelligent Research Assistant for Oocyte Studies

Overview

BioRAG is a specialized Retrieval-Augmented Generation (RAG) system designed to assist researchers in oocyte studies. It combines advanced NLP technologies with scientific literature processing to provide intelligent research assistance.

Features

🔍 Semantic search across scientific papers
💬 Interactive research-focused chat interface
📚 Real-time citation tracking
🎨 Intuitive user interface
📊 Persistent session management
🔄 System reset functionality

Technical Architecture

Retrieval Mechanism

BioRAG implements a Representation-based Similarity approach for document retrieval:

Key characteristics:

Documents and queries are independently encoded into dense vector representations
Retrieval is performed through vector similarity matching (e.g., cosine similarity)
Efficient for large-scale scientific document retrieval
Optimized for research paper processing
Implemented using OpenAI embeddings and Chroma vector store

Core Components

1. Document Processing

class DocumentProcessor:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len
        )

Specialized PDF processor for scientific literature
Recursive character text splitting
Optimized chunk size and overlap for research papers

2. Vector Store Management

ChromaDB integration for similarity search
Efficient document embedding storage
Persistent vector database
Optimized for research paper embeddings

3. RAG Pipeline

Language Model: GPT-3.5-turbo
Conversational memory for context retention
Scientific context-aware retrieval
Citation-aware response generation

System Architecture

flowchart LR
    docs[Documents] --> loader[DocumentLoader]
    loader --> splitter[TextSplitter]
    splitter --> embeddings[OpenAIEmbeddings]
    embeddings --> chroma[ChromaDB]
    
    query[User Query] --> retriever[Retriever]
    chroma --> retriever
    retriever --> chain[ConversationalRetrievalChain]
    
    memory[ConversationMemory] --> chain
    llm[ChatOpenAI] --> chain
    chain --> response[Response]

    style docs fill:#f9d5e5
    style chroma fill:#eeac99
    style llm fill:#84b6f4
    style response fill:#77dd77

Loading

Project Structure

project/
├── app.py                 # Streamlit application
├── process_pdfs.py        # PDF processing
├── src/
│   ├── document_loader.py # Document processing
│   ├── embeddings.py      # Vector embeddings
│   └── rag_pipeline.py    # RAG implementation

Deployment

Cloud Deployment

The application is deployed on Streamlit Cloud with:

Automated GitHub-based deployment
Secure environment variable management
Continuous availability
Protected API key handling

Local Development

Prerequisites

Python 3.8+
OpenAI API key
Dependencies: streamlit, langchain, chromadb, openai

Installation

git clone https://github.com/qianxu05172019/biorag-oocyte.git
cd biorag
pip install -r requirements.txt

Configuration

Create .env in project root:

OPENAI_API_KEY=your-api-key

Usage Example

from document_loader import DocumentProcessor
from embeddings import VectorStoreManager
from rag_pipeline import RAGPipeline

# Process research papers
processor = DocumentProcessor()
docs = processor.load_pdfs("path/to/papers")

# Create vector store
vector_store_manager = VectorStoreManager()
vector_store = vector_store_manager.create_vector_store(docs)

# Initialize RAG pipeline
rag = RAGPipeline(vector_store)

# Research query
response = rag.ask("What are the key factors affecting oocyte maturation?")

Future Development

Planned Features

Enhanced Citations
- Detailed tracking system
- Export functionality
- Citation network visualization
Analytics Integration
- Research trend analysis
- Document clustering
- Knowledge graph visualization
System Enhancements
- Multi-model support (GPT-4, Claude)
- Automated metadata extraction
- Enhanced conversation memory
- Chat history export
Knowledge Base Updates
- Meeting notes integration
- Experiment results tracking
- Oocyte stage imaging

Live Demo

Access the live application: BioRAG Oocyte Expert

Contributing

We welcome contributions! Please submit pull requests for any improvements.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Developed as a showcase of Machine Learning Engineering and Data Science capabilities, with focus on NLP, RAG systems, and LLM integration in biomedical research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BioRAG: Intelligent Research Assistant for Oocyte Studies

Overview

Features

Technical Architecture

Retrieval Mechanism

Core Components

1. Document Processing

2. Vector Store Management

3. RAG Pipeline

System Architecture

Project Structure

Deployment

Cloud Deployment

Local Development

Prerequisites

Installation

Configuration

Usage Example

Future Development

Planned Features

Live Demo

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

BioRAG: Intelligent Research Assistant for Oocyte Studies

Overview

Features

Technical Architecture

Retrieval Mechanism

Core Components

1. Document Processing

2. Vector Store Management

3. RAG Pipeline

System Architecture

Project Structure

Deployment

Cloud Deployment

Local Development

Prerequisites

Installation

Configuration

Usage Example

Future Development

Planned Features

Live Demo

Contributing

License