A collection of scrapers and tools for accessing Indian court data.
IMPORTANT NOTICE: The author takes no responsibility for the quality and performance of this code. All data obtained through these scrapers is the respective copyright of its owner. The author claims ownership only of the code, not the data. Use at your own risk.
The following table shows the implementation status of scrapers for various Indian courts:
Court | Cause Lists | Judgments | Orders | Case Status |
---|---|---|---|---|
Delhi High Court | ✅ | ❌ | ❌ | ❌ |
Supreme Court | ❌ | ❌ | ❌ | ❌ |
Bombay High Court | ❌ | ❌ | ❌ | ❌ |
Madras High Court | ❌ | ❌ | ❌ | ❌ |
Calcutta High Court | ❌ | ❌ | ❌ | ❌ |
Karnataka High Court | ❌ | ❌ | ❌ | ❌ |
Allahabad High Court | ❌ | ❌ | ❌ | ❌ |
Gujarat High Court | ❌ | ❌ | ❌ | ❌ |
✅ - Implemented | ❌ - Not yet implemented
open-court-data-india/
├── data/ # Directory for downloaded data (not tracked by Git)
│ └── delhi_hc/ # Delhi High Court data
│ └── cause_lists/ # Delhi HC cause lists
├── db/ # Database-related files
│ ├── connector.py # Database connector class
│ └── schema.sql # PostgreSQL database schema
├── scrapers/ # Individual court scrapers
│ ├── delhi_hc/ # Delhi High Court scraper
│ │ └── cause_lists/ # Delhi HC cause list scraper
│ ├── supreme_court/ # Supreme Court scraper
│ └── ... # Other court scrapers
├── utils/ # Utility functions and classes
│ ├── __init__.py # Package initialization
│ ├── common.py # Common utility functions
│ ├── pdf_utils.py # PDF processing utilities
│ ├── html_utils.py # HTML processing utilities
│ ├── scraper_utils.py # Base scraper and scraper utilities
│ ├── config.py # Configuration utilities
│ ├── logger.py # Logging utilities
│ ├── cache.py # Caching utilities
│ ├── data_processor.py # Data processing utilities
│ └── gemini_utils.py # Google Gemini API utilities
├── examples/ # Example scripts
├── tests/ # Unit tests
├── config.yaml # Configuration file
├── requirements.txt # Python dependencies
├── run_pipeline.py # Complete data pipeline script
├── query_db.py # Database query script
├── tag_cases.py # Case tagging script
├── .env # Environment variables (not tracked by Git)
└── README.md # This file
- Standardized Data Structure: All scraped data follows a consistent format for easy integration
- Metadata Extraction: Automatically extracts metadata from court documents
- PDF Processing: Downloads and processes PDF documents from court websites
- Structured Data Extraction: Converts unstructured court data into structured formats using Google's Gemini API
- PostgreSQL Database Integration: Stores structured court data in a PostgreSQL database for efficient querying and analysis
- Case Tagging System: Automatically categorizes cases based on patterns in case numbers and titles
- Complete Data Pipeline: End-to-end workflow from scraping to processing, storage, and analysis
- Configurable Behavior: Customize scraper behavior through configuration files
- Parallel Processing: Efficiently downloads and processes multiple documents concurrently
- Configurable number of worker threads for downloads and API processing
- Thread-safe operations to prevent race conditions
- Significant performance improvements for bulk operations
- Healthcheck Dashboard: Monitor the status of all scrapers with a visual dashboard
- Real-time status indicators for each scraper
- Run scrapers directly from the dashboard
- API endpoints for integration with monitoring systems
The project includes a healthcheck dashboard to monitor the status of all scrapers:
# Install dashboard dependencies
pip install -r requirements-dashboard.txt
# Run the dashboard
python healthcheck.py
Then open your browser to http://localhost:5001 to access the dashboard.
- Visual Status Indicators: Green/red indicators show the status of each scraper
- Detailed Status Information: View last check time, last success time, and error messages
- Run Controls: Run scrapers directly from the dashboard
- Integrated Status Logging: Scrapers automatically log their status when they run
- Auto-refresh: Dashboard automatically refreshes every 30 seconds
- Uptime Tracking: Visual representation of uptime history with accurate percentage calculations
- Shows 100% uptime when all runs are successful
- Excludes "running" status from uptime calculations
- Color-coded badges indicate health status (OK, Warning, Critical)
The healthcheck system has been designed to be lightweight and integrated:
- No Scheduled Pings: Unlike traditional monitoring systems, there are no automatic pings that could create unnecessary load
- Integrated Status Updates: Scrapers automatically update their status at key points:
- When a scraper starts running (status: running)
- When a scraper completes successfully (status: ok)
- When a scraper encounters an error (status: error, with error details)
- Historical Data: The system maintains a history of scraper runs, allowing you to track performance over time
- Uptime Calculation: Automatically calculates uptime percentages based on successful runs
- Excludes "running" status from calculations
- Shows 100% when all completed runs are successful
The healthcheck dashboard also provides API endpoints for integration with monitoring systems:
GET /api/scrapers
: List all available scrapersGET /api/status
: Get status of all scrapersGET /api/status/<scraper_id>
: Get status of a specific scraperGET /api/check/<scraper_id>
: Check the health of a specific scraperGET /api/run/<scraper_id>
: Run a specific scraper
Example:
# Get status of all scrapers
curl http://localhost:5001/api/status
# Check health of Delhi HC cause list scraper
curl http://localhost:5001/api/check/delhi_hc_cause_lists
# Run Delhi HC cause list scraper
curl http://localhost:5001/api/run/delhi_hc_cause_lists
-
Clone the repository:
git clone https://github.com/varun-heman/open-court-data-india.git cd open-court-data-india
-
Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
from scrapers import DelhiHCScraper, DelhiHCCauseListScraper
# Initialize the scraper
scraper = DelhiHCScraper()
# Run the scraper
scraper.run()
# Or use the specialized cause list scraper
cause_list_scraper = DelhiHCCauseListScraper()
cause_list_scraper.run()
The project includes command-line interfaces for running the scrapers:
Each scraper can be run directly from the command line:
# Run the Delhi High Court cause list scraper
python -m scrapers.delhi_hc.cause_lists.cause_list_scraper
# With custom options
python -m scrapers.delhi_hc.cause_lists.cause_list_scraper --output /path/to/output --config /path/to/config.yaml --debug
Available options:
--output
,-o
: Specify a custom output directory--config
,-c
: Specify a custom configuration file--debug
,-d
: Enable debug logging
The project includes example scripts to run multiple scrapers:
# Run all Delhi High Court scrapers
python examples/run_delhi_hc_scraper.py
# Run only the cause list scraper
python examples/run_delhi_hc_scraper.py --type cause_list
# Run only the base scraper
python examples/run_delhi_hc_scraper.py --type base
# With custom options
python examples/run_delhi_hc_scraper.py --output /path/to/output --config /path/to/config.yaml --debug
Available options:
--output
,-o
: Specify a custom output directory--config
,-c
: Specify a custom configuration file--debug
,-d
: Enable debug logging--type
,-t
: Specify which scraper to run (all
,base
, orcause_list
)
By default, the scrapers save data to the following directory structure:
data/
└── delhi_hc/ # Delhi High Court data
└── cause_lists/ # Cause lists for Delhi HC
└── YYYY-MM-DD/ # Date-specific directory
├── metadata.json # Metadata for all documents
├── *.pdf # Original PDF documents
└── *.md # Structured markdown files
Each specialized scraper (like cause lists, judgments, etc.) creates its own subdirectory under the court directory to maintain a clean organization of different document types.
The behavior of scrapers can be customized through the config.yaml
file or by passing parameters directly to the scraper constructors:
# Example configuration
output_dir: "data"
log_level: "INFO"
cache_enabled: true
cache_expiry: 86400 # 24 hours in seconds
# Parallel processing options
parallel_downloads: true
download_workers: 5
parallel_processing: true
processing_workers: 3
# API configuration
gemini_api_key: "your-api-key"
The project now includes PostgreSQL database integration for storing structured court data, with a complete data pipeline:
- Automated data structuring with Gemini API
- PostgreSQL database storage
- Query tools for data retrieval and analysis
- Case tagging functionality
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Delhi HC │ │ Gemini API │ │ PostgreSQL │ │ Frontend UI │
│ Scraper │───>│ Processing │───>│ Database │<───│ (React) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
▲
│
┌────┴────┐
│ FastAPI │
│ Backend │
└─────────┘
- Install PostgreSQL if not already installed:
# macOS (using Homebrew)
brew install postgresql
brew services start postgresql
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
sudo systemctl start postgresql
- Create database and user:
# Connect to PostgreSQL
psql postgres
# Create user and database
CREATE USER ecourts WITH PASSWORD 'your_password';
CREATE DATABASE ecourts OWNER ecourts;
\q
- Initialize the database schema:
# Set environment variables
export DB_USER=ecourts
export DB_PASSWORD=your_password
export DB_HOST=localhost
export DB_PORT=5432
export DB_NAME=ecourts
# Initialize database schema
psql -U ecourts -d ecourts -f db/schema.sql
The run_pipeline.py
script runs the entire workflow in a single command:
# Run the complete pipeline for today
python run_pipeline.py
# Run the pipeline for a specific date
python run_pipeline.py --date 2025-03-23
# Run the pipeline for multiple days
python run_pipeline.py --days 7
# Skip specific steps
python run_pipeline.py --no-scrape # Skip scraping, only process and tag
python run_pipeline.py --no-process # Skip processing, only scrape and tag
python run_pipeline.py --no-tag # Skip tagging, only scrape and process
# Enable parallel processing
python run_pipeline.py --parallel --download-workers 5 --processing-workers 3
# Enable debug logging
python run_pipeline.py --debug
The pipeline handles:
- Scraping Delhi HC cause lists (with optional date filtering)
- Processing PDFs with Gemini API
- Storing structured data in PostgreSQL
- Automatically tagging cases based on patterns
The query_db.py
script provides a command-line interface for querying the database:
# List all available dates
python query_db.py --list-dates
# Query by date
python query_db.py --date 2025-03-23 --pretty
# Filter by bench number
python query_db.py --date 2025-03-23 --bench "COURT NO. 01" --pretty
# Search for a specific case
python query_db.py --case "W.P.(C)-6483/2021" --pretty
# Filter by tags
python query_db.py --tag "education" --pretty
# Output as JSON
python query_db.py --date 2025-03-23 --json
The tag_cases.py
script allows you to add tags to cases in the database:
# List all tags in the database
python tag_cases.py --list-tags
# Manually tag a specific case
python tag_cases.py --case-id "case_uuid" --add-tags "important,urgent"
# Auto-tag all cases based on patterns
python tag_cases.py --auto-tag
# Remove tags from a case
python tag_cases.py --case-id "case_uuid" --remove-tags "urgent"
Auto-tagging rules include:
- Case numbers with "W.P." are tagged as "writ_petition"
- Case numbers with "CRL.M.C." are tagged as "criminal"
- Case numbers with "CS(COMM)" are tagged as "commercial"
- Titles containing "education" are tagged as "education"
- And many more predefined patterns
The database connection requires the following environment variables:
DB_USER=ecourts
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=5432
DB_NAME=ecourts
GEMINI_API_KEY=your_gemini_api_key
You can set these in a .env
file in the project root or export them directly in your shell.
The project includes several utility modules to simplify common tasks:
General utility functions for directory management, filename cleaning, date extraction, etc.
from utils import ensure_directory, clean_filename, extract_date_from_text
Functions for extracting text and structured data from PDF files.
from utils import extract_text_from_pdf, extract_date_from_pdf, is_cause_list_pdf
Functions for parsing HTML pages and extracting relevant information.
from utils import extract_links_from_html, is_navigation_page, extract_table_data_from_html
Base class for all court scrapers with common functionality.
from utils import BaseScraper
class MyCourtScraper(BaseScraper):
def __init__(self, config_file=None):
super().__init__("My Court", "https://mycourt.gov.in", config_file=config_file)
def run(self):
# Implement scraper logic here
pass
Configuration management for scrapers.
from utils import ScraperConfig
config = ScraperConfig("config.yaml", "delhi_hc")
timeout = config.get("timeout", 30)
Configurable logging with context.
from utils import setup_logger, get_logger_with_context
setup_logger(level="INFO")
logger = get_logger_with_context("my_scraper", {"court": "Delhi HC"})
logger.info("Starting scraper")
Caching for expensive operations.
from utils import ScraperCache, cached
cache = ScraperCache(".cache", 86400)
@cached(key_prefix="my_function")
def expensive_function(arg1, arg2):
# Expensive operation
return result
To add a new court scraper:
-
Create a new directory in the
scrapers
directory:scrapers/ └── new_court/ ├── __init__.py ├── new_court_scraper.py └── cause_lists/ ├── __init__.py └── cause_list_scraper.py
-
Implement the base scraper by extending
BaseScraper
:from utils import BaseScraper class NewCourtScraper(BaseScraper): def __init__(self, output_dir=None, config_file=None): super().__init__("New Court", "https://newcourt.gov.in", output_dir=output_dir, config_file=config_file) def run(self): # Implement scraper logic here pass
-
Implement specialized scrapers for different document types:
from ..new_court_scraper import NewCourtScraper class NewCourtCauseListScraper(NewCourtScraper): def __init__(self, output_dir=None, config_file=None): super().__init__(output_dir=output_dir, config_file=config_file) # Set up cause list specific configuration def run(self): # Implement cause list scraper logic pass
-
Update the imports in the package
__init__.py
files to expose the new scrapers. -
Add court-specific settings to
config.yaml
.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.