AI Observability & Evaluation
-
Updated
Jan 23, 2025 - Jupyter Notebook
AI Observability & Evaluation
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Langchain, Autogen, AG2, and CamelAI
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite
Test your LLM-powered apps with TypeScript. No API key required.
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
Evalica, your favourite evaluation toolkit
Benchmarking Large Language Models for FHIR
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
The OAIEvals Collector: A robust, Go-based metric collector for EVALS data. Supports Kafka, Elastic, Loki, InfluxDB, TimescaleDB integrations, and containerized deployment with Docker. Streamlines OAI-Evals data management efficiently with a low barrier of entry!
Develop better LLM apps by testing different models and prompts in bulk.
Open Source Video Understanding API and Large Vision Model Observability Platform.
The Modelmetry Python SDK allows developers to easily integrate Modelmetry’s advanced guardrails and monitoring capabilities into their LLM-powered applications.
A GitHub Action to parse LLM eval results, display and aggregate them.
Add a description, image, and links to the evals topic page so that developers can more easily learn about it.
To associate your repository with the evals topic, visit your repo's landing page and select "manage topics."