Metaprise Agent OS v4.2 — now with built-in observability for every agent. Read the changelog →
Platform Cloud Hybrid Enterprise Six Engines Orchestration Observability Harness Metaprise LLM
FREE WITH EVERY PLAN

Built-In Observability Layer

Full-chain tracing, real-time evaluation, AI-powered debugging, and human-in-the-loop annotation — all built into the Agent OS at zero extra cost.

TRACE VIEWER — LIVE PREVIEW
mission.start AGENT 0ms
llm.call GPT-4O 234ms
tool.execute SQL 89ms
eval.score PASS 12ms
llm.call CLAUDE 312ms
response.final 200 647ms
EVERY STEP · EVERY TOKEN · EVERY DECISION
7
Core Tools
100%
Full-Chain Tracing
Real-Time
LLM-as-Judge Eval
$0
Extra Cost

Seven tools for complete agent visibility

From production tracing to offline regression testing, every tool you need to understand, evaluate, and debug your agents — built in, not bolted on.

01

Full-Chain Trace Viewer

Every decision, visible

Visualize every decision, tool call, and model response for every step of every agent mission. Drill into token usage, latency, and cost at each node.

Debugging Latency Cost Audit
02

Online Evaluation

LLM-as-Judge, real-time

Real-time scoring of production traffic across dimensions: hallucination, reasoning quality, safety, and more. Catch regressions the moment they appear.

Hallucination Reasoning Safety
03

Offline Eval & Regression Tests

Ship with confidence

Run offline evaluation and regression test suites before every deploy. Compare versions side-by-side to ensure updates never degrade quality.

CI/CD Regression Version Compare
04

Insights Agent

AI-powered failure analysis

An AI agent that automatically analyzes your traces, clusters failure patterns, and surfaces emerging issues before they become systemic problems.

Clustering Pattern Detection Proactive
05

AI Debug Assistant

Ask your traces a question

Query your trace data with natural language. Ask "Why did this agent retry three times?" and get a structured answer with the exact spans that caused the issue.

NL Query Root Cause Suggestions
06

Annotation Queue

Human-in-the-loop review

Low-scoring traces are automatically routed to human reviewers. Build a continuous quality feedback loop that makes your agents smarter over time.

HITL Quality Labeling
07

Prompt Playground

Debug and iterate online

Online prompt debugging and version comparison. Test prompt variations side-by-side, compare outputs, and iterate without redeploying your agents.

Prompt Debug A/B Test Version
08

A/B Experiment Comparison

Data-driven optimization

Run A/B experiments comparing different prompts, models, and strategies. Statistical significance testing ensures you ship the variation that actually performs better.

Experiment Statistics Optimization
09

Cost / Latency / Error Dashboards

Monitor what matters

Pre-built monitoring dashboards for cost, latency, and error rates with configurable alert thresholds. Know the health of every agent at a glance.

Monitoring Cost Alerts SLA

DEEP DIVE 01

Insights Agent

An AI agent that watches your AI agents. It continuously analyzes trace data, clusters failure patterns, and proactively surfaces emerging issues — so you fix problems before users notice them.

CORE CAPABILITIES

Failure Clustering: Groups similar errors by root cause, not just error code — distinguishing timeout failures from logic bugs from model hallucinations
Pattern Detection: Identifies systemic issues across thousands of traces, like a specific tool consistently failing on edge-case inputs
Trend Analysis: Tracks quality metrics over time and alerts when a metric starts degrading, even before it crosses a threshold
Impact Scoring: Ranks issues by real business impact — cost, user-facing errors, mission completion rate — not just frequency

HOW IT WORKS

Runs continuously on your production trace stream
Generates weekly insight reports with actionable recommendations
Integrates with Alert Rules to auto-create tickets for high-severity patterns
Exportable summaries for compliance and stakeholder review

DEEP DIVE 02

AI Debug Assistant

Stop reading JSON traces manually. Ask questions in plain English — "Why did this mission fail?" — and get a structured root-cause analysis with the exact spans, tool calls, and model decisions that led to the issue.

EXAMPLE QUERIES

"Why did the agent retry three times?" — Surfaces the exact tool failure and retry logic that triggered the loop
"Show me all traces where latency exceeds 5s" — Returns filtered traces with latency breakdown per step
"What changed between v2.1 and v2.2?" — Compares trace patterns across agent versions to pinpoint behavioral differences
"Which model is causing the most hallucinations?" — Correlates eval scores with model usage across production traffic

CAPABILITIES

Natural language to trace query translation
Automatic root-cause identification with confidence scoring
Suggested fixes based on similar resolved issues
Session sharing — share a debug session with your team via link

DEEP DIVE 03

Annotation Queue

The bridge between automated evaluation and human judgment. Low-scoring traces are automatically routed to a review queue where domain experts label, correct, and feed data back into your evaluation pipeline — creating a continuous improvement loop.

WORKFLOW

Auto-Routing: Traces scoring below configurable thresholds are automatically added to the queue — no manual triage needed
Priority Scoring: Queue items ranked by business impact, uncertainty, and recency so reviewers focus on what matters most
Annotation Interface: Purpose-built UI for labeling trace quality, flagging hallucinations, and adding corrective examples
Feedback Loop: Annotations automatically update eval criteria and fine-tuning datasets, closing the quality loop

TEAM FEATURES

Role-based access — assign reviewers to specific agent types or domains
Inter-annotator agreement tracking for quality assurance
SLA tracking for annotation turnaround time
Export annotations for external training pipelines

The Observability Pipeline

Every agent mission flows through a six-stage observability pipeline — from trace collection to continuous optimization.

01
Agent Mission
Agent executes a user request
02
Trace Collection
Every step auto-captured
03
Online Eval
Real-time LLM-as-Judge
04
Insights Analysis
AI clusters failure patterns
05
Alert & Annotate
Route to humans or alerts
06
Debug & Optimize
Close the feedback loop

Connects to your existing stack

Export traces and metrics to your existing observability infrastructure via OpenTelemetry, or use the built-in dashboards and alerts to monitor everything in one place.

OpenTelemetry Export

Native OTel support — send traces and metrics to any OTel-compatible backend

Alert Rules

Anomaly detection with PagerDuty, Webhook, and Slack integrations

Monitoring Dashboards

Pre-built cost, latency, and error rate views with configurable thresholds

APM Integration

Works alongside Datadog, New Relic, Grafana, and other APM tools

Free with every plan: The entire Observability layer is included at no additional cost across Cloud, Hybrid, and Enterprise deployments. All tools, all agents, unlimited traces. No per-seat pricing, no usage caps.

See inside every agent mission — for free.

Full-chain tracing, AI debugging, and continuous evaluation. Included with every plan.