How to Monitor OpenClaw Agents: Dashboards, Heartbeats,...
Learn how to monitor OpenClaw agents in production with heartbeat checks, progress tracking, error alerts, cost monitoring, and real-time dashboards. Learn how
How to Monitor OpenClaw Agents: Dashboards, Heartbeats, and Progress Tracking
Running an AI agent in a demo is forgiving. Running one in production is a different game entirely. Without proper monitoring, your OpenClaw agent could silently fail mid-task, burn through your token budget, or stall indefinitely — and you would have no idea until a customer complaint lands in your inbox.
This guide walks through everything you need to monitor OpenClaw agents properly: the five metrics that matter, heartbeat setup, progress tracking patterns, error alerting, cost monitoring, and how to bring it all together in a dashboard.
Why Agent Monitoring Is Non-Negotiable in Production
Unlike a traditional API endpoint that either returns a 200 or fails fast, an AI agent operates over an extended, multi-step reasoning loop. Each step invokes tools, consumes tokens, calls external services, and makes branching decisions. Any one of those can go wrong silently.
The three failure modes that hurt most in production:
- Silent stalls — the agent is "running" but waiting on a tool call that never returns
- Runaway costs — no token ceiling means a single agent session can exhaust your monthly budget
- Invisible errors — an exception is swallowed, the agent halts, and the user sees nothing
Monitoring turns these invisible problems into actionable signals. It also gives you the audit trail you need when a customer asks "why did your agent do that?"
The Five Metrics That Matter
Before building any dashboard, define what you are measuring. For OpenClaw agents in production, these five metrics cover 95% of the observability surface:
1. Liveness (Is the agent still running?)
A heartbeat timestamp updated every N seconds. If it goes stale, the agent has stalled or crashed.
2. Progress (What step is it on?)
Which tool was last invoked, how many steps completed vs. total expected, and what the current intent is.
3. Errors (Did something fail?)
Tool call failures, API errors, unexpected exceptions, and any agent-surfaced error messages. Tracked as a count and a log.
4. Cost (How many tokens consumed?)
Input tokens, output tokens, and total estimated cost per session. Aggregated daily and per-agent.
5. Latency (How long per step?)
Time per tool call and total session duration. Helps identify slow external dependencies before users complain.
OpenClaw's Built-In Logging
OpenClaw exposes structured logs out of the box. Every agent run emits events in JSON format to stdout:
{
"event": "tool_call",
"agent_id": "agent_abc123",
"session_id": "sess_xyz789",
"tool": "web_search",
"input": { "query": "Q3 SaaS churn benchmarks" },
"timestamp": "2026-03-15T09:14:22.341Z",
"tokens_in": 420,
"tokens_out": 0,
"duration_ms": 0
}
What it provides:
- Tool invocation events with input and output
- Per-step token counts
- Session start and end events
- Error events with stack traces
Limitations to know:
- No aggregation — raw event stream only
- No built-in alerting
- Log retention depends on your infrastructure setup
- No cost rollup (you calculate from token counts)
For anything beyond basic debugging, you need to build on top of this stream.
Setting Up Heartbeat Monitoring
A heartbeat is the simplest and most important monitoring primitive. Here is how to implement one for OpenClaw agents.
Step 1 — Emit heartbeats from your agent loop
import { OpenClawAgent } from '@openclaw/sdk';
const agent = new OpenClawAgent({ agentId: 'research-agent' });
agent.on('step', async (step) => {
await fetch('/api/heartbeat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
agentId: step.agentId,
sessionId: step.sessionId,
stepNumber: step.number,
timestamp: new Date().toISOString(),
}),
});
});
Step 2 — Store heartbeats in PocketBase
// api/heartbeat.ts (Astro SSR route)
export const prerender = false;
export async function POST({ request }) {
const body = await request.json();
const pb = new PocketBase(import.meta.env.PUBLIC_POCKETBASE_URL);
await pb.collection('heartbeats').create({
agent_id: body.agentId,
session_id: body.sessionId,
step_number: body.stepNumber,
timestamp: body.timestamp,
});
return new Response(JSON.stringify({ ok: true }), { status: 200 });
}
Step 3 — Set up a staleness check
const staleThresholdMs = 60_000;
const cutoff = new Date(Date.now() - staleThresholdMs).toISOString();
const stale = await pb.collection('heartbeats').getList(1, 50, {
filter: `timestamp < "${cutoff}" && status = "running"`,
});
for (const session of stale.items) {
await alertSlack(`Agent ${session.agent_id} may be stalled. Last heartbeat: ${session.timestamp}`);
}
Task Progress Tracking Patterns
Heartbeats tell you the agent is alive. Progress tracking tells you where it is.
Pattern 1 — Structured step events
agent.on('step', async (step) => {
await pb.collection('agent_progress').create({
session_id: step.sessionId,
step_number: step.number,
tool_called: step.tool,
status: 'completed',
tokens_in: step.usage.inputTokens,
tokens_out: step.usage.outputTokens,
duration_ms: step.durationMs,
timestamp: new Date().toISOString(),
});
});
Pattern 2 — Intent labeling
const TOOL_INTENT_MAP: Record<string, string> = {
web_search: 'Searching the web',
read_file: 'Reading document',
write_file: 'Saving output',
send_email: 'Sending notification',
code_execution: 'Running code',
};
const intent = TOOL_INTENT_MAP[step.tool] ?? `Using ${step.tool}`;
Pattern 3 — Estimated completion
const EXPECTED_STEPS = 8;
const progressPct = Math.min(100, Math.round((step.number / EXPECTED_STEPS) * 100));
Error Alerting Setup
Slack Alerts
export async function alertSlack(message: string, level: 'info' | 'warn' | 'error' = 'info') {
const emoji = { info: ':robot_face:', warn: ':warning:', error: ':red_circle:' }[level];
await fetch(process.env.SLACK_WEBHOOK_URL!, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text: `${emoji} *OpenClaw Agent Alert*\n${message}`,
}),
});
}
agent.on('error', async (error) => {
await alertSlack(
`Agent ${error.agentId} failed on step ${error.stepNumber}\nTool: ${error.tool}\nError: ${error.message}`,
'error'
);
});
Email Alerts
export async function sendDailyDigest(stats: AgentDailyStats) {
await resend.emails.send({
from: 'alerts@getclawhosting.com',
to: process.env.ALERT_EMAIL!,
subject: `OpenClaw Daily Report — ${stats.date}`,
html: `<h2>Agent Activity Report</h2>
<ul>
<li>Sessions run: ${stats.sessionsCount}</li>
<li>Errors: ${stats.errorsCount}</li>
<li>Total tokens used: ${stats.totalTokens.toLocaleString()}</li>
<li>Estimated cost: $${stats.estimatedCostUsd.toFixed(4)}</li>
</ul>`,
});
}
GetClaw Hosting
Get GetClaw Hosting — Simple. Reliable. No lock-in.
Join thousands of users who rely on GetClaw Hosting.
Live now — no waitlist
Cost and Token Usage Monitoring
Token cost is the most commonly ignored production concern for AI agents. A single runaway session on a long-context model can cost more than your entire day's planned spend.
const COST_PER_M_INPUT = 3.00;
const COST_PER_M_OUTPUT = 15.00;
function calculateSessionCost(steps: AgentStep[]): number {
const totalInput = steps.reduce((sum, s) => sum + s.tokens_in, 0);
const totalOutput = steps.reduce((sum, s) => sum + s.tokens_out, 0);
return (totalInput / 1_000_000) * COST_PER_M_INPUT +
(totalOutput / 1_000_000) * COST_PER_M_OUTPUT;
}
const COST_CEILING_USD = 0.50;
agent.on('step', async (step) => {
const sessionCost = calculateSessionCost(allSessionSteps);
if (sessionCost > COST_CEILING_USD) {
agent.abort(`Cost ceiling reached: $${sessionCost.toFixed(4)}`);
await alertSlack(`Session ${step.sessionId} aborted — cost limit hit`, 'warn');
}
});
Building a Simple Monitoring Dashboard
With data flowing into PocketBase, you can build a lightweight real-time dashboard in Astro in under 50 lines.
const activeSessions = await pb.collection('agent_sessions').getList(1, 20, {
filter: 'status = "running"',
sort: '-created',
expand: 'latest_heartbeat',
});
const recentErrors = await pb.collection('agent_errors').getList(1, 10, {
sort: '-timestamp',
filter: `timestamp > "${oneDayAgo}"`,
});
<table class="w-full text-sm">
<thead>
<tr>
<th scope="col">Agent</th>
<th scope="col">Session</th>
<th scope="col">Step</th>
<th scope="col">Last Heartbeat</th>
<th scope="col">Status</th>
</tr>
</thead>
<tbody>
{activeSessions.items.map(session => (
<tr>
<td>{session.agent_id}</td>
<td class="font-mono text-xs">{session.id.slice(0, 8)}</td>
<td>{session.current_step}</td>
<td>{formatRelativeTime(session.last_heartbeat)}</td>
<td>
<span class={session.isStale ? 'text-red-400' : 'text-lime-400'}>
{session.isStale ? 'Stale' : 'Running'}
</span>
</td>
</tr>
))}
</tbody>
</table>
For real-time updates without a full page reload, poll the PocketBase REST API every 10 seconds using a React island or a setInterval in a script tag.
External Monitoring Tools Integration
Grafana and Loki
Ship your OpenClaw container logs to Loki using Promtail or Alloy. Then build Grafana panels on log rate by agent_id, error events, and step duration histograms.
sum(rate({app="openclaw"} |= ""event":"error"" [5m])) by (agent_id)
Datadog
Use the Datadog Agent's log pipeline to parse OpenClaw's JSON logs automatically. Create monitors for:
- No-data alert on heartbeat metric (agent stalled)
- Anomaly detection on token usage (runaway agent)
- Error rate threshold on
event:errorlog pattern
Both integrations require no changes to your agent code — just log forwarding configuration.
GetClaw Hosting Monitoring Dashboard
If you are running OpenClaw through GetClaw Hosting, all of the above is built in and configured on day one.
What the GetClaw dashboard provides:
- Real-time agent status panel — every running session with live heartbeat indicator, current step, and intent label
- Task history timeline — full step-by-step log for every completed session, searchable by agent ID or date range
- Error center — grouped errors with stack traces, affected sessions, and one-click Slack alert configuration
- Cost tracker — daily and monthly token usage by agent, with configurable cost ceiling alerts
- Latency heatmap — per-tool latency distribution across all agents, identifying slow dependencies at a glance
You configure your alert thresholds once in the dashboard — stale heartbeat window, cost ceiling, error rate threshold — and GetClaw handles the rest. No Grafana instance to manage, no Loki setup, no Promtail config.
For teams on the Team plan and above, the monitoring dashboard includes 90-day history retention and webhook-based alerting for integration with PagerDuty, Linear, or any tool that accepts webhooks.
Frequently Asked Questions
How often should I emit heartbeats from an OpenClaw agent?
Every 15 to 30 seconds is a reasonable default for most agents. If your agent runs very fast steps (under 5 seconds each), emit on every step instead. If steps are slow, keep the heartbeat on a fixed interval separate from the step loop.
What is the best way to track OpenClaw agent costs in production?
Store tokens_in and tokens_out on every step event in your database, then calculate cost server-side using the model's published pricing. Set a per-session ceiling and abort and alert when it is hit. GetClaw Hosting does this automatically.
Can I monitor multiple OpenClaw agents from a single dashboard?
Yes. Tag every event with both agent_id (the agent definition) and session_id (the individual run). This lets you see all agents on one dashboard while drilling into a specific session when debugging.
How do I detect if an OpenClaw agent is in an infinite loop?
Track cumulative step count per session. If a session exceeds your expected maximum step count, trigger an alert and consider aborting the session. Combine this with the cost ceiling for a belt-and-suspenders approach.
What should I log to have a full audit trail for an OpenClaw agent session?
At minimum: session start with task description, every tool call with input/output summary and token counts, every error, and session end with final status and total cost. Store this in a structured database like PocketBase rather than flat log files.
Start Monitoring Your Agents Today
Building agent monitoring from scratch takes days of infrastructure work — heartbeat stores, staleness jobs, cost calculators, Slack integrations, dashboard queries. It is important work, but it is not your product.
GetClaw Hosting gives you production-grade OpenClaw monitoring out of the box. Real-time status, full session history, cost alerts, and error center — all configured and running before you deploy your first agent.
Start your free trial on GetClaw Hosting and go from "I hope it's still running" to "I know exactly what every agent is doing."
Frequently Asked Questions
How often should I emit heartbeats from an OpenClaw agent?
What is the best way to track OpenClaw agent costs in production?
Can I monitor multiple OpenClaw agents from a single dashboard?
How do I detect if an OpenClaw agent is in an infinite loop?
What should I log to have a full audit trail for an OpenClaw agent session?
Continue Reading
Building a Domain-Specific OpenClaw Agent: Step-by-Step...
Learn to build a domain-specific OpenClaw agent for legal, finance, or marketing. Covers RAG setup, system prompts, tools, testing, and deployment. Learn how in
Read moreHow to Set Up OpenClaw in 15 Minutes Without Being Technical
Learn how to set up OpenClaw in 15 minutes or less. Step-by-step guide for non-technical users — no Docker, no ports, no reverse proxy headaches. Learn how in t
Read moreHow to Make OpenClaw Watch Slack, RSS, and HubSpot for...
Learn how to set up OpenClaw event-driven agents that watch Slack, RSS feeds, and HubSpot, triggering automatic actions without any manual prompting. Learn how
Read moreStay Informed
Get the latest updates from GetClaw Hosting. No spam, unsubscribe anytime.
We respect your privacy. Read our privacy policy.