Skip to main content
How-To Guide 7 min read by GetClaw Hosting Team

How to Monitor OpenClaw Agents: Dashboards, Heartbeats,...

Learn how to monitor OpenClaw agents in production with heartbeat checks, progress tracking, error alerts, cost monitoring, and real-time dashboards. Learn how

Table of Contents

How to Monitor OpenClaw Agents: Dashboards, Heartbeats, and Progress Tracking

Running an AI agent in a demo is forgiving. Running one in production is a different game entirely. Without proper monitoring, your OpenClaw agent could silently fail mid-task, burn through your token budget, or stall indefinitely — and you would have no idea until a customer complaint lands in your inbox.

This guide walks through everything you need to monitor OpenClaw agents properly: the five metrics that matter, heartbeat setup, progress tracking patterns, error alerting, cost monitoring, and how to bring it all together in a dashboard.


Why Agent Monitoring Is Non-Negotiable in Production

Unlike a traditional API endpoint that either returns a 200 or fails fast, an AI agent operates over an extended, multi-step reasoning loop. Each step invokes tools, consumes tokens, calls external services, and makes branching decisions. Any one of those can go wrong silently.

The three failure modes that hurt most in production:

  1. Silent stalls — the agent is "running" but waiting on a tool call that never returns
  2. Runaway costs — no token ceiling means a single agent session can exhaust your monthly budget
  3. Invisible errors — an exception is swallowed, the agent halts, and the user sees nothing

Monitoring turns these invisible problems into actionable signals. It also gives you the audit trail you need when a customer asks "why did your agent do that?"


The Five Metrics That Matter

Before building any dashboard, define what you are measuring. For OpenClaw agents in production, these five metrics cover 95% of the observability surface:

1. Liveness (Is the agent still running?)

A heartbeat timestamp updated every N seconds. If it goes stale, the agent has stalled or crashed.

2. Progress (What step is it on?)

Which tool was last invoked, how many steps completed vs. total expected, and what the current intent is.

3. Errors (Did something fail?)

Tool call failures, API errors, unexpected exceptions, and any agent-surfaced error messages. Tracked as a count and a log.

4. Cost (How many tokens consumed?)

Input tokens, output tokens, and total estimated cost per session. Aggregated daily and per-agent.

5. Latency (How long per step?)

Time per tool call and total session duration. Helps identify slow external dependencies before users complain.


OpenClaw's Built-In Logging

OpenClaw exposes structured logs out of the box. Every agent run emits events in JSON format to stdout:

{
  "event": "tool_call",
  "agent_id": "agent_abc123",
  "session_id": "sess_xyz789",
  "tool": "web_search",
  "input": { "query": "Q3 SaaS churn benchmarks" },
  "timestamp": "2026-03-15T09:14:22.341Z",
  "tokens_in": 420,
  "tokens_out": 0,
  "duration_ms": 0
}

What it provides:

  • Tool invocation events with input and output
  • Per-step token counts
  • Session start and end events
  • Error events with stack traces

Limitations to know:

  • No aggregation — raw event stream only
  • No built-in alerting
  • Log retention depends on your infrastructure setup
  • No cost rollup (you calculate from token counts)

For anything beyond basic debugging, you need to build on top of this stream.


Setting Up Heartbeat Monitoring

A heartbeat is the simplest and most important monitoring primitive. Here is how to implement one for OpenClaw agents.

Step 1 — Emit heartbeats from your agent loop

import { OpenClawAgent } from '@openclaw/sdk';

const agent = new OpenClawAgent({ agentId: 'research-agent' });

agent.on('step', async (step) => {
  await fetch('/api/heartbeat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      agentId: step.agentId,
      sessionId: step.sessionId,
      stepNumber: step.number,
      timestamp: new Date().toISOString(),
    }),
  });
});

Step 2 — Store heartbeats in PocketBase

// api/heartbeat.ts (Astro SSR route)
export const prerender = false;

export async function POST({ request }) {
  const body = await request.json();
  const pb = new PocketBase(import.meta.env.PUBLIC_POCKETBASE_URL);

  await pb.collection('heartbeats').create({
    agent_id: body.agentId,
    session_id: body.sessionId,
    step_number: body.stepNumber,
    timestamp: body.timestamp,
  });

  return new Response(JSON.stringify({ ok: true }), { status: 200 });
}

Step 3 — Set up a staleness check

const staleThresholdMs = 60_000;
const cutoff = new Date(Date.now() - staleThresholdMs).toISOString();

const stale = await pb.collection('heartbeats').getList(1, 50, {
  filter: `timestamp < "${cutoff}" && status = "running"`,
});

for (const session of stale.items) {
  await alertSlack(`Agent ${session.agent_id} may be stalled. Last heartbeat: ${session.timestamp}`);
}

Task Progress Tracking Patterns

Heartbeats tell you the agent is alive. Progress tracking tells you where it is.

Pattern 1 — Structured step events

agent.on('step', async (step) => {
  await pb.collection('agent_progress').create({
    session_id: step.sessionId,
    step_number: step.number,
    tool_called: step.tool,
    status: 'completed',
    tokens_in: step.usage.inputTokens,
    tokens_out: step.usage.outputTokens,
    duration_ms: step.durationMs,
    timestamp: new Date().toISOString(),
  });
});

Pattern 2 — Intent labeling

const TOOL_INTENT_MAP: Record<string, string> = {
  web_search: 'Searching the web',
  read_file: 'Reading document',
  write_file: 'Saving output',
  send_email: 'Sending notification',
  code_execution: 'Running code',
};

const intent = TOOL_INTENT_MAP[step.tool] ?? `Using ${step.tool}`;

Pattern 3 — Estimated completion

const EXPECTED_STEPS = 8;
const progressPct = Math.min(100, Math.round((step.number / EXPECTED_STEPS) * 100));

Error Alerting Setup

Slack Alerts

export async function alertSlack(message: string, level: 'info' | 'warn' | 'error' = 'info') {
  const emoji = { info: ':robot_face:', warn: ':warning:', error: ':red_circle:' }[level];

  await fetch(process.env.SLACK_WEBHOOK_URL!, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      text: `${emoji} *OpenClaw Agent Alert*\n${message}`,
    }),
  });
}

agent.on('error', async (error) => {
  await alertSlack(
    `Agent ${error.agentId} failed on step ${error.stepNumber}\nTool: ${error.tool}\nError: ${error.message}`,
    'error'
  );
});

Email Alerts

export async function sendDailyDigest(stats: AgentDailyStats) {
  await resend.emails.send({
    from: 'alerts@getclawhosting.com',
    to: process.env.ALERT_EMAIL!,
    subject: `OpenClaw Daily Report — ${stats.date}`,
    html: `<h2>Agent Activity Report</h2>
      <ul>
        <li>Sessions run: ${stats.sessionsCount}</li>
        <li>Errors: ${stats.errorsCount}</li>
        <li>Total tokens used: ${stats.totalTokens.toLocaleString()}</li>
        <li>Estimated cost: $${stats.estimatedCostUsd.toFixed(4)}</li>
      </ul>`,
  });
}

GetClaw Hosting

Get GetClaw Hosting — Simple. Reliable. No lock-in.

Join thousands of users who rely on GetClaw Hosting.

Get GetClaw Hosting →

Live now — no waitlist

Cost and Token Usage Monitoring

Token cost is the most commonly ignored production concern for AI agents. A single runaway session on a long-context model can cost more than your entire day's planned spend.

const COST_PER_M_INPUT = 3.00;
const COST_PER_M_OUTPUT = 15.00;

function calculateSessionCost(steps: AgentStep[]): number {
  const totalInput = steps.reduce((sum, s) => sum + s.tokens_in, 0);
  const totalOutput = steps.reduce((sum, s) => sum + s.tokens_out, 0);
  return (totalInput / 1_000_000) * COST_PER_M_INPUT +
         (totalOutput / 1_000_000) * COST_PER_M_OUTPUT;
}

const COST_CEILING_USD = 0.50;

agent.on('step', async (step) => {
  const sessionCost = calculateSessionCost(allSessionSteps);
  if (sessionCost > COST_CEILING_USD) {
    agent.abort(`Cost ceiling reached: $${sessionCost.toFixed(4)}`);
    await alertSlack(`Session ${step.sessionId} aborted — cost limit hit`, 'warn');
  }
});

Building a Simple Monitoring Dashboard

With data flowing into PocketBase, you can build a lightweight real-time dashboard in Astro in under 50 lines.

const activeSessions = await pb.collection('agent_sessions').getList(1, 20, {
  filter: 'status = "running"',
  sort: '-created',
  expand: 'latest_heartbeat',
});

const recentErrors = await pb.collection('agent_errors').getList(1, 10, {
  sort: '-timestamp',
  filter: `timestamp > "${oneDayAgo}"`,
});
<table class="w-full text-sm">
  <thead>
    <tr>
      <th scope="col">Agent</th>
      <th scope="col">Session</th>
      <th scope="col">Step</th>
      <th scope="col">Last Heartbeat</th>
      <th scope="col">Status</th>
    </tr>
  </thead>
  <tbody>
    {activeSessions.items.map(session => (
      <tr>
        <td>{session.agent_id}</td>
        <td class="font-mono text-xs">{session.id.slice(0, 8)}</td>
        <td>{session.current_step}</td>
        <td>{formatRelativeTime(session.last_heartbeat)}</td>
        <td>
          <span class={session.isStale ? 'text-red-400' : 'text-lime-400'}>
            {session.isStale ? 'Stale' : 'Running'}
          </span>
        </td>
      </tr>
    ))}
  </tbody>
</table>

For real-time updates without a full page reload, poll the PocketBase REST API every 10 seconds using a React island or a setInterval in a script tag.


External Monitoring Tools Integration

Grafana and Loki

Ship your OpenClaw container logs to Loki using Promtail or Alloy. Then build Grafana panels on log rate by agent_id, error events, and step duration histograms.

sum(rate({app="openclaw"} |= ""event":"error"" [5m])) by (agent_id)

Datadog

Use the Datadog Agent's log pipeline to parse OpenClaw's JSON logs automatically. Create monitors for:

  • No-data alert on heartbeat metric (agent stalled)
  • Anomaly detection on token usage (runaway agent)
  • Error rate threshold on event:error log pattern

Both integrations require no changes to your agent code — just log forwarding configuration.


GetClaw Hosting Monitoring Dashboard

If you are running OpenClaw through GetClaw Hosting, all of the above is built in and configured on day one.

What the GetClaw dashboard provides:

  • Real-time agent status panel — every running session with live heartbeat indicator, current step, and intent label
  • Task history timeline — full step-by-step log for every completed session, searchable by agent ID or date range
  • Error center — grouped errors with stack traces, affected sessions, and one-click Slack alert configuration
  • Cost tracker — daily and monthly token usage by agent, with configurable cost ceiling alerts
  • Latency heatmap — per-tool latency distribution across all agents, identifying slow dependencies at a glance

You configure your alert thresholds once in the dashboard — stale heartbeat window, cost ceiling, error rate threshold — and GetClaw handles the rest. No Grafana instance to manage, no Loki setup, no Promtail config.

For teams on the Team plan and above, the monitoring dashboard includes 90-day history retention and webhook-based alerting for integration with PagerDuty, Linear, or any tool that accepts webhooks.


Frequently Asked Questions

How often should I emit heartbeats from an OpenClaw agent?

Every 15 to 30 seconds is a reasonable default for most agents. If your agent runs very fast steps (under 5 seconds each), emit on every step instead. If steps are slow, keep the heartbeat on a fixed interval separate from the step loop.

What is the best way to track OpenClaw agent costs in production?

Store tokens_in and tokens_out on every step event in your database, then calculate cost server-side using the model's published pricing. Set a per-session ceiling and abort and alert when it is hit. GetClaw Hosting does this automatically.

Can I monitor multiple OpenClaw agents from a single dashboard?

Yes. Tag every event with both agent_id (the agent definition) and session_id (the individual run). This lets you see all agents on one dashboard while drilling into a specific session when debugging.

How do I detect if an OpenClaw agent is in an infinite loop?

Track cumulative step count per session. If a session exceeds your expected maximum step count, trigger an alert and consider aborting the session. Combine this with the cost ceiling for a belt-and-suspenders approach.

What should I log to have a full audit trail for an OpenClaw agent session?

At minimum: session start with task description, every tool call with input/output summary and token counts, every error, and session end with final status and total cost. Store this in a structured database like PocketBase rather than flat log files.


Start Monitoring Your Agents Today

Building agent monitoring from scratch takes days of infrastructure work — heartbeat stores, staleness jobs, cost calculators, Slack integrations, dashboard queries. It is important work, but it is not your product.

GetClaw Hosting gives you production-grade OpenClaw monitoring out of the box. Real-time status, full session history, cost alerts, and error center — all configured and running before you deploy your first agent.

Start your free trial on GetClaw Hosting and go from "I hope it's still running" to "I know exactly what every agent is doing."

Frequently Asked Questions

How often should I emit heartbeats from an OpenClaw agent?
Every 15 to 30 seconds is a reasonable default for most agents. If your agent runs very fast steps (under 5 seconds each), emit on every step instead. If steps are slow (external API calls that take 30+ seconds), keep the heartbeat on a fixed interval separate from the step loop.
What is the best way to track OpenClaw agent costs in production?
Store tokens_in and tokens_out on every step event in your database, then calculate cost server-side using the model's published pricing. Set a per-session ceiling and abort and alert when it is hit. GetClaw Hosting does this automatically and shows cost breakdowns in the dashboard.
Can I monitor multiple OpenClaw agents from a single dashboard?
Yes. The recommended pattern is to tag every event with both agent_id (the agent definition) and session_id (the individual run). This lets you see all agents on one dashboard while drilling into a specific session when debugging.
How do I detect if an OpenClaw agent is in an infinite loop?
Track cumulative step count per session. If a session exceeds your expected maximum step count (e.g., 50 steps for a research workflow that normally takes 8), trigger an alert and consider aborting the session. Combine this with the cost ceiling for a belt-and-suspenders approach.
What should I log to have a full audit trail for an OpenClaw agent session?
At minimum: session start (with task description and agent config), every tool call (tool name, input, output summary, tokens, duration), every error, and session end (with final status, total tokens, total cost, and output summary). Store this in a structured database like PocketBase rather than flat log files so you can query it later.

About the Author

GetClaw Hosting Team

The GetClaw Hosting team writes guides and articles to help you get the most from our product. All articles are fact-checked and regularly updated.

Ready to get started?

Join thousands of users who use GetClaw Hosting.

Get GetClaw Hosting Now

Continue Reading

Stay Informed

Get the latest updates from GetClaw Hosting. No spam, unsubscribe anytime.

We respect your privacy. Read our privacy policy.