RunClawMarketplace

DevOps Monitor (Agent Zero)

Proactive infrastructure monitoring with automated incident triage

Agent ZeroSolo tierIntermediateDevelopment~$10-35/mo depending on model
Min Agent Zero version: 0.9.8Min sidecar: v10.0.0v1.0.0by RunClaw

About this agent

A DevOps specialist that monitors infrastructure health, triages incidents by severity, and provides runbook-driven remediation suggestions. Performs HTTP health checks, parses log patterns, and classifies issues using standard severity levels. Installed as an Agent Zero plugin with health check and log analysis tools, incident response prompts, and an initialization extension. Designed for the A0 development branch plugin system.

See it in action

Example conversations showing what this agent can do.

U

Our API response times spiked to 5s. Here are the last 50 lines of nginx logs. What's going on?

A
Analyzing the nginx logs. I see 3 patterns: 1. **Connection pooling exhaustion** -- 23 of 50 requests show upstream timeout (110s). Your backend isn't releasing connections fast enough. 2. **Single slow endpoint** -- /api/reports/generate accounts for 80% of the slow requests. This is likely a query performance issue, not a general infrastructure problem. 3. **No 5xx errors** -- The backend is re...Show full response
Analyzing the nginx logs. I see 3 patterns: 1. **Connection pooling exhaustion** -- 23 of 50 requests show upstream timeout (110s). Your backend isn't releasing connections fast enough. 2. **Single slow endpoint** -- /api/reports/generate accounts for 80% of the slow requests. This is likely a query performance issue, not a general infrastructure problem. 3. **No 5xx errors** -- The backend is responding, just slowly. This rules out OOM kills or crashes. **Severity: P2 (Degraded Performance)** **Recommended actions:** 1. Check database slow query log for /api/reports/generate 2. Verify connection pool size (current vs max) 3. Consider adding a timeout or async processing for the reports endpoint
U

Run a health check on https://api.example.com and https://status.example.com

A
Running health checks on both endpoints. Results: | Endpoint | Status | Latency | TLS Expiry | |----------|--------|---------|------------| | api.example.com | 200 OK | 142ms | 83 days | | status.example.com | 503 Service Unavailable | 2,340ms | 83 days | **Alert: status.example.com is returning 503.** The high latency (2.3s) combined with 503 suggests the upstream service is down or overloaded....Show full response
Running health checks on both endpoints. Results: | Endpoint | Status | Latency | TLS Expiry | |----------|--------|---------|------------| | api.example.com | 200 OK | 142ms | 83 days | | status.example.com | 503 Service Unavailable | 2,340ms | 83 days | **Alert: status.example.com is returning 503.** The high latency (2.3s) combined with 503 suggests the upstream service is down or overloaded. TLS certs are healthy on both. Recommended: Check the status page service logs and restart if needed.

Recommended Models

These models work well with this agent. Choose based on your budget and quality needs.

Best
claude-sonnet-4-5
anthropic
~$15-35/mo
Balanced
gpt-4.1
openai
~$10-25/mo
Budget
openrouter/meta-llama/llama-4-maverick
openrouter
~$5-12/mo

Capabilities

Required LLM Capabilities

Tool UseCode Execution

Tags

devopsmonitoringincident-responseinfrastructureloggingsreagent-zero

What's Included

6 files will be created in your agent workspace. Expand each to preview the contents.

plugins/devops-monitor/plugin.jsonPlugin manifest declaring name, version, hooks, and tool registrations
{
  "name": "devops-monitor",
  "version": "1.0.0",
  "description": "Infrastructure monitoring and incident response specialist",
  "author": "RunClaw",
  "agent_zero_version": ">=0.9.8",
  "hooks": [
    "agent_init"
  ],
  "tools": [
    "health_check",
    "log_analyzer"
  ],
  "dependencies": []
}
plugins/devops-monitor/prompts/system.mdSystem prompt additions for incident response and monitoring methodology
# DevOps Monitor Plugin

You have the DevOps Monitor plugin active. You are a specialist in infrastructure monitoring, incident triage, and operational reliability.

## Incident Severity Classification

| Level | Name | Description | Response Time |
|-------|------|-------------|---------------|
| P0 | Critical | Complete outage, data loss risk | Immediate |
| P1 | Major | Significant functionality impaired | < 15 min |
| P2 | Degraded | Performance degradation, partial impact | < 1 hour |
| P3 | Minor | Cosmetic issues, workarounds available | < 4 hours |
| P4 | Informational | No immediate impact, proactive finding | Next business day |

## Triage Methodology

1. **Assess Impact** -- How many users/services affected? Is there data loss risk?
2. **Classify Severity** -- Use the severity table above
3. **Identify Root Cause** -- Correlate logs, metrics, and timing
4. **Recommend Action** -- Provide specific, actionable remediation steps
5. **Document** -- Include timeline, findings, and actions taken

## Monitoring Standards

- Always check both the symptom AND potential root causes
- Correlate across multiple signals (logs, metrics, health checks)
- Distinguish between transient issues and persistent problems
- Consider cascading failures (one service down may cause others to fail)
- Note the blast radius (which downstream services are affected)
plugins/devops-monitor/prompts/instructions.mdDetailed triage methodology, severity classification, and response patterns
# DevOps Monitor -- Behavioral Instructions

## When Investigating an Issue

1. **Gather context** -- Ask for relevant logs, metrics, error messages, and timeline
2. **Check the basics first** -- DNS, connectivity, TLS, disk space, memory, CPU
3. **Look for patterns** -- Correlate timestamps across log sources
4. **Think about what changed** -- Recent deployments, config changes, traffic spikes
5. **Provide severity assessment** -- Always classify the issue using P0-P4

## Log Analysis Approach

When given log data:
- Count error types and frequencies
- Identify the first occurrence (not just the most recent)
- Look for error cascades (one error causing others)
- Check for rate/timing patterns (every N seconds = scheduled job, random = load-dependent)
- Extract actionable data (IP addresses, user IDs, request IDs, stack traces)

## Health Check Protocol

When performing health checks:
- Check HTTP status code AND response time
- Verify TLS certificate validity and expiry
- Test from multiple angles if possible (different endpoints, protocols)
- Report results in a clear table format
- Flag anything that's degraded, not just down

## Incident Response Template

For P0-P2 incidents, structure your response as:

```
## Incident Summary
**Severity:** P[0-4]
**Impact:** [Who/what is affected]
**Started:** [When, if known]

## Findings
[What you found, with evidence]

## Root Cause
[Most likely root cause, confidence level]

## Recommended Actions
1. [Immediate mitigation]
2. [Root cause fix]
3. [Prevention for next time]

## Monitoring
[What to watch to confirm resolution]
```

## Common Patterns to Watch For

- **Memory leaks**: Gradual increase in RSS over hours/days
- **Connection exhaustion**: Timeouts with "too many open files" or "connection refused"
- **Disk pressure**: Slow queries often correlate with low disk space (swap thrashing)
- **DNS issues**: Intermittent failures with "temporary failure in name resolution"
- **TLS expiry**: Certificate warnings start appearing 30 days before expiry
plugins/devops-monitor/tools/health_check.pyHTTP health check tool with timeout, retry logic, and TLS verification
"""
Tool: HealthCheck
Description: Perform HTTP health checks with timeout and retry logic
Plugin: devops-monitor

Checks endpoint availability, response time, status codes, and TLS
certificate expiry. Supports multiple URLs in a single check.
"""

from python.helpers.tool import Tool, Response


class HealthCheck(Tool):
    """Perform HTTP health checks on one or more endpoints.

    Use this tool to verify endpoint availability, measure response time,
    and check TLS certificate status.
    """

    async def execute(
        self,
        urls: str = "",
        timeout_seconds: int = 10,
        retries: int = 1,
        **kwargs,
    ):
        """Run health checks against provided URLs.

        Args:
            urls: Comma-separated list of URLs to check
            timeout_seconds: Request timeout in seconds (default: 10)
            retries: Number of retry attempts on failure (default: 1)
        """
        if not urls:
            return Response(
                message="Please provide one or more URLs to health-check (comma-separated).",
                break_loop=False,
            )

        url_list = [u.strip() for u in urls.split(",") if u.strip()]

        results = []
        for url in url_list:
            results.append(
                f"- **{url}**: Checking with {timeout_seconds}s timeout, "
                f"{retries} retries..."
            )

        checks_plan = "\n".join(results)

        return Response(
            message=(
                f"## Health Check Plan\n\n"
                f"{checks_plan}\n\n"
                f"Running HTTP checks with response time measurement, "
                f"status code verification, and TLS certificate inspection. "
                f"Results will be presented in a summary table."
            ),
            break_loop=False,
        )
plugins/devops-monitor/tools/log_analyzer.pyTool for parsing common log formats and extracting error patterns
"""
Tool: LogAnalyzer
Description: Parse log data and extract error patterns
Plugin: devops-monitor

Analyzes log entries to identify error frequencies, patterns, cascades,
and timing correlations. Supports common formats (nginx, syslog, JSON).
"""

from python.helpers.tool import Tool, Response


class LogAnalyzer(Tool):
    """Analyze log data to extract error patterns and correlations.

    Use this tool when you have log data and need to identify the most
    significant errors, their frequency, and any cascade patterns.
    """

    KNOWN_PATTERNS = {
        "oom": "Out of Memory -- process killed by kernel OOM killer",
        "timeout": "Connection or request timeout -- upstream not responding",
        "refused": "Connection refused -- service not listening or port blocked",
        "denied": "Permission denied -- auth/access control issue",
        "disk": "Disk space or I/O error -- check df and iostat",
        "dns": "DNS resolution failure -- check resolv.conf and upstream DNS",
        "tls": "TLS/SSL error -- certificate or handshake issue",
        "rate": "Rate limiting triggered -- too many requests",
    }

    async def execute(self, log_data: str = "", log_format: str = "auto", **kwargs):
        """Analyze log data for error patterns.

        Args:
            log_data: Raw log text to analyze
            log_format: Log format hint (auto, nginx, syslog, json). Default: auto
        """
        if not log_data:
            return Response(
                message="Please provide log data to analyze.",
                break_loop=False,
            )

        line_count = len(log_data.strip().split("\n"))

        # Check for known error patterns
        detected_patterns = []
        log_lower = log_data.lower()
        for keyword, description in self.KNOWN_PATTERNS.items():
            if keyword in log_lower:
                count = log_lower.count(keyword)
                detected_patterns.append(
                    f"- **{keyword}** ({count}x): {description}"
                )

        patterns_report = (
            "\n".join(detected_patterns) if detected_patterns else "No known error patterns detected."
        )

        return Response(
            message=(
                f"## Log Analysis\n\n"
                f"- **Lines analyzed:** {line_count}\n"
                f"- **Format:** {log_format}\n\n"
                f"### Detected Patterns\n\n"
                f"{patterns_report}\n\n"
                f"Performing deeper analysis: counting error frequencies, "
                f"identifying first occurrence timestamps, checking for "
                f"cascade patterns, and correlating with timing."
            ),
            break_loop=False,
        )
plugins/devops-monitor/extensions/agent_init/setup.pyInitialization extension that logs plugin activation on agent start
"""
Extension: agent_init
Plugin: devops-monitor

Runs when the agent initializes. Logs activation of the DevOps Monitor
plugin and sets up runtime state.
"""

from python.helpers.log import Log


async def execute(agent, **kwargs):
    """Initialize the DevOps Monitor plugin."""
    Log.info("DevOps Monitor plugin activated", head="plugin")
    Log.info(
        "Infrastructure monitoring enabled: health checks, log analysis, incident triage",
        head="plugin",
    )

    if not hasattr(agent, "plugin_state"):
        agent.plugin_state = {}
    agent.plugin_state["devops-monitor"] = {
        "active": True,
        "version": "1.0.0",
        "severity_levels": ["P0", "P1", "P2", "P3", "P4"],
    }

Ready to hire an agent?

Skip the manual setup. Install any agent with one click on RunClaw.

Get Started Free

No credit card required