🛡️

DevOps Monitor (Agent Zero) — Manual Install Guide

Agent ZeroIntermediate

Prerequisites

✓A running Agent Zero instance (v0.9.8 or later)
✓SSH access to your VPS
✓Docker installed on your VPS

Estimated time: ~20 minutes

Installation Steps

Connect to your VPS

SSH into the server where your Agent Zero instance is running.

Terminal

ssh root@your-vps-ip

Create the plugin directory

Create the plugin directory inside the Agent Zero container for the DevOps Monitor (Agent Zero) plugin.

Terminal

docker exec agent-zero mkdir -p /a0/usr/plugins/devops-monitor/

Create plugins/devops-monitor/plugin.json

Plugin manifest declaring name, version, hooks, and tool registrations

Terminal

docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/plugin.json"' << 'BUNDLEOF'
{
  "name": "devops-monitor",
  "version": "1.0.0",
  "description": "Infrastructure monitoring and incident response specialist",
  "author": "RunClaw",
  "agent_zero_version": ">=0.9.8",
  "hooks": [
    "agent_init"
  ],
  "tools": [
    "health_check",
    "log_analyzer"
  ],
  "dependencies": []
}
BUNDLEOF

View file contents/a0/usr/plugins/devops-monitor/plugin.json

{
  "name": "devops-monitor",
  "version": "1.0.0",
  "description": "Infrastructure monitoring and incident response specialist",
  "author": "RunClaw",
  "agent_zero_version": ">=0.9.8",
  "hooks": [
    "agent_init"
  ],
  "tools": [
    "health_check",
    "log_analyzer"
  ],
  "dependencies": []
}

Create plugins/devops-monitor/prompts/system.md

System prompt additions for incident response and monitoring methodology

Terminal

docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/prompts" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/prompts/system.md"' << 'BUNDLEOF'
# DevOps Monitor Plugin

You have the DevOps Monitor plugin active. You are a specialist in infrastructure monitoring, incident triage, and operational reliability.

## Incident Severity Classification

| Level | Name | Description | Response Time |
|-------|------|-------------|---------------|
| P0 | Critical | Complete outage, data loss risk | Immediate |
| P1 | Major | Significant functionality impaired | < 15 min |
| P2 | Degraded | Performance degradation, partial impact | < 1 hour |
| P3 | Minor | Cosmetic issues, workarounds available | < 4 hours |
| P4 | Informational | No immediate impact, proactive finding | Next business day |

## Triage Methodology

1. **Assess Impact** -- How many users/services affected? Is there data loss risk?
2. **Classify Severity** -- Use the severity table above
3. **Identify Root Cause** -- Correlate logs, metrics, and timing
4. **Recommend Action** -- Provide specific, actionable remediation steps
5. **Document** -- Include timeline, findings, and actions taken

## Monitoring Standards

- Always check both the symptom AND potential root causes
- Correlate across multiple signals (logs, metrics, health checks)
- Distinguish between transient issues and persistent problems
- Consider cascading failures (one service down may cause others to fail)
- Note the blast radius (which downstream services are affected)

BUNDLEOF

View file contents/a0/usr/plugins/devops-monitor/prompts/system.md

# DevOps Monitor Plugin

You have the DevOps Monitor plugin active. You are a specialist in infrastructure monitoring, incident triage, and operational reliability.

## Incident Severity Classification

| Level | Name | Description | Response Time |
|-------|------|-------------|---------------|
| P0 | Critical | Complete outage, data loss risk | Immediate |
| P1 | Major | Significant functionality impaired | < 15 min |
| P2 | Degraded | Performance degradation, partial impact | < 1 hour |
| P3 | Minor | Cosmetic issues, workarounds available | < 4 hours |
| P4 | Informational | No immediate impact, proactive finding | Next business day |

## Triage Methodology

1. **Assess Impact** -- How many users/services affected? Is there data loss risk?
2. **Classify Severity** -- Use the severity table above
3. **Identify Root Cause** -- Correlate logs, metrics, and timing
4. **Recommend Action** -- Provide specific, actionable remediation steps
5. **Document** -- Include timeline, findings, and actions taken

## Monitoring Standards

- Always check both the symptom AND potential root causes
- Correlate across multiple signals (logs, metrics, health checks)
- Distinguish between transient issues and persistent problems
- Consider cascading failures (one service down may cause others to fail)
- Note the blast radius (which downstream services are affected)

Create plugins/devops-monitor/prompts/instructions.md

Detailed triage methodology, severity classification, and response patterns

Terminal

docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/prompts" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/prompts/instructions.md"' << 'BUNDLEOF'
# DevOps Monitor -- Behavioral Instructions

## When Investigating an Issue

1. **Gather context** -- Ask for relevant logs, metrics, error messages, and timeline
2. **Check the basics first** -- DNS, connectivity, TLS, disk space, memory, CPU
3. **Look for patterns** -- Correlate timestamps across log sources
4. **Think about what changed** -- Recent deployments, config changes, traffic spikes
5. **Provide severity assessment** -- Always classify the issue using P0-P4

## Log Analysis Approach

When given log data:
- Count error types and frequencies
- Identify the first occurrence (not just the most recent)
- Look for error cascades (one error causing others)
- Check for rate/timing patterns (every N seconds = scheduled job, random = load-dependent)
- Extract actionable data (IP addresses, user IDs, request IDs, stack traces)

## Health Check Protocol

When performing health checks:
- Check HTTP status code AND response time
- Verify TLS certificate validity and expiry
- Test from multiple angles if possible (different endpoints, protocols)
- Report results in a clear table format
- Flag anything that's degraded, not just down

## Incident Response Template

For P0-P2 incidents, structure your response as:

```
## Incident Summary
**Severity:** P[0-4]
**Impact:** [Who/what is affected]
**Started:** [When, if known]

## Findings
[What you found, with evidence]

## Root Cause
[Most likely root cause, confidence level]

## Recommended Actions
1. [Immediate mitigation]
2. [Root cause fix]
3. [Prevention for next time]

## Monitoring
[What to watch to confirm resolution]
```

## Common Patterns to Watch For

- **Memory leaks**: Gradual increase in RSS over hours/days
- **Connection exhaustion**: Timeouts with "too many open files" or "connection refused"
- **Disk pressure**: Slow queries often correlate with low disk space (swap thrashing)
- **DNS issues**: Intermittent failures with "temporary failure in name resolution"
- **TLS expiry**: Certificate warnings start appearing 30 days before expiry

BUNDLEOF

View file contents/a0/usr/plugins/devops-monitor/prompts/instructions.md

# DevOps Monitor -- Behavioral Instructions

## When Investigating an Issue

1. **Gather context** -- Ask for relevant logs, metrics, error messages, and timeline
2. **Check the basics first** -- DNS, connectivity, TLS, disk space, memory, CPU
3. **Look for patterns** -- Correlate timestamps across log sources
4. **Think about what changed** -- Recent deployments, config changes, traffic spikes
5. **Provide severity assessment** -- Always classify the issue using P0-P4

## Log Analysis Approach

When given log data:
- Count error types and frequencies
- Identify the first occurrence (not just the most recent)
- Look for error cascades (one error causing others)
- Check for rate/timing patterns (every N seconds = scheduled job, random = load-dependent)
- Extract actionable data (IP addresses, user IDs, request IDs, stack traces)

## Health Check Protocol

When performing health checks:
- Check HTTP status code AND response time
- Verify TLS certificate validity and expiry
- Test from multiple angles if possible (different endpoints, protocols)
- Report results in a clear table format
- Flag anything that's degraded, not just down

## Incident Response Template

For P0-P2 incidents, structure your response as:

```
## Incident Summary
**Severity:** P[0-4]
**Impact:** [Who/what is affected]
**Started:** [When, if known]

## Findings
[What you found, with evidence]

## Root Cause
[Most likely root cause, confidence level]

## Recommended Actions
1. [Immediate mitigation]
2. [Root cause fix]
3. [Prevention for next time]

## Monitoring
[What to watch to confirm resolution]
```

## Common Patterns to Watch For

- **Memory leaks**: Gradual increase in RSS over hours/days
- **Connection exhaustion**: Timeouts with "too many open files" or "connection refused"
- **Disk pressure**: Slow queries often correlate with low disk space (swap thrashing)
- **DNS issues**: Intermittent failures with "temporary failure in name resolution"
- **TLS expiry**: Certificate warnings start appearing 30 days before expiry

Create plugins/devops-monitor/tools/health_check.py

HTTP health check tool with timeout, retry logic, and TLS verification

Terminal

docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/tools" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/tools/health_check.py"' << 'BUNDLEOF'
"""
Tool: HealthCheck
Description: Perform HTTP health checks with timeout and retry logic
Plugin: devops-monitor

Checks endpoint availability, response time, status codes, and TLS
certificate expiry. Supports multiple URLs in a single check.
"""

from python.helpers.tool import Tool, Response


class HealthCheck(Tool):
    """Perform HTTP health checks on one or more endpoints.

    Use this tool to verify endpoint availability, measure response time,
    and check TLS certificate status.
    """

    async def execute(
        self,
        urls: str = "",
        timeout_seconds: int = 10,
        retries: int = 1,
        **kwargs,
    ):
        """Run health checks against provided URLs.

        Args:
            urls: Comma-separated list of URLs to check
            timeout_seconds: Request timeout in seconds (default: 10)
            retries: Number of retry attempts on failure (default: 1)
        """
        if not urls:
            return Response(
                message="Please provide one or more URLs to health-check (comma-separated).",
                break_loop=False,
            )

        url_list = [u.strip() for u in urls.split(",") if u.strip()]

        results = []
        for url in url_list:
            results.append(
                f"- **{url}**: Checking with {timeout_seconds}s timeout, "
                f"{retries} retries..."
            )

        checks_plan = "\n".join(results)

        return Response(
            message=(
                f"## Health Check Plan\n\n"
                f"{checks_plan}\n\n"
                f"Running HTTP checks with response time measurement, "
                f"status code verification, and TLS certificate inspection. "
                f"Results will be presented in a summary table."
            ),
            break_loop=False,
        )

BUNDLEOF

View file contents/a0/usr/plugins/devops-monitor/tools/health_check.py

"""
Tool: HealthCheck
Description: Perform HTTP health checks with timeout and retry logic
Plugin: devops-monitor

Checks endpoint availability, response time, status codes, and TLS
certificate expiry. Supports multiple URLs in a single check.
"""

from python.helpers.tool import Tool, Response


class HealthCheck(Tool):
    """Perform HTTP health checks on one or more endpoints.

    Use this tool to verify endpoint availability, measure response time,
    and check TLS certificate status.
    """

    async def execute(
        self,
        urls: str = "",
        timeout_seconds: int = 10,
        retries: int = 1,
        **kwargs,
    ):
        """Run health checks against provided URLs.

        Args:
            urls: Comma-separated list of URLs to check
            timeout_seconds: Request timeout in seconds (default: 10)
            retries: Number of retry attempts on failure (default: 1)
        """
        if not urls:
            return Response(
                message="Please provide one or more URLs to health-check (comma-separated).",
                break_loop=False,
            )

        url_list = [u.strip() for u in urls.split(",") if u.strip()]

        results = []
        for url in url_list:
            results.append(
                f"- **{url}**: Checking with {timeout_seconds}s timeout, "
                f"{retries} retries..."
            )

        checks_plan = "\n".join(results)

        return Response(
            message=(
                f"## Health Check Plan\n\n"
                f"{checks_plan}\n\n"
                f"Running HTTP checks with response time measurement, "
                f"status code verification, and TLS certificate inspection. "
                f"Results will be presented in a summary table."
            ),
            break_loop=False,
        )

Create plugins/devops-monitor/tools/log_analyzer.py

Tool for parsing common log formats and extracting error patterns

Terminal

docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/tools" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/tools/log_analyzer.py"' << 'BUNDLEOF'
"""
Tool: LogAnalyzer
Description: Parse log data and extract error patterns
Plugin: devops-monitor

Analyzes log entries to identify error frequencies, patterns, cascades,
and timing correlations. Supports common formats (nginx, syslog, JSON).
"""

from python.helpers.tool import Tool, Response


class LogAnalyzer(Tool):
    """Analyze log data to extract error patterns and correlations.

    Use this tool when you have log data and need to identify the most
    significant errors, their frequency, and any cascade patterns.
    """

    KNOWN_PATTERNS = {
        "oom": "Out of Memory -- process killed by kernel OOM killer",
        "timeout": "Connection or request timeout -- upstream not responding",
        "refused": "Connection refused -- service not listening or port blocked",
        "denied": "Permission denied -- auth/access control issue",
        "disk": "Disk space or I/O error -- check df and iostat",
        "dns": "DNS resolution failure -- check resolv.conf and upstream DNS",
        "tls": "TLS/SSL error -- certificate or handshake issue",
        "rate": "Rate limiting triggered -- too many requests",
    }

    async def execute(self, log_data: str = "", log_format: str = "auto", **kwargs):
        """Analyze log data for error patterns.

        Args:
            log_data: Raw log text to analyze
            log_format: Log format hint (auto, nginx, syslog, json). Default: auto
        """
        if not log_data:
            return Response(
                message="Please provide log data to analyze.",
                break_loop=False,
            )

        line_count = len(log_data.strip().split("\n"))

        # Check for known error patterns
        detected_patterns = []
        log_lower = log_data.lower()
        for keyword, description in self.KNOWN_PATTERNS.items():
            if keyword in log_lower:
                count = log_lower.count(keyword)
                detected_patterns.append(
                    f"- **{keyword}** ({count}x): {description}"
                )

        patterns_report = (
            "\n".join(detected_patterns) if detected_patterns else "No known error patterns detected."
        )

        return Response(
            message=(
                f"## Log Analysis\n\n"
                f"- **Lines analyzed:** {line_count}\n"
                f"- **Format:** {log_format}\n\n"
                f"### Detected Patterns\n\n"
                f"{patterns_report}\n\n"
                f"Performing deeper analysis: counting error frequencies, "
                f"identifying first occurrence timestamps, checking for "
                f"cascade patterns, and correlating with timing."
            ),
            break_loop=False,
        )

BUNDLEOF

View file contents/a0/usr/plugins/devops-monitor/tools/log_analyzer.py

"""
Tool: LogAnalyzer
Description: Parse log data and extract error patterns
Plugin: devops-monitor

Analyzes log entries to identify error frequencies, patterns, cascades,
and timing correlations. Supports common formats (nginx, syslog, JSON).
"""

from python.helpers.tool import Tool, Response


class LogAnalyzer(Tool):
    """Analyze log data to extract error patterns and correlations.

    Use this tool when you have log data and need to identify the most
    significant errors, their frequency, and any cascade patterns.
    """

    KNOWN_PATTERNS = {
        "oom": "Out of Memory -- process killed by kernel OOM killer",
        "timeout": "Connection or request timeout -- upstream not responding",
        "refused": "Connection refused -- service not listening or port blocked",
        "denied": "Permission denied -- auth/access control issue",
        "disk": "Disk space or I/O error -- check df and iostat",
        "dns": "DNS resolution failure -- check resolv.conf and upstream DNS",
        "tls": "TLS/SSL error -- certificate or handshake issue",
        "rate": "Rate limiting triggered -- too many requests",
    }

    async def execute(self, log_data: str = "", log_format: str = "auto", **kwargs):
        """Analyze log data for error patterns.

        Args:
            log_data: Raw log text to analyze
            log_format: Log format hint (auto, nginx, syslog, json). Default: auto
        """
        if not log_data:
            return Response(
                message="Please provide log data to analyze.",
                break_loop=False,
            )

        line_count = len(log_data.strip().split("\n"))

        # Check for known error patterns
        detected_patterns = []
        log_lower = log_data.lower()
        for keyword, description in self.KNOWN_PATTERNS.items():
            if keyword in log_lower:
                count = log_lower.count(keyword)
                detected_patterns.append(
                    f"- **{keyword}** ({count}x): {description}"
                )

        patterns_report = (
            "\n".join(detected_patterns) if detected_patterns else "No known error patterns detected."
        )

        return Response(
            message=(
                f"## Log Analysis\n\n"
                f"- **Lines analyzed:** {line_count}\n"
                f"- **Format:** {log_format}\n\n"
                f"### Detected Patterns\n\n"
                f"{patterns_report}\n\n"
                f"Performing deeper analysis: counting error frequencies, "
                f"identifying first occurrence timestamps, checking for "
                f"cascade patterns, and correlating with timing."
            ),
            break_loop=False,
        )

Create plugins/devops-monitor/extensions/agent_init/setup.py

Initialization extension that logs plugin activation on agent start

Terminal

docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/extensions/agent_init" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/extensions/agent_init/setup.py"' << 'BUNDLEOF'
"""
Extension: agent_init
Plugin: devops-monitor

Runs when the agent initializes. Logs activation of the DevOps Monitor
plugin and sets up runtime state.
"""

from python.helpers.log import Log


async def execute(agent, **kwargs):
    """Initialize the DevOps Monitor plugin."""
    Log.info("DevOps Monitor plugin activated", head="plugin")
    Log.info(
        "Infrastructure monitoring enabled: health checks, log analysis, incident triage",
        head="plugin",
    )

    if not hasattr(agent, "plugin_state"):
        agent.plugin_state = {}
    agent.plugin_state["devops-monitor"] = {
        "active": True,
        "version": "1.0.0",
        "severity_levels": ["P0", "P1", "P2", "P3", "P4"],
    }

BUNDLEOF

View file contents/a0/usr/plugins/devops-monitor/extensions/agent_init/setup.py

"""
Extension: agent_init
Plugin: devops-monitor

Runs when the agent initializes. Logs activation of the DevOps Monitor
plugin and sets up runtime state.
"""

from python.helpers.log import Log


async def execute(agent, **kwargs):
    """Initialize the DevOps Monitor plugin."""
    Log.info("DevOps Monitor plugin activated", head="plugin")
    Log.info(
        "Infrastructure monitoring enabled: health checks, log analysis, incident triage",
        head="plugin",
    )

    if not hasattr(agent, "plugin_state"):
        agent.plugin_state = {}
    agent.plugin_state["devops-monitor"] = {
        "active": True,
        "version": "1.0.0",
        "severity_levels": ["P0", "P1", "P2", "P3", "P4"],
    }

Restart Agent Zero

Restart the Agent Zero container to load the new plugin.

Terminal

docker restart agent-zero

Verify installation

Open your Agent Zero interface and verify the plugin loads correctly.

Note:Look for "DevOps Monitor (Agent Zero)" in the available plugins or extensions.

That was 10 steps.

With RunClaw, it's just one click.

Install with RunClaw

Done reading?

RunClaw handles all of this automatically. Create your first agent in minutes.

Try RunClaw Free

No credit card required