DevOps Monitor (Agent Zero) — Manual Install Guide
Prerequisites
- ✓A running Agent Zero instance (v0.9.8 or later)
- ✓SSH access to your VPS
- ✓Docker installed on your VPS
Estimated time: ~20 minutes
Installation Steps
Connect to your VPS
SSH into the server where your Agent Zero instance is running.
ssh root@your-vps-ipCreate the plugin directory
Create the plugin directory inside the Agent Zero container for the DevOps Monitor (Agent Zero) plugin.
docker exec agent-zero mkdir -p /a0/usr/plugins/devops-monitor/Create plugins/devops-monitor/plugin.json
Plugin manifest declaring name, version, hooks, and tool registrations
docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/plugin.json"' << 'BUNDLEOF'
{
"name": "devops-monitor",
"version": "1.0.0",
"description": "Infrastructure monitoring and incident response specialist",
"author": "RunClaw",
"agent_zero_version": ">=0.9.8",
"hooks": [
"agent_init"
],
"tools": [
"health_check",
"log_analyzer"
],
"dependencies": []
}
BUNDLEOFView file contents/a0/usr/plugins/devops-monitor/plugin.json
{
"name": "devops-monitor",
"version": "1.0.0",
"description": "Infrastructure monitoring and incident response specialist",
"author": "RunClaw",
"agent_zero_version": ">=0.9.8",
"hooks": [
"agent_init"
],
"tools": [
"health_check",
"log_analyzer"
],
"dependencies": []
}Create plugins/devops-monitor/prompts/system.md
System prompt additions for incident response and monitoring methodology
docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/prompts" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/prompts/system.md"' << 'BUNDLEOF'
# DevOps Monitor Plugin
You have the DevOps Monitor plugin active. You are a specialist in infrastructure monitoring, incident triage, and operational reliability.
## Incident Severity Classification
| Level | Name | Description | Response Time |
|-------|------|-------------|---------------|
| P0 | Critical | Complete outage, data loss risk | Immediate |
| P1 | Major | Significant functionality impaired | < 15 min |
| P2 | Degraded | Performance degradation, partial impact | < 1 hour |
| P3 | Minor | Cosmetic issues, workarounds available | < 4 hours |
| P4 | Informational | No immediate impact, proactive finding | Next business day |
## Triage Methodology
1. **Assess Impact** -- How many users/services affected? Is there data loss risk?
2. **Classify Severity** -- Use the severity table above
3. **Identify Root Cause** -- Correlate logs, metrics, and timing
4. **Recommend Action** -- Provide specific, actionable remediation steps
5. **Document** -- Include timeline, findings, and actions taken
## Monitoring Standards
- Always check both the symptom AND potential root causes
- Correlate across multiple signals (logs, metrics, health checks)
- Distinguish between transient issues and persistent problems
- Consider cascading failures (one service down may cause others to fail)
- Note the blast radius (which downstream services are affected)
BUNDLEOFView file contents/a0/usr/plugins/devops-monitor/prompts/system.md
# DevOps Monitor Plugin
You have the DevOps Monitor plugin active. You are a specialist in infrastructure monitoring, incident triage, and operational reliability.
## Incident Severity Classification
| Level | Name | Description | Response Time |
|-------|------|-------------|---------------|
| P0 | Critical | Complete outage, data loss risk | Immediate |
| P1 | Major | Significant functionality impaired | < 15 min |
| P2 | Degraded | Performance degradation, partial impact | < 1 hour |
| P3 | Minor | Cosmetic issues, workarounds available | < 4 hours |
| P4 | Informational | No immediate impact, proactive finding | Next business day |
## Triage Methodology
1. **Assess Impact** -- How many users/services affected? Is there data loss risk?
2. **Classify Severity** -- Use the severity table above
3. **Identify Root Cause** -- Correlate logs, metrics, and timing
4. **Recommend Action** -- Provide specific, actionable remediation steps
5. **Document** -- Include timeline, findings, and actions taken
## Monitoring Standards
- Always check both the symptom AND potential root causes
- Correlate across multiple signals (logs, metrics, health checks)
- Distinguish between transient issues and persistent problems
- Consider cascading failures (one service down may cause others to fail)
- Note the blast radius (which downstream services are affected)
Create plugins/devops-monitor/prompts/instructions.md
Detailed triage methodology, severity classification, and response patterns
docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/prompts" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/prompts/instructions.md"' << 'BUNDLEOF'
# DevOps Monitor -- Behavioral Instructions
## When Investigating an Issue
1. **Gather context** -- Ask for relevant logs, metrics, error messages, and timeline
2. **Check the basics first** -- DNS, connectivity, TLS, disk space, memory, CPU
3. **Look for patterns** -- Correlate timestamps across log sources
4. **Think about what changed** -- Recent deployments, config changes, traffic spikes
5. **Provide severity assessment** -- Always classify the issue using P0-P4
## Log Analysis Approach
When given log data:
- Count error types and frequencies
- Identify the first occurrence (not just the most recent)
- Look for error cascades (one error causing others)
- Check for rate/timing patterns (every N seconds = scheduled job, random = load-dependent)
- Extract actionable data (IP addresses, user IDs, request IDs, stack traces)
## Health Check Protocol
When performing health checks:
- Check HTTP status code AND response time
- Verify TLS certificate validity and expiry
- Test from multiple angles if possible (different endpoints, protocols)
- Report results in a clear table format
- Flag anything that's degraded, not just down
## Incident Response Template
For P0-P2 incidents, structure your response as:
```
## Incident Summary
**Severity:** P[0-4]
**Impact:** [Who/what is affected]
**Started:** [When, if known]
## Findings
[What you found, with evidence]
## Root Cause
[Most likely root cause, confidence level]
## Recommended Actions
1. [Immediate mitigation]
2. [Root cause fix]
3. [Prevention for next time]
## Monitoring
[What to watch to confirm resolution]
```
## Common Patterns to Watch For
- **Memory leaks**: Gradual increase in RSS over hours/days
- **Connection exhaustion**: Timeouts with "too many open files" or "connection refused"
- **Disk pressure**: Slow queries often correlate with low disk space (swap thrashing)
- **DNS issues**: Intermittent failures with "temporary failure in name resolution"
- **TLS expiry**: Certificate warnings start appearing 30 days before expiry
BUNDLEOFView file contents/a0/usr/plugins/devops-monitor/prompts/instructions.md
# DevOps Monitor -- Behavioral Instructions
## When Investigating an Issue
1. **Gather context** -- Ask for relevant logs, metrics, error messages, and timeline
2. **Check the basics first** -- DNS, connectivity, TLS, disk space, memory, CPU
3. **Look for patterns** -- Correlate timestamps across log sources
4. **Think about what changed** -- Recent deployments, config changes, traffic spikes
5. **Provide severity assessment** -- Always classify the issue using P0-P4
## Log Analysis Approach
When given log data:
- Count error types and frequencies
- Identify the first occurrence (not just the most recent)
- Look for error cascades (one error causing others)
- Check for rate/timing patterns (every N seconds = scheduled job, random = load-dependent)
- Extract actionable data (IP addresses, user IDs, request IDs, stack traces)
## Health Check Protocol
When performing health checks:
- Check HTTP status code AND response time
- Verify TLS certificate validity and expiry
- Test from multiple angles if possible (different endpoints, protocols)
- Report results in a clear table format
- Flag anything that's degraded, not just down
## Incident Response Template
For P0-P2 incidents, structure your response as:
```
## Incident Summary
**Severity:** P[0-4]
**Impact:** [Who/what is affected]
**Started:** [When, if known]
## Findings
[What you found, with evidence]
## Root Cause
[Most likely root cause, confidence level]
## Recommended Actions
1. [Immediate mitigation]
2. [Root cause fix]
3. [Prevention for next time]
## Monitoring
[What to watch to confirm resolution]
```
## Common Patterns to Watch For
- **Memory leaks**: Gradual increase in RSS over hours/days
- **Connection exhaustion**: Timeouts with "too many open files" or "connection refused"
- **Disk pressure**: Slow queries often correlate with low disk space (swap thrashing)
- **DNS issues**: Intermittent failures with "temporary failure in name resolution"
- **TLS expiry**: Certificate warnings start appearing 30 days before expiry
Create plugins/devops-monitor/tools/health_check.py
HTTP health check tool with timeout, retry logic, and TLS verification
docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/tools" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/tools/health_check.py"' << 'BUNDLEOF'
"""
Tool: HealthCheck
Description: Perform HTTP health checks with timeout and retry logic
Plugin: devops-monitor
Checks endpoint availability, response time, status codes, and TLS
certificate expiry. Supports multiple URLs in a single check.
"""
from python.helpers.tool import Tool, Response
class HealthCheck(Tool):
"""Perform HTTP health checks on one or more endpoints.
Use this tool to verify endpoint availability, measure response time,
and check TLS certificate status.
"""
async def execute(
self,
urls: str = "",
timeout_seconds: int = 10,
retries: int = 1,
**kwargs,
):
"""Run health checks against provided URLs.
Args:
urls: Comma-separated list of URLs to check
timeout_seconds: Request timeout in seconds (default: 10)
retries: Number of retry attempts on failure (default: 1)
"""
if not urls:
return Response(
message="Please provide one or more URLs to health-check (comma-separated).",
break_loop=False,
)
url_list = [u.strip() for u in urls.split(",") if u.strip()]
results = []
for url in url_list:
results.append(
f"- **{url}**: Checking with {timeout_seconds}s timeout, "
f"{retries} retries..."
)
checks_plan = "\n".join(results)
return Response(
message=(
f"## Health Check Plan\n\n"
f"{checks_plan}\n\n"
f"Running HTTP checks with response time measurement, "
f"status code verification, and TLS certificate inspection. "
f"Results will be presented in a summary table."
),
break_loop=False,
)
BUNDLEOFView file contents/a0/usr/plugins/devops-monitor/tools/health_check.py
"""
Tool: HealthCheck
Description: Perform HTTP health checks with timeout and retry logic
Plugin: devops-monitor
Checks endpoint availability, response time, status codes, and TLS
certificate expiry. Supports multiple URLs in a single check.
"""
from python.helpers.tool import Tool, Response
class HealthCheck(Tool):
"""Perform HTTP health checks on one or more endpoints.
Use this tool to verify endpoint availability, measure response time,
and check TLS certificate status.
"""
async def execute(
self,
urls: str = "",
timeout_seconds: int = 10,
retries: int = 1,
**kwargs,
):
"""Run health checks against provided URLs.
Args:
urls: Comma-separated list of URLs to check
timeout_seconds: Request timeout in seconds (default: 10)
retries: Number of retry attempts on failure (default: 1)
"""
if not urls:
return Response(
message="Please provide one or more URLs to health-check (comma-separated).",
break_loop=False,
)
url_list = [u.strip() for u in urls.split(",") if u.strip()]
results = []
for url in url_list:
results.append(
f"- **{url}**: Checking with {timeout_seconds}s timeout, "
f"{retries} retries..."
)
checks_plan = "\n".join(results)
return Response(
message=(
f"## Health Check Plan\n\n"
f"{checks_plan}\n\n"
f"Running HTTP checks with response time measurement, "
f"status code verification, and TLS certificate inspection. "
f"Results will be presented in a summary table."
),
break_loop=False,
)
Create plugins/devops-monitor/tools/log_analyzer.py
Tool for parsing common log formats and extracting error patterns
docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/tools" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/tools/log_analyzer.py"' << 'BUNDLEOF'
"""
Tool: LogAnalyzer
Description: Parse log data and extract error patterns
Plugin: devops-monitor
Analyzes log entries to identify error frequencies, patterns, cascades,
and timing correlations. Supports common formats (nginx, syslog, JSON).
"""
from python.helpers.tool import Tool, Response
class LogAnalyzer(Tool):
"""Analyze log data to extract error patterns and correlations.
Use this tool when you have log data and need to identify the most
significant errors, their frequency, and any cascade patterns.
"""
KNOWN_PATTERNS = {
"oom": "Out of Memory -- process killed by kernel OOM killer",
"timeout": "Connection or request timeout -- upstream not responding",
"refused": "Connection refused -- service not listening or port blocked",
"denied": "Permission denied -- auth/access control issue",
"disk": "Disk space or I/O error -- check df and iostat",
"dns": "DNS resolution failure -- check resolv.conf and upstream DNS",
"tls": "TLS/SSL error -- certificate or handshake issue",
"rate": "Rate limiting triggered -- too many requests",
}
async def execute(self, log_data: str = "", log_format: str = "auto", **kwargs):
"""Analyze log data for error patterns.
Args:
log_data: Raw log text to analyze
log_format: Log format hint (auto, nginx, syslog, json). Default: auto
"""
if not log_data:
return Response(
message="Please provide log data to analyze.",
break_loop=False,
)
line_count = len(log_data.strip().split("\n"))
# Check for known error patterns
detected_patterns = []
log_lower = log_data.lower()
for keyword, description in self.KNOWN_PATTERNS.items():
if keyword in log_lower:
count = log_lower.count(keyword)
detected_patterns.append(
f"- **{keyword}** ({count}x): {description}"
)
patterns_report = (
"\n".join(detected_patterns) if detected_patterns else "No known error patterns detected."
)
return Response(
message=(
f"## Log Analysis\n\n"
f"- **Lines analyzed:** {line_count}\n"
f"- **Format:** {log_format}\n\n"
f"### Detected Patterns\n\n"
f"{patterns_report}\n\n"
f"Performing deeper analysis: counting error frequencies, "
f"identifying first occurrence timestamps, checking for "
f"cascade patterns, and correlating with timing."
),
break_loop=False,
)
BUNDLEOFView file contents/a0/usr/plugins/devops-monitor/tools/log_analyzer.py
"""
Tool: LogAnalyzer
Description: Parse log data and extract error patterns
Plugin: devops-monitor
Analyzes log entries to identify error frequencies, patterns, cascades,
and timing correlations. Supports common formats (nginx, syslog, JSON).
"""
from python.helpers.tool import Tool, Response
class LogAnalyzer(Tool):
"""Analyze log data to extract error patterns and correlations.
Use this tool when you have log data and need to identify the most
significant errors, their frequency, and any cascade patterns.
"""
KNOWN_PATTERNS = {
"oom": "Out of Memory -- process killed by kernel OOM killer",
"timeout": "Connection or request timeout -- upstream not responding",
"refused": "Connection refused -- service not listening or port blocked",
"denied": "Permission denied -- auth/access control issue",
"disk": "Disk space or I/O error -- check df and iostat",
"dns": "DNS resolution failure -- check resolv.conf and upstream DNS",
"tls": "TLS/SSL error -- certificate or handshake issue",
"rate": "Rate limiting triggered -- too many requests",
}
async def execute(self, log_data: str = "", log_format: str = "auto", **kwargs):
"""Analyze log data for error patterns.
Args:
log_data: Raw log text to analyze
log_format: Log format hint (auto, nginx, syslog, json). Default: auto
"""
if not log_data:
return Response(
message="Please provide log data to analyze.",
break_loop=False,
)
line_count = len(log_data.strip().split("\n"))
# Check for known error patterns
detected_patterns = []
log_lower = log_data.lower()
for keyword, description in self.KNOWN_PATTERNS.items():
if keyword in log_lower:
count = log_lower.count(keyword)
detected_patterns.append(
f"- **{keyword}** ({count}x): {description}"
)
patterns_report = (
"\n".join(detected_patterns) if detected_patterns else "No known error patterns detected."
)
return Response(
message=(
f"## Log Analysis\n\n"
f"- **Lines analyzed:** {line_count}\n"
f"- **Format:** {log_format}\n\n"
f"### Detected Patterns\n\n"
f"{patterns_report}\n\n"
f"Performing deeper analysis: counting error frequencies, "
f"identifying first occurrence timestamps, checking for "
f"cascade patterns, and correlating with timing."
),
break_loop=False,
)
Create plugins/devops-monitor/extensions/agent_init/setup.py
Initialization extension that logs plugin activation on agent start
docker exec agent-zero mkdir -p "/a0/usr/plugins/devops-monitor/extensions/agent_init" && docker exec -i agent-zero sh -c 'cat > "/a0/usr/plugins/devops-monitor/extensions/agent_init/setup.py"' << 'BUNDLEOF'
"""
Extension: agent_init
Plugin: devops-monitor
Runs when the agent initializes. Logs activation of the DevOps Monitor
plugin and sets up runtime state.
"""
from python.helpers.log import Log
async def execute(agent, **kwargs):
"""Initialize the DevOps Monitor plugin."""
Log.info("DevOps Monitor plugin activated", head="plugin")
Log.info(
"Infrastructure monitoring enabled: health checks, log analysis, incident triage",
head="plugin",
)
if not hasattr(agent, "plugin_state"):
agent.plugin_state = {}
agent.plugin_state["devops-monitor"] = {
"active": True,
"version": "1.0.0",
"severity_levels": ["P0", "P1", "P2", "P3", "P4"],
}
BUNDLEOFView file contents/a0/usr/plugins/devops-monitor/extensions/agent_init/setup.py
"""
Extension: agent_init
Plugin: devops-monitor
Runs when the agent initializes. Logs activation of the DevOps Monitor
plugin and sets up runtime state.
"""
from python.helpers.log import Log
async def execute(agent, **kwargs):
"""Initialize the DevOps Monitor plugin."""
Log.info("DevOps Monitor plugin activated", head="plugin")
Log.info(
"Infrastructure monitoring enabled: health checks, log analysis, incident triage",
head="plugin",
)
if not hasattr(agent, "plugin_state"):
agent.plugin_state = {}
agent.plugin_state["devops-monitor"] = {
"active": True,
"version": "1.0.0",
"severity_levels": ["P0", "P1", "P2", "P3", "P4"],
}
Restart Agent Zero
Restart the Agent Zero container to load the new plugin.
docker restart agent-zeroVerify installation
Open your Agent Zero interface and verify the plugin loads correctly.
Note:Look for "DevOps Monitor (Agent Zero)" in the available plugins or extensions.
That was 10 steps.
With RunClaw, it's just one click.
Install with RunClawDone reading?
RunClaw handles all of this automatically. Create your first agent in minutes.
Try RunClaw FreeNo credit card required