Pendahuluan
Sistem yang running smooth di production bukan kebetulan. Behind the scenes, ada monitoring yang solid, logging yang terstruktur, dan ability untuk debug issues dengan cepat kalau ada masalah.
Artikel ini akan bahas cara monitoring dan debugging OpenClaw secara menyeluruh: dari log management, performance monitoring, error tracking, troubleshooting common issues, sampai setup alerting system yang proactive.
💡 Philosophy: "You can't fix what you can't see." Monitoring bukan optional—ini foundational untuk production system.
Log Management
Logs adalah first line of defense untuk troubleshooting.
1. Structured Logging
Setup log format yang structured:
// config/logging.json
{
"logging": {
"level": "info",
"format": "json",
"outputs": [
{
"type": "file",
"path": "./logs/openclaw.log",
"maxSize": "100m",
"maxFiles": 10
},
{
"type": "console",
"colorize": true
}
],
"fields": {
"timestamp": true,
"level": true,
"message": true,
"taskId": true,
"userId": true,
"ip": true,
"duration": true,
"error": true
}
}
}
2. Log Levels
- ERROR — critical errors yang perlu immediate action
- WARN — potential issues, not critical yet
- INFO — general informational messages
- DEBUG — detailed info untuk debugging (disable di production)
- TRACE — very verbose (development only)
3. View Logs
# Real-time log monitoring
tail -f ~/openclaw/logs/openclaw.log
# Filter by level
grep "ERROR" ~/openclaw/logs/openclaw.log
# Last 100 lines
tail -100 ~/openclaw/logs/openclaw.log
# Search for specific task
grep "taskId.*backup" ~/openclaw/logs/openclaw.log
# Count errors
grep -c "ERROR" ~/openclaw/logs/openclaw.log
4. Log Rotation
sudo nano /etc/logrotate.d/openclaw
/home/user/openclaw/logs/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0640 user user
postrotate
pm2 reloadLogs
endscript
}
Performance Monitoring
1. Built-in Metrics Endpoint
curl http://localhost:3000/metrics
Response:
{
"uptime": 86400,
"memory": {
"rss": "150MB",
"heapTotal": "80MB",
"heapUsed": "65MB",
"external": "2MB"
},
"cpu": {
"user": 12500,
"system": 3200
},
"tasks": {
"total": 150,
"completed": 145,
"failed": 3,
"running": 2
},
"apiCalls": {
"total": 1250,
"successful": 1200,
"failed": 50
}
}
2. PM2 Monitoring
# Real-time monitoring
pm2 monit
# Status
pm2 status
# Resource usage
pm2 list
pm2 show openclaw
# Logs
pm2 logs openclaw --lines 50
3. System Resource Monitoring
# CPU usage
top -p $(pgrep -f openclaw)
# Memory usage
ps aux | grep openclaw
# Network connections
netstat -tulpn | grep 3000
# Disk I/O
iotop -p $(pgrep -f openclaw)
4. Custom Metrics Collection
Setup automated metrics collection:
{
"name": "collect-metrics",
"schedule": "*/5 * * * *",
"actions": [
{
"type": "http",
"method": "GET",
"url": "http://localhost:3000/metrics",
"output": "metrics"
},
{
"type": "database",
"driver": "postgresql",
"query": "INSERT INTO metrics (timestamp, data) VALUES (NOW(), $1)",
"params": ["{{metrics}}"]
},
{
"type": "condition",
"if": "metrics.memory.heapUsed > '500MB'",
"then": [
{
"type": "notification",
"message": "⚠️ High memory usage: {{metrics.memory.heapUsed}}"
}
]
}
]
}
Error Tracking
1. Error Log Analysis
# Count errors by type
grep "ERROR" logs/openclaw.log | cut -d'"' -f8 | sort | uniq -c | sort -nr
# Recent errors
grep "ERROR" logs/openclaw.log | tail -20
# Errors in last hour
grep "ERROR" logs/openclaw.log | grep "$(date -d '1 hour ago' '+%Y-%m-%dT%H')"
2. Integration dengan Sentry (Optional)
// .env
SENTRY_DSN=https://[email protected]/xxxxx
SENTRY_ENVIRONMENT=production
Sentry akan automatically capture errors dan send ke dashboard.
3. AI-Powered Error Analysis
{
"name": "analyze-errors",
"schedule": "0 * * * *",
"actions": [
{
"type": "shell",
"command": "grep 'ERROR' logs/openclaw.log | tail -100",
"output": "recent_errors"
},
{
"type": "ai-analyze",
"prompt": "Analyze these error logs and identify: 1) Most common error types, 2) Root causes, 3) Suggested fixes",
"input": "{{recent_errors}}",
"output": "analysis"
},
{
"type": "notification",
"channel": "telegram",
"message": "📊 Error Analysis:\n{{analysis.summary}}"
}
]
}
Health Checks
1. Basic Health Check
curl http://localhost:3000/health
Response:
{
"status": "ok",
"uptime": 86400,
"timestamp": "2026-02-09T10:30:00Z",
"version": "1.0.0",
"checks": {
"database": "ok",
"redis": "ok",
"ai_api": "ok",
"disk_space": "ok"
}
}
2. Automated Health Monitoring
{
"name": "health-check-monitor",
"schedule": "*/2 * * * *",
"actions": [
{
"type": "http",
"method": "GET",
"url": "http://localhost:3000/health",
"timeout": 5000,
"output": "health"
},
{
"type": "condition",
"if": "health.status != 'ok' || health.response_time > 3000",
"then": [
{
"type": "notification",
"channel": "telegram",
"message": "🚨 Health check failed!\nStatus: {{health.status}}\nResponse time: {{health.response_time}}ms"
},
{
"type": "shell",
"command": "pm2 restart openclaw"
}
]
}
]
}
Debugging Techniques
1. Enable Debug Mode
# .env
NODE_ENV=development
LOG_LEVEL=debug
DEBUG=openclaw:*
pm2 restart openclaw
pm2 logs openclaw
2. Interactive Debugging
# Stop PM2
pm2 stop openclaw
# Run directly dengan debug
node --inspect dist/index.js
# Atau dengan breakpoint
node --inspect-brk dist/index.js
Connect dengan Chrome DevTools: chrome://inspect
3. Trace Specific Task
# Enable task tracing
curl -X POST http://localhost:3000/admin/trace \
-H "Authorization: Bearer your-token" \
-d '{"taskId": "backup-database", "level": "verbose"}'
# Run task
curl -X POST http://localhost:3000/api/tasks/backup-database/run
# View trace
grep "taskId.*backup-database" logs/openclaw.log | jq .
4. Memory Leak Detection
# Take heap snapshot
node --expose-gc dist/index.js &
kill -USR2 $(pgrep -f openclaw)
# Analyze dengan Chrome DevTools atau clinic.js
npm install -g clinic
clinic doctor -- node dist/index.js
Common Issues & Solutions
🔴 Issue: OpenClaw Crashes Randomly
Symptoms: Process exits unexpectedly
Diagnosis:
pm2 logs openclaw --err
grep "FATAL\|SIGTERM\|SIGKILL" logs/openclaw.log
Common Causes:
- Out of memory (check dengan
dmesg | grep -i kill) - Unhandled promise rejection
- Segmentation fault (native module issue)
Solution: Enable auto-restart, add more RAM, fix unhandled promises
🟡 Issue: High Memory Usage
Diagnosis:
pm2 monit
node --max-old-space-size=4096 dist/index.js
Solutions:
- Increase Node.js memory limit
- Implement cache eviction
- Check for memory leaks
- Optimize large data processing
🔵 Issue: API Timeouts
Diagnosis:
grep "timeout" logs/openclaw.log
curl -w "@curl-format.txt" http://localhost:3000/api/tasks
Solutions:
- Increase timeout values
- Optimize slow queries
- Add request queuing
- Check AI API latency
Alerting System
Setup alerts untuk stay informed about issues:
{
"alerts": [
{
"name": "cpu-high",
"condition": "cpu_usage > 80",
"duration": "5m",
"channels": ["telegram", "email"]
},
{
"name": "memory-high",
"condition": "memory_usage > 85",
"duration": "3m",
"channels": ["telegram"]
},
{
"name": "disk-full",
"condition": "disk_usage > 90",
"channels": ["telegram", "email", "slack"]
},
{
"name": "task-failures",
"condition": "failed_tasks > 5",
"duration": "1h",
"channels": ["telegram"]
}
]
}
Monitoring Dashboard
Setup simple dashboard untuk visualize metrics:
1. Grafana + Prometheus (Advanced)
# Install Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64
./prometheus --config.file=prometheus.yml
2. Simple Web Dashboard
OpenClaw punya built-in dashboard di http://localhost:3000/dashboard yang show:
- Task execution history
- Success/failure rates
- Resource usage charts
- Recent errors
- API call statistics
VPS untuk Monitoring Workload
🚀 VPS Indonesia Murah SufaNet
Monitoring yang comprehensive butuh resource yang stable:
- CPU headroom untuk metrics collection
- Storage cukup untuk logs dan metrics data
- Network stabil untuk alerting
- Uptime guarantee 99.9%
FAQ
Berapa lama log harus disimpan?
Tergantung compliance requirement. Standard: 30 hari untuk access logs, 90 hari untuk error logs, 1 tahun untuk audit logs.
Tool monitoring apa yang paling ringan?
PM2 built-in monitoring paling ringan. Kalau mau lebih advanced tapi tetap ringan, pakai Netdata (overhead < 3% CPU).
Bagaimana cara debug issue yang sudah resolved?
Check archived logs, audit trail, dan metrics history. Kalau pakai Grafana/Prometheus, bisa time-travel ke waktu issue terjadi.
Kesimpulan
Monitoring dan debugging bukan reactive activity—ini proactive practice yang prevent issues sebelum jadi critical. Dengan setup yang benar:
- Structured logs → easy troubleshooting
- Performance metrics → catch bottlenecks early
- Error tracking → fix issues faster
- Health checks → detect failures instantly
- Alerting → stay informed 24/7
👉 Langkah Selanjutnya
⚡ Performance Optimization
Tuning, caching, load balancing, dan scaling untuk handle workload besar
🏛️ Production Architecture
Enterprise-grade setup dengan high availability dan disaster recovery
System yang reliable butuh monitoring yang solid.