Observability Plugin
A comprehensive observability plugin for the s9s SLURM management interface that integrates with Prometheus to provide real-time monitoring, historical analysis, and intelligent resource optimization recommendations.
Table of Contents
- Features
- Installation
- Configuration
- Usage
- API Reference
- Architecture
- Metrics
- Efficiency Scoring
- Troubleshooting
- Development
Features
Core Monitoring
- Real-time Metrics: Live CPU, memory, storage, and network utilization
- Prometheus Integration: Native connection to existing Prometheus infrastructure
- Cached Queries: Intelligent caching system to reduce Prometheus load
- Visual Overlays: Seamless metric overlays on existing s9s views
Historical Analysis
- Time Series Collection: Automated collection and storage of historical metrics
- 30-Day Retention: Configurable data retention with automatic cleanup
- Statistical Analysis: Comprehensive trend analysis with linear regression
- Anomaly Detection: Z-score based anomaly detection with configurable sensitivity
- Seasonal Patterns: Daily, weekly, and custom seasonal pattern analysis
Resource Efficiency
- Comprehensive Scoring: Multi-factor efficiency scoring (0-100 scale)
- Resource Analysis: Individual analysis for CPU, memory, storage, network, and GPU
- Optimization Recommendations: AI-driven recommendations with cost impact analysis
- Cluster-wide Insights: Aggregate efficiency analysis across the entire cluster
- ROI Calculations: Return on investment analysis for optimization suggestions
Data Subscriptions
- Real-time Updates: Subscribe to metric updates with customizable intervals
- Persistent Subscriptions: Subscriptions survive plugin restarts
- Change Detection: Intelligent notification system for significant metric changes
- Callback System: Flexible callback system for custom integrations
External API
- HTTP REST API: Complete RESTful API for external integrations
- Authentication: Optional bearer token authentication
- JSON Responses: Structured JSON responses for all endpoints
- Rate Limiting: Built-in protection against excessive requests
Installation
-
Place the observability plugin directory in your s9s plugins folder:
cp -r plugins/observability /path/to/s9s/plugins/ -
Configure your s9s instance to load the plugin:
plugins: - name: observability enabled: true config: prometheus.endpoint: "http://your-prometheus:9090" prometheus.timeout: "10s" display.refreshInterval: "30s" display.showOverlays: true alerts.enabled: true
Configuration
Basic Configuration
observability: # Prometheus connection settings prometheus: endpoint: "http://localhost:9090" timeout: "10s" # Authentication (optional) auth: type: "basic" # or "bearer" username: "admin" password: "secret" # token: "bearer-token" # for bearer auth # TLS settings (optional) tls: enabled: true insecureSkipVerify: false caFile: "/path/to/ca.pem" certFile: "/path/to/cert.pem" keyFile: "/path/to/key.pem" # Display configuration display: refreshInterval: "30s" showOverlays: true showSparklines: true sparklinePoints: 20 colorScheme: "default" decimalPrecision: 2 # Alert settings alerts: enabled: true checkInterval: "60s" loadPredefinedRules: true showNotifications: true # Caching configuration cache: enabled: true defaultTTL: "1m" maxSize: 1000 cleanupInterval: "5m" # API configuration api: enabled: false port: 8080 auth_token: "your-secret-token"
Advanced Configuration
observability: # Historical data collection historical: dataDir: "./data/historical" retention: "720h" # 30 days collectInterval: "5m" maxDataPoints: 10000 # Custom queries for data collection queries: node_cpu: '100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)' node_memory: '(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100' node_load: 'node_load1' job_count: 'slurm_job_total' queue_length: 'slurm_queue_pending_jobs' # Metric collection settings metrics: node: nodeLabel: "instance" rateRange: "5m" enabledMetrics: ["cpu", "memory", "disk", "network"] job: enabled: true cgroupPattern: "/slurm/uid_%d/job_%d" enabledMetrics: ["cpu", "memory", "io"]
Usage
Web Interface
- Observability View: Access the main observability dashboard by pressing 'o' in the s9s interface
- Metric Overlays: View real-time metrics overlaid on jobs and nodes views
- Historical Charts: Access time-series charts and trend analysis
- Efficiency Dashboard: Review resource efficiency scores and recommendations
External API
The plugin exposes a comprehensive REST API when enabled.
Authentication
All API requests require a Bearer token when authentication is enabled:
curl -H "Authorization: Bearer your-token" http://localhost:8080/api/v1/status
Metrics Endpoints
Query Metrics
Instant query:
curl "http://localhost:8080/api/v1/metrics/query?query=up"
Range query:
curl "http://localhost:8080/api/v1/metrics/query_range?query=node_cpu&start=2023-01-01T00:00:00Z&end=2023-01-01T23:59:59Z&step=15m"
Historical Data
Get historical data:
curl "http://localhost:8080/api/v1/historical/data?metric=node_cpu&start=2023-01-01T00:00:00Z&end=2023-01-02T00:00:00Z"
Get statistics:
curl "http://localhost:8080/api/v1/historical/statistics?metric=node_cpu&duration=24h"
Analysis Endpoints
Trend Analysis
curl "http://localhost:8080/api/v1/analysis/trend?metric=node_cpu&duration=7d"
Anomaly Detection
curl "http://localhost:8080/api/v1/analysis/anomaly?metric=node_cpu&duration=24h&sensitivity=2.0"
Seasonal Analysis
curl "http://localhost:8080/api/v1/analysis/seasonal?metric=node_cpu&duration=168h"
Efficiency Analysis
Resource Efficiency
curl "http://localhost:8080/api/v1/efficiency/resource?type=cpu&duration=168h" curl "http://localhost:8080/api/v1/efficiency/resource?type=memory&duration=168h"
Cluster Efficiency
curl "http://localhost:8080/api/v1/efficiency/cluster?duration=168h"
Subscription Management
List Subscriptions
curl "http://localhost:8080/api/v1/subscriptions"
Create Subscription
curl -X POST "http://localhost:8080/api/v1/subscriptions/create" \ -H "Content-Type: application/json" \ -d '{"provider_id": "prometheus-metrics", "params": {"query": "up", "update_interval": "30s"}}'
Delete Subscription
curl -X DELETE "http://localhost:8080/api/v1/subscriptions/delete?id=subscription-id"
Architecture
Component Overview
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ s9s Interface │ │ External Apps │ │ Prometheus │ └─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘ │ │ │ │ │ │ ┌─────▼──────────────────────▼──────────────────────▼─────┐ │ │ │ Observability Plugin │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Views │ │ External API│ │ Prometheus │ │ │ │ │ │ │ │ Client │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Overlays │ │Subscription │ │ Historical │ │ │ │ │ │ Manager │ │ Collector │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Efficiency │ │ Config │ │ Cache │ │ │ │ Analyzer │ │ Manager │ │ Manager │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └────────────────────────────────────────────────────────┘
Data Flow
- Metric Collection: Prometheus client queries metrics based on configured intervals
- Caching: Frequently accessed metrics are cached to reduce Prometheus load
- Historical Storage: Time-series data is collected and stored locally for analysis
- Analysis Pipeline: Historical data feeds into trend, anomaly, and efficiency analyzers
- Subscription System: Real-time updates are distributed to subscribers
- API Exposure: External API provides programmatic access to all functionality
Storage Structure
data/ ├── observability/ # Subscription persistence │ ├── subscriptions.json │ └── notifications.json └── historical/ # Historical data storage ├── node_cpu.json ├── node_memory.json ├── node_load.json └── ...
Metrics
Default Collected Metrics
- node_cpu: CPU utilization percentage per node
- node_memory: Memory utilization percentage per node
- node_load: System load average per node
- job_count: Total number of SLURM jobs
- queue_length: Number of pending jobs in queue
Custom Metrics
Add custom metrics by extending the historical collector configuration:
historical: queries: custom_metric: 'your_prometheus_query_here' gpu_usage: 'nvidia_gpu_utilization_percent' network_io: 'rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])'
Efficiency Scoring
The efficiency analyzer uses a multi-factor scoring system:
Scoring Components
- Utilization Score (50%): Optimal range 70-85%
- Stability Score (30%): Lower standard deviation is better
- Waste Score (20%): Penalty for unused allocated resources
Resource-Specific Multipliers
- CPU: 1.1x (performance critical)
- Memory: 1.05x (stability critical)
- Storage: 1.0x (baseline)
- Network: 0.95x (less critical for most workloads)
Efficiency Levels
- Excellent (90-100): Optimal resource utilization
- Good (75-89): Minor optimization opportunities
- Fair (60-74): Moderate inefficiencies detected
- Poor (40-59): Significant waste or instability
- Critical (0-39): Severe inefficiencies requiring attention
Troubleshooting
Common Issues
Plugin fails to start
- Verify Prometheus endpoint is accessible
- Check authentication credentials
- Ensure required directories are writable
No data in historical views
- Confirm data collection is enabled
- Check historical collector is running
- Verify Prometheus queries return data
API authentication failures
- Ensure correct bearer token format
- Check token matches configuration
- Verify API is enabled in configuration
Performance issues
- Increase cache TTL to reduce Prometheus load
- Reduce collection frequency for large clusters
- Consider increasing maxDataPoints for longer retention
Debug Mode
Enable debug logging by setting log level to debug:
export LOG_LEVEL=debug
Health Checks
Monitor plugin health through the API:
curl http://localhost:8080/health
Or use the plugin's internal health check:
- Plugin status shows "healthy" when Prometheus is accessible
- Cache statistics indicate query performance
- Subscription statistics show active data flows
Development
Building
cd plugins/observability go build -o observability.so -buildmode=plugin .
Testing
Unit tests:
go test ./...
Integration tests with mock Prometheus:
go test -v ./integration_test.go
Benchmark tests:
go test -bench=. -benchmem
Contributing
- Follow Go coding standards
- Add comprehensive tests for new features
- Update documentation for configuration changes
- Ensure backward compatibility
License
This plugin is licensed under the MIT License. See LICENSE file for details.