Performance Monitoring Guide
Monitor, analyze, and optimize your SLURM cluster performance with S9S's comprehensive performance monitoring capabilities.
š Overview
S9S provides real-time performance monitoring for:
- Cluster-wide resource utilization
- Job efficiency and performance trends
- Node health and capacity metrics
- Queue performance and wait times
- Storage and network I/O statistics
š Real-Time Dashboard
Performance Views
Access performance monitoring views:
Key | View | Description |
---|---|---|
| Performance Dashboard | Overall cluster performance |
| Metrics View | Detailed resource metrics |
| Trend Analysis | Historical performance trends |
| Alert Dashboard | Performance alerts and warnings |
Dashboard Widgets
Cluster Overview:
CPU Utilization: āāāāāāāāāāāāāāāā 68.5% Memory Usage: āāāāāāāāāāāāāāāā 45.2% GPU Utilization: āāāāāāāāāāāāāāāā 82.3% Active Jobs: 2,847 / 3,500 Queue Length: 156 jobs Avg Wait Time: 12.5 minutes
Resource Heatmap:
Node CPU Memory GPU Load Status node001 āāāāāā āāāāāā āāāāāā 2.1 MIXED node002 āāāāāā āāāāāā āāāāāā 3.8 ALLOC node003 āāāāāā āāāāāā āāāāāā 0.1 IDLE
š Job Performance Analysis
Job Efficiency Metrics
Analyze job performance:
# View job efficiency :efficiency --job 12345 Job 12345 Efficiency Analysis: CPU Efficiency: 85.2% (Good) Memory Efficiency: 67.8% (Fair) GPU Utilization: 92.1% (Excellent) I/O Wait: 3.2% (Good) Runtime Efficiency: 78.5% (Good) Recommendations: - Consider reducing memory allocation (32GB ā 24GB) - Job is CPU-bound, memory overprovisioned
Performance Trends
# Historical performance trends :trends --user alice --period month Alice's Performance Trends (Last 30 Days): āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Avg CPU Efficiency: 82.3% (ā² +5.2% vs last month) ā ā Avg Wait Time: 8.5 min (ā¼ -2.1 min vs last month) ā ā Jobs Completed: 247 (ā² +18% vs last month) ā ā Success Rate: 96.8% (ā² +1.2% vs last month) ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Resource Utilization Analysis
# Detailed resource analysis :analyze --partition gpu --time-range 7d GPU Partition Analysis (Last 7 Days): Resource Utilization: - Average GPU Usage: 78.5% - Peak GPU Usage: 98.2% (Tuesday 14:30) - Lowest Usage: 12.3% (Sunday 03:00) Bottlenecks Identified: 1. Memory bandwidth limitation on older nodes 2. I/O contention during peak hours (14:00-16:00) 3. Network saturation on rack 3 nodes Optimization Recommendations: 1. Upgrade memory on node[025-048] 2. Stagger large I/O jobs 3. Balance network traffic across racks
š Performance Monitoring
Real-Time Metrics
Monitor live cluster metrics:
# Live performance monitoring :monitor --refresh 5s Live Cluster Metrics (Updated every 5s): āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Total Nodes: 256 (248 up, 8 maint) ā ā CPU Cores: 8,192 (5,647 alloc, 2,545 free) ā ā Memory: 32TB (18.5TB used, 13.5TB free) ā ā GPUs: 512 (420 busy, 92 idle) ā ā Jobs: 2,847 run, 156 pend, 23 fail ā ā Load: 1.85 avg, 3.21 peak ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Top Resource Consumers: 1. job_98765 (alice) - 128 cores, 512GB RAM, 8 GPUs 2. job_98712 (bob) - 64 cores, 256GB RAM, 4 GPUs 3. job_98698 (charlie) - 96 cores, 384GB RAM, 6 GPUs
Historical Data
View performance history:
# Performance history :history --metric utilization --period month Cluster Utilization History: Dec 2023 āāāāāāāāāāāāāāāā 78.5% Nov 2023 āāāāāāāāāāāāāāāā 72.3% Oct 2023 āāāāāāāāāāāāāāāā 75.1% Sep 2023 āāāāāāāāāāāāāāāā 68.9% Trend: ā² +5.6% utilization improvement Projected Jan 2024: 82.1% utilization
š Performance Alerts
Alert Configuration
Set up performance alerts:
# ~/.s9s/config.yaml performance: alerts: # Resource alerts high_cpu_usage: condition: "cpu_utilization > 90%" duration: "5m" severity: warning low_gpu_efficiency: condition: "gpu_efficiency < 50% AND runtime > 1h" severity: info notification: user_email memory_pressure: condition: "free_memory < 10%" severity: critical actions: ["drain_node", "notify_admin"] # Queue alerts long_queue_wait: condition: "avg_wait_time > 30m" severity: warning queue_backlog: condition: "pending_jobs > 500" severity: critical
Alert Dashboard
:alerts Active Performance Alerts: š“ CRITICAL: Node node042 - Memory usage 95.2% (2m ago) š” WARNING: GPU efficiency below 60% on 12 jobs (5m ago) šµ INFO: Queue wait time increased to 18.5 minutes (1m ago) āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Alert Actions: ā ā [D] Drain node042 ā ā [A] Acknowledge alerts ā ā [S] Snooze for 1 hour ā ā [C] Configure alert thresholds ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
š Performance Optimization
Resource Right-Sizing
Optimize resource allocation:
# Analyze resource usage patterns :optimize --user alice --recommend Resource Optimization Recommendations for alice: āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Memory Over-allocation: ā ā Current avg: 64GB, Used avg: 18GB ā ā Recommended: 24GB (-62% allocation) ā ā Savings: 40GB/job Ć avg 12 jobs = 480GB ā ā ā ā CPU Under-utilization: ā ā Current: 16 cores, Avg usage: 73% ā ā Recommended: 12 cores (+optimal binning) ā ā ā ā Runtime Patterns: ā ā 82% of jobs finish within 6 hours ā ā Consider shorter time limits for faster ā ā scheduling and better resource turnover ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Queue Optimization
# Queue performance analysis :queue-analysis --partition gpu GPU Queue Analysis: āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Queue Statistics: ā ā Pending Jobs: 47 ā ā Avg Wait Time: 22.3 minutes ā ā Queue Efficiency: 78.5% ā ā ā ā Bottlenecks: ā ā 1. Large jobs blocking smaller ones ā ā 2. Memory fragmentation on 8-GPU nodes ā ā 3. Priority inversion for long-queued jobs ā ā ā ā Recommendations: ā ā 1. Enable backfill scheduling ā ā 2. Add priority aging (1pt/hour) ā ā 3. Consider job preemption for urgent jobs ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Capacity Planning
# Capacity planning analysis :capacity-plan --horizon 6months Capacity Planning (6-month projection): āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Current Utilization: 78.5% ā ā Growth Trend: +2.3%/month ā ā Projected Peak: 92.1% (April 2024) ā ā ā ā Capacity Recommendations: ā ā - Add 32 GPU nodes by March 2024 ā ā - Upgrade memory on existing nodes ā ā - Consider job migration to cloud burst ā ā ā ā Cost Impact: ā ā - Hardware: $850K (32 nodes) ā ā - ROI: 18 months based on current demand ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
š Performance Reports
Standard Reports
Generate comprehensive performance reports:
# Weekly performance report :report performance --period week --format pdf # User efficiency report :report user-efficiency --users alice,bob,charlie # Resource utilization report :report utilization --partitions all --time-range month # Cost analysis report :report cost-analysis --include-energy --include-maintenance
Custom Performance Metrics
Define custom performance indicators:
# ~/.s9s/metrics/custom.yaml custom_metrics: job_efficiency_score: formula: "(cpu_efficiency * 0.4) + (memory_efficiency * 0.3) + (runtime_efficiency * 0.3)" threshold_good: 0.8 threshold_fair: 0.6 cluster_health_index: formula: "(available_nodes/total_nodes) * (1 - (failed_jobs/total_jobs)) * utilization" range: [0, 1] target: 0.85 user_productivity: formula: "completed_jobs / (submitted_jobs + failed_jobs)" period: "30d" benchmark: 0.95
š§ Performance Troubleshooting
Diagnostic Tools
# Performance diagnostics :diag performance --comprehensive Performance Diagnostic Report: ā CPU utilization healthy (78.5%) ā ļø Memory fragmentation detected on 12 nodes š“ I/O bottleneck on storage system (>80% utilization) ā Network performance within normal range ā ļø GPU memory leaks detected in 3 jobs ā Scheduler performance optimal Recommended Actions: 1. Restart jobs with memory leaks: job[98756,98721,98698] 2. Balance I/O load across storage systems 3. Consider memory compaction on affected nodes
Performance Profiling
# Profile specific job performance :profile --job 12345 --detailed Job 12345 Performance Profile: āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Phase Analysis: ā ā Initialization: 2.3s (0.1%) ā ā Data Loading: 245.7s (8.2%) ā ā Computation: 2,547.1s (85.1%) ā ā I/O Operations: 198.9s (6.6%) ā ā ā ā Resource Usage Timeline: ā ā CPU: Consistent 95% (well utilized) ā ā Memory: Peak 18.5GB of 32GB allocated ā ā GPU: 92% average, 3 brief idle periods ā ā Disk I/O: Peaks during data loading/saving ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
š Best Practices
Monitoring Strategy
- Set appropriate baselines - Establish normal performance ranges
- Monitor trends, not just snapshots - Focus on changes over time
- Use predictive alerts - Alert before problems become critical
- Regular performance reviews - Weekly/monthly performance assessments
- Correlate metrics - CPU, memory, I/O, and network together
Performance Optimization
- Right-size resources - Match allocation to actual usage
- Optimize job placement - Consider node capabilities and current load
- Balance workloads - Spread I/O and CPU-intensive jobs
- Use performance profiles - Create templates for common job types
- Continuous improvement - Regular analysis and optimization cycles
Resource Management
- Proactive maintenance - Address issues before they impact users
- Capacity planning - Plan for growth before hitting limits
- Load balancing - Distribute work evenly across resources
- Performance budgets - Set and enforce efficiency targets
- Cost optimization - Balance performance with cost considerations
š Next Steps
- Configure Advanced Filtering for performance analysis
- Set up Export & Reporting for performance data
- Learn Batch Operations for bulk optimizations
- Explore Enterprise Features for advanced monitoring