Performance Monitoring Guide

Monitor, analyze, and optimize your SLURM cluster performance with S9S's comprehensive performance monitoring capabilities.

šŸ“ˆ Overview

S9S provides real-time performance monitoring for:

  • Cluster-wide resource utilization
  • Job efficiency and performance trends
  • Node health and capacity metrics
  • Queue performance and wait times
  • Storage and network I/O statistics

šŸ“Š Real-Time Dashboard

Performance Views

Access performance monitoring views:

KeyViewDescription
:perf
Performance DashboardOverall cluster performance
:metrics
Metrics ViewDetailed resource metrics
:trends
Trend AnalysisHistorical performance trends
:alerts
Alert DashboardPerformance alerts and warnings

Dashboard Widgets

Cluster Overview:

CPU Utilization:     ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–‘ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘ 68.5%
Memory Usage:        ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–‘ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘ 45.2%
GPU Utilization:     ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–‘ā–‘ā–‘ā–‘ 82.3%
Active Jobs:         2,847 / 3,500
Queue Length:        156 jobs
Avg Wait Time:       12.5 minutes

Resource Heatmap:

Node        CPU    Memory   GPU    Load   Status
node001     ā–ˆā–ˆā–ˆā–ˆā–‘ā–‘  ā–ˆā–ˆā–ˆā–‘ā–‘ā–‘  ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ  2.1   MIXED
node002     ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ  ā–ˆā–ˆā–ˆā–ˆā–‘ā–‘  ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘  3.8   ALLOC
node003     ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘  ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘  ā–‘ā–‘ā–‘ā–‘ā–‘ā–‘  0.1   IDLE

šŸƒ Job Performance Analysis

Job Efficiency Metrics

Analyze job performance:

# View job efficiency
:efficiency --job 12345

Job 12345 Efficiency Analysis:
CPU Efficiency:      85.2% (Good)
Memory Efficiency:   67.8% (Fair)
GPU Utilization:     92.1% (Excellent)
I/O Wait:           3.2% (Good)
Runtime Efficiency:  78.5% (Good)

Recommendations:
- Consider reducing memory allocation (32GB → 24GB)
- Job is CPU-bound, memory overprovisioned

Performance Trends

# Historical performance trends
:trends --user alice --period month

Alice's Performance Trends (Last 30 Days):

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Avg CPU Efficiency: 82.3% (ā–² +5.2% vs last month) │
│ Avg Wait Time: 8.5 min (ā–¼ -2.1 min vs last month)  │
│ Jobs Completed: 247 (ā–² +18% vs last month)       │
│ Success Rate: 96.8% (ā–² +1.2% vs last month)      │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Resource Utilization Analysis

# Detailed resource analysis
:analyze --partition gpu --time-range 7d

GPU Partition Analysis (Last 7 Days):

Resource Utilization:
- Average GPU Usage: 78.5%
- Peak GPU Usage: 98.2% (Tuesday 14:30)
- Lowest Usage: 12.3% (Sunday 03:00)

Bottlenecks Identified:
1. Memory bandwidth limitation on older nodes
2. I/O contention during peak hours (14:00-16:00)
3. Network saturation on rack 3 nodes

Optimization Recommendations:
1. Upgrade memory on node[025-048]
2. Stagger large I/O jobs
3. Balance network traffic across racks

šŸ“Š Performance Monitoring

Real-Time Metrics

Monitor live cluster metrics:

# Live performance monitoring
:monitor --refresh 5s

Live Cluster Metrics (Updated every 5s):

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Total Nodes: 256 (248 up, 8 maint)               │
│ CPU Cores: 8,192 (5,647 alloc, 2,545 free)      │
│ Memory: 32TB (18.5TB used, 13.5TB free)          │
│ GPUs: 512 (420 busy, 92 idle)                    │
│ Jobs: 2,847 run, 156 pend, 23 fail              │
│ Load: 1.85 avg, 3.21 peak                        │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Top Resource Consumers:
1. job_98765 (alice)    - 128 cores, 512GB RAM, 8 GPUs
2. job_98712 (bob)      - 64 cores,  256GB RAM, 4 GPUs  
3. job_98698 (charlie)  - 96 cores,  384GB RAM, 6 GPUs

Historical Data

View performance history:

# Performance history
:history --metric utilization --period month

Cluster Utilization History:

Dec 2023    ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–‘ 78.5%
Nov 2023    ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–‘ā–‘ā–‘ 72.3%
Oct 2023    ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–‘ā–‘ 75.1%
Sep 2023    ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–‘ā–‘ā–‘ā–‘ā–‘ 68.9%

Trend: ā–² +5.6% utilization improvement
Projected Jan 2024: 82.1% utilization

šŸ”” Performance Alerts

Alert Configuration

Set up performance alerts:

# ~/.s9s/config.yaml
performance:
  alerts:
    # Resource alerts
    high_cpu_usage:
      condition: "cpu_utilization > 90%"
      duration: "5m"
      severity: warning
      
    low_gpu_efficiency:
      condition: "gpu_efficiency < 50% AND runtime > 1h"
      severity: info
      notification: user_email
      
    memory_pressure:
      condition: "free_memory < 10%"
      severity: critical
      actions: ["drain_node", "notify_admin"]
      
    # Queue alerts
    long_queue_wait:
      condition: "avg_wait_time > 30m"
      severity: warning
      
    queue_backlog:
      condition: "pending_jobs > 500"
      severity: critical

Alert Dashboard

:alerts

Active Performance Alerts:

šŸ”“ CRITICAL: Node node042 - Memory usage 95.2% (2m ago)
🟔 WARNING: GPU efficiency below 60% on 12 jobs (5m ago)
šŸ”µ INFO: Queue wait time increased to 18.5 minutes (1m ago)

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Alert Actions:                                   │
│ [D] Drain node042                               │
│ [A] Acknowledge alerts                          │
│ [S] Snooze for 1 hour                          │
│ [C] Configure alert thresholds                  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

šŸ“‰ Performance Optimization

Resource Right-Sizing

Optimize resource allocation:

# Analyze resource usage patterns
:optimize --user alice --recommend

Resource Optimization Recommendations for alice:

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Memory Over-allocation:                         │
│   Current avg: 64GB, Used avg: 18GB            │
│   Recommended: 24GB (-62% allocation)          │
│   Savings: 40GB/job Ɨ avg 12 jobs = 480GB      │
│                                               │
│ CPU Under-utilization:                        │
│   Current: 16 cores, Avg usage: 73%           │
│   Recommended: 12 cores (+optimal binning)     │
│                                               │
│ Runtime Patterns:                             │
│   82% of jobs finish within 6 hours           │
│   Consider shorter time limits for faster      │
│   scheduling and better resource turnover      │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Queue Optimization

# Queue performance analysis
:queue-analysis --partition gpu

GPU Queue Analysis:

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Queue Statistics:                              │
│   Pending Jobs: 47                            │
│   Avg Wait Time: 22.3 minutes                 │
│   Queue Efficiency: 78.5%                     │
│                                               │
│ Bottlenecks:                                  │
│   1. Large jobs blocking smaller ones         │
│   2. Memory fragmentation on 8-GPU nodes      │
│   3. Priority inversion for long-queued jobs  │
│                                               │
│ Recommendations:                              │
│   1. Enable backfill scheduling               │
│   2. Add priority aging (1pt/hour)            │
│   3. Consider job preemption for urgent jobs  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Capacity Planning

# Capacity planning analysis
:capacity-plan --horizon 6months

Capacity Planning (6-month projection):

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Current Utilization: 78.5%                     │
│ Growth Trend: +2.3%/month                      │
│ Projected Peak: 92.1% (April 2024)            │
│                                               │
│ Capacity Recommendations:                      │
│   - Add 32 GPU nodes by March 2024             │
│   - Upgrade memory on existing nodes            │
│   - Consider job migration to cloud burst      │
│                                               │
│ Cost Impact:                                   │
│   - Hardware: $850K (32 nodes)                 │
│   - ROI: 18 months based on current demand     │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

šŸ“„ Performance Reports

Standard Reports

Generate comprehensive performance reports:

# Weekly performance report
:report performance --period week --format pdf

# User efficiency report
:report user-efficiency --users alice,bob,charlie

# Resource utilization report  
:report utilization --partitions all --time-range month

# Cost analysis report
:report cost-analysis --include-energy --include-maintenance

Custom Performance Metrics

Define custom performance indicators:

# ~/.s9s/metrics/custom.yaml
custom_metrics:
  job_efficiency_score:
    formula: "(cpu_efficiency * 0.4) + (memory_efficiency * 0.3) + (runtime_efficiency * 0.3)"
    threshold_good: 0.8
    threshold_fair: 0.6
    
  cluster_health_index:
    formula: "(available_nodes/total_nodes) * (1 - (failed_jobs/total_jobs)) * utilization"
    range: [0, 1]
    target: 0.85
    
  user_productivity:
    formula: "completed_jobs / (submitted_jobs + failed_jobs)"
    period: "30d"
    benchmark: 0.95

šŸ”§ Performance Troubleshooting

Diagnostic Tools

# Performance diagnostics
:diag performance --comprehensive

Performance Diagnostic Report:

āœ… CPU utilization healthy (78.5%)
āš ļø  Memory fragmentation detected on 12 nodes
šŸ”“ I/O bottleneck on storage system (>80% utilization)
āœ… Network performance within normal range
āš ļø  GPU memory leaks detected in 3 jobs
āœ… Scheduler performance optimal

Recommended Actions:
1. Restart jobs with memory leaks: job[98756,98721,98698]
2. Balance I/O load across storage systems
3. Consider memory compaction on affected nodes

Performance Profiling

# Profile specific job performance
:profile --job 12345 --detailed

Job 12345 Performance Profile:

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Phase Analysis:                                │
│   Initialization: 2.3s (0.1%)                  │
│   Data Loading: 245.7s (8.2%)                  │
│   Computation: 2,547.1s (85.1%)                │
│   I/O Operations: 198.9s (6.6%)                │
│                                               │
│ Resource Usage Timeline:                      │
│   CPU: Consistent 95% (well utilized)          │
│   Memory: Peak 18.5GB of 32GB allocated        │
│   GPU: 92% average, 3 brief idle periods       │
│   Disk I/O: Peaks during data loading/saving   │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

šŸš€ Best Practices

Monitoring Strategy

  1. Set appropriate baselines - Establish normal performance ranges
  2. Monitor trends, not just snapshots - Focus on changes over time
  3. Use predictive alerts - Alert before problems become critical
  4. Regular performance reviews - Weekly/monthly performance assessments
  5. Correlate metrics - CPU, memory, I/O, and network together

Performance Optimization

  1. Right-size resources - Match allocation to actual usage
  2. Optimize job placement - Consider node capabilities and current load
  3. Balance workloads - Spread I/O and CPU-intensive jobs
  4. Use performance profiles - Create templates for common job types
  5. Continuous improvement - Regular analysis and optimization cycles

Resource Management

  1. Proactive maintenance - Address issues before they impact users
  2. Capacity planning - Plan for growth before hitting limits
  3. Load balancing - Distribute work evenly across resources
  4. Performance budgets - Set and enforce efficiency targets
  5. Cost optimization - Balance performance with cost considerations

šŸš€ Next Steps