Node Operations Guide
Master node management in S9S with powerful operations for monitoring, maintenance, and direct access to cluster nodes.
🖥️ Node View Overview
Press
2
:view nodes
- Monitor all cluster nodes in real-time
- View detailed node specifications and utilization
- Perform maintenance operations
- Access nodes directly via SSH
- Drain and resume nodes for maintenance
📊 Node Information
Node States
S9S displays nodes in various states:
State | Description | Color |
---|---|---|
IDLE | Available for new jobs | Green |
MIXED | Some CPUs allocated, some free | Yellow |
ALLOCATED | Fully utilized by jobs | Blue |
DOWN | Node is offline | Red |
DRAIN | Being drained for maintenance | Orange |
DRAINING | Actively draining jobs | Orange |
FAIL | Node has failed | Red |
MAINT | In maintenance mode | Gray |
Node Details
View detailed information with
Enter
d
Node: node001.cluster.edu State: MIXED (4/16 CPUs allocated) Features: gpu,nvme,infiniband OS: Linux 5.4.0-74-generic Architecture: x86_64 Real Memory: 128 GB Allocated Memory: 32 GB (25%) Free Memory: 96 GB GPUs: 2x NVIDIA A100-SXM4-40GB Jobs: 2 running, 0 pending Boot Time: 2023-12-01 08:30:15 Last Seen: 2023-12-15 14:23:42 (5s ago)
🔧 Node Operations
Basic Operations
Key | Action | Description |
---|---|---|
| Show details | View comprehensive node information |
| View logs | Show node logs and messages |
| View jobs | List all jobs on this node |
| SSH to node | Direct SSH access |
| Refresh | Update node information |
Maintenance Operations
Key | Action | Description |
---|---|---|
| Drain node | Prepare node for maintenance |
| Resume node | Return node to service |
| Update state | Force state update |
| Maintenance mode | Put node in maintenance |
Advanced Operations
Key | Action | Description |
---|---|---|
| Reboot node | Restart node (admin only) |
| Power cycle | Hard power cycle |
| Force drain | Immediate drain with job termination |
🚧 Maintenance Workflows
Planned Maintenance
-
Drain the node:
# In nodes view, select node and press D # Or use command mode :drain node001 --reason="Planned maintenance" --timeout=1h
-
Wait for jobs to complete:
- Monitor draining progress
- Jobs will finish naturally
- New jobs won't be scheduled
-
Perform maintenance:
# SSH to node for maintenance s # Press 's' on selected node
-
Resume the node:
# After maintenance, resume the node :resume node001
Emergency Maintenance
-
Force drain immediately:
:drain node001 --force --reason="Emergency maintenance"
-
Jobs are immediately terminated
-
Node is ready for maintenance
Batch Maintenance
Maintain multiple nodes efficiently:
# Drain multiple nodes :drain node[001-010] --reason="OS update" --timeout=2h # Resume multiple nodes :resume node[001-010] # Check status of node range /node:node[001-010]
🖧 SSH Integration
Direct SSH Access
Press
s
# Automatically connects with your configured SSH settings ssh [email protected]
SSH Configuration
Configure SSH in
~/.s9s/config.yaml
ssh: defaultUser: ${USER} keyFile: ~/.ssh/id_rsa knownHostsFile: ~/.ssh/known_hosts compression: true forwardAgent: true extraArgs: "-o StrictHostKeyChecking=ask"
SSH Operations
Key | Action | Description |
---|---|---|
| SSH to node | Interactive SSH session |
| SSH with options | Choose user, key, options |
| Background SSH | SSH in new terminal |
Bulk SSH Operations
Execute commands across multiple nodes:
# Run command on all idle nodes :ssh --filter="state:idle" "uptime" # Update all nodes in maintenance :ssh --nodes="node[001-010]" "sudo apt update" # Check disk space on GPU nodes :ssh --filter="features:gpu" "df -h"
📈 Node Monitoring
Resource Utilization
Monitor real-time resource usage:
# CPU utilization CPU: ████████░░░░░░░░ 8/16 cores (50%) # Memory usage Memory: ██████░░░░░░░░░░ 32/128 GB (25%) # GPU utilization GPU 0: ████████████████ 100% (job_12345) GPU 1: ░░░░░░░░░░░░░░░░ 0% (idle)
Health Monitoring
S9S monitors node health indicators:
- Load Average: System load over 1, 5, 15 minutes
- Memory Pressure: Available vs allocated memory
- Disk Space: Available disk space on filesystems
- Network: Network connectivity and bandwidth
- Temperature: Hardware temperature sensors
- Jobs: Running and pending job counts
Alerts and Notifications
Configure alerts for node issues:
# In config.yaml notifications: nodeAlerts: - condition: "load > 32" severity: warning message: "High load on {node}" - condition: "memory < 10%" severity: critical message: "Low memory on {node}" - condition: "state == DOWN" severity: critical message: "Node {node} is down"
🔍 Node Filtering and Search
Find Specific Nodes
# Find nodes by name pattern /node:compute* # Find nodes by state /state:idle # Find GPU nodes /features:gpu # Find nodes with high memory /memory:>64GB # Find nodes with specific job count /jobs:>4
Complex Node Queries
# Idle GPU nodes with >100GB RAM /state:idle features:gpu memory:>100GB # Nodes that haven't been seen recently /lastseen:>1h state:!DOWN # Overutilized nodes /load:>16 cpus:<=16 # Nodes ready for maintenance /jobs:0 state:idle
🛠️ Troubleshooting Node Issues
Common Node Problems
Node shows as DOWN:
- Check network connectivity
- Verify SLURM daemon is running
- Check system logs
- Restart slurmd if needed
Node not accepting jobs:
- Check if node is drained
- Verify available resources
- Check job constraints vs node features
- Review partition configuration
High load but no jobs:
- Check for system processes
- Look for hung or zombie processes
- Check for I/O wait issues
- Review system logs
Node Diagnostics
# View detailed node diagnostics :diag node001 # Check node connectivity :ping node001 # View system logs :logs node001 --system --lines=100 # Check SLURM daemon status :slurm-status node001
🎯 Best Practices
Node Management
- Plan maintenance windows - Use drain with timeout
- Monitor during drainage - Ensure jobs complete cleanly
- Verify after maintenance - Test functionality before resuming
- Document changes - Use descriptive drain reasons
- Batch operations - Maintain multiple nodes efficiently
Resource Monitoring
- Set up alerts - Proactive monitoring prevents issues
- Regular health checks - Monitor trends over time
- Capacity planning - Track utilization patterns
- Performance baselines - Know normal vs abnormal behavior
SSH Security
- Use SSH keys - Avoid password authentication
- Limit access - Restrict SSH to necessary users
- Audit connections - Log and monitor SSH usage
- Keep keys secure - Rotate and protect SSH keys
🚀 Next Steps
- Learn about Job Management to manage jobs on nodes
- Explore Advanced Filtering for powerful node queries
- Set up SSH Integration for seamless node access
- Configure Performance Monitoring for proactive management