Node Operations Guide
Master node management in S9S with powerful operations for monitoring, maintenance, and direct access to cluster nodes.
Node View Overview
Press 2 or :nodes to access the nodes view, where you can:
- Monitor all cluster nodes in real-time
- View detailed node specifications and utilization
- Perform maintenance operations
- Access nodes directly via SSH
- Drain and resume nodes for maintenance
Node Information
Node States
S9S displays nodes in various states:
| State | Description | Color |
|---|---|---|
| IDLE | Available for new jobs | Green |
| MIXED | Some CPUs allocated, some free | Blue |
| ALLOCATED | Fully utilized by jobs | Blue |
| DOWN | Node is offline | Red |
| DRAIN | Being drained for maintenance | Red |
| DRAINING | Actively draining jobs | Red |
| RESERVED | Reserved for specific use | Yellow |
| MAINTENANCE | In maintenance mode | Orange |
Node Details
View detailed information with Enter:
Node: node001.cluster.edu State: MIXED (4/16 CPUs allocated) Features: gpu,nvme,infiniband OS: Linux 5.4.0-74-generic Architecture: x86_64 Real Memory: 128 GB Allocated Memory: 32 GB (25%) Free Memory: 96 GB GPUs: 2x NVIDIA A100-SXM4-40GB Jobs: 2 running, 0 pending Boot Time: 2023-12-01 08:30:15 Last Seen: 2023-12-15 14:23:42 (5s ago)
Node Operations
Basic Operations
| Key | Action | Description |
|---|---|---|
| Enter | Show details | View comprehensive node information |
| d/D | Drain | Prepare node for maintenance |
| r | Resume | Return node to service |
| R | Refresh | Update node information |
| s | SSH to node | Direct SSH access |
State Filter Shortcuts
| Key | Action | Description |
|---|---|---|
| a/A | All states | Clear state filter |
| i/I | Idle filter | Toggle idle state filter |
| m/M | Mixed filter | Toggle mixed state filter |
| p/P | Partition filter | Prompt for partition filter |
| g/G | Group by | Group nodes by partition, state, or features |
| e/E | Export | Open export dialog |
Maintenance Workflows
Planned Maintenance
-
Drain the node:
# In nodes view, select node and press D # Or use command mode :drain node001 Planned maintenance -
Wait for jobs to complete:
- Monitor draining progress
- Jobs will finish naturally
- New jobs won't be scheduled
-
Perform maintenance:
# SSH to node for maintenance s # Press 's' on selected node -
Resume the node:
# After maintenance, resume the node :resume node001
Emergency Maintenance
-
Drain the node with a reason:
:drain node001 Emergency maintenance -
Monitor draining progress -- jobs will finish naturally
-
Perform maintenance once drained
Batch Maintenance
Drain individual nodes using the command mode:
# Drain nodes one at a time :drain node001 OS update :drain node002 OS update # Resume nodes after maintenance :resume node001 :resume node002 # Filter to see specific nodes /node001
Note: The
:draincommand accepts a single node name and an optional reason string. Node ranges and--reason/--timeoutflags are planned. See #119.
SSH Integration
Direct SSH Access
Press s on any node to SSH directly:
# Automatically connects with your configured SSH settings ssh [email protected]
SSH Operations
| Key | Action | Description |
|---|---|---|
| s | SSH to node | Interactive SSH session to selected node |
Press s on a selected node in the Nodes view to open an SSH session. See SSH Integration Guide for configuration.
Node Monitoring
Resource Utilization
Monitor real-time resource usage:
# CPU utilization CPU: ████████░░░░░░░░ 8/16 cores (50%) # Memory usage Memory: ██████░░░░░░░░░░ 32/128 GB (25%) # GPU utilization GPU 0: ████████████████ 100% (job_12345) GPU 1: ░░░░░░░░░░░░░░░░ 0% (idle)
Resource Display
S9S displays the following node resource information:
- CPU Usage: Allocated vs total CPUs, CPU load
- Memory Usage: Allocated vs total memory
- State: Current node state with color coding
- Partitions: Which partitions the node belongs to
- Features: Node feature tags
- Reason: Drain/down reason if applicable
Node Filtering and Search
Find Specific Nodes
The / quick filter performs plain text search across all visible columns:
# Find nodes by name /compute # Nodes containing "compute" /gpu # Nodes with "gpu" in any column /node001 # Specific node
Press Ctrl+F for global search across all entity types, or use keyboard shortcuts for specific state filters:
# State filter shortcuts i/I # Toggle idle filter m/M # Toggle mixed filter a/A # Show all states p/P # Filter by partition
Troubleshooting Node Issues
Common Node Problems
Node shows as DOWN:
- Check network connectivity
- Verify SLURM daemon is running
- Check system logs
- Restart slurmd if needed
Node not accepting jobs:
- Check if node is drained
- Verify available resources
- Check job constraints vs node features
- Review partition configuration
High load but no jobs:
- Check for system processes
- Look for hung or zombie processes
- Check for I/O wait issues
- Review system logs
Node Diagnostics
View node details by selecting a node and pressing Enter in the Nodes view. For deeper diagnostics, SSH to the node with s.
Note: Command-mode diagnostic commands (
:diag,:ping,:logs,:slurm-status) are planned. See #119.
Best Practices
Node Management
- Plan maintenance windows - Use drain with descriptive reasons
- Monitor during drainage - Ensure jobs complete cleanly
- Verify after maintenance - Test functionality before resuming
- Document changes - Use descriptive drain reasons
- Batch operations - Maintain multiple nodes efficiently
Resource Monitoring
- Set up alerts - Proactive monitoring prevents issues
- Regular health checks - Monitor trends over time
- Capacity planning - Track utilization patterns
- Performance baselines - Know normal vs abnormal behavior
SSH Security
- Use SSH keys - Avoid password authentication
- Limit access - Restrict SSH to necessary users
- Audit connections - Log and monitor SSH usage
- Keep keys secure - Rotate and protect SSH keys
Next Steps
- Learn about Batch Operations to manage multiple nodes efficiently
- Explore Advanced Filtering for powerful node queries
- Set up Export capabilities to analyze node data