Troubleshooting Guide
This guide helps you resolve common issues with S9S. If you can't find a solution here, please check our GitHub Issues or join our Discord community.
Common Issues
Installation Problems
"Command not found" after installation
Problem: S9S is installed but not in PATH
Solutions:
# Check if S9S is installed which s9s ls -la ~/.local/bin/s9s # Add to PATH (bash) echo 'export PATH=$PATH:~/.local/bin' >> ~/.bashrc source ~/.bashrc # Add to PATH (zsh) echo 'export PATH=$PATH:~/.local/bin' >> ~/.zshrc source ~/.zshrc # Or use full path ~/.local/bin/s9s
Permission denied during installation
Problem: Cannot write to directories
Solutions:
# Install to user directory (recommended) mkdir -p ~/.local/bin mv s9s ~/.local/bin/ chmod +x ~/.local/bin/s9s export PATH=$PATH:~/.local/bin # Add to shell rc file for persistence echo 'export PATH=$PATH:~/.local/bin' >> ~/.bashrc
Connection Issues
Cannot connect to SLURM cluster
Problem: S9S cannot reach SLURM REST API
Important: s9s requires
slurmrestd(the SLURM REST API daemon, default port 6820). Havingslurmctld(port 6817) andslurmdbd(port 6819) running is not sufficient. Check if slurmrestd is running:ss -tlnp | grep 6820
Diagnostics:
# Check if slurmrestd is running ss -tlnp | grep 6820 # Test connection s9s --debug s9s config validate # Check API endpoint curl -k https://your-slurm-api.com/slurm/v0.0.43/ping # Verify credentials echo $SLURM_JWT
Solutions:
-
Start slurmrestd if it's not running:
# Start slurmrestd (as root or SlurmUser) slurmrestd 0.0.0.0:6820 # Or with systemd (if configured) sudo systemctl start slurmrestd # Verify it's listening ss -tlnp | grep 6820 curl http://localhost:6820/slurm/v0.0.43/ping -
Check endpoint format:
# Correct endpoint: "https://slurm.example.com:6820" # Incorrect endpoint: "slurm.example.com" # Missing protocol endpoint: "https://slurm.example.com:6820/" # Trailing slash -
Verify network access:
# Test connectivity ping slurm.example.com telnet slurm.example.com 6820 # Check firewall sudo iptables -L | grep 6820 -
Handle SSL/TLS issues:
# For self-signed certificates clusters: - name: "default" cluster: endpoint: "https://slurm.example.com:6820" insecure: true
Authentication failures
Problem: Invalid credentials or token
Solutions:
-
Token authentication:
# Verify token echo $SLURM_JWT # Test token directly curl -H "X-Auth-Token: $SLURM_JWT" \ https://slurm.example.com/slurm/v0.0.43/jobs # Refresh token scontrol token -
Generate a new token:
# Generate a new SLURM JWT token scontrol token # Set it in your environment export SLURM_JWT="<new-token>"
Display Issues
Corrupted or garbled display
Problem: Terminal compatibility issues
Solutions:
-
Check terminal capabilities:
# Verify 256 color support tput colors # Test UTF-8 support echo $LANG locale # Set proper locale export LANG=en_US.UTF-8 export LC_ALL=en_US.UTF-8 -
Try different terminal:
- Recommended: iTerm2, Alacritty, kitty
- Avoid: Windows Command Prompt
- Use Windows Terminal or WSL2 on Windows
-
Adjust S9S settings:
ui: skin: "default" noIcons: true # Disable icons if they render incorrectly
Screen flickering or slow updates
Problem: Performance issues
Solutions:
-
Adjust refresh rate: Change the
refreshRatesetting in your configuration file (~/.s9s/config.yaml):refreshRate: "10s" # Slower refresh (default is 2s) -
Customize visible columns: Configure columns in your configuration file:
views: jobs: columns: [id, name, state, time] -
Check system resources:
# Monitor S9S resource usage top -p $(pgrep s9s) # Check network latency ping -c 10 slurm.example.com
Data Issues
Jobs not showing up
Problem: Missing or filtered jobs
Diagnostics:
- Press / to check if a filter is active, then press Esc to clear it
- Press F5 to force a manual refresh
Solutions:
-
Check permissions:
# Verify user can see jobs sacctmgr show user $USER # Check account associations sacctmgr show associations user=$USER -
API version mismatch:
# Update API version clusters: - name: "default" cluster: apiVersion: v0.0.43 # or latest -
Partition visibility:
# Switch to partitions view in s9s :partitions # Or check partition access from the command line sinfo -s
Incorrect job states
Problem: Stale or wrong job information
Solutions:
-
Force refresh:
# Manual refresh F5 # Press F5 in any view -
Check time sync:
# Verify time sync timedatectl status # Sync time sudo ntpdate -s time.nist.gov
Performance Problems
S9S is slow or unresponsive
Problem: Performance degradation
Solutions:
-
Limit displayed jobs:
views: jobs: maxJobs: 100 # Limit results (default 1000) -
Debug mode analysis:
# Enable debug logging s9s --debug # Check debug log (written to ./s9s-debug.log in current directory) tail -f ./s9s-debug.log # Check app log (general application log) tail -f ~/.s9s/s9s.log -
Increase cluster timeout:
clusters: - name: "default" cluster: endpoint: "https://slurm.example.com:6820" timeout: "60s" # Increase timeout
SSH Issues
Cannot SSH to nodes
Problem: SSH connection fails from S9S
Solutions:
-
Configure SSH via system settings (
~/.ssh/config):Host node* User your-username IdentityFile ~/.ssh/id_rsa StrictHostKeyChecking no -
Test SSH manually:
# Test connection ssh node001 # Check SSH agent ssh-add -l # Add key to agent ssh-add ~/.ssh/id_rsa -
Node name resolution:
# Check DNS nslookup node001 # Add to hosts file echo "10.0.0.1 node001" | sudo tee -a /etc/hosts
Advanced Troubleshooting
Debug Mode
Enable debug logging:
# Start with debug logging s9s --debug # Save debug output to a file s9s --debug 2>&1 | tee debug.log
API Testing
Test the SLURM REST API directly to isolate connection issues:
# Test API endpoint curl -k -H "X-Auth-Token: $SLURM_JWT" \ https://slurm.example.com/slurm/v0.0.43/ping # List jobs via API curl -k -H "X-Auth-Token: $SLURM_JWT" \ https://slurm.example.com/slurm/v0.0.43/jobs
Log Analysis
The --debug flag writes a debug log to ./s9s-debug.log in the current working directory. The general app log is at ~/.s9s/s9s.log.
# View recent debug log (created by --debug flag) tail -n 100 ./s9s-debug.log # Search for errors in debug log grep ERROR ./s9s-debug.log # Monitor debug log in real time tail -f ./s9s-debug.log # View general app log tail -n 100 ~/.s9s/s9s.log
Diagnostic Information
To view cluster health information, use the built-in health view:
# Switch to health view :health # Or press 9 to switch to the health view
For configuration issues, use the config view:
# Open configuration :config
Note: Additional diagnostic commands are planned. See #119 for planned diagnostic commands.
Getting Help
Collect Debug Information
When reporting issues, include:
# Check s9s version s9s --version # Run with debug logging s9s --debug 2>&1 | tee debug.log # Collect debug log if available tar czf s9s-debug.tar.gz ./s9s-debug.log ~/.s9s/s9s.log
Community Support
- Discord: Join our server
- GitHub Issues: Report bugs
- Discussions: GitHub Discussions
Enterprise Support
For enterprise support:
- Email: [email protected]
- Priority support available
- SLA guarantees
- Custom development
Recovery Procedures
Reset S9S
Complete reset:
# Backup configuration cp -r ~/.s9s ~/.s9s.backup # Manual reset rm -rf ~/.s9s # S9S will recreate defaults on next launch s9s
Clear Cache
# Manual cache clear rm -rf ~/.s9s/cache/
Reinstall S9S
# Backup config cp ~/.s9s/config.yaml ~/s9s-config-backup.yaml # Remove S9S rm ~/.local/bin/s9s rm -rf ~/.s9s # Reinstall curl -sSL https://get.s9s.dev | bash # Restore config mkdir -p ~/.s9s cp ~/s9s-config-backup.yaml ~/.s9s/config.yaml
Prevention Tips
- Keep S9S updated: Check for updates regularly
- Monitor logs: Set up log rotation and monitoring
- Test changes: Use mock mode for testing
- Backup config: Version control your configuration
- Document issues: Keep notes on resolved problems
Next Steps
- Review Configuration Reference for optimization
- Join our Community for help