Self-Hosted Agents - Maintenance and Monitoring¶
Overview¶
Regular maintenance and monitoring ensure self-hosted agents remain healthy, performant, and available for builds. This guide covers maintenance procedures, monitoring strategies, and best practices.
Maintenance Schedule¶
Daily¶
- Automated: Monitor agent status via Azure DevOps
- Automated: Check for failed builds using self-hosted agents
Weekly¶
- Manual: Review agent health and status
- Manual: Check disk space usage
- Manual: Review build performance metrics
Monthly¶
- Manual: Update agent software
- Manual: Update system packages (Linux) or Windows updates
- Manual: Review and optimize agent configuration
- Manual: Clean up old build artifacts and caches
Quarterly¶
- Manual: Review and optimize agent configuration
- Manual: Assess scaling needs based on usage
- Manual: Security audit and updates
Monitoring Agent Health¶
Azure DevOps Monitoring¶
Agent Status¶
- Navigate to Organization Settings → Agent Pools
- Select agent pool (e.g., Hetzner-Linux)
- Go to Agents tab
- Check agent status:
- Green (Online): Agent is running and available
- Gray (Offline): Agent is not running or cannot connect
- Yellow (Busy): Agent is currently running a job
Agent History¶
- Click on an agent name
- Go to History tab
- Review:
- Recent job executions
- Success/failure rates
- Average job duration
- Last job execution time
Build Queue Monitoring¶
Monitor build queue to identify scaling needs:
- Navigate to Pipelines → Runs
- Filter by Queued status
- Monitor queue wait times
- If queue time consistently > 5 minutes, consider adding agents
Server-Level Monitoring¶
Linux Agents¶
Connect to Server:
Using SSH Key:
# Connect using SSH key
ssh azdevops@<server-ip>
# Or with specific key file
ssh -i ~/.ssh/id_rsa azdevops@<server-ip>
Using Password:
Check Agent Service Status:
# SSH into server
ssh azdevops@<server-ip>
# Check service status
sudo systemctl status vsts.agent.*.service
# View recent logs
sudo journalctl -u vsts.agent.*.service -n 50 --no-pager
Check Disk Space:
Check System Resources:
Windows Agents¶
Connect to Server via RDP:
On Windows:
# Open Remote Desktop Connection
mstsc /v:<server-ip>
# Or use GUI:
# Press Win+R, type mstsc, press Enter
# Enter server IP and credentials
On Mac/Linux: - Use Microsoft Remote Desktop (Mac) or Remmina (Linux) - See Windows Setup Guide for detailed instructions
Check Agent Service Status:
# After RDP connection, open PowerShell
# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}
# View recent event logs
Get-EventLog -LogName Application -Source "vsts*" -Newest 50
Check Disk Space:
# Check disk usage
Get-PSDrive C | Select-Object Used, Free, @{Name="PercentFree";Expression={($_.Free/$_.Used)*100}}
# Check agent work directory
Get-ChildItem C:\azagent\_work -Recurse | Measure-Object -Property Length -Sum
Check System Resources:
# CPU and memory usage
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10
Get-Counter '\Processor(_Total)\% Processor Time'
Get-Counter '\Memory\Available MBytes'
Maintenance Procedures¶
Update Agent Software¶
Linux¶
# SSH into server
ssh azdevops@<server-ip>
cd ~/azagent
# Stop service
sudo ./svc.sh stop
# Download new version (check https://github.com/microsoft/azure-pipelines-agent/releases)
AGENT_VERSION="3.248.0" # Update to latest version
curl -LO https://vstsagentpackage.azureedge.net/agent/${AGENT_VERSION}/vsts-agent-linux-x64-${AGENT_VERSION}.tar.gz
tar xzf vsts-agent-linux-x64-${AGENT_VERSION}.tar.gz
# Restart service
sudo ./svc.sh start
# Verify status
sudo ./svc.sh status
Windows¶
# RDP into server
cd C:\azagent
# Stop service
.\svc.cmd stop
# Download new version
$AgentVersion = "3.248.0" # Update to latest version
$AgentUrl = "https://vstsagentpackage.azureedge.net/agent/$AgentVersion/vsts-agent-win-x64-$AgentVersion.zip"
Invoke-WebRequest -Uri $AgentUrl -OutFile "agent.zip"
Expand-Archive -Path "agent.zip" -DestinationPath . -Force
Remove-Item "agent.zip"
# Restart service
.\svc.cmd start
# Verify status
Get-Service | Where-Object {$_.Name -like "*vsts*"}
Update System Packages¶
Linux¶
# Update package lists
sudo apt update
# Upgrade packages
sudo apt upgrade -y
# Update .NET SDK (if needed)
sudo apt install --only-upgrade dotnet-sdk-9.0
# Reboot if kernel updated
sudo reboot
Windows¶
# Check for updates
Get-WindowsUpdate
# Install updates
Install-WindowsUpdate -AcceptAll -AutoReboot
# Or use Windows Update GUI
# Settings → Update & Security → Windows Update
Clean Up Build Artifacts¶
Linux¶
# Clean agent work directory (be careful - this removes all build artifacts)
cd ~/azagent/_work
rm -rf *
# Or clean specific old directories
find ~/azagent/_work -type d -mtime +30 -exec rm -rf {} \;
# Clean Docker (if used)
docker system prune -a --volumes
Windows¶
# Clean agent work directory
Remove-Item C:\azagent\_work\* -Recurse -Force
# Or clean specific old directories
Get-ChildItem C:\azagent\_work -Directory | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-30)} | Remove-Item -Recurse -Force
# Clean Docker (if used)
docker system prune -a --volumes
Rotate PAT Tokens¶
- Create new PAT in Azure DevOps:
- User Settings → Personal Access Tokens → New Token
-
Same scopes as existing token
-
Update agent configuration:
Linux:
Windows:
- Revoke old PAT token in Azure DevOps
Scaling Strategy¶
When to Scale Up¶
- Build queue wait time consistently > 5 minutes
- Multiple builds queued simultaneously
- Agent utilization consistently > 80%
- Build deadlines being missed due to queue times
When to Scale Down¶
- Agents idle for extended periods (> 1 week)
- Low build queue wait times (< 1 minute)
- Agent utilization consistently < 20%
- Cost optimization needed
Scaling Process¶
- Monitor metrics for 1-2 weeks
- Identify scaling need based on metrics
- Provision new server (or use Terraform)
- Install and configure agent (see Linux Setup or Windows Setup)
- Register to existing pool
- Monitor new agent performance
- Document new agent in inventory
Performance Optimization¶
Build Cache Management¶
Linux:
# Configure NuGet cache location
export NUGET_PACKAGES=~/.nuget/packages
# Configure npm cache
npm config set cache ~/.npm
Windows:
# Configure NuGet cache location
$env:NUGET_PACKAGES = "$env:USERPROFILE\.nuget\packages"
# Configure npm cache
npm config set cache "$env:USERPROFILE\.npm"
Disk Space Management¶
- Monitor disk usage regularly
- Set up automated cleanup scripts
- Use separate volumes for agent work directories
- Consider increasing disk size if needed
Network Optimization¶
- Ensure agents have stable internet connection
- Use local package repositories when possible
- Configure proxy if needed
- Monitor network latency to Azure DevOps
Security Maintenance¶
Regular Security Updates¶
- Apply OS security patches promptly
- Update installed software regularly
- Review and rotate credentials (PAT tokens, SSH keys)
- Monitor for security advisories
Access Control¶
- Use dedicated agent user accounts (not root/administrator)
- Limit SSH/RDP access to authorized personnel
- Use strong passwords or SSH keys
- Regularly review access logs
Audit and Compliance¶
- Review agent logs for suspicious activity
- Monitor failed authentication attempts
- Keep audit trail of agent changes
- Document security configurations
Backup and Recovery¶
Agent Configuration Backup¶
Linux:
# Backup agent configuration
tar -czf ~/azagent-backup-$(date +%Y%m%d).tar.gz ~/azagent/.agent ~/azagent/.credentials
Windows:
# Backup agent configuration
Compress-Archive -Path C:\azagent\.agent, C:\azagent\.credentials -DestinationPath "C:\backups\azagent-backup-$(Get-Date -Format 'yyyyMMdd').zip"
Disaster Recovery Plan¶
- Document agent configuration and setup
- Maintain server images or Terraform configurations
- Test recovery procedures regularly
- Keep backups of critical configurations
Monitoring Tools¶
Azure DevOps Built-in Monitoring¶
- Agent status dashboard
- Build queue monitoring
- Agent history and statistics
Third-Party Monitoring (Optional)¶
- Prometheus + Grafana: For detailed metrics
- Nagios/Zabbix: For infrastructure monitoring
- Azure Monitor: If using Azure services
- Custom scripts: For specific monitoring needs
Troubleshooting Common Issues¶
Agent Goes Offline¶
- Check agent service status
- Verify network connectivity
- Review agent logs
- Check for system updates or reboots
- Verify PAT token is valid
High Disk Usage¶
- Clean up old build artifacts
- Remove unused Docker images
- Clear package caches
- Consider increasing disk size
Slow Build Performance¶
- Check system resources (CPU, memory, disk I/O)
- Review build logs for bottlenecks
- Optimize build cache usage
- Consider upgrading server resources
Next Steps¶
- Review Troubleshooting Guide for specific issues
- Set up automated monitoring alerts
- Document your specific maintenance procedures
- Schedule regular maintenance windows