Skip to content

Self-Hosted Agents - Maintenance and Monitoring

Overview

Regular maintenance and monitoring ensure self-hosted agents remain healthy, performant, and available for builds. This guide covers maintenance procedures, monitoring strategies, and best practices.

Maintenance Schedule

Daily

  • Automated: Monitor agent status via Azure DevOps
  • Automated: Check for failed builds using self-hosted agents

Weekly

  • Manual: Review agent health and status
  • Manual: Check disk space usage
  • Manual: Review build performance metrics

Monthly

  • Manual: Update agent software
  • Manual: Update system packages (Linux) or Windows updates
  • Manual: Review and optimize agent configuration
  • Manual: Clean up old build artifacts and caches

Quarterly

  • Manual: Review and optimize agent configuration
  • Manual: Assess scaling needs based on usage
  • Manual: Security audit and updates

Monitoring Agent Health

Azure DevOps Monitoring

Agent Status

  1. Navigate to Organization SettingsAgent Pools
  2. Select agent pool (e.g., Hetzner-Linux)
  3. Go to Agents tab
  4. Check agent status:
  5. Green (Online): Agent is running and available
  6. Gray (Offline): Agent is not running or cannot connect
  7. Yellow (Busy): Agent is currently running a job

Agent History

  1. Click on an agent name
  2. Go to History tab
  3. Review:
  4. Recent job executions
  5. Success/failure rates
  6. Average job duration
  7. Last job execution time

Build Queue Monitoring

Monitor build queue to identify scaling needs:

  1. Navigate to PipelinesRuns
  2. Filter by Queued status
  3. Monitor queue wait times
  4. If queue time consistently > 5 minutes, consider adding agents

Server-Level Monitoring

Linux Agents

Connect to Server:

Using SSH Key:

# Connect using SSH key
ssh azdevops@<server-ip>

# Or with specific key file
ssh -i ~/.ssh/id_rsa azdevops@<server-ip>

Using Password:

# Connect using password
ssh azdevops@<server-ip>
# Enter password when prompted

Check Agent Service Status:

# SSH into server
ssh azdevops@<server-ip>

# Check service status
sudo systemctl status vsts.agent.*.service

# View recent logs
sudo journalctl -u vsts.agent.*.service -n 50 --no-pager

Check Disk Space:

# Check disk usage
df -h

# Check specific directory usage
du -sh /home/azdevops/azagent/_work/*

Check System Resources:

# CPU and memory usage
top
# or
htop

# System load
uptime

Windows Agents

Connect to Server via RDP:

On Windows:

# Open Remote Desktop Connection
mstsc /v:<server-ip>

# Or use GUI:
# Press Win+R, type mstsc, press Enter
# Enter server IP and credentials

On Mac/Linux: - Use Microsoft Remote Desktop (Mac) or Remmina (Linux) - See Windows Setup Guide for detailed instructions

Check Agent Service Status:

# After RDP connection, open PowerShell
# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}

# View recent event logs
Get-EventLog -LogName Application -Source "vsts*" -Newest 50

Check Disk Space:

# Check disk usage
Get-PSDrive C | Select-Object Used, Free, @{Name="PercentFree";Expression={($_.Free/$_.Used)*100}}

# Check agent work directory
Get-ChildItem C:\azagent\_work -Recurse | Measure-Object -Property Length -Sum

Check System Resources:

# CPU and memory usage
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10
Get-Counter '\Processor(_Total)\% Processor Time'
Get-Counter '\Memory\Available MBytes'

Maintenance Procedures

Update Agent Software

Linux

# SSH into server
ssh azdevops@<server-ip>
cd ~/azagent

# Stop service
sudo ./svc.sh stop

# Download new version (check https://github.com/microsoft/azure-pipelines-agent/releases)
AGENT_VERSION="3.248.0"  # Update to latest version
curl -LO https://vstsagentpackage.azureedge.net/agent/${AGENT_VERSION}/vsts-agent-linux-x64-${AGENT_VERSION}.tar.gz
tar xzf vsts-agent-linux-x64-${AGENT_VERSION}.tar.gz

# Restart service
sudo ./svc.sh start

# Verify status
sudo ./svc.sh status

Windows

# RDP into server
cd C:\azagent

# Stop service
.\svc.cmd stop

# Download new version
$AgentVersion = "3.248.0"  # Update to latest version
$AgentUrl = "https://vstsagentpackage.azureedge.net/agent/$AgentVersion/vsts-agent-win-x64-$AgentVersion.zip"
Invoke-WebRequest -Uri $AgentUrl -OutFile "agent.zip"
Expand-Archive -Path "agent.zip" -DestinationPath . -Force
Remove-Item "agent.zip"

# Restart service
.\svc.cmd start

# Verify status
Get-Service | Where-Object {$_.Name -like "*vsts*"}

Update System Packages

Linux

# Update package lists
sudo apt update

# Upgrade packages
sudo apt upgrade -y

# Update .NET SDK (if needed)
sudo apt install --only-upgrade dotnet-sdk-9.0

# Reboot if kernel updated
sudo reboot

Windows

# Check for updates
Get-WindowsUpdate

# Install updates
Install-WindowsUpdate -AcceptAll -AutoReboot

# Or use Windows Update GUI
# Settings → Update & Security → Windows Update

Clean Up Build Artifacts

Linux

# Clean agent work directory (be careful - this removes all build artifacts)
cd ~/azagent/_work
rm -rf *

# Or clean specific old directories
find ~/azagent/_work -type d -mtime +30 -exec rm -rf {} \;

# Clean Docker (if used)
docker system prune -a --volumes

Windows

# Clean agent work directory
Remove-Item C:\azagent\_work\* -Recurse -Force

# Or clean specific old directories
Get-ChildItem C:\azagent\_work -Directory | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-30)} | Remove-Item -Recurse -Force

# Clean Docker (if used)
docker system prune -a --volumes

Rotate PAT Tokens

  1. Create new PAT in Azure DevOps:
  2. User SettingsPersonal Access TokensNew Token
  3. Same scopes as existing token

  4. Update agent configuration:

Linux:

cd ~/azagent
sudo ./svc.sh stop
./config.sh --token <NEW_PAT_TOKEN> --replace
sudo ./svc.sh start

Windows:

cd C:\azagent
.\svc.cmd stop
.\config.cmd --token <NEW_PAT_TOKEN> --replace
.\svc.cmd start

  1. Revoke old PAT token in Azure DevOps

Scaling Strategy

When to Scale Up

  • Build queue wait time consistently > 5 minutes
  • Multiple builds queued simultaneously
  • Agent utilization consistently > 80%
  • Build deadlines being missed due to queue times

When to Scale Down

  • Agents idle for extended periods (> 1 week)
  • Low build queue wait times (< 1 minute)
  • Agent utilization consistently < 20%
  • Cost optimization needed

Scaling Process

  1. Monitor metrics for 1-2 weeks
  2. Identify scaling need based on metrics
  3. Provision new server (or use Terraform)
  4. Install and configure agent (see Linux Setup or Windows Setup)
  5. Register to existing pool
  6. Monitor new agent performance
  7. Document new agent in inventory

Performance Optimization

Build Cache Management

Linux:

# Configure NuGet cache location
export NUGET_PACKAGES=~/.nuget/packages

# Configure npm cache
npm config set cache ~/.npm

Windows:

# Configure NuGet cache location
$env:NUGET_PACKAGES = "$env:USERPROFILE\.nuget\packages"

# Configure npm cache
npm config set cache "$env:USERPROFILE\.npm"

Disk Space Management

  • Monitor disk usage regularly
  • Set up automated cleanup scripts
  • Use separate volumes for agent work directories
  • Consider increasing disk size if needed

Network Optimization

  • Ensure agents have stable internet connection
  • Use local package repositories when possible
  • Configure proxy if needed
  • Monitor network latency to Azure DevOps

Security Maintenance

Regular Security Updates

  • Apply OS security patches promptly
  • Update installed software regularly
  • Review and rotate credentials (PAT tokens, SSH keys)
  • Monitor for security advisories

Access Control

  • Use dedicated agent user accounts (not root/administrator)
  • Limit SSH/RDP access to authorized personnel
  • Use strong passwords or SSH keys
  • Regularly review access logs

Audit and Compliance

  • Review agent logs for suspicious activity
  • Monitor failed authentication attempts
  • Keep audit trail of agent changes
  • Document security configurations

Backup and Recovery

Agent Configuration Backup

Linux:

# Backup agent configuration
tar -czf ~/azagent-backup-$(date +%Y%m%d).tar.gz ~/azagent/.agent ~/azagent/.credentials

Windows:

# Backup agent configuration
Compress-Archive -Path C:\azagent\.agent, C:\azagent\.credentials -DestinationPath "C:\backups\azagent-backup-$(Get-Date -Format 'yyyyMMdd').zip"

Disaster Recovery Plan

  1. Document agent configuration and setup
  2. Maintain server images or Terraform configurations
  3. Test recovery procedures regularly
  4. Keep backups of critical configurations

Monitoring Tools

Azure DevOps Built-in Monitoring

  • Agent status dashboard
  • Build queue monitoring
  • Agent history and statistics

Third-Party Monitoring (Optional)

  • Prometheus + Grafana: For detailed metrics
  • Nagios/Zabbix: For infrastructure monitoring
  • Azure Monitor: If using Azure services
  • Custom scripts: For specific monitoring needs

Troubleshooting Common Issues

Agent Goes Offline

  1. Check agent service status
  2. Verify network connectivity
  3. Review agent logs
  4. Check for system updates or reboots
  5. Verify PAT token is valid

High Disk Usage

  1. Clean up old build artifacts
  2. Remove unused Docker images
  3. Clear package caches
  4. Consider increasing disk size

Slow Build Performance

  1. Check system resources (CPU, memory, disk I/O)
  2. Review build logs for bottlenecks
  3. Optimize build cache usage
  4. Consider upgrading server resources

Next Steps

  • Review Troubleshooting Guide for specific issues
  • Set up automated monitoring alerts
  • Document your specific maintenance procedures
  • Schedule regular maintenance windows

References