Skip to content

Self-Hosted Agents - Troubleshooting Guide

Overview

This guide covers common issues encountered with self-hosted Azure DevOps agents and their solutions.

Agent Not Appearing in Azure DevOps

Symptoms

  • Agent does not appear in agent pool
  • Agent shows as "Offline" immediately after installation

Possible Causes

  1. Incorrect PAT token permissions
  2. Network connectivity issues
  3. Agent configuration errors
  4. Service not running

Solutions

Check PAT Token Permissions

  1. Verify PAT token has Agent Pools (Read & Manage) scope
  2. Check token expiration date
  3. Create new PAT if needed

Verify Network Connectivity

Linux:

# Test Azure DevOps connectivity
curl -I https://dev.azure.com

# Test DNS resolution
nslookup dev.azure.com

Windows:

# Test Azure DevOps connectivity
Test-NetConnection -ComputerName dev.azure.com -Port 443

# Test DNS resolution
Resolve-DnsName dev.azure.com

Check Agent Configuration

Linux:

# View agent configuration
cat ~/azagent/.agent

# Check service status
sudo systemctl status vsts.agent.*.service

Windows:

# View agent configuration
Get-Content C:\azagent\.agent

# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}

Review Agent Logs

Linux:

# View recent logs
sudo journalctl -u vsts.agent.*.service -n 100 --no-pager

# Follow logs in real-time
sudo journalctl -u vsts.agent.*.service -f

Windows:

# View recent event logs
Get-EventLog -LogName Application -Source "vsts*" -Newest 50

# View specific error events
Get-EventLog -LogName Application -Source "vsts*" -EntryType Error -Newest 20

Agent Goes Offline

Symptoms

  • Agent was online but now shows as offline
  • Agent status changes to offline intermittently

Possible Causes

  1. Service stopped
  2. Network connectivity lost
  3. Server rebooted
  4. PAT token expired

Solutions

Check Service Status

Linux:

# Check service status
sudo systemctl status vsts.agent.*.service

# Start service if stopped
sudo systemctl start vsts.agent.*.service

# Enable auto-start
sudo systemctl enable vsts.agent.*.service

Windows:

# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}

# Start service if stopped
Start-Service -Name (Get-Service | Where-Object {$_.Name -like "*vsts*"}).Name

Verify Network Connectivity

# Linux
ping -c 4 dev.azure.com
curl -I https://dev.azure.com

# Windows
Test-Connection dev.azure.com
Test-NetConnection -ComputerName dev.azure.com -Port 443

Check for Server Reboots

Linux:

# Check last reboot time
last reboot

# Check system uptime
uptime

Windows:

# Check last reboot time
Get-EventLog -LogName System -Source "Microsoft-Windows-Kernel-General" | Where-Object {$_.EventID -eq 1074} | Select-Object -First 1

# Check system uptime
(Get-CimInstance Win32_OperatingSystem).LastBootUpTime

Docker Not Found Error

Symptoms

  • Pipeline fails with error: ##[error]File not found: 'docker'
  • Container services fail to start
  • Error: docker: command not found

Possible Causes

  1. Docker not installed on agent
  2. Docker not in PATH
  3. Agent user not in docker group
  4. Docker service not running

Solutions

Install Docker (Linux)

# Add Docker's official GPG key
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# Set up repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Start Docker service
sudo systemctl start docker
sudo systemctl enable docker

# Add agent user to docker group
sudo usermod -aG docker azdevops

# Verify installation (may need to log out and back in for group changes)
sudo docker run hello-world

Verify Docker Installation

# Check Docker version
docker --version

# Check Docker service status
sudo systemctl status docker

# Test Docker (as azdevops user)
docker run hello-world

# If permission denied, log out and back in, or restart agent service

Fix Docker Permissions

# Add user to docker group (if not already)
sudo usermod -aG docker azdevops

# Verify user is in docker group
groups azdevops

# Restart agent service to apply group changes
cd ~/azagent
sudo ./svc.sh stop
sudo ./svc.sh start

Verify Docker is Accessible to Agent

# Check if agent user can run Docker
sudo -u azdevops docker ps

# If permission denied, ensure:
# 1. User is in docker group: groups azdevops
# 2. Docker socket has correct permissions: ls -la /var/run/docker.sock
# 3. Agent service is restarted after group changes

Note: If your pipelines use container services (redis, mssql, mongodb, etc.), Docker is required, not optional. See the Linux Setup Guide for complete Docker installation instructions.

Build Failures on Agent

Symptoms

  • Builds fail with tool not found errors
  • Builds fail with permission errors
  • Builds fail with disk space errors

Possible Causes

  1. Required tools not installed
  2. Insufficient permissions
  3. Disk space full
  4. Incorrect agent capabilities

Solutions

Verify Required Tools

Linux:

# Check .NET SDK
dotnet --version

# Check Docker
docker --version

# If Docker is not found, install it (see Linux Setup Guide)

# Check Node.js
node --version
npm --version

Windows:

# Check .NET SDK
dotnet --version

# Check Git
git --version

# Check Node.js
node --version
npm --version

Check Permissions

Linux:

# Check agent user permissions
id azdevops

# Check directory permissions
ls -la ~/azagent/_work

Windows:

# Check agent user permissions
whoami /groups

# Check directory permissions
Get-Acl C:\azagent\_work

Check Disk Space

Linux:

# Check disk usage
df -h

# Check specific directory
du -sh ~/azagent/_work/*

Windows:

# Check disk usage
Get-PSDrive C | Select-Object Used, Free

# Check specific directory
Get-ChildItem C:\azagent\_work -Recurse | Measure-Object -Property Length -Sum

Verify Agent Capabilities

  1. In Azure DevOps, navigate to agent pool
  2. Select agent → Capabilities tab
  3. Verify required capabilities are present
  4. Add missing capabilities if needed

Code Coverage Not Found by Build Quality Checks

Symptoms

  • Build Quality Checks shows 0% coverage: Total lines: 0, Covered lines: 0
  • Coverage reports are published successfully but Build Quality Checks can't find them
  • Error: The code coverage value (0%, 0 lines) is lower than the minimum value

Possible Causes

  1. Case sensitivity on Linux - File paths are case-sensitive on Linux
  2. Coverage XML files not found by PublishCodeCoverageResults - The glob pattern might not match on Linux
  3. Coverage files in wrong location - Files might be in a different directory than expected

Solutions

Verify Coverage Files Exist

Add a diagnostic step before Build Quality Checks to verify coverage files:

- script: |
    echo "Checking for coverage files..."
    find "$(Agent.TempDirectory)" -name "coverage.cobertura.xml" -type f
    find "$(Agent.TempDirectory)" -name "*coverage*" -type f
  displayName: 'Diagnose coverage file locations'

Ensure PublishCodeCoverageResults Finds Files

The PublishCodeCoverageResults@2 task uses:

summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'

On Linux, ensure: 1. The file name is exactly coverage.cobertura.xml (case-sensitive) 2. The file is in a subdirectory of $(Agent.TempDirectory) 3. The file is readable by the agent user

Fix Coverage File Paths

If coverage files are in a different location, you may need to:

  1. Copy files to expected location:

    # Find coverage files
    find . -name "coverage.cobertura.xml" -type f
    
    # Copy to expected location if needed
    mkdir -p "$(Agent.TempDirectory)/TestResults"
    cp path/to/coverage.cobertura.xml "$(Agent.TempDirectory)/TestResults/"
    

  2. Update PublishCodeCoverageResults path (if you can modify the template):

    - task: PublishCodeCoverageResults@2
      inputs:
        codeCoverageTool: 'Cobertura'
        summaryFileLocation: '$(Agent.TempDirectory)/TestResults/**/coverage.cobertura.xml'
        # Or use absolute path if known
    

Verify Build Quality Checks Configuration

Ensure Build Quality Checks is configured correctly:

- task: mspremier.BuildQualityChecks.QualityChecks-task.BuildQualityChecks@10
  inputs:
    checkCoverage: true
    coverageFailOption: fixed
    coverageType: lines
    coverageThreshold: '76'

Note: Build Quality Checks reads coverage data from PublishCodeCoverageResults, not from file artifacts. The coverage must be published successfully before Build Quality Checks can read it.

Pipeline Cannot Find Agent

Symptoms

  • Pipeline shows "No agent found" error
  • Pipeline waits indefinitely for agent

Possible Causes

  1. Pool name mismatch
  2. Demand requirements not met
  3. All agents busy or offline
  4. Agent capabilities don't match demands

Solutions

Verify Pool Name

Ensure pool name in pipeline YAML matches exactly (case-sensitive):

pool:
  name: 'Hetzner-Linux'  # Must match exactly

Check Agent Demands

Verify agent capabilities match pipeline demands:

pool:
  name: 'Hetzner-Linux'
  demands:
    - Agent.OS -equals Linux
    - DotNet -equals 9.0.x  # Agent must have this capability

Verify Agent Availability

  1. Check agent pool in Azure DevOps
  2. Verify at least one agent is online
  3. Check if agents are busy with other jobs
  4. Consider adding more agents if all are busy

High Disk Usage

Symptoms

  • Builds fail with "No space left on device" errors
  • Disk usage shows > 90%

Solutions

Clean Up Build Artifacts

Linux:

# Clean agent work directory
cd ~/azagent/_work
rm -rf *

# Clean old directories (older than 30 days)
find ~/azagent/_work -type d -mtime +30 -exec rm -rf {} \;

Windows:

# Clean agent work directory
Remove-Item C:\azagent\_work\* -Recurse -Force

# Clean old directories
Get-ChildItem C:\azagent\_work -Directory | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-30)} | Remove-Item -Recurse -Force

Clean Package Caches

Linux:

# Clean NuGet cache
rm -rf ~/.nuget/packages/*

# Clean npm cache
npm cache clean --force

# Clean Docker
docker system prune -a --volumes

Windows:

# Clean NuGet cache
Remove-Item "$env:USERPROFILE\.nuget\packages\*" -Recurse -Force

# Clean npm cache
npm cache clean --force

# Clean Docker
docker system prune -a --volumes

Increase Disk Size

If using Hetzner Cloud, you can increase disk size:

  1. Navigate to Hetzner Cloud Console
  2. Select server → Resize → Increase disk size
  3. Follow instructions to resize filesystem

Slow Build Performance

Symptoms

  • Builds take longer than expected
  • Agent CPU/memory usage is high

Solutions

Check System Resources

Linux:

# Check CPU and memory
top
htop

# Check disk I/O
iostat -x 1

# Check system load
uptime

Windows:

# Check CPU and memory
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10
Get-Counter '\Processor(_Total)\% Processor Time'
Get-Counter '\Memory\Available MBytes'

# Check disk I/O
Get-Counter '\PhysicalDisk(*)\Disk Reads/sec'
Get-Counter '\PhysicalDisk(*)\Disk Writes/sec'

Optimize Build Cache

  • Configure persistent NuGet cache
  • Use Docker layer caching
  • Cache npm/node_modules
  • Cache build artifacts between runs

Upgrade Server Resources

If resources are consistently maxed out:

  1. Consider upgrading to larger server type
  2. Add more agents to distribute load
  3. Optimize build processes

Authentication Errors

Symptoms

  • "401 Unauthorized" errors
  • "403 Forbidden" errors
  • PAT token errors

Solutions

Verify PAT Token

  1. Check token expiration date
  2. Verify token has correct scopes:
  3. Agent Pools (Read & Manage)
  4. Build (Read & Execute)
  5. Create new PAT if needed

Update Agent Configuration

Linux:

cd ~/azagent
sudo ./svc.sh stop
./config.sh --token <NEW_PAT_TOKEN> --replace
sudo ./svc.sh start

Windows:

cd C:\azagent
.\svc.cmd stop
.\config.cmd --token <NEW_PAT_TOKEN> --replace
.\svc.cmd start

Service Won't Start

Symptoms

  • Agent service fails to start
  • Service shows as "Failed" status

Solutions

Check Service Logs

Linux:

# View service logs
sudo journalctl -u vsts.agent.*.service -n 100 --no-pager

# Check service status
sudo systemctl status vsts.agent.*.service

Windows:

# View service logs
Get-EventLog -LogName Application -Source "vsts*" -Newest 50

# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}

Verify Agent Configuration

Linux:

# Check configuration file
cat ~/azagent/.agent

# Verify credentials file exists
ls -la ~/azagent/.credentials

Windows:

# Check configuration file
Get-Content C:\azagent\.agent

# Verify credentials file exists
Test-Path C:\azagent\.credentials

Reinstall Service

Linux:

cd ~/azagent
sudo ./svc.sh uninstall
sudo ./svc.sh install azdevops
sudo ./svc.sh start

Windows:

cd C:\azagent
.\svc.cmd uninstall
.\svc.cmd install
.\svc.cmd start

Network Connectivity Issues

Symptoms

  • Agent cannot connect to Azure DevOps
  • Timeout errors
  • SSL/TLS errors

Solutions

Test Connectivity

# Linux
curl -v https://dev.azure.com
ping -c 4 dev.azure.com

# Windows
Test-NetConnection -ComputerName dev.azure.com -Port 443
Test-Connection dev.azure.com

Check Firewall Rules

Linux:

# Check firewall status
sudo ufw status

# Allow outbound HTTPS
sudo ufw allow out 443/tcp

Windows:

# Check firewall rules
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*HTTPS*"}

# Allow outbound HTTPS (usually enabled by default)

Check Proxy Settings

If behind a proxy:

  1. Configure proxy in agent environment
  2. Set HTTP_PROXY and HTTPS_PROXY variables
  3. Update agent configuration if needed

Git Authentication Errors on Self-Hosted Agents

Symptoms

  • Error: fatal: unable to access 'https://dev.azure.com/...': The requested URL returned error: 400
  • Git fetch fails with exit code 128
  • Repository checkout fails on self-hosted Linux agents
  • Works on Microsoft-hosted agents but fails on self-hosted
  • Error occurs after manually installing Git on the agent

Possible Causes

  1. Missing explicit checkout with credentials - Default checkout doesn't persist credentials on self-hosted agents
  2. Git configuration conflicts - Manual Git installation may have changed global Git config
  3. Agent permissions - Agent user doesn't have proper repository access
  4. Stale Git credentials - Old credentials cached in Git config

Solutions

Add Explicit Checkout with Credentials (Required)

In your pipeline YAML, add explicit checkout step:

steps:
  - checkout: self
    persistCredentials: true
    displayName: 'Checkout repository with credentials'
  # ... rest of your steps

This is required for self-hosted agents to authenticate properly. Without this, the agent cannot authenticate to fetch from Azure DevOps repositories.

Clear Git Configuration on Agent

If Git was manually installed and causing issues:

# Connect to agent server
ssh azdevops@<server-ip>

# Check current Git config
git config --global --list

# Remove problematic credentials
git config --global --unset-all http.extraheader
git config --global --unset-all http.https://dev.azure.com.extraheader

# Verify Git version
git --version

Verify Agent Repository Permissions

  1. In Azure DevOps, go to Project Settings → Repositories
  2. Select your repository
  3. Go to Security tab
  4. Ensure Project Collection Build Service has Read permission
  5. Ensure Project Build Service has Read permission

Configure Git Authentication Manually (If Needed)

If persistCredentials: true doesn't work, configure Git manually in pipeline:

- script: |
    git config --global http.extraheader "AUTHORIZATION: bearer $(System.AccessToken)"
    git config --global http.version HTTP/1.1
  displayName: 'Configure Git authentication'
  env:
    System_AccessToken: $(System.AccessToken)

Restart Agent Service

After making changes, restart the agent:

cd ~/azagent
sudo ./svc.sh stop
sudo ./svc.sh start

Note: The persistCredentials: true option is the standard solution for self-hosted agents. Always include this in your pipeline YAML when using self-hosted agents.

Global Git Config Conflicts with Pipeline Authentication

Symptoms

  • Error: fatal: unable to access 'https://dev.azure.com/...': The requested URL returned error: 400
  • Pipeline logs show: ##[warning]Git config still contains extraheader keys. It may cause errors.
  • Pipeline logs show: ##[warning]An unsuccessful attempt was made using git command line to remove "http.extraheader" from the git config.
  • Git fetch fails even with persistCredentials: true configured
  • Works on Microsoft-hosted agents but fails on self-hosted agents

Root Cause

The Azure DevOps checkout task attempts to manage Git authentication by: 1. Removing existing http.extraheader configuration to avoid conflicts 2. Setting its own authentication per repository using pipeline tokens

If a global Git config has http.extraheader set (e.g., in ~/.gitconfig), the checkout task cannot remove it cleanly, causing authentication conflicts. The pipeline tries to use its own token, but Git still uses the stale global token, resulting in HTTP 400 errors.

Solutions

Remove Global http.extraheader Configuration

On Linux Agent:

# Connect to agent server as azdevops user
ssh azdevops@<server-ip>

# Check current global Git config
git config --global --list

# Remove the problematic http.extraheader
git config --global --unset-all http.extraheader

# Verify it's removed
git config --global --list

# Restart agent service
cd ~/azagent
sudo ./svc.sh stop
sudo ./svc.sh start

On Windows Agent:

# Connect via RDP or PowerShell

# Check current global Git config
git config --global --list

# Remove the problematic http.extraheader
git config --global --unset-all http.extraheader

# Verify it's removed
git config --global --list

# Restart agent service
cd C:\azagent
.\svc.cmd stop
.\svc.cmd start
Keep User Configuration, Remove Only Authentication

You can keep user name and email in global config, but remove authentication settings:

# Keep these (optional)
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

# Remove these (required)
git config --global --unset-all http.extraheader
git config --global --unset-all http.https://dev.azure.com.extraheader
git config --global --unset-all credential.helper

# Remove credential files
rm -f ~/.git-credentials
rm -rf ~/.git-credential-cache
Add Explicit Checkout with persistCredentials

Even after removing global config, add explicit checkout to your pipeline:

steps:
  - checkout: self
    persistCredentials: true
    displayName: 'Checkout repository with credentials'
  # ... rest of your steps

This ensures the pipeline manages authentication correctly.

Verify Configuration

After making changes, verify:

# Check global config (should not contain http.extraheader)
git config --global --list

# Check system config (if exists)
cat /etc/gitconfig 2>/dev/null || echo "No system gitconfig"

# Check for credential files
ls -la ~/.git-credentials 2>/dev/null
ls -la ~/.git-credential-cache 2>/dev/null

Prevention

Best Practices:

  1. Never set http.extraheader globally - Let the pipeline manage authentication
  2. Use persistCredentials: true - Always include explicit checkout in pipeline YAML for self-hosted agents
  3. Keep user.name and user.email - These are safe to set globally
  4. Avoid credential helpers in global config - Let the pipeline handle credentials

Why This Happens

The Azure DevOps checkout task: 1. Tries to remove existing http.extraheader to avoid conflicts 2. Sets its own authentication per repository using System.AccessToken 3. If global config has http.extraheader, it conflicts with the pipeline's token 4. Git uses the stale global token instead of the fresh pipeline token 5. Results in HTTP 400 because the global token may be expired or invalid

Solution: Remove global http.extraheader and let the pipeline manage authentication per repository.

SQL Server Connection Timeouts in Docker Containers

Symptoms

  • Error: Microsoft.Data.SqlClient.SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
  • Error: System.ComponentModel.Win32Exception: Unknown error 258
  • Tests fail during TestInitializeAsync when initializing NServiceBus SQL Persistence
  • Saga table creation scripts timeout
  • Tests run successfully on Microsoft-hosted agents but fail on self-hosted Linux agents
  • Error occurs when running tests in Docker containers

Root Cause

SQL Server connection timeouts in Docker containers on self-hosted agents can occur due to:

  1. SQL Server container not ready - Container may not be fully initialized when tests start
  2. Network connectivity issues - Containers may not be able to communicate properly
  3. Resource constraints - Self-hosted agent may have limited CPU/memory, causing SQL Server to respond slowly
  4. Connection string issues - Wrong hostname or port in connection string
  5. SQL Server startup time - SQL Server 2025 may take longer to start on resource-constrained systems
  6. Container health check not working - Pipeline may start tests before SQL Server is ready

Solutions

Verify SQL Server Container is Running

Check container status in pipeline:

- script: |
    echo "Checking SQL Server container status..."
    docker ps -a | grep mssql
    docker logs mssql --tail 50
  displayName: 'Check SQL Server container status'

Or add a wait step before tests:

- script: |
    echo "Waiting for SQL Server to be ready..."
    timeout 120 bash -c 'until docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT 1" -C; do sleep 2; done'
  displayName: 'Wait for SQL Server to be ready'

Add Health Check to SQL Server Container

Update pipeline YAML to include health check:

- container: mssql
  image: mcr.microsoft.com/mssql/server:2025-latest
  options: --name mssql --hostname mssql --health-cmd "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P Password@123 -Q 'SELECT 1' -C" --health-interval 10s --health-timeout 5s --health-retries 10
  ports:
    - 1433:1433
  env:
     SA_PASSWORD: "Password@123"
     ACCEPT_EULA: "Y"
     MSSQL_PID: "Express"

Increase SQL Server Memory Limit

If agent has limited resources, reduce SQL Server memory:

- container: mssql
  image: mcr.microsoft.com/mssql/server:2025-latest
  options: --name mssql --hostname mssql
  ports:
    - 1433:1433
  env:
     SA_PASSWORD: "Password@123"
     ACCEPT_EULA: "Y"
     MSSQL_PID: "Express"
     MSSQL_MEMORY_LIMIT_MB: "2048"  # Reduce from default 4GB to 2GB

Verify Connection String

Ensure connection string uses correct hostname:

  • In Docker containers, use container name: Server=mssql,1433;...
  • On host machine, use: Server=localhost,1433;...
  • Check appsettings.Development.Docker.json for Docker-specific connection strings

Example connection string for Docker:

{
  "ConnectionStrings": {
    "ConnectSoft.BaseTemplateSqlServer": "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=60;"
  }
}

Increase Connection Timeout

Add connection timeout to connection string:

{
  "ConnectionStrings": {
    "ConnectSoft.BaseTemplateSqlServer": "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;Connection Timeout=120;Command Timeout=120;"
  }
}

Or in code:

var connectionString = "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;Connection Timeout=120;Command Timeout=120;";

Add Retry Logic

Add retry logic in test initialization:

[TestInitialize]
public async Task TestInitializeAsync()
{
    var maxRetries = 5;
    var delay = TimeSpan.FromSeconds(5);

    for (int i = 0; i < maxRetries; i++)
    {
        try
        {
            // Your initialization code
            await InitializeServices();
            return;
        }
        catch (SqlException ex) when (ex.Message.Contains("timeout") && i < maxRetries - 1)
        {
            await Task.Delay(delay);
            delay = TimeSpan.FromSeconds(delay.TotalSeconds * 2); // Exponential backoff
        }
    }
}

Check Container Network

Verify containers are on the same network:

- script: |
    echo "Checking Docker network..."
    docker network ls
    docker network inspect bridge | grep -A 10 mssql
  displayName: 'Check Docker network'

If using Docker Compose, ensure services are on the same network:

services:
  sql:
    networks:
      - backend
  app:
    networks:
      - backend

Monitor SQL Server Performance

Check SQL Server resource usage:

- script: |
    echo "SQL Server container stats:"
    docker stats mssql --no-stream
    echo "SQL Server logs (last 20 lines):"
    docker logs mssql --tail 20
  displayName: 'Check SQL Server performance'

Use SQL Server 2022 Instead of 2025

If SQL Server 2025 is causing issues, try 2022:

- container: mssql
  image: mcr.microsoft.com/mssql/server:2022-latest  # Use 2022 instead of 2025
  options: --name mssql --hostname mssql
  ports:
    - 1433:1433
  env:
     SA_PASSWORD: "Password@123"
     ACCEPT_EULA: "Y"
     MSSQL_PID: "Express"

Verify Agent Resources

Check if agent has sufficient resources:

# On agent server
free -h
df -h
nproc
docker system df

If resources are low: - Increase agent VM size (e.g., from CPX32 to CPX41) - Reduce number of concurrent containers - Stop unnecessary services on agent

Prevention

Best Practices:

  1. Always add health checks to SQL Server containers in pipeline YAML
  2. Wait for SQL Server to be ready before starting tests
  3. Use appropriate connection timeouts (60-120 seconds for containerized SQL Server)
  4. Monitor agent resources - Ensure sufficient CPU and memory
  5. Use connection pooling to reduce connection overhead
  6. Test container startup - Verify SQL Server starts within expected time

Common Issues on Self-Hosted Agents

Issue: SQL Server 2025 takes longer to start on resource-constrained agents - Solution: Use SQL Server 2022 or increase agent resources

Issue: Multiple containers competing for resources - Solution: Reduce number of containers or increase agent VM size

Issue: Network latency between containers - Solution: Ensure containers are on the same Docker network

Issue: SQL Server memory limit too high - Solution: Set MSSQL_MEMORY_LIMIT_MB to match available agent memory

Issue: NServiceBus saga table creation times out - Solution: Add Connection Timeout=120;Command Timeout=120; to NServiceBus connection strings in test configuration files

Issue: Orleans AdoNetReminderTable initialization times out - Solution: Add Connection Timeout=120;Command Timeout=120; to Orleans AdoNetGrainReminderTable and GrainPersistence.AdoNet connection strings in test configuration files

NServiceBus Saga Table Creation Timeouts

Symptoms

  • Error: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
  • Error occurs during TestInitializeAsync when initializing NServiceBus
  • Stack trace shows: NServiceBus.Persistence.Sql.ScriptRunner.InstallSagas
  • Saga table creation script fails to complete
  • Error happens specifically on self-hosted Linux agents
  • Works on Microsoft-hosted agents but fails on self-hosted

Root Cause

NServiceBus saga table creation involves executing complex SQL scripts that: 1. Create tables with multiple columns 2. Add correlation properties 3. Create indexes 4. Verify column types 5. Purge obsolete indexes and properties

On resource-constrained self-hosted agents, these operations can take longer than the default 30-second timeout, especially when: - SQL Server is running in a container - Agent has limited CPU/memory - Multiple tests are running concurrently - SQL Server is still initializing

Solutions

Add Connection and Command Timeouts to NServiceBus Connection Strings

Update test configuration files (appsettings.Development.Docker.json, appsettings.RateLimitTests.json, etc.):

{
  "NServiceBus": {
    "SqlServerTransport": {
      "ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_NSERVICEBUS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
    },
    "SqlServerPersistence": {
      "ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_NSERVICEBUS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
    }
  }
}

Key additions: - Connection Timeout=120; - Allows 120 seconds to establish connection - Command Timeout=120; - Allows 120 seconds for each SQL command to execute

Increase SQL Server Resources

If timeouts persist even with 120-second timeout, increase SQL Server resources:

- container: mssql
  image: mcr.microsoft.com/mssql/server:2025-latest
  options: --name mssql --hostname mssql
  ports:
    - 1433:1433
  env:
     SA_PASSWORD: "Password@123"
     ACCEPT_EULA: "Y"
     MSSQL_PID: "Express"
     MSSQL_MEMORY_LIMIT_MB: "4096"  # Increase from 2GB to 4GB if agent has resources

Wait for SQL Server Before Tests

Add a wait step in pipeline before tests start:

- script: |
    echo "Waiting for SQL Server to be ready for NServiceBus..."
    for i in {1..30}; do
      if docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT 1" -C > /dev/null 2>&1; then
        echo "SQL Server is ready!"
        # Additional check: ensure SQL Server can execute complex queries
        docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT COUNT(*) FROM sys.tables" -C > /dev/null 2>&1
        exit 0
      fi
      echo "Waiting for SQL Server... (attempt $i/30)"
      sleep 2
    done
    echo "SQL Server did not become ready in time"
    exit 1
  displayName: 'Wait for SQL Server to be ready for NServiceBus'

Prevention

Best Practices:

  1. Always set timeouts - Use Connection Timeout=120;Command Timeout=120; for NServiceBus connection strings in test configurations
  2. Wait for SQL Server - Add explicit wait step in pipeline before tests start
  3. Monitor resource usage - Ensure agent has sufficient CPU/memory for SQL Server
  4. Use appropriate SQL Server edition - SQL Server Express may have limitations; consider Developer edition if needed

Common Scenarios

Scenario 1: Saga table creation times out on first test run - Solution: Increase timeouts to 120 seconds and ensure SQL Server is fully ready

Scenario 2: Timeout occurs intermittently - Solution: Check agent resource usage; may need to reduce concurrent tests or increase agent resources

Scenario 3: Works on Microsoft-hosted but fails on self-hosted - Solution: Self-hosted agents may have less resources; increase timeouts and ensure SQL Server has adequate memory

Orleans AdoNetReminderTable Initialization Timeouts

Symptoms

  • Error: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
  • Error occurs during TestInitializeAsync when initializing Orleans
  • Stack trace shows: OrleansExtensions.ConfigureAdoNetReminderService or SqlServerDatabaseHelper.CreateIfNotExists
  • Orleans reminder table creation fails
  • Error happens specifically on self-hosted Linux agents
  • Works on Microsoft-hosted agents but fails on self-hosted

Root Cause

Orleans AdoNetReminderTable initialization involves: 1. Creating the database if it doesn't exist (CreateIfNotExists) 2. Executing Orleans SQL scripts (SQLServer-Main.sql and SQLServer-Reminders.sql) 3. Creating reminder tables and indexes

On resource-constrained self-hosted agents, these operations can take longer than the default 30-second timeout, especially when: - SQL Server is running in a container - Agent has limited CPU/memory - Multiple tests are running concurrently - SQL Server is still initializing

Solutions

Add Connection and Command Timeouts to Orleans Connection Strings

Update test configuration files (appsettings.Development.Docker.json, appsettings.RateLimitTests.json, etc.):

{
  "Orleans": {
    "GrainPersistence": {
      "AdoNet": {
        "ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_ORLEANS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
      }
    },
    "AdoNetGrainReminderTable": {
      "ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_ORLEANS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
    }
  }
}

Key additions: - Connection Timeout=120; - Allows 120 seconds to establish connection - Command Timeout=120; - Allows 120 seconds for each SQL command to execute

Increase SQL Server Resources

If timeouts persist even with 120-second timeout, increase SQL Server resources (same as NServiceBus solution above).

Wait for SQL Server Before Tests

Add a wait step in pipeline before tests start (same as NServiceBus solution above).

Prevention

Best Practices:

  1. Always set timeouts - Use Connection Timeout=120;Command Timeout=120; for Orleans connection strings in test configurations
  2. Wait for SQL Server - Add explicit wait step in pipeline before tests start
  3. Monitor resource usage - Ensure agent has sufficient CPU/memory for SQL Server
  4. Use appropriate SQL Server edition - SQL Server Express may have limitations; consider Developer edition if needed

Common Scenarios

Scenario 1: Orleans reminder table creation times out on first test run - Solution: Increase timeouts to 120 seconds and ensure SQL Server is fully ready

Scenario 2: Timeout occurs intermittently - Solution: Check agent resource usage; may need to reduce concurrent tests or increase agent resources

Scenario 3: Works on Microsoft-hosted but fails on self-hosted - Solution: Self-hosted agents may have less resources; increase timeouts and ensure SQL Server has adequate memory

Ollama 500 Internal Server Error

Symptoms

  • Error: Response status code does not indicate success: 500 (Internal Server Error)
  • Error occurs when calling Ollama API for chat completions or tool invocation
  • Stack trace shows: OllamaSharp.OllamaApiClient.ChatAsync or GetStreamingResponseAsync
  • Test fails during Ollama chat completion or tool invocation

Root Cause

Ollama returns 500 errors typically when: 1. Model not loaded: The specified model is not available or not loaded in memory 2. Insufficient memory: The model is too large for available system memory 3. Model name mismatch: The model name in configuration doesn't match the actual model name 4. Ollama service issues: The Ollama service is having internal problems

Solutions

Verify Model is Available

On the agent server, check available models:

# List all installed models
ollama list

# Expected output should include your model:
# NAME                       ID              SIZE      MODIFIED
# mistral:7b-instruct        6577803aa9a0    4.4 GB    13 days ago

If model is missing, pull it:

# Pull the model (this may take several minutes)
ollama pull mistral:7b-instruct

# Verify it's available
ollama list | grep mistral

Verify Model Name Format

Check the exact model name format:

# List models with exact names
ollama list

# Test the model directly
ollama run mistral:7b-instruct "Hello"

Common model name formats: - mistral:7b-instruct (with tag) - mistral (without tag, uses default) - mistral:7b (shorter tag)

Update configuration if the model name doesn't match:

{
  "Ollama": {
    "Model": "qwen3:0.6b"  // Use exact name from 'ollama list'
  }
}

Note: The default model qwen3:0.6b (~522 MB) is recommended for basic chat completions. For tool invocation, use mistral:7b-instruct (~4.4 GB) but ensure you have 6-8 GB free RAM.

Check Ollama Service Status and Logs

Check service status:

# Check if Ollama is running
sudo systemctl status ollama

# Check recent logs for errors (this is critical for diagnosing 500 errors)
sudo journalctl -u ollama -n 100 --no-pager | tail -50

# Check for specific error patterns
sudo journalctl -u ollama -n 100 --no-pager | grep -i "error\|fail\|500\|memory\|timeout"

Common log errors: - out of memory - Model too large for available RAM - model not found - Model name incorrect or not pulled - context length exceeded - Request too long for model - failed to load model - Model file corrupted or incomplete - connection refused - Ollama service not running or port blocked

To see real-time logs during test execution:

# Watch Ollama logs in real-time (run this in a separate terminal during tests)
sudo journalctl -u ollama -f

Check Available Memory

Verify system has enough memory for the model:

# Check available memory
free -h

# Check memory usage
top -bn1 | head -20

# For mistral:7b-instruct, you need at least 6-8 GB free RAM

If memory is insufficient: - Use a smaller model (e.g., qwen3:0.6b for basic chat, but it doesn't support tool invocation) - Increase server memory - Stop other services to free memory

Test Ollama API Directly

Test the API endpoint:

# Test API is responding
curl http://localhost:11434/api/tags

# Test specific model
curl http://localhost:11434/api/generate -d '{
  "model": "mistral:7b-instruct",
  "prompt": "Say hello",
  "stream": false
}'

If API test fails: - Check Ollama service is running: sudo systemctl start ollama - Check firewall/port access: netstat -tlnp | grep 11434 - Verify endpoint in config matches actual endpoint

Pre-load the Model

Models need to be loaded into memory before use. Pre-load to avoid 500 errors:

# Pre-load the model (this loads it into memory)
ollama run qwen3:0.6b "test"

# Or use the API to pre-load
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:0.6b",
  "prompt": "test",
  "stream": false
}'

# Verify model is loaded (check memory usage)
ollama ps

Why this helps: - Models are lazy-loaded on first API request - If memory is limited, loading during tests can cause 500 errors - Pre-loading ensures models are ready when tests run - For qwen3:0.6b (~522 MB), pre-loading is usually not necessary, but can help avoid delays

Restart Ollama Service

If issues persist, restart Ollama:

# Restart Ollama service
sudo systemctl restart ollama

# Wait a few seconds for service to start
sleep 5

# Verify it's running
sudo systemctl status ollama

# Test API again
curl http://localhost:11434/api/tags

# Pre-load models after restart (optional for qwen3:0.6b)
ollama run qwen3:0.6b "test"

Prevention

Best Practices:

  1. Verify model before tests - Run ollama list to confirm model is available
  2. Use correct model name - Match exactly what ollama list shows
  3. Ensure sufficient memory - Have at least 1.5x model size in free RAM
  4. Monitor Ollama logs - Check logs regularly for warnings or errors
  5. Test API directly - Use curl to test Ollama before running tests

Common Scenarios

Scenario 1: Model not found (500 error) - Solution: Run ollama pull mistral:7b-instruct to download the model

Scenario 2: Out of memory (500 error) - Solution: Free up memory or use a smaller model for basic chat (but tool invocation requires larger model)

Scenario 3: Model name mismatch (500 error) - Solution: Check ollama list and use exact model name from output

Scenario 4: Ollama service not running (connection refused) - Solution: Start service with sudo systemctl start ollama

Getting Additional Help

Azure DevOps Resources

Hetzner Cloud Resources

Log Collection

When seeking help, collect:

  1. Agent logs (last 100 lines)
  2. Service status
  3. System resource usage
  4. Network connectivity test results
  5. Agent configuration (sanitized)

Next Steps

  • Review Maintenance Guide for preventive measures
  • Set up monitoring to catch issues early
  • Document your specific troubleshooting procedures