Self-Hosted Agents - Troubleshooting Guide¶

Overview¶

This guide covers common issues encountered with self-hosted Azure DevOps agents and their solutions.

Agent Not Appearing in Azure DevOps¶

Symptoms¶

Agent does not appear in agent pool
Agent shows as "Offline" immediately after installation

Possible Causes¶

Incorrect PAT token permissions
Network connectivity issues
Agent configuration errors
Service not running

Solutions¶

Check PAT Token Permissions¶

Verify PAT token has Agent Pools (Read & Manage) scope
Check token expiration date
Create new PAT if needed

Verify Network Connectivity¶

Linux:

# Test Azure DevOps connectivity
curl -I https://dev.azure.com

# Test DNS resolution
nslookup dev.azure.com

Windows:

# Test Azure DevOps connectivity
Test-NetConnection -ComputerName dev.azure.com -Port 443

# Test DNS resolution
Resolve-DnsName dev.azure.com

Check Agent Configuration¶

Linux:

# View agent configuration
cat ~/azagent/.agent

# Check service status
sudo systemctl status vsts.agent.*.service

Windows:

# View agent configuration
Get-Content C:\azagent\.agent

# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}

Review Agent Logs¶

Linux:

# View recent logs
sudo journalctl -u vsts.agent.*.service -n 100 --no-pager

# Follow logs in real-time
sudo journalctl -u vsts.agent.*.service -f

Windows:

# View recent event logs
Get-EventLog -LogName Application -Source "vsts*" -Newest 50

# View specific error events
Get-EventLog -LogName Application -Source "vsts*" -EntryType Error -Newest 20

Agent Goes Offline¶

Symptoms¶

Agent was online but now shows as offline
Agent status changes to offline intermittently

Possible Causes¶

Service stopped
Network connectivity lost
Server rebooted
PAT token expired

Solutions¶

Check Service Status¶

Linux:

# Check service status
sudo systemctl status vsts.agent.*.service

# Start service if stopped
sudo systemctl start vsts.agent.*.service

# Enable auto-start
sudo systemctl enable vsts.agent.*.service

Windows:

# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}

# Start service if stopped
Start-Service -Name (Get-Service | Where-Object {$_.Name -like "*vsts*"}).Name

Verify Network Connectivity¶

# Linux
ping -c 4 dev.azure.com
curl -I https://dev.azure.com

# Windows
Test-Connection dev.azure.com
Test-NetConnection -ComputerName dev.azure.com -Port 443

Check for Server Reboots¶

Linux:

# Check last reboot time
last reboot

# Check system uptime
uptime

Windows:

# Check last reboot time
Get-EventLog -LogName System -Source "Microsoft-Windows-Kernel-General" | Where-Object {$_.EventID -eq 1074} | Select-Object -First 1

# Check system uptime
(Get-CimInstance Win32_OperatingSystem).LastBootUpTime

Docker Not Found Error¶

Symptoms¶

Pipeline fails with error: ##[error]File not found: 'docker'
Container services fail to start
Error: docker: command not found

Possible Causes¶

Docker not installed on agent
Docker not in PATH
Agent user not in docker group
Docker service not running

Solutions¶

Install Docker (Linux)¶

# Add Docker's official GPG key
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# Set up repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Start Docker service
sudo systemctl start docker
sudo systemctl enable docker

# Add agent user to docker group
sudo usermod -aG docker azdevops

# Verify installation (may need to log out and back in for group changes)
sudo docker run hello-world

Verify Docker Installation¶

# Check Docker version
docker --version

# Check Docker service status
sudo systemctl status docker

# Test Docker (as azdevops user)
docker run hello-world

# If permission denied, log out and back in, or restart agent service

Fix Docker Permissions¶

# Add user to docker group (if not already)
sudo usermod -aG docker azdevops

# Verify user is in docker group
groups azdevops

# Restart agent service to apply group changes
cd ~/azagent
sudo ./svc.sh stop
sudo ./svc.sh start

Verify Docker is Accessible to Agent¶

# Check if agent user can run Docker
sudo -u azdevops docker ps

# If permission denied, ensure:
# 1. User is in docker group: groups azdevops
# 2. Docker socket has correct permissions: ls -la /var/run/docker.sock
# 3. Agent service is restarted after group changes

Note: If your pipelines use container services (redis, mssql, mongodb, etc.), Docker is required, not optional. See the Linux Setup Guide for complete Docker installation instructions.

Build Failures on Agent¶

Symptoms¶

Builds fail with tool not found errors
Builds fail with permission errors
Builds fail with disk space errors

Possible Causes¶

Required tools not installed
Insufficient permissions
Disk space full
Incorrect agent capabilities

Solutions¶

Verify Required Tools¶

Linux:

# Check .NET SDK
dotnet --version

# Check Docker
docker --version

# If Docker is not found, install it (see Linux Setup Guide)

# Check Node.js
node --version
npm --version

Windows:

# Check .NET SDK
dotnet --version

# Check Git
git --version

# Check Node.js
node --version
npm --version

Check Permissions¶

Linux:

# Check agent user permissions
id azdevops

# Check directory permissions
ls -la ~/azagent/_work

Windows:

# Check agent user permissions
whoami /groups

# Check directory permissions
Get-Acl C:\azagent\_work

Check Disk Space¶

Linux:

# Check disk usage
df -h

# Check specific directory
du -sh ~/azagent/_work/*

Windows:

# Check disk usage
Get-PSDrive C | Select-Object Used, Free

# Check specific directory
Get-ChildItem C:\azagent\_work -Recurse | Measure-Object -Property Length -Sum

Verify Agent Capabilities¶

In Azure DevOps, navigate to agent pool
Select agent → Capabilities tab
Verify required capabilities are present
Add missing capabilities if needed

Code Coverage Not Found by Build Quality Checks¶

Symptoms¶

Build Quality Checks shows 0% coverage: Total lines: 0, Covered lines: 0
Coverage reports are published successfully but Build Quality Checks can't find them
Error: The code coverage value (0%, 0 lines) is lower than the minimum value

Possible Causes¶

Case sensitivity on Linux - File paths are case-sensitive on Linux
Coverage XML files not found by PublishCodeCoverageResults - The glob pattern might not match on Linux
Coverage files in wrong location - Files might be in a different directory than expected

Solutions¶

Verify Coverage Files Exist¶

Add a diagnostic step before Build Quality Checks to verify coverage files:

- script: |
    echo "Checking for coverage files..."
    find "$(Agent.TempDirectory)" -name "coverage.cobertura.xml" -type f
    find "$(Agent.TempDirectory)" -name "*coverage*" -type f
  displayName: 'Diagnose coverage file locations'

Ensure PublishCodeCoverageResults Finds Files¶

The PublishCodeCoverageResults@2 task uses:

summaryFileLocation: '$(Agent.TempDirectory)/**/coverage.cobertura.xml'

On Linux, ensure: 1. The file name is exactly coverage.cobertura.xml (case-sensitive) 2. The file is in a subdirectory of $(Agent.TempDirectory) 3. The file is readable by the agent user

Fix Coverage File Paths¶

If coverage files are in a different location, you may need to:

Copy files to expected location:

# Find coverage files
find . -name "coverage.cobertura.xml" -type f

# Copy to expected location if needed
mkdir -p "$(Agent.TempDirectory)/TestResults"
cp path/to/coverage.cobertura.xml "$(Agent.TempDirectory)/TestResults/"

Update PublishCodeCoverageResults path (if you can modify the template):

- task: PublishCodeCoverageResults@2
  inputs:
    codeCoverageTool: 'Cobertura'
    summaryFileLocation: '$(Agent.TempDirectory)/TestResults/**/coverage.cobertura.xml'
    # Or use absolute path if known

Verify Build Quality Checks Configuration¶

Ensure Build Quality Checks is configured correctly:

- task: mspremier.BuildQualityChecks.QualityChecks-task.BuildQualityChecks@10
  inputs:
    checkCoverage: true
    coverageFailOption: fixed
    coverageType: lines
    coverageThreshold: '76'

Note: Build Quality Checks reads coverage data from PublishCodeCoverageResults, not from file artifacts. The coverage must be published successfully before Build Quality Checks can read it.

Pipeline Cannot Find Agent¶

Symptoms¶

Pipeline shows "No agent found" error
Pipeline waits indefinitely for agent

Possible Causes¶

Pool name mismatch
Demand requirements not met
All agents busy or offline
Agent capabilities don't match demands

Solutions¶

Verify Pool Name¶

Ensure pool name in pipeline YAML matches exactly (case-sensitive):

pool:
  name: 'Hetzner-Linux'  # Must match exactly

Check Agent Demands¶

Verify agent capabilities match pipeline demands:

pool:
  name: 'Hetzner-Linux'
  demands:
    - Agent.OS -equals Linux
    - DotNet -equals 9.0.x  # Agent must have this capability

Verify Agent Availability¶

Check agent pool in Azure DevOps
Verify at least one agent is online
Check if agents are busy with other jobs
Consider adding more agents if all are busy

High Disk Usage¶

Symptoms¶

Builds fail with "No space left on device" errors
Disk usage shows > 90%

Solutions¶

Clean Up Build Artifacts¶

Linux:

# Clean agent work directory
cd ~/azagent/_work
rm -rf *

# Clean old directories (older than 30 days)
find ~/azagent/_work -type d -mtime +30 -exec rm -rf {} \;

Windows:

# Clean agent work directory
Remove-Item C:\azagent\_work\* -Recurse -Force

# Clean old directories
Get-ChildItem C:\azagent\_work -Directory | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-30)} | Remove-Item -Recurse -Force

Clean Package Caches¶

Linux:

# Clean NuGet cache
rm -rf ~/.nuget/packages/*

# Clean npm cache
npm cache clean --force

# Clean Docker
docker system prune -a --volumes

Windows:

# Clean NuGet cache
Remove-Item "$env:USERPROFILE\.nuget\packages\*" -Recurse -Force

# Clean npm cache
npm cache clean --force

# Clean Docker
docker system prune -a --volumes

Increase Disk Size¶

If using Hetzner Cloud, you can increase disk size:

Navigate to Hetzner Cloud Console
Select server → Resize → Increase disk size
Follow instructions to resize filesystem

Slow Build Performance¶

Symptoms¶

Builds take longer than expected
Agent CPU/memory usage is high

Solutions¶

Check System Resources¶

Linux:

# Check CPU and memory
top
htop

# Check disk I/O
iostat -x 1

# Check system load
uptime

Windows:

# Check CPU and memory
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10
Get-Counter '\Processor(_Total)\% Processor Time'
Get-Counter '\Memory\Available MBytes'

# Check disk I/O
Get-Counter '\PhysicalDisk(*)\Disk Reads/sec'
Get-Counter '\PhysicalDisk(*)\Disk Writes/sec'

Optimize Build Cache¶

Configure persistent NuGet cache
Use Docker layer caching
Cache npm/node_modules
Cache build artifacts between runs

Upgrade Server Resources¶

If resources are consistently maxed out:

Consider upgrading to larger server type
Add more agents to distribute load
Optimize build processes

Authentication Errors¶

Symptoms¶

"401 Unauthorized" errors
"403 Forbidden" errors
PAT token errors

Solutions¶

Verify PAT Token¶

Check token expiration date
Verify token has correct scopes:
Agent Pools (Read & Manage)
Build (Read & Execute)
Create new PAT if needed

Update Agent Configuration¶

Linux:

cd ~/azagent
sudo ./svc.sh stop
./config.sh --token <NEW_PAT_TOKEN> --replace
sudo ./svc.sh start

Windows:

cd C:\azagent
.\svc.cmd stop
.\config.cmd --token <NEW_PAT_TOKEN> --replace
.\svc.cmd start

Service Won't Start¶

Symptoms¶

Agent service fails to start
Service shows as "Failed" status

Solutions¶

Check Service Logs¶

Linux:

# View service logs
sudo journalctl -u vsts.agent.*.service -n 100 --no-pager

# Check service status
sudo systemctl status vsts.agent.*.service

Windows:

# View service logs
Get-EventLog -LogName Application -Source "vsts*" -Newest 50

# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}

Verify Agent Configuration¶

Linux:

# Check configuration file
cat ~/azagent/.agent

# Verify credentials file exists
ls -la ~/azagent/.credentials

Windows:

# Check configuration file
Get-Content C:\azagent\.agent

# Verify credentials file exists
Test-Path C:\azagent\.credentials

Reinstall Service¶

Linux:

cd ~/azagent
sudo ./svc.sh uninstall
sudo ./svc.sh install azdevops
sudo ./svc.sh start

Windows:

cd C:\azagent
.\svc.cmd uninstall
.\svc.cmd install
.\svc.cmd start

Network Connectivity Issues¶

Symptoms¶

Agent cannot connect to Azure DevOps
Timeout errors
SSL/TLS errors

Solutions¶

Test Connectivity¶

# Linux
curl -v https://dev.azure.com
ping -c 4 dev.azure.com

# Windows
Test-NetConnection -ComputerName dev.azure.com -Port 443
Test-Connection dev.azure.com

Check Firewall Rules¶

Linux:

# Check firewall status
sudo ufw status

# Allow outbound HTTPS
sudo ufw allow out 443/tcp

Windows:

# Check firewall rules
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*HTTPS*"}

# Allow outbound HTTPS (usually enabled by default)

Check Proxy Settings¶

If behind a proxy:

Configure proxy in agent environment
Set HTTP_PROXY and HTTPS_PROXY variables
Update agent configuration if needed

Git Authentication Errors on Self-Hosted Agents¶

Symptoms¶

Error: fatal: unable to access 'https://dev.azure.com/...': The requested URL returned error: 400
Git fetch fails with exit code 128
Repository checkout fails on self-hosted Linux agents
Works on Microsoft-hosted agents but fails on self-hosted
Error occurs after manually installing Git on the agent

Possible Causes¶

Missing explicit checkout with credentials - Default checkout doesn't persist credentials on self-hosted agents
Git configuration conflicts - Manual Git installation may have changed global Git config
Agent permissions - Agent user doesn't have proper repository access
Stale Git credentials - Old credentials cached in Git config

Solutions¶

Add Explicit Checkout with Credentials (Required)¶

In your pipeline YAML, add explicit checkout step:

steps:
  - checkout: self
    persistCredentials: true
    displayName: 'Checkout repository with credentials'
  # ... rest of your steps

This is required for self-hosted agents to authenticate properly. Without this, the agent cannot authenticate to fetch from Azure DevOps repositories.

Clear Git Configuration on Agent¶

If Git was manually installed and causing issues:

# Connect to agent server
ssh azdevops@<server-ip>

# Check current Git config
git config --global --list

# Remove problematic credentials
git config --global --unset-all http.extraheader
git config --global --unset-all http.https://dev.azure.com.extraheader

# Verify Git version
git --version

Verify Agent Repository Permissions¶

In Azure DevOps, go to Project Settings → Repositories
Select your repository
Go to Security tab
Ensure Project Collection Build Service has Read permission
Ensure Project Build Service has Read permission

Configure Git Authentication Manually (If Needed)¶

If persistCredentials: true doesn't work, configure Git manually in pipeline:

- script: |
    git config --global http.extraheader "AUTHORIZATION: bearer $(System.AccessToken)"
    git config --global http.version HTTP/1.1
  displayName: 'Configure Git authentication'
  env:
    System_AccessToken: $(System.AccessToken)

Restart Agent Service¶

After making changes, restart the agent:

cd ~/azagent
sudo ./svc.sh stop
sudo ./svc.sh start

Note: The persistCredentials: true option is the standard solution for self-hosted agents. Always include this in your pipeline YAML when using self-hosted agents.

Global Git Config Conflicts with Pipeline Authentication¶

Symptoms¶

Error: fatal: unable to access 'https://dev.azure.com/...': The requested URL returned error: 400
Pipeline logs show: ##[warning]Git config still contains extraheader keys. It may cause errors.
Pipeline logs show: ##[warning]An unsuccessful attempt was made using git command line to remove "http.extraheader" from the git config.
Git fetch fails even with persistCredentials: true configured
Works on Microsoft-hosted agents but fails on self-hosted agents

Root Cause¶

The Azure DevOps checkout task attempts to manage Git authentication by: 1. Removing existing http.extraheader configuration to avoid conflicts 2. Setting its own authentication per repository using pipeline tokens

If a global Git config has http.extraheader set (e.g., in ~/.gitconfig), the checkout task cannot remove it cleanly, causing authentication conflicts. The pipeline tries to use its own token, but Git still uses the stale global token, resulting in HTTP 400 errors.

Solutions¶

Remove Global http.extraheader Configuration¶

On Linux Agent:

# Connect to agent server as azdevops user
ssh azdevops@<server-ip>

# Check current global Git config
git config --global --list

# Remove the problematic http.extraheader
git config --global --unset-all http.extraheader

# Verify it's removed
git config --global --list

# Restart agent service
cd ~/azagent
sudo ./svc.sh stop
sudo ./svc.sh start

On Windows Agent:

# Connect via RDP or PowerShell

# Check current global Git config
git config --global --list

# Remove the problematic http.extraheader
git config --global --unset-all http.extraheader

# Verify it's removed
git config --global --list

# Restart agent service
cd C:\azagent
.\svc.cmd stop
.\svc.cmd start

Keep User Configuration, Remove Only Authentication¶

You can keep user name and email in global config, but remove authentication settings:

# Keep these (optional)
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

# Remove these (required)
git config --global --unset-all http.extraheader
git config --global --unset-all http.https://dev.azure.com.extraheader
git config --global --unset-all credential.helper

# Remove credential files
rm -f ~/.git-credentials
rm -rf ~/.git-credential-cache

Add Explicit Checkout with persistCredentials¶

Even after removing global config, add explicit checkout to your pipeline:

steps:
  - checkout: self
    persistCredentials: true
    displayName: 'Checkout repository with credentials'
  # ... rest of your steps

This ensures the pipeline manages authentication correctly.

Verify Configuration¶

After making changes, verify:

# Check global config (should not contain http.extraheader)
git config --global --list

# Check system config (if exists)
cat /etc/gitconfig 2>/dev/null || echo "No system gitconfig"

# Check for credential files
ls -la ~/.git-credentials 2>/dev/null
ls -la ~/.git-credential-cache 2>/dev/null

Prevention¶

Best Practices:

Never set http.extraheader globally - Let the pipeline manage authentication
Use persistCredentials: true - Always include explicit checkout in pipeline YAML for self-hosted agents
Keep user.name and user.email - These are safe to set globally
Avoid credential helpers in global config - Let the pipeline handle credentials

Why This Happens¶

The Azure DevOps checkout task: 1. Tries to remove existing http.extraheader to avoid conflicts 2. Sets its own authentication per repository using System.AccessToken 3. If global config has http.extraheader, it conflicts with the pipeline's token 4. Git uses the stale global token instead of the fresh pipeline token 5. Results in HTTP 400 because the global token may be expired or invalid

Solution: Remove global http.extraheader and let the pipeline manage authentication per repository.

SQL Server Connection Timeouts in Docker Containers¶

Symptoms¶

Error: Microsoft.Data.SqlClient.SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Error: System.ComponentModel.Win32Exception: Unknown error 258
Tests fail during TestInitializeAsync when initializing NServiceBus SQL Persistence
Saga table creation scripts timeout
Tests run successfully on Microsoft-hosted agents but fail on self-hosted Linux agents
Error occurs when running tests in Docker containers

Root Cause¶

SQL Server connection timeouts in Docker containers on self-hosted agents can occur due to:

SQL Server container not ready - Container may not be fully initialized when tests start
Network connectivity issues - Containers may not be able to communicate properly
Resource constraints - Self-hosted agent may have limited CPU/memory, causing SQL Server to respond slowly
Connection string issues - Wrong hostname or port in connection string
SQL Server startup time - SQL Server 2025 may take longer to start on resource-constrained systems
Container health check not working - Pipeline may start tests before SQL Server is ready

Solutions¶

Verify SQL Server Container is Running¶

Check container status in pipeline:

- script: |
    echo "Checking SQL Server container status..."
    docker ps -a | grep mssql
    docker logs mssql --tail 50
  displayName: 'Check SQL Server container status'

Or add a wait step before tests:

- script: |
    echo "Waiting for SQL Server to be ready..."
    timeout 120 bash -c 'until docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT 1" -C; do sleep 2; done'
  displayName: 'Wait for SQL Server to be ready'

Add Health Check to SQL Server Container¶

Update pipeline YAML to include health check:

- container: mssql
  image: mcr.microsoft.com/mssql/server:2025-latest
  options: --name mssql --hostname mssql --health-cmd "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P Password@123 -Q 'SELECT 1' -C" --health-interval 10s --health-timeout 5s --health-retries 10
  ports:
    - 1433:1433
  env:
     SA_PASSWORD: "Password@123"
     ACCEPT_EULA: "Y"
     MSSQL_PID: "Express"

Increase SQL Server Memory Limit¶

If agent has limited resources, reduce SQL Server memory:

- container: mssql
  image: mcr.microsoft.com/mssql/server:2025-latest
  options: --name mssql --hostname mssql
  ports:
    - 1433:1433
  env:
     SA_PASSWORD: "Password@123"
     ACCEPT_EULA: "Y"
     MSSQL_PID: "Express"
     MSSQL_MEMORY_LIMIT_MB: "2048"  # Reduce from default 4GB to 2GB

Verify Connection String¶

Ensure connection string uses correct hostname:

In Docker containers, use container name: Server=mssql,1433;...
On host machine, use: Server=localhost,1433;...
Check appsettings.Development.Docker.json for Docker-specific connection strings

Example connection string for Docker:

{
  "ConnectionStrings": {
    "ConnectSoft.BaseTemplateSqlServer": "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=60;"
  }
}

Increase Connection Timeout¶

Add connection timeout to connection string:

{
  "ConnectionStrings": {
    "ConnectSoft.BaseTemplateSqlServer": "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;Connection Timeout=120;Command Timeout=120;"
  }
}

Or in code:

var connectionString = "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;Connection Timeout=120;Command Timeout=120;";

Add Retry Logic¶

Add retry logic in test initialization:

[TestInitialize]
public async Task TestInitializeAsync()
{
    var maxRetries = 5;
    var delay = TimeSpan.FromSeconds(5);

    for (int i = 0; i < maxRetries; i++)
    {
        try
        {
            // Your initialization code
            await InitializeServices();
            return;
        }
        catch (SqlException ex) when (ex.Message.Contains("timeout") && i < maxRetries - 1)
        {
            await Task.Delay(delay);
            delay = TimeSpan.FromSeconds(delay.TotalSeconds * 2); // Exponential backoff
        }
    }
}

Check Container Network¶

Verify containers are on the same network:

- script: |
    echo "Checking Docker network..."
    docker network ls
    docker network inspect bridge | grep -A 10 mssql
  displayName: 'Check Docker network'

If using Docker Compose, ensure services are on the same network:

services:
  sql:
    networks:
      - backend
  app:
    networks:
      - backend

Monitor SQL Server Performance¶

Check SQL Server resource usage:

- script: |
    echo "SQL Server container stats:"
    docker stats mssql --no-stream
    echo "SQL Server logs (last 20 lines):"
    docker logs mssql --tail 20
  displayName: 'Check SQL Server performance'

Use SQL Server 2022 Instead of 2025¶

If SQL Server 2025 is causing issues, try 2022:

- container: mssql
  image: mcr.microsoft.com/mssql/server:2022-latest  # Use 2022 instead of 2025
  options: --name mssql --hostname mssql
  ports:
    - 1433:1433
  env:
     SA_PASSWORD: "Password@123"
     ACCEPT_EULA: "Y"
     MSSQL_PID: "Express"

Verify Agent Resources¶

Check if agent has sufficient resources:

# On agent server
free -h
df -h
nproc
docker system df

If resources are low: - Increase agent VM size (e.g., from CPX32 to CPX41) - Reduce number of concurrent containers - Stop unnecessary services on agent

Prevention¶

Best Practices:

Always add health checks to SQL Server containers in pipeline YAML
Wait for SQL Server to be ready before starting tests
Use appropriate connection timeouts (60-120 seconds for containerized SQL Server)
Monitor agent resources - Ensure sufficient CPU and memory
Use connection pooling to reduce connection overhead
Test container startup - Verify SQL Server starts within expected time

Common Issues on Self-Hosted Agents¶

Issue: SQL Server 2025 takes longer to start on resource-constrained agents - Solution: Use SQL Server 2022 or increase agent resources

Issue: Multiple containers competing for resources - Solution: Reduce number of containers or increase agent VM size

Issue: Network latency between containers - Solution: Ensure containers are on the same Docker network

Issue: SQL Server memory limit too high - Solution: Set MSSQL_MEMORY_LIMIT_MB to match available agent memory

Issue: NServiceBus saga table creation times out - Solution: Add Connection Timeout=120;Command Timeout=120; to NServiceBus connection strings in test configuration files

Issue: Orleans AdoNetReminderTable initialization times out - Solution: Add Connection Timeout=120;Command Timeout=120; to Orleans AdoNetGrainReminderTable and GrainPersistence.AdoNet connection strings in test configuration files

NServiceBus Saga Table Creation Timeouts¶

Symptoms¶

Error: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Error occurs during TestInitializeAsync when initializing NServiceBus
Stack trace shows: NServiceBus.Persistence.Sql.ScriptRunner.InstallSagas
Saga table creation script fails to complete
Error happens specifically on self-hosted Linux agents
Works on Microsoft-hosted agents but fails on self-hosted

Root Cause¶

NServiceBus saga table creation involves executing complex SQL scripts that: 1. Create tables with multiple columns 2. Add correlation properties 3. Create indexes 4. Verify column types 5. Purge obsolete indexes and properties

On resource-constrained self-hosted agents, these operations can take longer than the default 30-second timeout, especially when: - SQL Server is running in a container - Agent has limited CPU/memory - Multiple tests are running concurrently - SQL Server is still initializing

Solutions¶

Add Connection and Command Timeouts to NServiceBus Connection Strings¶

Update test configuration files (appsettings.Development.Docker.json, appsettings.RateLimitTests.json, etc.):

{
  "NServiceBus": {
    "SqlServerTransport": {
      "ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_NSERVICEBUS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
    },
    "SqlServerPersistence": {
      "ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_NSERVICEBUS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
    }
  }
}

Key additions: - Connection Timeout=120; - Allows 120 seconds to establish connection - Command Timeout=120; - Allows 120 seconds for each SQL command to execute

Increase SQL Server Resources¶

If timeouts persist even with 120-second timeout, increase SQL Server resources:

- container: mssql
  image: mcr.microsoft.com/mssql/server:2025-latest
  options: --name mssql --hostname mssql
  ports:
    - 1433:1433
  env:
     SA_PASSWORD: "Password@123"
     ACCEPT_EULA: "Y"
     MSSQL_PID: "Express"
     MSSQL_MEMORY_LIMIT_MB: "4096"  # Increase from 2GB to 4GB if agent has resources

Wait for SQL Server Before Tests¶

Add a wait step in pipeline before tests start:

- script: |
    echo "Waiting for SQL Server to be ready for NServiceBus..."
    for i in {1..30}; do
      if docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT 1" -C > /dev/null 2>&1; then
        echo "SQL Server is ready!"
        # Additional check: ensure SQL Server can execute complex queries
        docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT COUNT(*) FROM sys.tables" -C > /dev/null 2>&1
        exit 0
      fi
      echo "Waiting for SQL Server... (attempt $i/30)"
      sleep 2
    done
    echo "SQL Server did not become ready in time"
    exit 1
  displayName: 'Wait for SQL Server to be ready for NServiceBus'

Prevention¶

Best Practices:

Always set timeouts - Use Connection Timeout=120;Command Timeout=120; for NServiceBus connection strings in test configurations
Wait for SQL Server - Add explicit wait step in pipeline before tests start
Monitor resource usage - Ensure agent has sufficient CPU/memory for SQL Server
Use appropriate SQL Server edition - SQL Server Express may have limitations; consider Developer edition if needed

Common Scenarios¶

Scenario 1: Saga table creation times out on first test run - Solution: Increase timeouts to 120 seconds and ensure SQL Server is fully ready

Scenario 2: Timeout occurs intermittently - Solution: Check agent resource usage; may need to reduce concurrent tests or increase agent resources

Scenario 3: Works on Microsoft-hosted but fails on self-hosted - Solution: Self-hosted agents may have less resources; increase timeouts and ensure SQL Server has adequate memory

Orleans AdoNetReminderTable Initialization Timeouts¶

Symptoms¶

Error: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Error occurs during TestInitializeAsync when initializing Orleans
Stack trace shows: OrleansExtensions.ConfigureAdoNetReminderService or SqlServerDatabaseHelper.CreateIfNotExists
Orleans reminder table creation fails
Error happens specifically on self-hosted Linux agents
Works on Microsoft-hosted agents but fails on self-hosted

Root Cause¶

Orleans AdoNetReminderTable initialization involves: 1. Creating the database if it doesn't exist (CreateIfNotExists) 2. Executing Orleans SQL scripts (SQLServer-Main.sql and SQLServer-Reminders.sql) 3. Creating reminder tables and indexes

On resource-constrained self-hosted agents, these operations can take longer than the default 30-second timeout, especially when: - SQL Server is running in a container - Agent has limited CPU/memory - Multiple tests are running concurrently - SQL Server is still initializing

Solutions¶

Add Connection and Command Timeouts to Orleans Connection Strings¶

Update test configuration files (appsettings.Development.Docker.json, appsettings.RateLimitTests.json, etc.):

{
  "Orleans": {
    "GrainPersistence": {
      "AdoNet": {
        "ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_ORLEANS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
      }
    },
    "AdoNetGrainReminderTable": {
      "ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_ORLEANS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
    }
  }
}

Key additions: - Connection Timeout=120; - Allows 120 seconds to establish connection - Command Timeout=120; - Allows 120 seconds for each SQL command to execute

Increase SQL Server Resources¶

If timeouts persist even with 120-second timeout, increase SQL Server resources (same as NServiceBus solution above).

Wait for SQL Server Before Tests¶

Add a wait step in pipeline before tests start (same as NServiceBus solution above).

Prevention¶

Best Practices:

Always set timeouts - Use Connection Timeout=120;Command Timeout=120; for Orleans connection strings in test configurations
Wait for SQL Server - Add explicit wait step in pipeline before tests start
Monitor resource usage - Ensure agent has sufficient CPU/memory for SQL Server
Use appropriate SQL Server edition - SQL Server Express may have limitations; consider Developer edition if needed

Common Scenarios¶

Scenario 1: Orleans reminder table creation times out on first test run - Solution: Increase timeouts to 120 seconds and ensure SQL Server is fully ready

Scenario 2: Timeout occurs intermittently - Solution: Check agent resource usage; may need to reduce concurrent tests or increase agent resources

Scenario 3: Works on Microsoft-hosted but fails on self-hosted - Solution: Self-hosted agents may have less resources; increase timeouts and ensure SQL Server has adequate memory

Ollama 500 Internal Server Error¶

Symptoms¶

Error: Response status code does not indicate success: 500 (Internal Server Error)
Error occurs when calling Ollama API for chat completions or tool invocation
Stack trace shows: OllamaSharp.OllamaApiClient.ChatAsync or GetStreamingResponseAsync
Test fails during Ollama chat completion or tool invocation

Root Cause¶

Ollama returns 500 errors typically when: 1. Model not loaded: The specified model is not available or not loaded in memory 2. Insufficient memory: The model is too large for available system memory 3. Model name mismatch: The model name in configuration doesn't match the actual model name 4. Ollama service issues: The Ollama service is having internal problems

Solutions¶

Verify Model is Available¶

On the agent server, check available models:

# List all installed models
ollama list

# Expected output should include your model:
# NAME                       ID              SIZE      MODIFIED
# mistral:7b-instruct        6577803aa9a0    4.4 GB    13 days ago

If model is missing, pull it:

# Pull the model (this may take several minutes)
ollama pull mistral:7b-instruct

# Verify it's available
ollama list | grep mistral

Verify Model Name Format¶

Check the exact model name format:

# List models with exact names
ollama list

# Test the model directly
ollama run mistral:7b-instruct "Hello"

Common model name formats: - mistral:7b-instruct (with tag) - mistral (without tag, uses default) - mistral:7b (shorter tag)

Update configuration if the model name doesn't match:

{
  "Ollama": {
    "Model": "qwen3:0.6b"  // Use exact name from 'ollama list'
  }
}

Note: The default model qwen3:0.6b (~522 MB) is recommended for basic chat completions. For tool invocation, use mistral:7b-instruct (~4.4 GB) but ensure you have 6-8 GB free RAM.

Check Ollama Service Status and Logs¶

Check service status:

# Check if Ollama is running
sudo systemctl status ollama

# Check recent logs for errors (this is critical for diagnosing 500 errors)
sudo journalctl -u ollama -n 100 --no-pager | tail -50

# Check for specific error patterns
sudo journalctl -u ollama -n 100 --no-pager | grep -i "error\|fail\|500\|memory\|timeout"

Common log errors: - out of memory - Model too large for available RAM - model not found - Model name incorrect or not pulled - context length exceeded - Request too long for model - failed to load model - Model file corrupted or incomplete - connection refused - Ollama service not running or port blocked

To see real-time logs during test execution:

# Watch Ollama logs in real-time (run this in a separate terminal during tests)
sudo journalctl -u ollama -f

Check Available Memory¶

Verify system has enough memory for the model:

# Check available memory
free -h

# Check memory usage
top -bn1 | head -20

# For mistral:7b-instruct, you need at least 6-8 GB free RAM

If memory is insufficient: - Use a smaller model (e.g., qwen3:0.6b for basic chat, but it doesn't support tool invocation) - Increase server memory - Stop other services to free memory

Test Ollama API Directly¶

Test the API endpoint:

# Test API is responding
curl http://localhost:11434/api/tags

# Test specific model
curl http://localhost:11434/api/generate -d '{
  "model": "mistral:7b-instruct",
  "prompt": "Say hello",
  "stream": false
}'

If API test fails: - Check Ollama service is running: sudo systemctl start ollama - Check firewall/port access: netstat -tlnp | grep 11434 - Verify endpoint in config matches actual endpoint

Pre-load the Model¶

Models need to be loaded into memory before use. Pre-load to avoid 500 errors:

# Pre-load the model (this loads it into memory)
ollama run qwen3:0.6b "test"

# Or use the API to pre-load
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:0.6b",
  "prompt": "test",
  "stream": false
}'

# Verify model is loaded (check memory usage)
ollama ps

Why this helps: - Models are lazy-loaded on first API request - If memory is limited, loading during tests can cause 500 errors - Pre-loading ensures models are ready when tests run - For qwen3:0.6b (~522 MB), pre-loading is usually not necessary, but can help avoid delays

Restart Ollama Service¶

If issues persist, restart Ollama:

# Restart Ollama service
sudo systemctl restart ollama

# Wait a few seconds for service to start
sleep 5

# Verify it's running
sudo systemctl status ollama

# Test API again
curl http://localhost:11434/api/tags

# Pre-load models after restart (optional for qwen3:0.6b)
ollama run qwen3:0.6b "test"

Prevention¶

Best Practices:

Verify model before tests - Run ollama list to confirm model is available
Use correct model name - Match exactly what ollama list shows
Ensure sufficient memory - Have at least 1.5x model size in free RAM
Monitor Ollama logs - Check logs regularly for warnings or errors
Test API directly - Use curl to test Ollama before running tests

Common Scenarios¶

Scenario 1: Model not found (500 error) - Solution: Run ollama pull mistral:7b-instruct to download the model

Scenario 2: Out of memory (500 error) - Solution: Free up memory or use a smaller model for basic chat (but tool invocation requires larger model)

Scenario 3: Model name mismatch (500 error) - Solution: Check ollama list and use exact model name from output

Scenario 4: Ollama service not running (connection refused) - Solution: Start service with sudo systemctl start ollama

Getting Additional Help¶

Azure DevOps Resources¶

Hetzner Cloud Resources¶

Log Collection¶

When seeking help, collect:

Agent logs (last 100 lines)
Service status
System resource usage
Network connectivity test results
Agent configuration (sanitized)

Next Steps¶

Review Maintenance Guide for preventive measures
Set up monitoring to catch issues early
Document your specific troubleshooting procedures