Self-Hosted Agents - Troubleshooting Guide¶
Overview¶
This guide covers common issues encountered with self-hosted Azure DevOps agents and their solutions.
Agent Not Appearing in Azure DevOps¶
Symptoms¶
- Agent does not appear in agent pool
- Agent shows as "Offline" immediately after installation
Possible Causes¶
- Incorrect PAT token permissions
- Network connectivity issues
- Agent configuration errors
- Service not running
Solutions¶
Check PAT Token Permissions¶
- Verify PAT token has Agent Pools (Read & Manage) scope
- Check token expiration date
- Create new PAT if needed
Verify Network Connectivity¶
Linux:
# Test Azure DevOps connectivity
curl -I https://dev.azure.com
# Test DNS resolution
nslookup dev.azure.com
Windows:
# Test Azure DevOps connectivity
Test-NetConnection -ComputerName dev.azure.com -Port 443
# Test DNS resolution
Resolve-DnsName dev.azure.com
Check Agent Configuration¶
Linux:
# View agent configuration
cat ~/azagent/.agent
# Check service status
sudo systemctl status vsts.agent.*.service
Windows:
# View agent configuration
Get-Content C:\azagent\.agent
# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}
Review Agent Logs¶
Linux:
# View recent logs
sudo journalctl -u vsts.agent.*.service -n 100 --no-pager
# Follow logs in real-time
sudo journalctl -u vsts.agent.*.service -f
Windows:
# View recent event logs
Get-EventLog -LogName Application -Source "vsts*" -Newest 50
# View specific error events
Get-EventLog -LogName Application -Source "vsts*" -EntryType Error -Newest 20
Agent Goes Offline¶
Symptoms¶
- Agent was online but now shows as offline
- Agent status changes to offline intermittently
Possible Causes¶
- Service stopped
- Network connectivity lost
- Server rebooted
- PAT token expired
Solutions¶
Check Service Status¶
Linux:
# Check service status
sudo systemctl status vsts.agent.*.service
# Start service if stopped
sudo systemctl start vsts.agent.*.service
# Enable auto-start
sudo systemctl enable vsts.agent.*.service
Windows:
# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}
# Start service if stopped
Start-Service -Name (Get-Service | Where-Object {$_.Name -like "*vsts*"}).Name
Verify Network Connectivity¶
# Linux
ping -c 4 dev.azure.com
curl -I https://dev.azure.com
# Windows
Test-Connection dev.azure.com
Test-NetConnection -ComputerName dev.azure.com -Port 443
Check for Server Reboots¶
Linux:
Windows:
# Check last reboot time
Get-EventLog -LogName System -Source "Microsoft-Windows-Kernel-General" | Where-Object {$_.EventID -eq 1074} | Select-Object -First 1
# Check system uptime
(Get-CimInstance Win32_OperatingSystem).LastBootUpTime
Docker Not Found Error¶
Symptoms¶
- Pipeline fails with error:
##[error]File not found: 'docker' - Container services fail to start
- Error:
docker: command not found
Possible Causes¶
- Docker not installed on agent
- Docker not in PATH
- Agent user not in docker group
- Docker service not running
Solutions¶
Install Docker (Linux)¶
# Add Docker's official GPG key
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
# Set up repository
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Start Docker service
sudo systemctl start docker
sudo systemctl enable docker
# Add agent user to docker group
sudo usermod -aG docker azdevops
# Verify installation (may need to log out and back in for group changes)
sudo docker run hello-world
Verify Docker Installation¶
# Check Docker version
docker --version
# Check Docker service status
sudo systemctl status docker
# Test Docker (as azdevops user)
docker run hello-world
# If permission denied, log out and back in, or restart agent service
Fix Docker Permissions¶
# Add user to docker group (if not already)
sudo usermod -aG docker azdevops
# Verify user is in docker group
groups azdevops
# Restart agent service to apply group changes
cd ~/azagent
sudo ./svc.sh stop
sudo ./svc.sh start
Verify Docker is Accessible to Agent¶
# Check if agent user can run Docker
sudo -u azdevops docker ps
# If permission denied, ensure:
# 1. User is in docker group: groups azdevops
# 2. Docker socket has correct permissions: ls -la /var/run/docker.sock
# 3. Agent service is restarted after group changes
Note: If your pipelines use container services (redis, mssql, mongodb, etc.), Docker is required, not optional. See the Linux Setup Guide for complete Docker installation instructions.
Build Failures on Agent¶
Symptoms¶
- Builds fail with tool not found errors
- Builds fail with permission errors
- Builds fail with disk space errors
Possible Causes¶
- Required tools not installed
- Insufficient permissions
- Disk space full
- Incorrect agent capabilities
Solutions¶
Verify Required Tools¶
Linux:
# Check .NET SDK
dotnet --version
# Check Docker
docker --version
# If Docker is not found, install it (see Linux Setup Guide)
# Check Node.js
node --version
npm --version
Windows:
# Check .NET SDK
dotnet --version
# Check Git
git --version
# Check Node.js
node --version
npm --version
Check Permissions¶
Linux:
Windows:
# Check agent user permissions
whoami /groups
# Check directory permissions
Get-Acl C:\azagent\_work
Check Disk Space¶
Linux:
Windows:
# Check disk usage
Get-PSDrive C | Select-Object Used, Free
# Check specific directory
Get-ChildItem C:\azagent\_work -Recurse | Measure-Object -Property Length -Sum
Verify Agent Capabilities¶
- In Azure DevOps, navigate to agent pool
- Select agent → Capabilities tab
- Verify required capabilities are present
- Add missing capabilities if needed
Code Coverage Not Found by Build Quality Checks¶
Symptoms¶
- Build Quality Checks shows 0% coverage:
Total lines: 0, Covered lines: 0 - Coverage reports are published successfully but Build Quality Checks can't find them
- Error:
The code coverage value (0%, 0 lines) is lower than the minimum value
Possible Causes¶
- Case sensitivity on Linux - File paths are case-sensitive on Linux
- Coverage XML files not found by PublishCodeCoverageResults - The glob pattern might not match on Linux
- Coverage files in wrong location - Files might be in a different directory than expected
Solutions¶
Verify Coverage Files Exist¶
Add a diagnostic step before Build Quality Checks to verify coverage files:
- script: |
echo "Checking for coverage files..."
find "$(Agent.TempDirectory)" -name "coverage.cobertura.xml" -type f
find "$(Agent.TempDirectory)" -name "*coverage*" -type f
displayName: 'Diagnose coverage file locations'
Ensure PublishCodeCoverageResults Finds Files¶
The PublishCodeCoverageResults@2 task uses:
On Linux, ensure:
1. The file name is exactly coverage.cobertura.xml (case-sensitive)
2. The file is in a subdirectory of $(Agent.TempDirectory)
3. The file is readable by the agent user
Fix Coverage File Paths¶
If coverage files are in a different location, you may need to:
-
Copy files to expected location:
-
Update PublishCodeCoverageResults path (if you can modify the template):
Verify Build Quality Checks Configuration¶
Ensure Build Quality Checks is configured correctly:
- task: mspremier.BuildQualityChecks.QualityChecks-task.BuildQualityChecks@10
inputs:
checkCoverage: true
coverageFailOption: fixed
coverageType: lines
coverageThreshold: '76'
Note: Build Quality Checks reads coverage data from PublishCodeCoverageResults, not from file artifacts. The coverage must be published successfully before Build Quality Checks can read it.
Pipeline Cannot Find Agent¶
Symptoms¶
- Pipeline shows "No agent found" error
- Pipeline waits indefinitely for agent
Possible Causes¶
- Pool name mismatch
- Demand requirements not met
- All agents busy or offline
- Agent capabilities don't match demands
Solutions¶
Verify Pool Name¶
Ensure pool name in pipeline YAML matches exactly (case-sensitive):
Check Agent Demands¶
Verify agent capabilities match pipeline demands:
pool:
name: 'Hetzner-Linux'
demands:
- Agent.OS -equals Linux
- DotNet -equals 9.0.x # Agent must have this capability
Verify Agent Availability¶
- Check agent pool in Azure DevOps
- Verify at least one agent is online
- Check if agents are busy with other jobs
- Consider adding more agents if all are busy
High Disk Usage¶
Symptoms¶
- Builds fail with "No space left on device" errors
- Disk usage shows > 90%
Solutions¶
Clean Up Build Artifacts¶
Linux:
# Clean agent work directory
cd ~/azagent/_work
rm -rf *
# Clean old directories (older than 30 days)
find ~/azagent/_work -type d -mtime +30 -exec rm -rf {} \;
Windows:
# Clean agent work directory
Remove-Item C:\azagent\_work\* -Recurse -Force
# Clean old directories
Get-ChildItem C:\azagent\_work -Directory | Where-Object {$_.LastWriteTime -lt (Get-Date).AddDays(-30)} | Remove-Item -Recurse -Force
Clean Package Caches¶
Linux:
# Clean NuGet cache
rm -rf ~/.nuget/packages/*
# Clean npm cache
npm cache clean --force
# Clean Docker
docker system prune -a --volumes
Windows:
# Clean NuGet cache
Remove-Item "$env:USERPROFILE\.nuget\packages\*" -Recurse -Force
# Clean npm cache
npm cache clean --force
# Clean Docker
docker system prune -a --volumes
Increase Disk Size¶
If using Hetzner Cloud, you can increase disk size:
- Navigate to Hetzner Cloud Console
- Select server → Resize → Increase disk size
- Follow instructions to resize filesystem
Slow Build Performance¶
Symptoms¶
- Builds take longer than expected
- Agent CPU/memory usage is high
Solutions¶
Check System Resources¶
Linux:
Windows:
# Check CPU and memory
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10
Get-Counter '\Processor(_Total)\% Processor Time'
Get-Counter '\Memory\Available MBytes'
# Check disk I/O
Get-Counter '\PhysicalDisk(*)\Disk Reads/sec'
Get-Counter '\PhysicalDisk(*)\Disk Writes/sec'
Optimize Build Cache¶
- Configure persistent NuGet cache
- Use Docker layer caching
- Cache npm/node_modules
- Cache build artifacts between runs
Upgrade Server Resources¶
If resources are consistently maxed out:
- Consider upgrading to larger server type
- Add more agents to distribute load
- Optimize build processes
Authentication Errors¶
Symptoms¶
- "401 Unauthorized" errors
- "403 Forbidden" errors
- PAT token errors
Solutions¶
Verify PAT Token¶
- Check token expiration date
- Verify token has correct scopes:
- Agent Pools (Read & Manage)
- Build (Read & Execute)
- Create new PAT if needed
Update Agent Configuration¶
Linux:
Windows:
Service Won't Start¶
Symptoms¶
- Agent service fails to start
- Service shows as "Failed" status
Solutions¶
Check Service Logs¶
Linux:
# View service logs
sudo journalctl -u vsts.agent.*.service -n 100 --no-pager
# Check service status
sudo systemctl status vsts.agent.*.service
Windows:
# View service logs
Get-EventLog -LogName Application -Source "vsts*" -Newest 50
# Check service status
Get-Service | Where-Object {$_.Name -like "*vsts*"}
Verify Agent Configuration¶
Linux:
# Check configuration file
cat ~/azagent/.agent
# Verify credentials file exists
ls -la ~/azagent/.credentials
Windows:
# Check configuration file
Get-Content C:\azagent\.agent
# Verify credentials file exists
Test-Path C:\azagent\.credentials
Reinstall Service¶
Linux:
Windows:
Network Connectivity Issues¶
Symptoms¶
- Agent cannot connect to Azure DevOps
- Timeout errors
- SSL/TLS errors
Solutions¶
Test Connectivity¶
# Linux
curl -v https://dev.azure.com
ping -c 4 dev.azure.com
# Windows
Test-NetConnection -ComputerName dev.azure.com -Port 443
Test-Connection dev.azure.com
Check Firewall Rules¶
Linux:
Windows:
# Check firewall rules
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*HTTPS*"}
# Allow outbound HTTPS (usually enabled by default)
Check Proxy Settings¶
If behind a proxy:
- Configure proxy in agent environment
- Set HTTP_PROXY and HTTPS_PROXY variables
- Update agent configuration if needed
Git Authentication Errors on Self-Hosted Agents¶
Symptoms¶
- Error:
fatal: unable to access 'https://dev.azure.com/...': The requested URL returned error: 400 - Git fetch fails with exit code 128
- Repository checkout fails on self-hosted Linux agents
- Works on Microsoft-hosted agents but fails on self-hosted
- Error occurs after manually installing Git on the agent
Possible Causes¶
- Missing explicit checkout with credentials - Default checkout doesn't persist credentials on self-hosted agents
- Git configuration conflicts - Manual Git installation may have changed global Git config
- Agent permissions - Agent user doesn't have proper repository access
- Stale Git credentials - Old credentials cached in Git config
Solutions¶
Add Explicit Checkout with Credentials (Required)¶
In your pipeline YAML, add explicit checkout step:
steps:
- checkout: self
persistCredentials: true
displayName: 'Checkout repository with credentials'
# ... rest of your steps
This is required for self-hosted agents to authenticate properly. Without this, the agent cannot authenticate to fetch from Azure DevOps repositories.
Clear Git Configuration on Agent¶
If Git was manually installed and causing issues:
# Connect to agent server
ssh azdevops@<server-ip>
# Check current Git config
git config --global --list
# Remove problematic credentials
git config --global --unset-all http.extraheader
git config --global --unset-all http.https://dev.azure.com.extraheader
# Verify Git version
git --version
Verify Agent Repository Permissions¶
- In Azure DevOps, go to Project Settings → Repositories
- Select your repository
- Go to Security tab
- Ensure Project Collection Build Service has Read permission
- Ensure Project Build Service has Read permission
Configure Git Authentication Manually (If Needed)¶
If persistCredentials: true doesn't work, configure Git manually in pipeline:
- script: |
git config --global http.extraheader "AUTHORIZATION: bearer $(System.AccessToken)"
git config --global http.version HTTP/1.1
displayName: 'Configure Git authentication'
env:
System_AccessToken: $(System.AccessToken)
Restart Agent Service¶
After making changes, restart the agent:
Note: The persistCredentials: true option is the standard solution for self-hosted agents. Always include this in your pipeline YAML when using self-hosted agents.
Global Git Config Conflicts with Pipeline Authentication¶
Symptoms¶
- Error:
fatal: unable to access 'https://dev.azure.com/...': The requested URL returned error: 400 - Pipeline logs show:
##[warning]Git config still contains extraheader keys. It may cause errors. - Pipeline logs show:
##[warning]An unsuccessful attempt was made using git command line to remove "http.extraheader" from the git config. - Git fetch fails even with
persistCredentials: trueconfigured - Works on Microsoft-hosted agents but fails on self-hosted agents
Root Cause¶
The Azure DevOps checkout task attempts to manage Git authentication by:
1. Removing existing http.extraheader configuration to avoid conflicts
2. Setting its own authentication per repository using pipeline tokens
If a global Git config has http.extraheader set (e.g., in ~/.gitconfig), the checkout task cannot remove it cleanly, causing authentication conflicts. The pipeline tries to use its own token, but Git still uses the stale global token, resulting in HTTP 400 errors.
Solutions¶
Remove Global http.extraheader Configuration¶
On Linux Agent:
# Connect to agent server as azdevops user
ssh azdevops@<server-ip>
# Check current global Git config
git config --global --list
# Remove the problematic http.extraheader
git config --global --unset-all http.extraheader
# Verify it's removed
git config --global --list
# Restart agent service
cd ~/azagent
sudo ./svc.sh stop
sudo ./svc.sh start
On Windows Agent:
# Connect via RDP or PowerShell
# Check current global Git config
git config --global --list
# Remove the problematic http.extraheader
git config --global --unset-all http.extraheader
# Verify it's removed
git config --global --list
# Restart agent service
cd C:\azagent
.\svc.cmd stop
.\svc.cmd start
Keep User Configuration, Remove Only Authentication¶
You can keep user name and email in global config, but remove authentication settings:
# Keep these (optional)
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
# Remove these (required)
git config --global --unset-all http.extraheader
git config --global --unset-all http.https://dev.azure.com.extraheader
git config --global --unset-all credential.helper
# Remove credential files
rm -f ~/.git-credentials
rm -rf ~/.git-credential-cache
Add Explicit Checkout with persistCredentials¶
Even after removing global config, add explicit checkout to your pipeline:
steps:
- checkout: self
persistCredentials: true
displayName: 'Checkout repository with credentials'
# ... rest of your steps
This ensures the pipeline manages authentication correctly.
Verify Configuration¶
After making changes, verify:
# Check global config (should not contain http.extraheader)
git config --global --list
# Check system config (if exists)
cat /etc/gitconfig 2>/dev/null || echo "No system gitconfig"
# Check for credential files
ls -la ~/.git-credentials 2>/dev/null
ls -la ~/.git-credential-cache 2>/dev/null
Prevention¶
Best Practices:
- Never set
http.extraheaderglobally - Let the pipeline manage authentication - Use
persistCredentials: true- Always include explicit checkout in pipeline YAML for self-hosted agents - Keep user.name and user.email - These are safe to set globally
- Avoid credential helpers in global config - Let the pipeline handle credentials
Why This Happens¶
The Azure DevOps checkout task:
1. Tries to remove existing http.extraheader to avoid conflicts
2. Sets its own authentication per repository using System.AccessToken
3. If global config has http.extraheader, it conflicts with the pipeline's token
4. Git uses the stale global token instead of the fresh pipeline token
5. Results in HTTP 400 because the global token may be expired or invalid
Solution: Remove global http.extraheader and let the pipeline manage authentication per repository.
SQL Server Connection Timeouts in Docker Containers¶
Symptoms¶
- Error:
Microsoft.Data.SqlClient.SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. - Error:
System.ComponentModel.Win32Exception: Unknown error 258 - Tests fail during
TestInitializeAsyncwhen initializing NServiceBus SQL Persistence - Saga table creation scripts timeout
- Tests run successfully on Microsoft-hosted agents but fail on self-hosted Linux agents
- Error occurs when running tests in Docker containers
Root Cause¶
SQL Server connection timeouts in Docker containers on self-hosted agents can occur due to:
- SQL Server container not ready - Container may not be fully initialized when tests start
- Network connectivity issues - Containers may not be able to communicate properly
- Resource constraints - Self-hosted agent may have limited CPU/memory, causing SQL Server to respond slowly
- Connection string issues - Wrong hostname or port in connection string
- SQL Server startup time - SQL Server 2025 may take longer to start on resource-constrained systems
- Container health check not working - Pipeline may start tests before SQL Server is ready
Solutions¶
Verify SQL Server Container is Running¶
Check container status in pipeline:
- script: |
echo "Checking SQL Server container status..."
docker ps -a | grep mssql
docker logs mssql --tail 50
displayName: 'Check SQL Server container status'
Or add a wait step before tests:
- script: |
echo "Waiting for SQL Server to be ready..."
timeout 120 bash -c 'until docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT 1" -C; do sleep 2; done'
displayName: 'Wait for SQL Server to be ready'
Add Health Check to SQL Server Container¶
Update pipeline YAML to include health check:
- container: mssql
image: mcr.microsoft.com/mssql/server:2025-latest
options: --name mssql --hostname mssql --health-cmd "/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P Password@123 -Q 'SELECT 1' -C" --health-interval 10s --health-timeout 5s --health-retries 10
ports:
- 1433:1433
env:
SA_PASSWORD: "Password@123"
ACCEPT_EULA: "Y"
MSSQL_PID: "Express"
Increase SQL Server Memory Limit¶
If agent has limited resources, reduce SQL Server memory:
- container: mssql
image: mcr.microsoft.com/mssql/server:2025-latest
options: --name mssql --hostname mssql
ports:
- 1433:1433
env:
SA_PASSWORD: "Password@123"
ACCEPT_EULA: "Y"
MSSQL_PID: "Express"
MSSQL_MEMORY_LIMIT_MB: "2048" # Reduce from default 4GB to 2GB
Verify Connection String¶
Ensure connection string uses correct hostname:
- In Docker containers, use container name:
Server=mssql,1433;... - On host machine, use:
Server=localhost,1433;... - Check
appsettings.Development.Docker.jsonfor Docker-specific connection strings
Example connection string for Docker:
{
"ConnectionStrings": {
"ConnectSoft.BaseTemplateSqlServer": "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=60;"
}
}
Increase Connection Timeout¶
Add connection timeout to connection string:
{
"ConnectionStrings": {
"ConnectSoft.BaseTemplateSqlServer": "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;Connection Timeout=120;Command Timeout=120;"
}
}
Or in code:
var connectionString = "Server=mssql,1433;Database=TestDb;User Id=sa;Password=Password@123;Connection Timeout=120;Command Timeout=120;";
Add Retry Logic¶
Add retry logic in test initialization:
[TestInitialize]
public async Task TestInitializeAsync()
{
var maxRetries = 5;
var delay = TimeSpan.FromSeconds(5);
for (int i = 0; i < maxRetries; i++)
{
try
{
// Your initialization code
await InitializeServices();
return;
}
catch (SqlException ex) when (ex.Message.Contains("timeout") && i < maxRetries - 1)
{
await Task.Delay(delay);
delay = TimeSpan.FromSeconds(delay.TotalSeconds * 2); // Exponential backoff
}
}
}
Check Container Network¶
Verify containers are on the same network:
- script: |
echo "Checking Docker network..."
docker network ls
docker network inspect bridge | grep -A 10 mssql
displayName: 'Check Docker network'
If using Docker Compose, ensure services are on the same network:
Monitor SQL Server Performance¶
Check SQL Server resource usage:
- script: |
echo "SQL Server container stats:"
docker stats mssql --no-stream
echo "SQL Server logs (last 20 lines):"
docker logs mssql --tail 20
displayName: 'Check SQL Server performance'
Use SQL Server 2022 Instead of 2025¶
If SQL Server 2025 is causing issues, try 2022:
- container: mssql
image: mcr.microsoft.com/mssql/server:2022-latest # Use 2022 instead of 2025
options: --name mssql --hostname mssql
ports:
- 1433:1433
env:
SA_PASSWORD: "Password@123"
ACCEPT_EULA: "Y"
MSSQL_PID: "Express"
Verify Agent Resources¶
Check if agent has sufficient resources:
If resources are low: - Increase agent VM size (e.g., from CPX32 to CPX41) - Reduce number of concurrent containers - Stop unnecessary services on agent
Prevention¶
Best Practices:
- Always add health checks to SQL Server containers in pipeline YAML
- Wait for SQL Server to be ready before starting tests
- Use appropriate connection timeouts (60-120 seconds for containerized SQL Server)
- Monitor agent resources - Ensure sufficient CPU and memory
- Use connection pooling to reduce connection overhead
- Test container startup - Verify SQL Server starts within expected time
Common Issues on Self-Hosted Agents¶
Issue: SQL Server 2025 takes longer to start on resource-constrained agents - Solution: Use SQL Server 2022 or increase agent resources
Issue: Multiple containers competing for resources - Solution: Reduce number of containers or increase agent VM size
Issue: Network latency between containers - Solution: Ensure containers are on the same Docker network
Issue: SQL Server memory limit too high
- Solution: Set MSSQL_MEMORY_LIMIT_MB to match available agent memory
Issue: NServiceBus saga table creation times out
- Solution: Add Connection Timeout=120;Command Timeout=120; to NServiceBus connection strings in test configuration files
Issue: Orleans AdoNetReminderTable initialization times out
- Solution: Add Connection Timeout=120;Command Timeout=120; to Orleans AdoNetGrainReminderTable and GrainPersistence.AdoNet connection strings in test configuration files
NServiceBus Saga Table Creation Timeouts¶
Symptoms¶
- Error:
Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. - Error occurs during
TestInitializeAsyncwhen initializing NServiceBus - Stack trace shows:
NServiceBus.Persistence.Sql.ScriptRunner.InstallSagas - Saga table creation script fails to complete
- Error happens specifically on self-hosted Linux agents
- Works on Microsoft-hosted agents but fails on self-hosted
Root Cause¶
NServiceBus saga table creation involves executing complex SQL scripts that: 1. Create tables with multiple columns 2. Add correlation properties 3. Create indexes 4. Verify column types 5. Purge obsolete indexes and properties
On resource-constrained self-hosted agents, these operations can take longer than the default 30-second timeout, especially when: - SQL Server is running in a container - Agent has limited CPU/memory - Multiple tests are running concurrently - SQL Server is still initializing
Solutions¶
Add Connection and Command Timeouts to NServiceBus Connection Strings¶
Update test configuration files (appsettings.Development.Docker.json, appsettings.RateLimitTests.json, etc.):
{
"NServiceBus": {
"SqlServerTransport": {
"ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_NSERVICEBUS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
},
"SqlServerPersistence": {
"ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_NSERVICEBUS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
}
}
}
Key additions:
- Connection Timeout=120; - Allows 120 seconds to establish connection
- Command Timeout=120; - Allows 120 seconds for each SQL command to execute
Increase SQL Server Resources¶
If timeouts persist even with 120-second timeout, increase SQL Server resources:
- container: mssql
image: mcr.microsoft.com/mssql/server:2025-latest
options: --name mssql --hostname mssql
ports:
- 1433:1433
env:
SA_PASSWORD: "Password@123"
ACCEPT_EULA: "Y"
MSSQL_PID: "Express"
MSSQL_MEMORY_LIMIT_MB: "4096" # Increase from 2GB to 4GB if agent has resources
Wait for SQL Server Before Tests¶
Add a wait step in pipeline before tests start:
- script: |
echo "Waiting for SQL Server to be ready for NServiceBus..."
for i in {1..30}; do
if docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT 1" -C > /dev/null 2>&1; then
echo "SQL Server is ready!"
# Additional check: ensure SQL Server can execute complex queries
docker exec mssql /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "Password@123" -Q "SELECT COUNT(*) FROM sys.tables" -C > /dev/null 2>&1
exit 0
fi
echo "Waiting for SQL Server... (attempt $i/30)"
sleep 2
done
echo "SQL Server did not become ready in time"
exit 1
displayName: 'Wait for SQL Server to be ready for NServiceBus'
Prevention¶
Best Practices:
- Always set timeouts - Use
Connection Timeout=120;Command Timeout=120;for NServiceBus connection strings in test configurations - Wait for SQL Server - Add explicit wait step in pipeline before tests start
- Monitor resource usage - Ensure agent has sufficient CPU/memory for SQL Server
- Use appropriate SQL Server edition - SQL Server Express may have limitations; consider Developer edition if needed
Common Scenarios¶
Scenario 1: Saga table creation times out on first test run - Solution: Increase timeouts to 120 seconds and ensure SQL Server is fully ready
Scenario 2: Timeout occurs intermittently - Solution: Check agent resource usage; may need to reduce concurrent tests or increase agent resources
Scenario 3: Works on Microsoft-hosted but fails on self-hosted - Solution: Self-hosted agents may have less resources; increase timeouts and ensure SQL Server has adequate memory
Orleans AdoNetReminderTable Initialization Timeouts¶
Symptoms¶
- Error:
Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. - Error occurs during
TestInitializeAsyncwhen initializing Orleans - Stack trace shows:
OrleansExtensions.ConfigureAdoNetReminderServiceorSqlServerDatabaseHelper.CreateIfNotExists - Orleans reminder table creation fails
- Error happens specifically on self-hosted Linux agents
- Works on Microsoft-hosted agents but fails on self-hosted
Root Cause¶
Orleans AdoNetReminderTable initialization involves:
1. Creating the database if it doesn't exist (CreateIfNotExists)
2. Executing Orleans SQL scripts (SQLServer-Main.sql and SQLServer-Reminders.sql)
3. Creating reminder tables and indexes
On resource-constrained self-hosted agents, these operations can take longer than the default 30-second timeout, especially when: - SQL Server is running in a container - Agent has limited CPU/memory - Multiple tests are running concurrently - SQL Server is still initializing
Solutions¶
Add Connection and Command Timeouts to Orleans Connection Strings¶
Update test configuration files (appsettings.Development.Docker.json, appsettings.RateLimitTests.json, etc.):
{
"Orleans": {
"GrainPersistence": {
"AdoNet": {
"ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_ORLEANS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
}
},
"AdoNetGrainReminderTable": {
"ConnectionString": "Server=localhost,1433;Database=BASETEMPLATE_ORLEANS_DATABASE;User Id=sa;Password=Password@123;MultipleActiveResultSets=true;Encrypt=false;TrustServerCertificate=true;Connection Timeout=120;Command Timeout=120;"
}
}
}
Key additions:
- Connection Timeout=120; - Allows 120 seconds to establish connection
- Command Timeout=120; - Allows 120 seconds for each SQL command to execute
Increase SQL Server Resources¶
If timeouts persist even with 120-second timeout, increase SQL Server resources (same as NServiceBus solution above).
Wait for SQL Server Before Tests¶
Add a wait step in pipeline before tests start (same as NServiceBus solution above).
Prevention¶
Best Practices:
- Always set timeouts - Use
Connection Timeout=120;Command Timeout=120;for Orleans connection strings in test configurations - Wait for SQL Server - Add explicit wait step in pipeline before tests start
- Monitor resource usage - Ensure agent has sufficient CPU/memory for SQL Server
- Use appropriate SQL Server edition - SQL Server Express may have limitations; consider Developer edition if needed
Common Scenarios¶
Scenario 1: Orleans reminder table creation times out on first test run - Solution: Increase timeouts to 120 seconds and ensure SQL Server is fully ready
Scenario 2: Timeout occurs intermittently - Solution: Check agent resource usage; may need to reduce concurrent tests or increase agent resources
Scenario 3: Works on Microsoft-hosted but fails on self-hosted - Solution: Self-hosted agents may have less resources; increase timeouts and ensure SQL Server has adequate memory
Ollama 500 Internal Server Error¶
Symptoms¶
- Error:
Response status code does not indicate success: 500 (Internal Server Error) - Error occurs when calling Ollama API for chat completions or tool invocation
- Stack trace shows:
OllamaSharp.OllamaApiClient.ChatAsyncorGetStreamingResponseAsync - Test fails during Ollama chat completion or tool invocation
Root Cause¶
Ollama returns 500 errors typically when: 1. Model not loaded: The specified model is not available or not loaded in memory 2. Insufficient memory: The model is too large for available system memory 3. Model name mismatch: The model name in configuration doesn't match the actual model name 4. Ollama service issues: The Ollama service is having internal problems
Solutions¶
Verify Model is Available¶
On the agent server, check available models:
# List all installed models
ollama list
# Expected output should include your model:
# NAME ID SIZE MODIFIED
# mistral:7b-instruct 6577803aa9a0 4.4 GB 13 days ago
If model is missing, pull it:
# Pull the model (this may take several minutes)
ollama pull mistral:7b-instruct
# Verify it's available
ollama list | grep mistral
Verify Model Name Format¶
Check the exact model name format:
# List models with exact names
ollama list
# Test the model directly
ollama run mistral:7b-instruct "Hello"
Common model name formats:
- mistral:7b-instruct (with tag)
- mistral (without tag, uses default)
- mistral:7b (shorter tag)
Update configuration if the model name doesn't match:
Note: The default model qwen3:0.6b (~522 MB) is recommended for basic chat completions. For tool invocation, use mistral:7b-instruct (~4.4 GB) but ensure you have 6-8 GB free RAM.
Check Ollama Service Status and Logs¶
Check service status:
# Check if Ollama is running
sudo systemctl status ollama
# Check recent logs for errors (this is critical for diagnosing 500 errors)
sudo journalctl -u ollama -n 100 --no-pager | tail -50
# Check for specific error patterns
sudo journalctl -u ollama -n 100 --no-pager | grep -i "error\|fail\|500\|memory\|timeout"
Common log errors:
- out of memory - Model too large for available RAM
- model not found - Model name incorrect or not pulled
- context length exceeded - Request too long for model
- failed to load model - Model file corrupted or incomplete
- connection refused - Ollama service not running or port blocked
To see real-time logs during test execution:
# Watch Ollama logs in real-time (run this in a separate terminal during tests)
sudo journalctl -u ollama -f
Check Available Memory¶
Verify system has enough memory for the model:
# Check available memory
free -h
# Check memory usage
top -bn1 | head -20
# For mistral:7b-instruct, you need at least 6-8 GB free RAM
If memory is insufficient:
- Use a smaller model (e.g., qwen3:0.6b for basic chat, but it doesn't support tool invocation)
- Increase server memory
- Stop other services to free memory
Test Ollama API Directly¶
Test the API endpoint:
# Test API is responding
curl http://localhost:11434/api/tags
# Test specific model
curl http://localhost:11434/api/generate -d '{
"model": "mistral:7b-instruct",
"prompt": "Say hello",
"stream": false
}'
If API test fails:
- Check Ollama service is running: sudo systemctl start ollama
- Check firewall/port access: netstat -tlnp | grep 11434
- Verify endpoint in config matches actual endpoint
Pre-load the Model¶
Models need to be loaded into memory before use. Pre-load to avoid 500 errors:
# Pre-load the model (this loads it into memory)
ollama run qwen3:0.6b "test"
# Or use the API to pre-load
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:0.6b",
"prompt": "test",
"stream": false
}'
# Verify model is loaded (check memory usage)
ollama ps
Why this helps:
- Models are lazy-loaded on first API request
- If memory is limited, loading during tests can cause 500 errors
- Pre-loading ensures models are ready when tests run
- For qwen3:0.6b (~522 MB), pre-loading is usually not necessary, but can help avoid delays
Restart Ollama Service¶
If issues persist, restart Ollama:
# Restart Ollama service
sudo systemctl restart ollama
# Wait a few seconds for service to start
sleep 5
# Verify it's running
sudo systemctl status ollama
# Test API again
curl http://localhost:11434/api/tags
# Pre-load models after restart (optional for qwen3:0.6b)
ollama run qwen3:0.6b "test"
Prevention¶
Best Practices:
- Verify model before tests - Run
ollama listto confirm model is available - Use correct model name - Match exactly what
ollama listshows - Ensure sufficient memory - Have at least 1.5x model size in free RAM
- Monitor Ollama logs - Check logs regularly for warnings or errors
- Test API directly - Use
curlto test Ollama before running tests
Common Scenarios¶
Scenario 1: Model not found (500 error)
- Solution: Run ollama pull mistral:7b-instruct to download the model
Scenario 2: Out of memory (500 error) - Solution: Free up memory or use a smaller model for basic chat (but tool invocation requires larger model)
Scenario 3: Model name mismatch (500 error)
- Solution: Check ollama list and use exact model name from output
Scenario 4: Ollama service not running (connection refused)
- Solution: Start service with sudo systemctl start ollama
Getting Additional Help¶
Azure DevOps Resources¶
Hetzner Cloud Resources¶
Log Collection¶
When seeking help, collect:
- Agent logs (last 100 lines)
- Service status
- System resource usage
- Network connectivity test results
- Agent configuration (sanitized)
Next Steps¶
- Review Maintenance Guide for preventive measures
- Set up monitoring to catch issues early
- Document your specific troubleshooting procedures