Self-Hosted Agents - Ollama Installation Guide¶
Overview¶
Ollama is a local LLM (Large Language Model) server that enables running AI models directly on your self-hosted Azure DevOps agents. This eliminates the need for external API calls and provides:
- Cost Savings: No per-request API costs (free, local inference)
- Privacy: All AI processing happens locally on your infrastructure
- Performance: No network latency for AI operations
- Reliability: No dependency on external AI service availability
- Testing: Enables AI acceptance tests without external dependencies
Use Cases¶
- AI Acceptance Tests: Run tests that use AI models without external API dependencies
- Local Development: Test AI functionality during development
- Cost-Effective AI: Avoid API costs for CI/CD pipelines
- Privacy-Sensitive Workloads: Keep AI processing on-premises
Prerequisites¶
System Requirements¶
- Operating System: Ubuntu 22.04 LTS (or compatible Linux distribution)
- Disk Space:
- Minimum: 10 GB free space
- Recommended: 20+ GB for multiple models
- Model sizes: ~1-4 GB per model
- Memory:
- Minimum: 8 GB RAM
- Recommended: 16+ GB RAM for better performance
- Network: Internet access for initial installation and model downloads
- Permissions: Root or sudo access for installation
Required Models¶
For ConnectSoft.BaseTemplate acceptance tests: - Qwen3-0.6B-GGUF: Chat completion model (~1-2 GB) - nomic-embed-text: Embedding model (~150-300 MB)
Installation¶
Follow these steps to install and configure Ollama on your self-hosted agent.
Step 1: Install Ollama¶
Install Ollama using the official installation script:
The installer will:
- Download and install Ollama binary
- Create ollama system user
- Set up systemd service
- Configure Ollama to run on port 11434
Troubleshooting: If the installation fails, ensure you have: - Internet connectivity - Root or sudo access - Sufficient disk space (10+ GB recommended)
Step 2: Start Ollama Service¶
After installation, start and enable the Ollama service:
# Start Ollama service
sudo systemctl start ollama
# Enable auto-start on boot
sudo systemctl enable ollama
# Verify service is running
sudo systemctl status ollama
Expected output: Service should show as active (running)
If the service fails to start:
# Check service logs
sudo journalctl -u ollama -n 50 --no-pager
# Verify port is not in use
sudo netstat -tulpn | grep 11434
Step 3: Wait for Service to be Ready¶
The Ollama service may take a few seconds to fully start. Wait for the API to be accessible:
# Wait for API to be ready (check every 2 seconds, up to 30 seconds)
for i in {1..15}; do
if curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
echo "Ollama API is ready"
break
fi
echo "Waiting for Ollama API... ($i/15)"
sleep 2
done
Step 4: Pull Required Models¶
Pull the models required for ConnectSoft.BaseTemplate acceptance tests:
# Pull chat completion model (used for AI chat completions)
ollama pull Qwen3-0.6B-GGUF
# Pull embedding model (used for text embeddings)
ollama pull nomic-embed-text
Note: - Model downloads may take several minutes depending on your internet connection - The first model pull is typically the slowest - Models are cached locally after download (~1-4 GB total) - You can monitor download progress in the terminal
Troubleshooting model pulls:
# If a model pull fails, retry:
ollama pull Qwen3-0.6B-GGUF
# Check available disk space
df -h
# Verify internet connectivity
ping -c 3 ollama.com
Step 5: Verify Installation¶
Verify that Ollama is installed correctly and models are available:
# List installed models
ollama list
# Test API connectivity
curl http://localhost:11434/api/tags
# Test model availability (should return model list)
curl http://localhost:11434/api/tags | grep -i qwen
curl http://localhost:11434/api/tags | grep -i nomic
Expected output from ollama list:
NAME ID SIZE MODIFIED
Qwen3-0.6B-GGUF abc123... 1.2GB 2 hours ago
nomic-embed-text def456... 274MB 2 hours ago
Test API with a simple request:
# Test chat completion (simple test)
curl http://localhost:11434/api/generate -d '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "Hello",
"stream": false
}'
If all checks pass, Ollama is ready to use!
Service Configuration¶
Systemd Service¶
Ollama runs as a systemd service named ollama. The service is automatically configured by the installer.
Service Management:
# Check service status
sudo systemctl status ollama
# Start service
sudo systemctl start ollama
# Stop service
sudo systemctl stop ollama
# Restart service
sudo systemctl restart ollama
# View service logs
sudo journalctl -u ollama -f
# Check if service is enabled (auto-start on boot)
sudo systemctl is-enabled ollama
Configuration File¶
Ollama configuration is stored in /etc/systemd/system/ollama.service. The default configuration:
- User:
ollama(dedicated system user) - Port:
11434(default HTTP port) - Data Directory:
/usr/share/ollama/.ollama(models and data)
Port Configuration¶
Ollama listens on port 11434 by default. To change the port:
- Edit service file:
sudo systemctl edit ollama - Add override:
- Reload and restart:
sudo systemctl daemon-reload && sudo systemctl restart ollama
Note: If changing the port, update application configuration accordingly.
Verification¶
Test API Endpoint¶
# Check API is accessible
curl http://localhost:11434/api/tags
# Test chat completion (simple test)
curl http://localhost:11434/api/generate -d '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "Hello, how are you?",
"stream": false
}'
Expected response: JSON with generated text
Test from Application¶
Configure your application to use Ollama. Update appsettings.json:
{
"MicrosoftExtensionsAI": {
"ChatCompletionProvider": "Ollama",
"Ollama": {
"Endpoint": "http://localhost:11434",
"Model": "Qwen3-0.6B-GGUF"
},
"OllamaEmbedding": {
"Endpoint": "http://localhost:11434",
"Model": "nomic-embed-text"
}
}
}
For acceptance tests on agents, ensure the endpoint is http://localhost:11434 (not http://127.0.0.1:1234/ which is for local development).
Then run a simple test to verify connectivity from your application.
Running Ollama¶
Start Ollama Service¶
Ollama runs as a systemd service and should start automatically on boot:
# Start service (if not running)
sudo systemctl start ollama
# Check status
sudo systemctl status ollama
# View logs
sudo journalctl -u ollama -f
Using Ollama CLI¶
The ollama command-line tool is available after installation:
# List installed models
ollama list
# Run a model interactively
ollama run Qwen3-0.6B-GGUF
# Run a one-off command
ollama run Qwen3-0.6B-GGUF "What is 2+2?"
# Show model information
ollama show Qwen3-0.6B-GGUF
# Remove a model (to free space)
ollama rm <model-name>
Using Ollama API¶
Ollama provides a REST API on port 11434:
# List available models
curl http://localhost:11434/api/tags
# Generate text
curl http://localhost:11434/api/generate -d '{
"model": "Qwen3-0.6B-GGUF",
"prompt": "Explain AI in one sentence",
"stream": false
}'
# Create embeddings
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "Hello world"
}'
Service Management¶
# Start service
sudo systemctl start ollama
# Stop service
sudo systemctl stop ollama
# Restart service
sudo systemctl restart ollama
# Check if enabled (auto-start on boot)
sudo systemctl is-enabled ollama
# Enable auto-start
sudo systemctl enable ollama
# Disable auto-start
sudo systemctl disable ollama
Integration with Azure DevOps Pipelines¶
Configuration for Tests¶
For acceptance tests running on self-hosted agents, configure Ollama endpoint:
appsettings.json (for agent):
{
"MicrosoftExtensionsAI": {
"ChatCompletionProvider": "Ollama",
"Ollama": {
"Endpoint": "http://localhost:11434",
"Model": "Qwen3-0.6B-GGUF"
},
"OllamaEmbedding": {
"Endpoint": "http://localhost:11434",
"Model": "nomic-embed-text"
}
}
}
Pipeline Considerations¶
- Service Availability: Ensure Ollama service is running before tests start
- Model Availability: Models must be pulled before first use
- Resource Usage: AI inference can be CPU/memory intensive
- Timeout Settings: AI operations may take longer than typical API calls
Maintenance¶
Regular Tasks¶
Check Service Status¶
# Weekly check
sudo systemctl status ollama
# View recent logs
sudo journalctl -u ollama -n 50 --no-pager
Monitor Disk Space¶
# Check Ollama data directory size
du -sh /usr/share/ollama/.ollama
# List models and sizes
ollama list
Update Models¶
# Pull latest version of a model (updates if newer version available)
ollama pull Qwen3-0.6B-GGUF
# Remove old/unused models to free space
ollama rm <model-name>
Update Ollama¶
# Update Ollama to latest version
curl -fsSL https://ollama.com/install.sh | sh
# Restart service after update
sudo systemctl restart ollama
Monitoring¶
Service Health¶
# Check if service is running
systemctl is-active ollama
# Check service uptime
systemctl show ollama --property=ActiveEnterTimestamp
Model Usage¶
Resource Usage¶
# Monitor CPU and memory usage
top -p $(pgrep ollama)
# Or use htop for better visualization
htop -p $(pgrep ollama)
Troubleshooting¶
Service Not Running¶
Symptoms: systemctl status ollama shows inactive or failed
Solutions:
# Check service status
sudo systemctl status ollama
# View error logs
sudo journalctl -u ollama -n 100 --no-pager
# Try starting manually
sudo systemctl start ollama
# Check if port is already in use
sudo netstat -tulpn | grep 11434
Models Not Found¶
Symptoms: ollama list shows no models or models missing
Solutions:
# Verify models are installed
ollama list
# Re-pull missing models
ollama pull Qwen3-0.6B-GGUF
ollama pull nomic-embed-text
# Check model storage location
ls -la /usr/share/ollama/.ollama/models
Connection Errors¶
Symptoms: Application cannot connect to Ollama API
Solutions:
# Verify service is running
sudo systemctl status ollama
# Test API endpoint
curl http://localhost:11434/api/tags
# Check firewall (if applicable)
sudo ufw status
sudo ufw allow 11434/tcp
# Verify endpoint in application config matches service
# Should be: http://localhost:11434
Performance Issues¶
Symptoms: Slow AI inference, high CPU usage
Solutions:
-
Check System Resources:
-
Consider GPU Acceleration (if available):
- Install NVIDIA drivers and CUDA
- Ollama will automatically use GPU if available
-
Check GPU usage:
nvidia-smi -
Use Smaller Models: Consider using smaller models for faster inference
-
Increase System Resources: Upgrade agent server if consistently resource-constrained
Port Conflicts¶
Symptoms: Service fails to start, port already in use
Solutions:
# Find process using port 11434
sudo lsof -i :11434
# Kill conflicting process (if safe to do so)
sudo kill <PID>
# Or change Ollama port (see Configuration section)
Permission Issues¶
Symptoms: Cannot access Ollama API, permission denied errors
Solutions:
# Verify ollama user exists
id ollama
# Check service user
sudo systemctl show ollama --property=User
# Verify data directory permissions
ls -la /usr/share/ollama/.ollama
Performance Optimization¶
GPU Acceleration (Optional)¶
For better performance, especially with larger models:
-
Install NVIDIA Drivers:
-
Install CUDA (if needed):
-
Verify GPU Usage:
Resource Management¶
Memory Optimization:
- Use smaller models when possible
- Limit concurrent requests if memory is constrained
- Monitor memory usage:
free -h
CPU Optimization:
- Ensure adequate CPU cores (AI inference is CPU-intensive)
- Consider CPU affinity for Ollama process
- Monitor CPU usage:
htop
Model Selection¶
For CI/CD pipelines, consider:
- Smaller Models: Faster inference, less resource usage
- Quantized Models: Reduced memory footprint
- Task-Specific Models: Use specialized models for specific tasks
Quick Installation Summary¶
For a quick installation, follow these steps:
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# 2. Start and enable service
sudo systemctl start ollama
sudo systemctl enable ollama
# 3. Pull required models
ollama pull Qwen3-0.6B-GGUF
ollama pull nomic-embed-text
# 4. Verify installation
ollama list
curl http://localhost:11434/api/tags
For detailed step-by-step instructions with troubleshooting, see the Installation section above.
Quick Reference¶
Essential Commands¶
# Service management
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama
sudo systemctl status ollama
# Model management
ollama list
ollama pull <model-name>
ollama rm <model-name>
ollama show <model-name>
# API testing
curl http://localhost:11434/api/tags
curl http://localhost:11434/api/generate -d '{"model": "Qwen3-0.6B-GGUF", "prompt": "test"}'
# Logs
sudo journalctl -u ollama -f
Configuration Locations¶
- Service File:
/etc/systemd/system/ollama.service - Data Directory:
/usr/share/ollama/.ollama - Models:
/usr/share/ollama/.ollama/models - Logs:
journalctl -u ollama
Default Settings¶
- Port: 11434
- User: ollama
- Host: 0.0.0.0 (listens on all interfaces)
- Endpoint: http://localhost:11434
Next Steps¶
After installing Ollama:
- Verify Installation: Run verification steps above
- Configure Application: Update appsettings.json to use Ollama
- Run Tests: Execute acceptance tests that use Ollama
- Monitor Performance: Watch resource usage during tests
- Optimize: Adjust models or resources based on needs
Related Documentation¶
- Linux Agent Setup - Complete Linux agent installation
- Agent Maintenance - Ongoing maintenance procedures
- Troubleshooting Guide - Common issues and solutions
- AI Services Documentation - Using AI in BaseTemplate