Self-Hosted Agents - Ollama Installation Guide¶

Overview¶

Ollama is a local LLM (Large Language Model) server that enables running AI models directly on your self-hosted Azure DevOps agents. This eliminates the need for external API calls and provides:

Cost Savings: No per-request API costs (free, local inference)
Privacy: All AI processing happens locally on your infrastructure
Performance: No network latency for AI operations
Reliability: No dependency on external AI service availability
Testing: Enables AI acceptance tests without external dependencies

Use Cases¶

AI Acceptance Tests: Run tests that use AI models without external API dependencies
Local Development: Test AI functionality during development
Cost-Effective AI: Avoid API costs for CI/CD pipelines
Privacy-Sensitive Workloads: Keep AI processing on-premises

Prerequisites¶

System Requirements¶

Operating System: Ubuntu 22.04 LTS (or compatible Linux distribution)
Disk Space:
Minimum: 10 GB free space
Recommended: 20+ GB for multiple models
Model sizes: ~1-4 GB per model
Memory:
Minimum: 8 GB RAM
Recommended: 16+ GB RAM for better performance
Network: Internet access for initial installation and model downloads
Permissions: Root or sudo access for installation

Required Models¶

For ConnectSoft.BaseTemplate acceptance tests: - Qwen3-0.6B-GGUF: Chat completion model (~1-2 GB) - nomic-embed-text: Embedding model (~150-300 MB)

Installation¶

Follow these steps to install and configure Ollama on your self-hosted agent.

Step 1: Install Ollama¶

Install Ollama using the official installation script:

# Download and run official Ollama installer
curl -fsSL https://ollama.com/install.sh | sh

The installer will: - Download and install Ollama binary - Create ollama system user - Set up systemd service - Configure Ollama to run on port 11434

Troubleshooting: If the installation fails, ensure you have: - Internet connectivity - Root or sudo access - Sufficient disk space (10+ GB recommended)

Step 2: Start Ollama Service¶

After installation, start and enable the Ollama service:

# Start Ollama service
sudo systemctl start ollama

# Enable auto-start on boot
sudo systemctl enable ollama

# Verify service is running
sudo systemctl status ollama

Expected output: Service should show as active (running)

If the service fails to start:

# Check service logs
sudo journalctl -u ollama -n 50 --no-pager

# Verify port is not in use
sudo netstat -tulpn | grep 11434

Step 3: Wait for Service to be Ready¶

The Ollama service may take a few seconds to fully start. Wait for the API to be accessible:

# Wait for API to be ready (check every 2 seconds, up to 30 seconds)
for i in {1..15}; do
    if curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
        echo "Ollama API is ready"
        break
    fi
    echo "Waiting for Ollama API... ($i/15)"
    sleep 2
done

Step 4: Pull Required Models¶

Pull the models required for ConnectSoft.BaseTemplate acceptance tests:

# Pull chat completion model (used for AI chat completions)
ollama pull Qwen3-0.6B-GGUF

# Pull embedding model (used for text embeddings)
ollama pull nomic-embed-text

Note: - Model downloads may take several minutes depending on your internet connection - The first model pull is typically the slowest - Models are cached locally after download (~1-4 GB total) - You can monitor download progress in the terminal

Troubleshooting model pulls:

# If a model pull fails, retry:
ollama pull Qwen3-0.6B-GGUF

# Check available disk space
df -h

# Verify internet connectivity
ping -c 3 ollama.com

Step 5: Verify Installation¶

Verify that Ollama is installed correctly and models are available:

# List installed models
ollama list

# Test API connectivity
curl http://localhost:11434/api/tags

# Test model availability (should return model list)
curl http://localhost:11434/api/tags | grep -i qwen
curl http://localhost:11434/api/tags | grep -i nomic

Expected output from ollama list:

NAME                  ID              SIZE    MODIFIED
Qwen3-0.6B-GGUF      abc123...        1.2GB   2 hours ago
nomic-embed-text      def456...        274MB   2 hours ago

Test API with a simple request:

# Test chat completion (simple test)
curl http://localhost:11434/api/generate -d '{
  "model": "Qwen3-0.6B-GGUF",
  "prompt": "Hello",
  "stream": false
}'

If all checks pass, Ollama is ready to use!

Service Configuration¶

Systemd Service¶

Ollama runs as a systemd service named ollama. The service is automatically configured by the installer.

Service Management:

# Check service status
sudo systemctl status ollama

# Start service
sudo systemctl start ollama

# Stop service
sudo systemctl stop ollama

# Restart service
sudo systemctl restart ollama

# View service logs
sudo journalctl -u ollama -f

# Check if service is enabled (auto-start on boot)
sudo systemctl is-enabled ollama

Configuration File¶

Ollama configuration is stored in /etc/systemd/system/ollama.service. The default configuration:

User: ollama (dedicated system user)
Port: 11434 (default HTTP port)
Data Directory: /usr/share/ollama/.ollama (models and data)

Port Configuration¶

Ollama listens on port 11434 by default. To change the port:

Edit service file: sudo systemctl edit ollama

Add override:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11435"

Reload and restart: sudo systemctl daemon-reload && sudo systemctl restart ollama

Note: If changing the port, update application configuration accordingly.

Verification¶

Test API Endpoint¶

# Check API is accessible
curl http://localhost:11434/api/tags

# Test chat completion (simple test)
curl http://localhost:11434/api/generate -d '{
  "model": "Qwen3-0.6B-GGUF",
  "prompt": "Hello, how are you?",
  "stream": false
}'

Expected response: JSON with generated text

Test from Application¶

Configure your application to use Ollama. Update appsettings.json:

{
  "MicrosoftExtensionsAI": {
    "ChatCompletionProvider": "Ollama",
    "Ollama": {
      "Endpoint": "http://localhost:11434",
      "Model": "Qwen3-0.6B-GGUF"
    },
    "OllamaEmbedding": {
      "Endpoint": "http://localhost:11434",
      "Model": "nomic-embed-text"
    }
  }
}

For acceptance tests on agents, ensure the endpoint is http://localhost:11434 (not http://127.0.0.1:1234/ which is for local development).

Then run a simple test to verify connectivity from your application.

Running Ollama¶

Start Ollama Service¶

Ollama runs as a systemd service and should start automatically on boot:

# Start service (if not running)
sudo systemctl start ollama

# Check status
sudo systemctl status ollama

# View logs
sudo journalctl -u ollama -f

Using Ollama CLI¶

The ollama command-line tool is available after installation:

# List installed models
ollama list

# Run a model interactively
ollama run Qwen3-0.6B-GGUF

# Run a one-off command
ollama run Qwen3-0.6B-GGUF "What is 2+2?"

# Show model information
ollama show Qwen3-0.6B-GGUF

# Remove a model (to free space)
ollama rm <model-name>

Using Ollama API¶

Ollama provides a REST API on port 11434:

# List available models
curl http://localhost:11434/api/tags

# Generate text
curl http://localhost:11434/api/generate -d '{
  "model": "Qwen3-0.6B-GGUF",
  "prompt": "Explain AI in one sentence",
  "stream": false
}'

# Create embeddings
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Hello world"
}'

Service Management¶

# Start service
sudo systemctl start ollama

# Stop service
sudo systemctl stop ollama

# Restart service
sudo systemctl restart ollama

# Check if enabled (auto-start on boot)
sudo systemctl is-enabled ollama

# Enable auto-start
sudo systemctl enable ollama

# Disable auto-start
sudo systemctl disable ollama

Integration with Azure DevOps Pipelines¶

Configuration for Tests¶

For acceptance tests running on self-hosted agents, configure Ollama endpoint:

appsettings.json (for agent):

{
  "MicrosoftExtensionsAI": {
    "ChatCompletionProvider": "Ollama",
    "Ollama": {
      "Endpoint": "http://localhost:11434",
      "Model": "Qwen3-0.6B-GGUF"
    },
    "OllamaEmbedding": {
      "Endpoint": "http://localhost:11434",
      "Model": "nomic-embed-text"
    }
  }
}

Pipeline Considerations¶

Service Availability: Ensure Ollama service is running before tests start
Model Availability: Models must be pulled before first use
Resource Usage: AI inference can be CPU/memory intensive
Timeout Settings: AI operations may take longer than typical API calls

Maintenance¶

Regular Tasks¶

Check Service Status¶

# Weekly check
sudo systemctl status ollama

# View recent logs
sudo journalctl -u ollama -n 50 --no-pager

Monitor Disk Space¶

# Check Ollama data directory size
du -sh /usr/share/ollama/.ollama

# List models and sizes
ollama list

Update Models¶

# Pull latest version of a model (updates if newer version available)
ollama pull Qwen3-0.6B-GGUF

# Remove old/unused models to free space
ollama rm <model-name>

Update Ollama¶

# Update Ollama to latest version
curl -fsSL https://ollama.com/install.sh | sh

# Restart service after update
sudo systemctl restart ollama

Monitoring¶

Service Health¶

# Check if service is running
systemctl is-active ollama

# Check service uptime
systemctl show ollama --property=ActiveEnterTimestamp

Model Usage¶

# List all models
ollama list

# Check model details
ollama show Qwen3-0.6B-GGUF

Resource Usage¶

# Monitor CPU and memory usage
top -p $(pgrep ollama)

# Or use htop for better visualization
htop -p $(pgrep ollama)

Troubleshooting¶

Service Not Running¶

Symptoms: systemctl status ollama shows inactive or failed

Solutions:

# Check service status
sudo systemctl status ollama

# View error logs
sudo journalctl -u ollama -n 100 --no-pager

# Try starting manually
sudo systemctl start ollama

# Check if port is already in use
sudo netstat -tulpn | grep 11434

Models Not Found¶

Symptoms: ollama list shows no models or models missing

Solutions:

# Verify models are installed
ollama list

# Re-pull missing models
ollama pull Qwen3-0.6B-GGUF
ollama pull nomic-embed-text

# Check model storage location
ls -la /usr/share/ollama/.ollama/models

Connection Errors¶

Symptoms: Application cannot connect to Ollama API

Solutions:

# Verify service is running
sudo systemctl status ollama

# Test API endpoint
curl http://localhost:11434/api/tags

# Check firewall (if applicable)
sudo ufw status
sudo ufw allow 11434/tcp

# Verify endpoint in application config matches service
# Should be: http://localhost:11434

Performance Issues¶

Symptoms: Slow AI inference, high CPU usage

Solutions:

Check System Resources:

# Monitor CPU and memory
htop

# Check available memory
free -h

Consider GPU Acceleration (if available):
Install NVIDIA drivers and CUDA
Ollama will automatically use GPU if available
Check GPU usage: nvidia-smi
Use Smaller Models: Consider using smaller models for faster inference
Increase System Resources: Upgrade agent server if consistently resource-constrained

Port Conflicts¶

Symptoms: Service fails to start, port already in use

Solutions:

# Find process using port 11434
sudo lsof -i :11434

# Kill conflicting process (if safe to do so)
sudo kill <PID>

# Or change Ollama port (see Configuration section)

Permission Issues¶

Symptoms: Cannot access Ollama API, permission denied errors

Solutions:

# Verify ollama user exists
id ollama

# Check service user
sudo systemctl show ollama --property=User

# Verify data directory permissions
ls -la /usr/share/ollama/.ollama

Performance Optimization¶

GPU Acceleration (Optional)¶

For better performance, especially with larger models:

Install NVIDIA Drivers:

# Check if NVIDIA GPU is available
lspci | grep -i nvidia

# Install NVIDIA drivers (Ubuntu)
sudo apt update
sudo apt install -y nvidia-driver-535
sudo reboot

Install CUDA (if needed):

# Ollama will use GPU automatically if CUDA is available
# Check GPU usage
nvidia-smi

Verify GPU Usage:

# Run a test and monitor GPU
ollama run Qwen3-0.6B-GGUF "Hello"
# In another terminal:
watch -n 1 nvidia-smi

Resource Management¶

Memory Optimization:

Use smaller models when possible
Limit concurrent requests if memory is constrained
Monitor memory usage: free -h

CPU Optimization:

Ensure adequate CPU cores (AI inference is CPU-intensive)
Consider CPU affinity for Ollama process
Monitor CPU usage: htop

Model Selection¶

For CI/CD pipelines, consider:

Smaller Models: Faster inference, less resource usage
Quantized Models: Reduced memory footprint
Task-Specific Models: Use specialized models for specific tasks

Quick Installation Summary¶

For a quick installation, follow these steps:

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Start and enable service
sudo systemctl start ollama
sudo systemctl enable ollama

# 3. Pull required models
ollama pull Qwen3-0.6B-GGUF
ollama pull nomic-embed-text

# 4. Verify installation
ollama list
curl http://localhost:11434/api/tags

For detailed step-by-step instructions with troubleshooting, see the Installation section above.

Quick Reference¶

Essential Commands¶

# Service management
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama
sudo systemctl status ollama

# Model management
ollama list
ollama pull <model-name>
ollama rm <model-name>
ollama show <model-name>

# API testing
curl http://localhost:11434/api/tags
curl http://localhost:11434/api/generate -d '{"model": "Qwen3-0.6B-GGUF", "prompt": "test"}'

# Logs
sudo journalctl -u ollama -f

Configuration Locations¶

Service File: /etc/systemd/system/ollama.service
Data Directory: /usr/share/ollama/.ollama
Models: /usr/share/ollama/.ollama/models
Logs: journalctl -u ollama

Default Settings¶

Port: 11434
User: ollama
Host: 0.0.0.0 (listens on all interfaces)
Endpoint: http://localhost:11434

Next Steps¶

After installing Ollama:

Verify Installation: Run verification steps above
Configure Application: Update appsettings.json to use Ollama
Run Tests: Execute acceptance tests that use Ollama
Monitor Performance: Watch resource usage during tests
Optimize: Adjust models or resources based on needs

Linux Agent Setup - Complete Linux agent installation
Agent Maintenance - Ongoing maintenance procedures
Troubleshooting Guide - Common issues and solutions
AI Services Documentation - Using AI in BaseTemplate

Self-Hosted Agents - Ollama Installation Guide¶

Overview¶

Use Cases¶

Prerequisites¶

System Requirements¶

Required Models¶

Installation¶

Step 1: Install Ollama¶

Step 2: Start Ollama Service¶

Step 3: Wait for Service to be Ready¶

Step 4: Pull Required Models¶

Step 5: Verify Installation¶

Service Configuration¶

Systemd Service¶

Configuration File¶

Port Configuration¶

Verification¶

Test API Endpoint¶

Test from Application¶

Running Ollama¶

Start Ollama Service¶

Using Ollama CLI¶

Using Ollama API¶

Service Management¶

Integration with Azure DevOps Pipelines¶

Configuration for Tests¶

Pipeline Considerations¶

Maintenance¶

Regular Tasks¶

Check Service Status¶

Monitor Disk Space¶

Update Models¶

Update Ollama¶

Monitoring¶

Service Health¶

Model Usage¶

Resource Usage¶

Troubleshooting¶

Service Not Running¶

Models Not Found¶

Connection Errors¶

Performance Issues¶

Port Conflicts¶

Permission Issues¶

Performance Optimization¶

GPU Acceleration (Optional)¶

Resource Management¶

Model Selection¶

Quick Installation Summary¶

Quick Reference¶

Essential Commands¶

Configuration Locations¶

Default Settings¶

Next Steps¶

Related Documentation¶

References¶