Skip to content

Self-Hosted Agents - Ollama Installation Guide

Overview

Ollama is a local LLM (Large Language Model) server that enables running AI models directly on your self-hosted Azure DevOps agents. This eliminates the need for external API calls and provides:

  • Cost Savings: No per-request API costs (free, local inference)
  • Privacy: All AI processing happens locally on your infrastructure
  • Performance: No network latency for AI operations
  • Reliability: No dependency on external AI service availability
  • Testing: Enables AI acceptance tests without external dependencies

Use Cases

  • AI Acceptance Tests: Run tests that use AI models without external API dependencies
  • Local Development: Test AI functionality during development
  • Cost-Effective AI: Avoid API costs for CI/CD pipelines
  • Privacy-Sensitive Workloads: Keep AI processing on-premises

Prerequisites

System Requirements

  • Operating System: Ubuntu 22.04 LTS (or compatible Linux distribution)
  • Disk Space:
  • Minimum: 10 GB free space
  • Recommended: 20+ GB for multiple models
  • Model sizes: ~1-4 GB per model
  • Memory:
  • Minimum: 8 GB RAM
  • Recommended: 16+ GB RAM for better performance
  • Network: Internet access for initial installation and model downloads
  • Permissions: Root or sudo access for installation

Required Models

For ConnectSoft.BaseTemplate acceptance tests: - Qwen3-0.6B-GGUF: Chat completion model (~1-2 GB) - nomic-embed-text: Embedding model (~150-300 MB)

Installation

Follow these steps to install and configure Ollama on your self-hosted agent.

Step 1: Install Ollama

Install Ollama using the official installation script:

# Download and run official Ollama installer
curl -fsSL https://ollama.com/install.sh | sh

The installer will: - Download and install Ollama binary - Create ollama system user - Set up systemd service - Configure Ollama to run on port 11434

Troubleshooting: If the installation fails, ensure you have: - Internet connectivity - Root or sudo access - Sufficient disk space (10+ GB recommended)

Step 2: Start Ollama Service

After installation, start and enable the Ollama service:

# Start Ollama service
sudo systemctl start ollama

# Enable auto-start on boot
sudo systemctl enable ollama

# Verify service is running
sudo systemctl status ollama

Expected output: Service should show as active (running)

If the service fails to start:

# Check service logs
sudo journalctl -u ollama -n 50 --no-pager

# Verify port is not in use
sudo netstat -tulpn | grep 11434

Step 3: Wait for Service to be Ready

The Ollama service may take a few seconds to fully start. Wait for the API to be accessible:

# Wait for API to be ready (check every 2 seconds, up to 30 seconds)
for i in {1..15}; do
    if curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
        echo "Ollama API is ready"
        break
    fi
    echo "Waiting for Ollama API... ($i/15)"
    sleep 2
done

Step 4: Pull Required Models

Pull the models required for ConnectSoft.BaseTemplate acceptance tests:

# Pull chat completion model (used for AI chat completions)
ollama pull Qwen3-0.6B-GGUF

# Pull embedding model (used for text embeddings)
ollama pull nomic-embed-text

Note: - Model downloads may take several minutes depending on your internet connection - The first model pull is typically the slowest - Models are cached locally after download (~1-4 GB total) - You can monitor download progress in the terminal

Troubleshooting model pulls:

# If a model pull fails, retry:
ollama pull Qwen3-0.6B-GGUF

# Check available disk space
df -h

# Verify internet connectivity
ping -c 3 ollama.com

Step 5: Verify Installation

Verify that Ollama is installed correctly and models are available:

# List installed models
ollama list

# Test API connectivity
curl http://localhost:11434/api/tags

# Test model availability (should return model list)
curl http://localhost:11434/api/tags | grep -i qwen
curl http://localhost:11434/api/tags | grep -i nomic

Expected output from ollama list:

NAME                  ID              SIZE    MODIFIED
Qwen3-0.6B-GGUF      abc123...        1.2GB   2 hours ago
nomic-embed-text      def456...        274MB   2 hours ago

Test API with a simple request:

# Test chat completion (simple test)
curl http://localhost:11434/api/generate -d '{
  "model": "Qwen3-0.6B-GGUF",
  "prompt": "Hello",
  "stream": false
}'

If all checks pass, Ollama is ready to use!

Service Configuration

Systemd Service

Ollama runs as a systemd service named ollama. The service is automatically configured by the installer.

Service Management:

# Check service status
sudo systemctl status ollama

# Start service
sudo systemctl start ollama

# Stop service
sudo systemctl stop ollama

# Restart service
sudo systemctl restart ollama

# View service logs
sudo journalctl -u ollama -f

# Check if service is enabled (auto-start on boot)
sudo systemctl is-enabled ollama

Configuration File

Ollama configuration is stored in /etc/systemd/system/ollama.service. The default configuration:

  • User: ollama (dedicated system user)
  • Port: 11434 (default HTTP port)
  • Data Directory: /usr/share/ollama/.ollama (models and data)

Port Configuration

Ollama listens on port 11434 by default. To change the port:

  1. Edit service file: sudo systemctl edit ollama
  2. Add override:
    [Service]
    Environment="OLLAMA_HOST=0.0.0.0:11435"
    
  3. Reload and restart: sudo systemctl daemon-reload && sudo systemctl restart ollama

Note: If changing the port, update application configuration accordingly.

Verification

Test API Endpoint

# Check API is accessible
curl http://localhost:11434/api/tags

# Test chat completion (simple test)
curl http://localhost:11434/api/generate -d '{
  "model": "Qwen3-0.6B-GGUF",
  "prompt": "Hello, how are you?",
  "stream": false
}'

Expected response: JSON with generated text

Test from Application

Configure your application to use Ollama. Update appsettings.json:

{
  "MicrosoftExtensionsAI": {
    "ChatCompletionProvider": "Ollama",
    "Ollama": {
      "Endpoint": "http://localhost:11434",
      "Model": "Qwen3-0.6B-GGUF"
    },
    "OllamaEmbedding": {
      "Endpoint": "http://localhost:11434",
      "Model": "nomic-embed-text"
    }
  }
}

For acceptance tests on agents, ensure the endpoint is http://localhost:11434 (not http://127.0.0.1:1234/ which is for local development).

Then run a simple test to verify connectivity from your application.

Running Ollama

Start Ollama Service

Ollama runs as a systemd service and should start automatically on boot:

# Start service (if not running)
sudo systemctl start ollama

# Check status
sudo systemctl status ollama

# View logs
sudo journalctl -u ollama -f

Using Ollama CLI

The ollama command-line tool is available after installation:

# List installed models
ollama list

# Run a model interactively
ollama run Qwen3-0.6B-GGUF

# Run a one-off command
ollama run Qwen3-0.6B-GGUF "What is 2+2?"

# Show model information
ollama show Qwen3-0.6B-GGUF

# Remove a model (to free space)
ollama rm <model-name>

Using Ollama API

Ollama provides a REST API on port 11434:

# List available models
curl http://localhost:11434/api/tags

# Generate text
curl http://localhost:11434/api/generate -d '{
  "model": "Qwen3-0.6B-GGUF",
  "prompt": "Explain AI in one sentence",
  "stream": false
}'

# Create embeddings
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Hello world"
}'

Service Management

# Start service
sudo systemctl start ollama

# Stop service
sudo systemctl stop ollama

# Restart service
sudo systemctl restart ollama

# Check if enabled (auto-start on boot)
sudo systemctl is-enabled ollama

# Enable auto-start
sudo systemctl enable ollama

# Disable auto-start
sudo systemctl disable ollama

Integration with Azure DevOps Pipelines

Configuration for Tests

For acceptance tests running on self-hosted agents, configure Ollama endpoint:

appsettings.json (for agent):

{
  "MicrosoftExtensionsAI": {
    "ChatCompletionProvider": "Ollama",
    "Ollama": {
      "Endpoint": "http://localhost:11434",
      "Model": "Qwen3-0.6B-GGUF"
    },
    "OllamaEmbedding": {
      "Endpoint": "http://localhost:11434",
      "Model": "nomic-embed-text"
    }
  }
}

Pipeline Considerations

  • Service Availability: Ensure Ollama service is running before tests start
  • Model Availability: Models must be pulled before first use
  • Resource Usage: AI inference can be CPU/memory intensive
  • Timeout Settings: AI operations may take longer than typical API calls

Maintenance

Regular Tasks

Check Service Status

# Weekly check
sudo systemctl status ollama

# View recent logs
sudo journalctl -u ollama -n 50 --no-pager

Monitor Disk Space

# Check Ollama data directory size
du -sh /usr/share/ollama/.ollama

# List models and sizes
ollama list

Update Models

# Pull latest version of a model (updates if newer version available)
ollama pull Qwen3-0.6B-GGUF

# Remove old/unused models to free space
ollama rm <model-name>

Update Ollama

# Update Ollama to latest version
curl -fsSL https://ollama.com/install.sh | sh

# Restart service after update
sudo systemctl restart ollama

Monitoring

Service Health

# Check if service is running
systemctl is-active ollama

# Check service uptime
systemctl show ollama --property=ActiveEnterTimestamp

Model Usage

# List all models
ollama list

# Check model details
ollama show Qwen3-0.6B-GGUF

Resource Usage

# Monitor CPU and memory usage
top -p $(pgrep ollama)

# Or use htop for better visualization
htop -p $(pgrep ollama)

Troubleshooting

Service Not Running

Symptoms: systemctl status ollama shows inactive or failed

Solutions:

# Check service status
sudo systemctl status ollama

# View error logs
sudo journalctl -u ollama -n 100 --no-pager

# Try starting manually
sudo systemctl start ollama

# Check if port is already in use
sudo netstat -tulpn | grep 11434

Models Not Found

Symptoms: ollama list shows no models or models missing

Solutions:

# Verify models are installed
ollama list

# Re-pull missing models
ollama pull Qwen3-0.6B-GGUF
ollama pull nomic-embed-text

# Check model storage location
ls -la /usr/share/ollama/.ollama/models

Connection Errors

Symptoms: Application cannot connect to Ollama API

Solutions:

# Verify service is running
sudo systemctl status ollama

# Test API endpoint
curl http://localhost:11434/api/tags

# Check firewall (if applicable)
sudo ufw status
sudo ufw allow 11434/tcp

# Verify endpoint in application config matches service
# Should be: http://localhost:11434

Performance Issues

Symptoms: Slow AI inference, high CPU usage

Solutions:

  1. Check System Resources:

    # Monitor CPU and memory
    htop
    
    # Check available memory
    free -h
    

  2. Consider GPU Acceleration (if available):

  3. Install NVIDIA drivers and CUDA
  4. Ollama will automatically use GPU if available
  5. Check GPU usage: nvidia-smi

  6. Use Smaller Models: Consider using smaller models for faster inference

  7. Increase System Resources: Upgrade agent server if consistently resource-constrained

Port Conflicts

Symptoms: Service fails to start, port already in use

Solutions:

# Find process using port 11434
sudo lsof -i :11434

# Kill conflicting process (if safe to do so)
sudo kill <PID>

# Or change Ollama port (see Configuration section)

Permission Issues

Symptoms: Cannot access Ollama API, permission denied errors

Solutions:

# Verify ollama user exists
id ollama

# Check service user
sudo systemctl show ollama --property=User

# Verify data directory permissions
ls -la /usr/share/ollama/.ollama

Performance Optimization

GPU Acceleration (Optional)

For better performance, especially with larger models:

  1. Install NVIDIA Drivers:

    # Check if NVIDIA GPU is available
    lspci | grep -i nvidia
    
    # Install NVIDIA drivers (Ubuntu)
    sudo apt update
    sudo apt install -y nvidia-driver-535
    sudo reboot
    

  2. Install CUDA (if needed):

    # Ollama will use GPU automatically if CUDA is available
    # Check GPU usage
    nvidia-smi
    

  3. Verify GPU Usage:

    # Run a test and monitor GPU
    ollama run Qwen3-0.6B-GGUF "Hello"
    # In another terminal:
    watch -n 1 nvidia-smi
    

Resource Management

Memory Optimization:

  • Use smaller models when possible
  • Limit concurrent requests if memory is constrained
  • Monitor memory usage: free -h

CPU Optimization:

  • Ensure adequate CPU cores (AI inference is CPU-intensive)
  • Consider CPU affinity for Ollama process
  • Monitor CPU usage: htop

Model Selection

For CI/CD pipelines, consider:

  • Smaller Models: Faster inference, less resource usage
  • Quantized Models: Reduced memory footprint
  • Task-Specific Models: Use specialized models for specific tasks

Quick Installation Summary

For a quick installation, follow these steps:

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Start and enable service
sudo systemctl start ollama
sudo systemctl enable ollama

# 3. Pull required models
ollama pull Qwen3-0.6B-GGUF
ollama pull nomic-embed-text

# 4. Verify installation
ollama list
curl http://localhost:11434/api/tags

For detailed step-by-step instructions with troubleshooting, see the Installation section above.

Quick Reference

Essential Commands

# Service management
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama
sudo systemctl status ollama

# Model management
ollama list
ollama pull <model-name>
ollama rm <model-name>
ollama show <model-name>

# API testing
curl http://localhost:11434/api/tags
curl http://localhost:11434/api/generate -d '{"model": "Qwen3-0.6B-GGUF", "prompt": "test"}'

# Logs
sudo journalctl -u ollama -f

Configuration Locations

  • Service File: /etc/systemd/system/ollama.service
  • Data Directory: /usr/share/ollama/.ollama
  • Models: /usr/share/ollama/.ollama/models
  • Logs: journalctl -u ollama

Default Settings

Next Steps

After installing Ollama:

  1. Verify Installation: Run verification steps above
  2. Configure Application: Update appsettings.json to use Ollama
  3. Run Tests: Execute acceptance tests that use Ollama
  4. Monitor Performance: Watch resource usage during tests
  5. Optimize: Adjust models or resources based on needs

References