Docker Desktop 4.40 AI Model Runner: Complete Setup Guide

Docker Desktop 4.40 introduced a game-changing feature for developers working with AI models: the Docker Model Runner. Released on March 31, 2025, this beta feature allows macOS users with Apple Silicon to pull, run, and manage AI models directly from Docker Hub within Docker Desktop, eliminating the complexity of traditional AI model deployment.
This comprehensive guide will walk you through everything you need to know about setting up and using Docker's AI Model Runner, from installation to advanced use cases.
Link to section: Understanding Docker Model RunnerUnderstanding Docker Model Runner
Docker Model Runner bridges the gap between AI model development and containerized deployment. Instead of dealing with complex Python environments, CUDA installations, or cloud API dependencies, you can now run AI models as simply as running any Docker container.
The Model Runner supports popular model formats and provides a standardized OpenAI API interface, making it compatible with existing applications that already integrate with OpenAI's services. Each AI model is packaged as an OCI (Open Container Initiative) Artifact, allowing you to leverage your existing CI/CD workflows for automation and access control.
Currently available in beta for macOS with Apple Silicon, the feature takes advantage of GPU acceleration for improved performance while maintaining the isolation and portability that Docker containers provide.
Link to section: Prerequisites and System RequirementsPrerequisites and System Requirements
Before diving into the setup process, ensure your system meets the following requirements:
Hardware Requirements:
- macOS device with Apple Silicon (M1, M2, M3, or M4 chip)
- Minimum 16GB RAM (32GB recommended for larger models)
- At least 50GB free disk space for model storage
Software Requirements:
- Docker Desktop 4.40 or later
- macOS 12.0 (Monterey) or newer
- Admin privileges for Docker Desktop installation
Network Requirements:
- Stable internet connection for downloading models
- Docker Hub account (free tier sufficient)
Link to section: Installing Docker Desktop 4.40Installing Docker Desktop 4.40
If you don't have Docker Desktop 4.40 or later installed, follow these steps:
Step 1: Download Docker Desktop
Visit the official Docker website and download Docker Desktop for Mac. Ensure you select the Apple Silicon version if you're on an M-series chip.
# Verify your chip architecture
uname -m
# Should return: arm64
Step 2: Install Docker Desktop
- Open the downloaded
.dmg
file - Drag Docker to your Applications folder
- Launch Docker Desktop from Applications
- Complete the initial setup wizard
Step 3: Verify Installation
# Check Docker version
docker --version
# Should show: Docker version 24.0.x or later
# Verify Docker Desktop version
open -a Docker\ Desktop
Navigate to Docker Desktop's settings by clicking the gear icon. Under "General," you should see version 4.40.0 or higher.
Link to section: Enabling AI Model RunnerEnabling AI Model Runner
The Model Runner feature comes disabled by default and requires manual activation:
Step 1: Access Experimental Features
- Open Docker Desktop
- Click the settings gear icon
- Navigate to "Features in development"
- Locate "AI Model Runner (Beta)"
- Toggle the switch to enable it
Step 2: Restart Docker Desktop
After enabling the feature, Docker Desktop will prompt you to restart. Click "Apply & Restart" to activate the Model Runner.
Step 3: Verify Activation
Once Docker Desktop restarts, you should see a new "AI Models" section in the left sidebar. If this section doesn't appear, verify that your system meets the requirements and that the feature toggle is enabled.

Link to section: Your First AI Model: Running a Text Generation ModelYour First AI Model: Running a Text Generation Model
Let's start with a practical example by running a text generation model. We'll use the popular Llama2-7B model for this demonstration.
Step 1: Browse Available Models
- Click on "AI Models" in the Docker Desktop sidebar
- Browse the model catalog or use the search function
- Look for "llama2-7b-chat" in the available models
Step 2: Pull and Run Your First Model
# Pull the model using Docker CLI
docker pull registry.docker.io/models/llama2-7b-chat:latest
# Run the model with the Model Runner
docker run -d -p 8080:8080 --name my-llama2 models/llama2-7b-chat:latest
Alternatively, you can use Docker Desktop's GUI:
- Find the llama2-7b-chat model in the catalog
- Click "Run"
- Configure the port mapping (default: 8080)
- Click "Start"
Step 3: Test Your Model
The model now runs locally and exposes an OpenAI-compatible API endpoint. Test it using curl:
# Test the model with a simple prompt
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2-7b-chat",
"messages": [
{"role": "user", "content": "Explain Docker containers in simple terms"}
],
"max_tokens": 150
}'
Expected response:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "Docker containers are like lightweight, portable boxes that package applications with everything they need to run..."
}
}
]
}
Link to section: Working with Different Model TypesWorking with Different Model Types
Docker's Model Runner supports various AI model categories. Let's explore different types and their specific use cases.
Image Generation Models
For image generation tasks, you might want to run a Stable Diffusion model:
# Pull a Stable Diffusion model
docker pull models/stable-diffusion-xl:latest
# Run with appropriate memory allocation
docker run -d -p 8081:8080 --memory=8g --name image-gen models/stable-diffusion-xl:latest
Test the image generation endpoint:
curl -X POST http://localhost:8081/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "A futuristic cityscape with flying cars",
"size": "1024x1024",
"n": 1
}'
Code Generation Models
For coding assistance, try a specialized code model:
# Pull CodeLlama model
docker pull models/codellama-7b:latest
# Run the code generation model
docker run -d -p 8082:8080 --name code-assistant models/codellama-7b:latest
Test with a coding query:
curl -X POST http://localhost:8082/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama-7b",
"messages": [
{"role": "user", "content": "Write a Python function to calculate Fibonacci numbers"}
]
}'
Link to section: Model Management and ConfigurationModel Management and Configuration
Effective model management becomes crucial when working with multiple AI models simultaneously.
Listing Running Models
# View all running AI model containers
docker ps --filter "ancestor=models/*"
# Check resource usage
docker stats $(docker ps -q --filter "ancestor=models/*")
Model Configuration Options
Most models support various configuration parameters:
# Run with custom configuration
docker run -d -p 8080:8080 \
--name configured-llama \
-e MODEL_TEMPERATURE=0.7 \
-e MODEL_MAX_TOKENS=512 \
-e MODEL_TOP_P=0.9 \
models/llama2-7b-chat:latest
Common environment variables:
MODEL_TEMPERATURE
: Controls randomness (0.0 to 1.0)MODEL_MAX_TOKENS
: Maximum response lengthMODEL_TOP_P
: Nucleus sampling parameterMODEL_TOP_K
: Top-k sampling parameter
Persistent Storage for Models
To avoid re-downloading models, set up persistent storage:
# Create a volume for model storage
docker volume create ai-models-storage
# Run model with persistent storage
docker run -d -p 8080:8080 \
--name persistent-llama \
-v ai-models-storage:/models \
models/llama2-7b-chat:latest
Link to section: Building Custom Model ContainersBuilding Custom Model Containers
While Docker Hub provides many pre-built models, you might need to containerize your own models.
Creating a Custom Model Dockerfile
# Dockerfile for custom model
FROM python:3.11-slim
# Install required dependencies
RUN pip install transformers torch fastapi uvicorn
# Copy your model files
COPY ./my-custom-model /app/model
COPY ./api-server.py /app/
WORKDIR /app
# Expose the API port
EXPOSE 8080
# Start the model server
CMD ["uvicorn", "api-server:app", "--host", "0.0.0.0", "--port", "8080"]
Sample API Server (api-server.py)
from fastapi import FastAPI
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
app = FastAPI()
# Load model and tokenizer
model_path = "/app/model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
@app.post("/v1/chat/completions")
async def chat_completions(request: dict):
prompt = request["messages"][-1]["content"]
# Tokenize and generate
inputs = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(inputs, max_length=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {
"choices": [
{
"message": {
"role": "assistant",
"content": response
}
}
]
}
Building and Running Your Custom Model
# Build your custom model container
docker build -t my-custom-model:latest .
# Run your custom model
docker run -d -p 8083:8080 --name custom-ai my-custom-model:latest
Link to section: Integration with Existing ApplicationsIntegration with Existing Applications
The OpenAI-compatible API makes it easy to integrate Docker-hosted models into existing applications.
Python Integration Example
import openai
# Configure client for local model
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key="dummy-key" # Not needed for local models
)
# Use the model like any OpenAI API
def chat_with_local_model(prompt):
response = client.chat.completions.create(
model="llama2-7b-chat",
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
return response.choices[0].message.content
# Example usage
result = chat_with_local_model("Explain machine learning")
print(result)
Node.js Integration Example
const OpenAI = require('openai');
const openai = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: 'dummy-key'
});
async function queryLocalModel(prompt) {
const completion = await openai.chat.completions.create({
model: 'llama2-7b-chat',
messages: [{ role: 'user', content: prompt }],
max_tokens: 200
});
return completion.choices[0].message.content;
}
// Usage
queryLocalModel('What is Docker?').then(console.log);
This approach allows you to seamlessly switch between local and cloud-based AI services by simply changing the base URL configuration.
Link to section: Troubleshooting Common IssuesTroubleshooting Common Issues
Model Won't Start
If your model container fails to start, check these common causes:
# Check container logs
docker logs my-llama2
# Verify system resources
docker system df
free -h # Check available memory
Common solutions:
- Ensure sufficient RAM (models typically need 2-8GB)
- Check disk space for model storage
- Verify port isn't already in use:
lsof -i :8080
Slow Performance
Performance issues often stem from resource constraints:
# Monitor container resource usage
docker stats my-llama2
# Allocate more memory
docker update --memory=8g my-llama2
# Restart with more resources
docker stop my-llama2
docker run -d -p 8080:8080 --memory=8g --cpus=4 --name my-llama2 models/llama2-7b-chat:latest
API Connection Errors
When API calls fail:
# Test basic connectivity
curl -I http://localhost:8080/health
# Check if the model is ready
curl http://localhost:8080/v1/models
Model Download Failures
If model downloads fail or timeout:
# Check Docker Hub connectivity
docker pull hello-world
# Manual model pull with verbose output
docker pull --quiet=false models/llama2-7b-chat:latest
# Clear Docker cache if needed
docker system prune -a
Link to section: Advanced Use Cases and IntegrationsAdvanced Use Cases and Integrations
Multi-Model Orchestration
For complex applications requiring multiple AI capabilities:
# docker-compose.yml for multi-model setup
version: '3.8'
services:
text-generator:
image: models/llama2-7b-chat:latest
ports:
- "8080:8080"
environment:
- MODEL_TEMPERATURE=0.7
image-generator:
image: models/stable-diffusion-xl:latest
ports:
- "8081:8080"
deploy:
resources:
limits:
memory: 8G
code-assistant:
image: models/codellama-7b:latest
ports:
- "8082:8080"
environment:
- MODEL_MAX_TOKENS=1024
nginx-proxy:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
Load Balancing Multiple Model Instances
# nginx.conf for load balancing
events {
worker_connections 1024;
}
http {
upstream llama_backend {
server text-generator:8080;
server text-generator-2:8080;
server text-generator-3:8080;
}
server {
listen 80;
location /v1/ {
proxy_pass http://llama_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
}
CI/CD Integration
Incorporate AI model testing into your development pipeline:
# .github/workflows/ai-model-test.yml
name: Test AI Models
on: [push, pull_request]
jobs:
test-models:
runs-on: macos-latest
steps:
- uses: actions/checkout@v3
- name: Start AI model
run: |
docker run -d -p 8080:8080 --name test-model models/llama2-7b-chat:latest
sleep 60 # Wait for model to initialize
- name: Test model API
run: |
response=$(curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama2-7b-chat","messages":[{"role":"user","content":"Hello"}]}')
echo $response | jq -e '.choices[0].message.content'
- name: Cleanup
run: docker stop test-model && docker rm test-model
The Docker Model Runner represents a significant step forward in making AI development more accessible and standardized. By leveraging Docker's containerization strengths, developers can now integrate AI capabilities into their applications without the traditional complexity of model deployment and management.
This approach aligns perfectly with modern development practices, where workflow automation tools are becoming essential for maintaining efficient development cycles. The standardized API interface ensures that switching between different models or scaling to cloud deployments requires minimal code changes.
As the AI landscape continues to evolve rapidly, tools like Docker's Model Runner provide the stability and consistency that development teams need to build reliable AI-powered applications. Whether you're prototyping new ideas or deploying production systems, this containerized approach to AI model management offers a robust foundation for your projects.
The beta status of this feature means we can expect continued improvements and expanded platform support. Keep an eye on Docker's release notes for updates that might bring Model Runner support to Windows and Linux platforms, as well as additional model formats and optimization features.