Do I need the NVIDIA Container Toolkit to use my GPU in Docker?

Yes, on Linux. Docker can't see your NVIDIA GPU until you install the NVIDIA Container Toolkit and restart Docker. After that, adding --gpus all to docker run passes the GPU into the Ollama container. On Windows you get the same via Docker Desktop with WSL2 and a current NVIDIA driver.

Where does Ollama store models when it runs in Docker?

Inside the container at /root/.ollama. That's wiped if you remove the container, so mount a named volume (-v ollama:/root/.ollama). Your downloaded models then live in the volume and survive restarts, updates, and re-creating the container.

Can Open WebUI in Docker reach Ollama running in Docker?

Yes. The cleanest way is to put both in one docker compose file on a shared network, then point Open WebUI at http://ollama:11434 using the service name. They talk over Docker's internal network with no host ports required for Ollama.

How to Run Ollama in Docker (with GPU + Open WebUI)

By LocalLLMGear Editorial · Editorial Team · Updated 2026-06-29

We test hardware hands-on and may use AI tools in research — every guide is human-reviewed. Editorial policy.

We may earn a commission from links in this article, at no extra cost to you. Disclosure.

Running Ollama in Docker keeps your local-LLM setup tidy and reproducible: one image, one volume, no system packages to manage, and a clean teardown when you’re done. The catch most people hit is the GPU — a container can’t see your NVIDIA card by default, so generation crawls on the CPU. This guide fixes that, then bolts on a real chat UI.

The 30-second answer: Run docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama to get a GPU-accelerated Ollama server in a container, with models persisted in a named volume. Add Open WebUI via docker compose for a ChatGPT-style interface on http://localhost:3000.

Run Ollama in Docker

The official image is published as ollama/ollama on Docker Hub. The simplest possible launch — CPU only, just to confirm it works:

docker run -d -p 11434:11434 --name ollama ollama/ollama

-d runs it in the background, -p 11434:11434 exposes Ollama’s API on the usual port, and --name ollama gives the container a memorable handle. Confirm it’s alive:

docker exec -it ollama ollama --version
docker exec -it ollama ollama run llama3

The second command downloads llama3 (if needed) and drops you into a chat inside the container. That proves the server works — but two things are still missing: GPU acceleration, and somewhere durable to keep the models. Let’s add both.

Enable the NVIDIA GPU

Without a GPU, even an 8B model is painfully slow. To pass an NVIDIA card into a container on Linux you need the NVIDIA Container Toolkit installed on the host (this is separate from your GPU driver, which must already be working — check with nvidia-smi).

Install and wire it into Docker:

# Add the NVIDIA Container Toolkit repo, then:
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

(Follow NVIDIA’s current install docs for the exact repo lines for your distro — they change occasionally.) Once Docker has been restarted, you can hand the GPU to any container with --gpus all. Re-run Ollama with the flag added:

docker run -d --gpus all -p 11434:11434 --name ollama ollama/ollama

Verify the container actually sees the GPU:

docker exec -it ollama nvidia-smi

If nvidia-smi lists your card from inside the container, you’re set — Ollama will offload the model to VRAM automatically. On Windows, the equivalent is Docker Desktop with the WSL2 backend plus a recent NVIDIA driver; --gpus all then works the same way. On macOS, Docker can’t pass through the Apple GPU, so containerized Ollama runs on CPU — Mac users are better off running the native Ollama app instead.

Persist your models with a volume

By default a container’s filesystem is disposable. Ollama stores everything — model weights, config — in /root/.ollama, which means a docker rm throws away every gigabyte you downloaded. Mount a named volume so the data outlives the container:

docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama ollama/ollama

Now ollama is a Docker-managed volume holding your models. Remove and re-create the container, upgrade to a newer image — the models stay put. You can pull more into it from the host without ever opening a shell:

docker exec -it ollama ollama pull qwen2.5:14b
docker exec -it ollama ollama list

This is the command worth saving. It’s GPU-accelerated, persistent, and reachable on localhost:11434 for any app on your machine.

Add Open WebUI with docker compose

The terminal works, but a proper chat UI makes a local model feel finished. Rather than juggle two docker run commands, define the whole stack in one docker-compose.yml so the two containers share a network and start together:

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui:/app/backend/data
    depends_on:
      - ollama

volumes:
  ollama:
  open-webui:

A few things to notice. The GPU is requested through the deploy.resources block — the compose equivalent of --gpus all. Open WebUI reaches Ollama at http://ollama:11434, using the service name as the hostname over the shared internal network, so Ollama needs no published host port at all. And each service has its own named volume, so both your models and your chat history survive updates.

Bring it up:

docker compose up -d

Verify it all works

Give the containers a few seconds, then check the stack:

docker compose ps                       # both should be "running"
docker exec -it ollama ollama pull llama3

Open http://localhost:3000 in your browser. The first account you create becomes the admin, and any model you’ve pulled shows up in the picker automatically. Pick llama3, send a message, and you’ve got a fully private, GPU-accelerated ChatGPT-style app — every byte staying on your own hardware.

If generation feels slow, the bottleneck is almost never Docker — it’s the model not fitting in your VRAM and spilling to CPU. Matching model size to your card is a hardware question; the Ollama guide has a VRAM sizing table to help you pick.

Common gotchas

could not select device driver "nvidia": the NVIDIA Container Toolkit isn’t installed or Docker wasn’t restarted after nvidia-ctk runtime configure. Re-run those steps.
Models vanish after an update: you forgot the -v ollama:/root/.ollama volume. Always mount it.
Open WebUI shows no models: confirm OLLAMA_BASE_URL points at http://ollama:11434 (the service name), and that you’ve pulled at least one model into the Ollama container.
Port 11434 already in use: a native Ollama install is probably already running on the host. Stop it, or drop the -p 11434:11434 mapping and talk to the container internally.

Docker gives you a clean, repeatable local-LLM box you can rebuild in one command. If you want to genuinely understand what’s running inside it — prompting, embeddings, RAG and fine-tuning — a structured course shortcuts a lot of trial and error:

Learn the fundamentals on DataCamp Ad