Docker Practices for Large Language Model Deployment

Large language models (LLMs) are everywhere — powering chatbots, copilots, and AI-driven apps across industries. But if you’ve ever tried to run one outside of a managed service, you know the pain: gigabytes of model weights, conflicting Python dependencies, fragile CUDA versions, and a GPU setup that only seems to work on your machine.

This is where Docker shines. By packaging the entire environment — code, libraries, and drivers — into a container, you can run an LLM anywhere, whether it’s your laptop, a cloud GPU node, or a Kubernetes cluster. Containers give you reproducibility, portability, and isolation: exactly what’s needed for the messy world of LLMOps.

In this article, we’ll explore how to run LLM workloads inside Docker. We’ll build a working container that serves predictions from a Hugging Face model, enable GPU support with NVIDIA’s container toolkit, and show how the same image can scale in Kubernetes. Along the way, we’ll cover common pitfalls like CUDA drift, bloated images, and cold starts — and share best practices to avoid them.

The goal is simple: by the end, you’ll see that Docker isn’t just for microservices — it’s quickly becoming an essential building block for deploying and scaling AI.

Why LLMOps Needs Docker

Running an LLM isn’t as simple as pip install transformers. These models often require dozens of dependencies, specific CUDA drivers, and sometimes gigabytes of model weights. Without containers, developers usually end up in “dependency hell,” where code runs on one machine but fails on another.

Docker solves this problem by providing a consistent runtime environment. Here’s why it’s particularly valuable for LLMOps:

Reproducibility. Package PyTorch, TensorFlow, CUDA, and Hugging Face libraries in a Docker image → guaranteed consistency.
Isolation. With NVIDIA Container Toolkit, containers can safely share GPU hardware without driver/library conflicts.
Portability. The identical container may execute locally, on-prem, or in-cloud services such as AWS Sagemaker, GCP Vertex AI, or Azure ML.
Scalability. Orchestrators such as Kubernetes can automatically utilize the full power of container replications to scale inference workloads of LLM.
Security and compliance. Containers can be scanned, signed, and enforced with runtime policies.

Requests flow through a Dockerized app, model runtime, NVIDIA toolkit, GPU drivers, and finally the hardware. Each layer builds on the previous one to make large language models portable, scalable, and reliable across environments.

Creating an LLM-Ready Docker Image

Let’s begin by constructing a Docker image that can serve predictions from a Hugging Face model via FastAPI.

This image will come from the NVIDIA CUDA images, which support GPU-ready environments. We will top it with Python, PyTorch, FastAPI, and Hugging Face Transformers. This will make it a portable inference container.

Here’s a sample Dockerfile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


FROM nvidia/cuda:12.2.0-base-ubuntu22.04
# Install Python3
RUN apt-get update && apt-get install -y python3 python3-pip
# Install AI libraries
RUN pip install torch transformers fastapi uvicorn
# Copy application code
COPY app.py /app/app.py
WORKDIR /app
# Expose the API
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Explanation:

We will use NVIDIA CUDA as the base image. This ensures GPU compatibility out of the box.
Next, install Python. Add PyTorch, Hugging Face Transformers, FastAPI, and Uvicorn to complete our serving stack.
Lastly, replicate a basic app.py which loads a model and serves predictions on a REST endpoint.

Sample app.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
generator = pipeline("text-generation", model="distilgpt2")

@app.get("/generate")
def generate(prompt: str):
    result = generator(prompt, max_length=50, num_return_sequences=1)
    return {"output": result[0]["generated_text"]}

Build the image with:

1

docker build -t my-llm-container .

And run it locally with:

1

docker run --gpus all -p 8080:8080 my-llm-container

You now have a Dockerized LLM. It works on text prompts and generates completions. It is also reproducible and portable, running the same way everywhere.

Running LLMs With GPU Support

LLMs shine on GPUs, but giving containers access to GPU hardware requires some extra setup. By default, Docker can’t talk directly to your GPU drivers — that’s where the NVIDIA Container Toolkit comes in. This toolkit bridges your host GPU with the container runtime so Docker images can execute CUDA operations.

Once you have installed the toolkit on your host, you can run a GPU-enabled container with this simple command:

1

docker run --gpus all -p 8080:8080 my-llm-container

Explanation:

--gpus all tells Docker to expose all available GPUs to the container. You can also limit it to one GPU (--gpus "device=0") if you’re running multiple workloads.
-p 8080:8080 maps the FastAPI app inside the container to port 8080 on your machine.
my-llm-container is the image we built in the previous section.

The container loads CUDA drivers from the host. It exposes them to PyTorch inside the container when it starts. In case of a mismatch of the drivers, errors like “invalid device function” may occur. To avoid this, you need to check the NVIDIA CUDA compatibility matrix. With that, you can ensure that your host drivers support the CUDA version in your image.

To confirm that the GPU is actually available, you can run:

1

docker exec -it <container_id> nvidia-smi

This runs nvidia-smi inside the container, showing GPU utilization, driver version, and CUDA compatibility. If you see your GPU listed, you’re good to go.

Using this implementation, you have a containerized LLM that has the ability to make use of hardware acceleration. Whether it’s on your laptop’s RTX card, a cloud GPU, or inside Kubernetes, your model is now ready to run fast and efficiently.

Scaling LLM Workloads in Kubernetes — With GPUs

When your LLM works in Docker, it is simple to scale it to Kubernetes. After the deployment of pods to the GPU nodes, the number of replicas can be increased or decreased.

Prerequisites (One-Time)

Install the NVIDIA Device Plugin DaemonSet. This exposes GPUs in Kubernetes under nvidia.com/gpu resources.
Ensure that you have at least one GPU node pool in your cluster, be it GKE, AKS, EKS, or on-prem, and label it (e.g., accelerator=nvidia).

Minimal Deployment + Service

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51


apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      nodeSelector:
        accelerator: nvidia               # schedule onto GPU nodes
      containers:
      - name: llm
        image: my-llm-container:latest
        ports:
        - containerPort: 8080
        resources:
          limits:
            nvidia.com/gpu: 1             # request 1 GPU
            cpu: "2"
            memory: "8Gi"
          requests:
            cpu: "1"
            memory: "4Gi"
        env:
        - name: HF_HOME                   # optional: cache models
          value: /models
        volumeMounts:
        - name: model-cache
          mountPath: /models
      volumes:
      - name: model-cache
        emptyDir: { sizeLimit: "20Gi" }   # cache speeds cold starts
---
apiVersion: v1
kind: Service
metadata:
  name: llm-svc
spec:
  selector:
    app: llm
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
  type: ClusterIP

Optional: Spread, Scale, and Burst

Topology spread (avoid all pods on one node):

1
2
3
4
5
6
7
8


spec:
   topologySpreadConstraints:
   - maxSkew: 1
     topologyKey: topology.kubernetes.io/zone
     whenUnsatisfiable: ScheduleAnyway
     labelSelector:
       matchLabels:
         app: llm

Autoscale by QPS/CPU (HPA example):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 8
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

GPU-aware pod anti-affinity (keep replicas on different GPU nodes):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


spec:
   affinity:
     podAntiAffinity:
       preferredDuringSchedulingIgnoredDuringExecution:
       - weight: 100
         podAffinityTerm:
           topologyKey: kubernetes.io/hostname
           labelSelector:
             matchLabels:
               app: llm

Notes and Tips

Cold starts: Use a writable cache (emptyDir, CSI, or object store pre-warm) so models aren’t re-downloaded every pod start.
Driver/Runtime match: Ensure node driver versions are compatible with your CUDA base image.
Requests vs. limits: Set realistic CPU/memory, so GPUs stay fully utilized. Under-requesting can throttle performance.
Ingress and scaling: Put an Ingress, service mesh, or gateway in front. Scale replicas horizontally rather than just boosting single-pod batch sizes.

Challenges and Trade-Offs

Docker helps bring order to the chaos of LLM deployment, but it’s not magic. Running large AI models in containers brings its own problems that teams need to think about.

Image bloat: LLM containers can get huge — often tens of gigabytes once you add CUDA libraries, PyTorch, and model weights. Big images slow everything down: builds, pushes, pulls, and even cluster rollouts.
Dependency hell (still around, just in a box): CUDA, cuDNN, and PyTorch versions must still match the GPU driver on the host. If they don’t, you’ll hit runtime errors no matter how “contained” things look.
Cold starts: Spinning up a new container often means pulling or loading gigabytes of model weights. This delays scaling on demand, which is painful for latency-sensitive apps.
GPU scheduling headaches: In shared clusters, deciding who gets which GPU is not simple. Kubernetes GPU scheduling takes careful resource limits and sometimes extra operators to keep things fair.
Hardware lock-in: Docker hides the OS but not the hardware. If your container expects AVX2 or specific GPU features, it won’t run on weaker nodes.
Security and compliance risks: GPU-ready images often come from public registries and pull in lots of dependencies. You still need to scan, sign, and lock them down to avoid shipping vulnerabilities into production.

The Trade-Off Docker provides portability, reproducibility, and scalability of LLMs. It also results in bigger images, colder, slow warming up, and needs to be closely orchestrated. The trick is to treat these trade-offs as design limits, not deal-breakers — and use best practices like model caching, multi-stage builds, and proactive driver checks.

Best Practices and Recommendations

Most of the impediments to running LLMs in Docker can be managed with the right practices. The following are some approaches that ensure that your containers are lean, secure, and reliable.

Use multi-stage builds: By separating the build env — compilers and dev tools — from the runtime env — your apps and libraries, you can keep the images lightweight. This isolation eliminates build artifacts, reduces image size, and accelerates deployment.
Cache model weights: Pre-download Hugging Face or PyTorch models in your image, or mount them as volumes. This avoids long cold starts and saves bandwidth from re-downloading models every time.
Pin dependencies: Lock down versions of CUDA, cuDNN, PyTorch, and Transformers in your Dockerfile. Don’t use the “latest” tag. This will ensure every build is reproducible and predictable
Align drivers and runtimes: Ensure that the CUDA version within your container is compatible with the host-based version of the driver of the GPU. See the compatibility list at NVIDIA to prevent run-time failures.
Scan and sign images: Use tools like Trivy or Grype to scan for vulnerabilities. Sign your images before pushing them so you don’t accidentally pull something unsafe into production.
Monitor GPU utilization: Track GPU usage with nvidia-smi, Prometheus, or DCGM exporters. This helps with cost efficiency and ensures GPUs don’t sit idle (or overloaded).
Plan for cold starts: For latency-sensitive apps, preload models into memory or keep warm standby pods. If you don’t need a giant model, consider smaller distilled or quantized models to save time and resources.

Takeaway Treat LLM containers like production-grade services, not one-off experiments. By caching, pinning, scanning, and monitoring, you can turn fragile AI stacks into reliable, portable, and secure deployments.

Conclusion

Large language models are powerful but notoriously hard to run outside of managed services. Their heavy dependencies, hardware quirks, and scaling needs often turn simple experiments into production headaches.

Docker changes the game. Libraries, codes, and drivers can be bundled into portable containers. It provides reproducibility, isolation, and scalability that traditional setups lack. The container that works on a laptop can also work on a GPU-backed Kubernetes cluster — assuming you take care of the trade-offs like image bloat, cold starts, and driver mismatches.

The bottom line: Docker isn’t just for microservices anymore. It’s becoming a critical foundation for deploying and scaling AI workloads reliably. Teams that embrace containers for LLMOps will find it easier to move from prototype to production — without losing sleep over dependency hell or “works on my machine” failures.

Docker助力大语言模型部署：容器化实践与优化

本文深入探讨了如何使用Docker容器化技术部署和扩展大语言模型（LLM），涵盖了构建GPU支持的镜像、在Kubernetes中扩展LLM工作负载、面临的挑战以及最佳实践，旨在将脆弱的AI堆栈转变为可靠、可移植的生产级服务。