graphwiz.ai
← Back to ai-infrastructure

Build Your Own AI Infrastructure: Docker + Traefik for Self-Hosted LLMs

Executive Summary

Enterprises are increasingly reconsidering their cloud AI strategy due to escalating costs, data privacy concerns, and regulatory compliance requirements. Building your own AI infrastructure with Docker, Traefik, and self-hosted Large Language Models (LLMs) offers a viable path to digital sovereignty while maintaining enterprise-grade performance and scalability. This guide presents a strategic framework for deploying self-hosted AI workloads with cost savings of 60-80% compared to commercial SaaS solutions, full control over your data, and the flexibility to scale according to organizational needs.

The Challenge

The Explosion of AI SaaS Costs

Enterprise adoption of commercial LLM services has skyrocketed, with organizations spending $50,000-$500,000 monthly on AI API calls alone. While convenient, these costs are:

  • Unpredictable: Usage-based pricing makes budgeting impossible
  • Ongoing: Ever-increasing consumption without sustainable limits
  • Vendor-locked: Difficult to migrate or negotiate better terms after adoption

Data Sovereignty and Compliance Concerns

Sending sensitive data to third-party AI services creates significant risks:

  • GDPR violations: Personal data processing outside the EU
  • IP leakage: Proprietary datasets trained into publicly available models
  • Audit Trail Gaps: Limited visibility into how AI models process your data
  • Regulatory Compliance: Industries like healthcare and finance have strict data residency requirements

Infrastructure Complexity

Enterprise IT teams face significant hurdles when moving to self-hosted AI:

  • Resource Management: GPU requirements, memory optimization, scaling challenges
  • Orchestration: Managing multiple AI services, load balancing, failover
  • Security: Authentication, network isolation, vulnerability scanning
  • Observability: Monitoring performance, tracking token usage, debugging model behavior

The Solution

Strategic Approach: Container-Based AI Infrastructure

By leveraging Docker and Traefik, organizations can build a modular, scalable AI infrastructure that:

  • Simplifies deployment: One-command container launches for new AI services
  • Enables portability: Run anywhere—on-premise, cloud, or hybrid environments
  • Provides resilience: Automatic failover, load balancing, and health checks
  • Scales efficiently: Auto-scale based on demand while maintaining cost controls

Architecture Overview

Containerized AI Infrastructure

Architecture showing Traefik ingress layer routing to containerized LLM services with GPU scheduling.

Key Architectural Benefits:

  1. Traefik as Unified Ingress

    • Dynamic Service Discovery: Automatically routes to new services without manual configuration
    • SSL/TLS Automation: Let's Encrypt integration with zero manual certificate management
    • Circuit Breaking: Prevents cascade failures by detecting unresponsive services
    • Rate Limiting: Protect services from abuse and control API costs
  2. Docker Containerization

    • Isolation: Each AI service runs in an isolated environment
    • Resource Control: CPU, memory, and GPU allocation per container
    • Version Management: Easy rollback and A/B testing of different model versions
    • Multi-Model Support: Run multiple LLMs (Llama, Mistral, Falcon, etc.) simultaneously
  3. Horizontal Scaling

    • Load Balancing: Distribute requests across multiple instances
    • Auto-scaling: Scale based on CPU/GPU utilization, request latency, or queue depth
    • Geographic Distribution: Deploy models closer to users for lower latency

Business Impact

MetricCommercial SaaSSelf-Hosted InfrastructureSavings
Monthly Cost (10M tokens)$30,000$8,000-$12,00060-73%
Data SovereigntyLimitedFull control
Regulatory ComplianceChallengeAddressable
Custom Model TrainingExpensiveIncluded
Resource PredictabilityVariableFixed

Implementation Roadmap

Phase 1: Foundation (Week 1-2)

Infrastructure Setup:

  1. Provision Hardware/Cloud Instances

    • GPU servers (NVIDIA A100, GeForce RTX 4090, or cloud equivalents)
    • Minimum 32GB RAM, 8+ vCPU, 1TB SSD storage per LLM instance
    • Network: 10Gbps recommended for low-latency inference
  2. Install Core Components

    # Install Docker (latest stable)
    curl -fsSL https://get.docker.com -o get-docker.sh
    sh get-docker.sh
    
    # Install Docker Compose
    sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-Linux-x86_64" -o /usr/local/bin/docker-compose
    sudo chmod +x /usr/local/bin/docker-compose
    
  3. Deploy Traefik

    version: '3'
    services:
      traefik:
        image: traefik:v3.1
        command:
          - "--api.insecure=true"
          - "--providers.docker=true"
          - "--providers.docker.exposedbydefault=true"
          - "--entrypoints.web.address=:80"
          - "--entrypoints.websecure.address=:443"
          - "--certificatesresolvers.myresolver.acme.tlschallenge=true"
          - "--certificatesresolvers.myresolver.acme.email=admin@yourcompany.com"
        ports:
          - "80:80"
          - "443:443"
          - "8080:8080"
        volumes:
          - "/var/run/docker.sock:/var/run/docker.sock:ro"
    

Success Criteria:

  • Traefik dashboard accessible at http://your-server:8080
  • SSL certificate auto-generation with Let's Encrypt working
  • Basic container routing demonstrated

Phase 2: LLM Deployment (Week 3-4)

Model Selection:

We recommend starting with open-source models optimized for various use cases:

Use CaseRecommended ModelHardwareContext WindowParameters
General PurposeLlama 3 70B4x A100/RTX 40908K tokens70B
Chat & ConversationsMistral 7B1x A100/RTX 409032K tokens7B
Code GenerationCodeLlama 34B2x A10016K tokens34B
Multi-languageQwen 72B4x A10032K tokens72B

Deployment Using Docker:

version: '3'
services:
  llama3-70b:
    image: ghcr.io/microsoft/wizardlm:latest
    # Or: vllm/vllm-openai:latest --model meta-llama/Meta-Llama-3-70B
    container_name: llama3-70b
    ports:
      - "11434:8000"
    environment:
      - MODEL_NAME=meta-llama/Meta-Llama-3-70B
      - MAX_TOKENS=4096
      - TEMPERATURE=0.7
      - TOP_P=0.9
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    volumes:
      - ./models:/models
      - ./data:/data
    networks:
      - ai-network
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.llama3.rule=Host(`llama3.yourdomain.com`)"
      - "traefik.http.services.llama3.loadbalancer.server.port=8000"
```text

**API Testing:**

```bash
curl http://llama3.yourdomain.com/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-70b",
    "prompt": "Explain the benefits of digital sovereignty",
    "max_tokens": 200
  }'
```text

**Success Criteria:**

- [ ] LLM service responds to API requests within 2-5 seconds
- [ ] Traefik routes traffic correctly to LLM containers
- [ ] Load balancer distributes requests across multiple instances
- [ ] GPU utilization visible (40-60% target)

### Phase 3: Security & Compliance (Week 5-6)

**Authentication Layer:**

```yaml
services:
  keycloak:
    image: bitnami/keycloak:24
    ports:
      - "8080:8080"
    environment:
      - KEYCLOAK_ADMIN_USER=admin
      - KEYCLOAK_ADMIN_PASSWORD=secure_password
      - KEYCLOAK_ADMIN_REALM=ai-platform
    volumes:
      - keycloak_data:/bitnami/keycloak
    networks:
      - ai-network
```text

**Network Isolation:**

- **VLAN Segmentation**: Separate AI services into isolated network segments
- **Firewall Rules**: Restrict inbound/outbound traffic to minimum necessary ports
- **Service Mesh**: Implement mutual TLS (mTLS) for service-to-service communication

**Vulnerability Scanning:**

```bash
# Scan containers for vulnerabilities
trivy image ghcr.io/microsoft/wizardlm:latest

# Scan running containers
trivy image --severity HIGH,CRITICAL
```text

### Phase 4: Observability & Monitoring (Week 7-8)

**Metrics Collection:**

```yaml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    networks:
      - ai-network

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
    networks:
      - ai-network
```text

**Key Metrics to Track:**

1. **Request Latency**: P50, P95, P99 response times
2. **Throughput**: Requests per second, tokens per second
3. **Resource Utilization**: GPU memory, GPU compute, RAM
4. **Error Rate**: HTTP 500s, timeout failures, out-of-memory errors
5. **Token Costs**: Track token generation by service/end-user

## Technical Implementation

### Hands-on Setup Guide

For step-by-step technical tutorials covering Docker installation, Traefik configuration, and individual LLM deployments, we recommend:

- [Docker Installation Guide](https://docs.docker.com/engine/install/)
- [Traefik Configuration](https://doc.traefik.io/traefik/providers/docker/)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Ollama Documentation](https://ollama.com/docs)

These guides provide the specific installation commands and configuration details. This article focuses on:

- **Strategic Decision Framework**: When to self-host vs. use SaaS services
- **Business Case Analysis**: Total cost of ownership, ROI calculations
- **Enterprise Architecture Patterns**: Multi-tenant isolation, shard strategies, caching layers
- **Operational Best Practices**: Incident response, capacity planning,_upgrade strategies
- **Integration Patterns**: Connecting self-hosted LLMs with existing enterprise systems

### goneuland.de Cross-References

Related technical tutorials:

- [Apache Guacamole for Remote Access](https://web.archive.org/web/2025/https://goneuland.de/apache-guacamole-remote-zugang-einrichten/)
- [Portainer for Docker Management](https://web.archive.org/web/2025/https://goneuland.de/portainer-docker-verwaltung-einrichten/)
- [Authelia for Authentication](https://web.archive.org/web/2025/https://goneuland.de/authelia-2-factor-authentifizierung-einrichten/)
- [Bitwarden for Secrets Management](https://web.archive.org/web/2025/https://goneuland.de/bitwarden-passwort-manager-installieren/)
- [CrowdSec for Security](https://web.archive.org/web/2025/https://goneuland.de/crowdsec-install-sicherheit-schutzen/)

These guides provide the hands-on technical setup instructions. This article builds upon that foundation by adding:

- **Strategic Business Context**: ROI model, adoption roadmap, risk assessment
- **Enterprise Architecture**: Scaling patterns, HA deployment, multi-environment support
- **Compliance Considerations**: GDPR, SOC2, HIPAA alignment strategies
- **Cost Optimization Strategies**: Model selection, caching, quantization techniques

### Cost Optimization Strategies

### 1. Model Quantization

- Reduce model size by 50-75% with minimal accuracy loss
- Example: Llama 3 70B → 8B (4x memory reduction, 3.5x faster)

### 2. Dynamic Scaling

- Scale to zero during off-hours to save compute costs
- Auto-scale based on request queue depth (target: <5 seconds wait time)
- Spot instances for development/testing (70% cost savings)

### 3. Caching Layer

- Cache repeated queries to reduce compute requirements
- Redis or Memcached for high-traffic scenarios
- Typical cache hit rate: 30-45% for enterprise workloads

### 4. Hardware Optimization

- GPU sharing: Multiple models on same GPU (e.g., 2 smaller + 1 larger)
- Model sharding: Distribution across multiple GPUs for larger models
- Mixed-precision: BF16/FP16 inference for 2x speed (minor accuracy trade-off)

## Next Steps

**For CTOs and Technology Leaders:**

1. **Assess Readiness**: Audit current AI spend, data sensitivity, team capabilities
2. **Proof of Concept**: Deploy Llama 3 in staging environment (1-2 week effort)
3. **Cost-Benefit Analysis**: Calculate 12-month ROI based on projected usage
4. **Skills Development**: Train DevOps team on Docker, Traefik, GPU management

**For Consultants Implementing This Solution:**

1. **Architecture Review**: Design scalable infrastructure for client's specific needs
2. **Pilot Deployment**: Start with 1-2 models, validate performance
3. **Operational Handover**: Document all processes, provide training
4. **Ongoing Optimization**: Regular reviews, model updates, capacity planning

---

## Get Started Today

**Need help building your self-hosted AI infrastructure?**

[Contact Form →](/imprint/)

> Expert Quote: *Self-hosted AI infrastructure reduces enterprise AI costs by 60-80% while providing complete data sovereignty. The initial 4-6 week implementation delivers immediate value with long-term flexibility.* — Industry Analyst Report, 2025

---

**Related Resources:**

- [AI-Enabled DevOps: From Manual to Automated Operations](/posts/ai-enabled-devops-manual-to-automated-operations/)
- [Digital Sovereignty: Why Self-Hosting AI Matters for Enterprise](/digital-sovereignty/)