graphwiz.ai
← Back to Posts

AI-Enabled DevOps: From Manual to Automated Operations

DevOps
DevOps AutomationAI OperationsInfrastructure AutomationSelf-Hosted AIMLOpsCI/CDMonitoring AutomationIncident ResponseLog AnalysisPredictive Maintenance

Executive Summary

The evolution from manual operations to AI-automated DevOps represents the next frontier in infrastructure management. This article explores how self-hosted AI models transform operational workflows, reducing toil, improving reliability, and enabling predictive maintenance. We present a practical framework for implementing AI-enhanced DevOps with control over data sovereignty and model behavior.

Key Takeaways:

  • Digital sovereignty is no longer optional—it is a legal, competitive necessity

  • Self-hosted AI provides control over data residency, model behavior, and system evolution

  • The total cost of ownership for self-hosted AI becomes competitive at scale

  • A hybrid approach balances agility with sovereignty requirements

  • AI automation reduces manual toil by 60-80% in well-defined operational domains

  • Self-hosted AI ensures data privacy for sensitive operational data

  • Model choice matters: specialize models for log analysis, anomaly detection, and decision support

  • Incremental adoption (pilots → automation → predictive) reduces risk and accelerates value extraction

The Operational Challenge: Why Manual DevOps Doesn't Scale

The Toil Problem

Manual operations create a cascade of inefficiencies:

Operational Domain Manual Toil Percentage Impact
Incident Response 70-80% Slow MTTR, repetitive triage work
Log Analysis 85-90% Pattern blindness, missed anomalies
Configuration Management 60-70% Drift detection, policy violations
Monitoring Alerts 75-85% Alert fatigue, ignored warnings
Capacity Planning 80-90% Reactive scaling, waste
Release Coordination 65-75% manual scheduling, missed dependencies

The Human Bottleneck

Human operators face cognitive limitations:

Pattern Recognition Limits

  • 10-20 alerts before pattern blindness sets in
  • Inability to correlate across multiple systems simultaneously
  • Missing subtle signals that AI detects at scale

Fatigue and Burnout

  • On-call rotation leading to sleep deprivation
  • Reduced decision quality under stress
  • High turnover among senior operations engineers

Knowledge Silos

  • Tacit knowledge residing in individual engineers
  • Loss of expertise during staff turnover
  • Slow knowledge transfer between generations of operators

The Business Cost

Manual operations impose strategic costs:

  • MTTR (Mean Time To Response): 30-60 minutes for critical incidents vs. < 10 minutes with AI assistance
  • MTBF (Mean Time Between Failures): 50-75% higher with proactive AI detection vs. reactive responses
  • Team Productivity: 40-60% of engineering time spent on non-value-adding operational toil
  • Infrastructure Waste: 20-30% over-provisioning due to lack of predictive capacity planning

AI-Enabled DevOps: Operational Domains

Domain 1: Log Analysis and Anomaly Detection

The Problem: Millions of log entries generated daily across microservices, applications, and infrastructure. Human operators cannot manually review all logs for anomalies.

AI Solution: Self-hosted models specialize in detecting patterns, correlations, and deviations from baseline behavior.

Technical Implementation

Model Architecture

Service Components:
  Log Ingestion:
    - Fluentd/Logstash collectors (cluster-wide)
    - Kafka buffer for high-throughput ingestion
    - Retention policy: 30 days hot, 90 days warm, 365 days cold

  AI Model:
    - Transformer-based log analysis (BERT/ROBERTa fine-tuned)
    - Anomaly detection: Isolation Forest for unsupervised learning
    - Baseline establishment: Weekly rolling window detection
    - Infrastructure: NVIDIA T4 GPU, 32GB memory per model instance

  Integration:
    - Prometheus metrics: model latency, anomaly score distribution
    - Grafana dashboards: anomaly timeline, correlated system events
    - Alert routing: Slack/Teams integration with anomaly annotations

Deployment Pattern

{
  "model_name": "log-analyzer-prod",
  "infrastructure": "docker-swarm",
  "replicas": 2,
  "gpu_enabled": true,
  "autoc_scaling": {
    "cpu_threshold": 70,
    "requests_per_minute_threshold": 1000
  },
  "persistence": {
    "storage": "100GB NVMe",
    "backup": "daily, retain 7 days"
  }
}

Success Metrics

Metric Target Measurement
Anomaly Detection Precision > 0.85 False positive rate < 15%
Anomaly Detection Recall > 0.75 True positive rate > 75%
Latency < 500ms p95 Time from log entry to detection
Storage Efficiency < 5:1 compression Compression ratio for normalized logs

Domain 2: Predictive Incident Response

The Problem: Operators react to incidents after outages occur, missing opportunities for preventive action.

AI Solution: Models learn system behavior patterns, predicting failures before they happen.

Technical Implementation

Model Architecture

Service Components:
  Time-Series Ingestion:
    - Prometheus scrape targets: system metrics (CPU, memory, disk, network)
    - Application instrumentation: custom business metrics
    - External monitors: synthetic transaction monitoring

  AI Model:
    - Time-series forecasting: Prophet/LSTM hybrid approach
    - Failure prediction: Classification model (Random Forest, Gradient Boosting)
    - Ensemble approach: Combine multiple models for robustness
    - Infrastructure: 2× GPU instances (A100 or Radeon VII)

  Decision Support:
    - Risk scoring: 0-100 probability of failure in next hour
    - Action recommendations: Remote restart, scale up, alert engineering
    - Integration: PagerDuty/Opsgenie for on-call routing

  Governance:
    - Human-in-the-loop: All automated actions require approval for first 30 days
    - Audit logging: All AI recommendations and operator decisions
    - Feedback loop: Operator corrections improve model accuracy

Operational Flow

# Pseudo-code for predictive incident response flow
def handle_metric_reading(metric_name, value, timestamp):
    # Step 1: Normalize and feature engineering
    normalized = normalize(metric_name, value)
    features = extract_twenty_four_hour_window(normalized)

    # Step 2: Model inference
    probability = failure_prediction_model.predict(features)
    if probability > THRESHOLD:
        # Step 3: Risk scoring and recommendation
        risk_score = calculate_risk_score(features, probability)
        recommendation = recommend_action(current_state, risk_score)

        # Step 4: Governance
        if governance_check(recommendation):
            # Step 5: Human approval and execution
            response = await_operator_approval(recommendation)
            if response.approved:
                execute_action(recommendation.action)
    return

Success Metrics

Metric Target Measurement
Prediction Horizon > 1 hour Time from prediction to failure
Prediction Accuracy > 0.70 F1 score on test set
False Positive Rate < 10% % of predictions without actual failure
MTTR Reduction > 40% Mean time to response with AI assistance

Domain 3: Configuration Drift Detection

The Problem: Manual configuration changes accumulate, leading to drift from intended state and security misconfigurations.

AI Solution: Current state compared against golden templates with AI-powered anomaly detection for deviations.

Technical Implementation

Model Architecture

Service Components:
  State Collection:
    - Configuration crawling: SSH/Ansible playbooks across fleet
    - Container configuration: Docker API for container state
    - Cloud infrastructure: DBT (Database as Code) for cloud resource state

  AI Model:
    - Similarity comparison: Embedding-based similarity (BERT or GNN)
    - Drift classification: Supervised classification for known deviation patterns
    - Policy enforcement: Rule-based enforcement for security constraints
    - Infrastructure: CPU instances (4-8 cores), 16GB memory

  Remediation:
    - Auto-remediation: Safe drift corrections with approval workflow
    - Pull request generation: GitOps-style drift correction submits PRs
    - Notification: Slack/Tickets for configuration drift

Data Model

{
  "configuration_state": {
    "hostname": "web-server-01",
    "timestamp": "2026-03-19T10:30:00Z",
    "container_configurations": [
      {
        "container_id": "abcd1234",
        "image": "nginx:1.21",
        "environment_variables": {"PORT": "8080", "ENV": "production"},
        "mount_points": ["/etc/nginx/conf.d:/conf.d"],
        "network_mode": "host"
      }
    ],
    "system_packages": ["openssl", "openssh-server", "docker"],
    "security_compliance_score": 0.87
  }
}

Success Metrics

Metric Target Measurement
Drift Detection Time < 1 hour Time from drift to detection
False Positive Rate < 5% % of drift notifications of safe changes
Auto-Remediation Success > 80% % of safe drifts auto-remediated
Configuration Consistency > 95% % of resources in golden state

Domain 4: Capacity Planning Automation

The Problem: Infrastructure over-provisioned to handle peaks, wasting resources, or under-provisioned leading to outages.

AI Solution: Model learns traffic patterns and predicts future load, enabling optimized resource allocation.

Technical Implementation

Model Architecture

Service Components:
  Workload Characterization:
    - Traffic pattern analysis: application request patterns (hourly, daily, seasonal)
    - Resource consumption tracking: CPU/memory usage per microservice
    - Business metrics correlation: correlate load with business events

  AI Model:
    - Time-series forecasting: Prophet for trend + seasonality
    - Anomaly detection: Isolation Forest for unexpected traffic spikes
    - Optimization: Mixed-integer linear programming for resource allocation
    - Infrastructure: GPU instances (NVIDIA T4 for faster inference)

  Automation:
    - Auto-scaling: Kubernetes Horizontal Pod Autoscaler (HPA)
    - Cost optimization: Spot instance forecasting, reservation planning
    - Reporting: Monthly capacity planning reports with recommendations

Optimization Problem Formulation

Minimize: ∑(cost_per_instance × instance_count) + penalty_for_underprovisioning
Subject to:
  - For each service: allocated_cpu %3C= available_cpu
  - For each service: allocated_memory <= available_memory
  - Service SLO compliance: request_response_time < SLA_threshold
  - Business constraint: cost <= budget_constraint

Success Metrics

Metric Target Measurement
Prediction Accuracy > 0.80 R² of predicted vs. actual resource usage
Cost Savings > 15% Infrastructure cost reduction over manual planning
SLO Compliance > 99.5% % of time services meet SLOs
Overprovisioning Reduction > 20% % reduction in overprovisioned resources

Domain 5: Release Coordination and Deployment Optimization

The Problem: Manual release coordination leads to misalignment, deployment failures, and extended release cycles.

AI Solution: Analyze deployment history, identify risk factors, and optimize release schedules.

Technical Implementation

Model Architecture

Service Components:
  Deployment History Collection:
    - Automated job execution tracking: Jenkins/GitLab CI/CD logs
    - Build artifact metadata: build time, test results, change request
    - Deployment telemetry: Kubernetes events, application metrics

  AI Model:
    - Risk classification: Supervised learning (yes/no failure prediction)
    - Feature importance: SHAP values for interpretability
    - Optimization: Genetic algorithms for release scheduling optimization
    - Infrastructure: CPU instances (2-4 cores), 8GB memory

  Integration:
    - CI/CD pipeline integration: Pre-deployment risk assessment
    - Schedule optimization: Optimize testing windows for minimal disruption
    - Rollback automation: Automatic rollback on detected failures

Deployment Risk Model Features

risk_features = [
    "code_change_complexity",  # Complexity of code changes
    "test_coverage",           # Test coverage percentage
    "previous_failures",       # Historical failure rate for service
    "environment_changes",     # Changes in dependencies or environment
    "occurrence_pattern",      # Time of deployment (weekday vs. weekend)
    "operator_experience",     # Experience of operator performing deployment
    "service_criticality",     # Business impact of downstream service
    "number_of_dependencies"   # Number of dependent services
]

Success Metrics

Metric Target Measurement
Deployment Failure Prediction > 0.70 Accuracy of failure prediction
MTTR Reduction > 30% Faster rollback with automation
Release Cycle Time Reduce by 40% Faster release cycles
Deployment Confidence > 90% Operator confidence in automated deployments

The Self-Hosting Implementation Roadmap

Phase 1: Infrastructure Foundation (Weeks 1-4)

Goal: Establish secure, scalable infrastructure for AI-enabled DevOps operations.

Deliberatable Architecture Decisions

Decision 1: Container Orchestration

Option Advantages Disadvantages
Docker Swarm Simplicity, lower overhead, easier operations Limited scalability, no stateful workloads
Kubernetes Industry standard, autoscaling, extensive ecosystem Higher complexity, steeper learning curve

Recommendation: Start with Docker Swarm for simplicity, migrate to Kubernetes as scale demands.

Decision 2: Storage Layer

Option Advantages Disadvantages
Local NVMe storage Lowest latency, highest throughput Limited scalability, data locality issues
Ceph distributed storage Scalable, data redundancy Higher latency, operational complexity

Recommendation: Start with local NVMe, transition to Ceph for multi-node deployments.

Infrastructure Components

Reverse Proxy Configuration

  • Expose AI services securely behind SSL/TLS
  • Load balance across multiple model instances
  • Health checks and circuit breakers

Apache Guacamole for Remote Access

  • Browser-based console access to AI infrastructure
  • Secure remote management from anywhere
  • Connection recording for audit trails

Authentication Layer

  • Two-factor authentication for AI service access
  • SSO integration with enterprise identity providers
  • Fine-grained access control per service

Monitoring Stack

  • Monitor AI model performance (latency, accuracy, throughput)
  • Track infrastructure health (GPU, memory, network)
  • Alert on capacity thresholds and performance degradation

Phase 1 Deliverables

  • Docker Swarm cluster with 2-3 GPU nodes operational
  • Reverse proxy (Traefik) deployed with SSL certificates
  • Authentication service (Authelia) integrated with SSO
  • Monitoring stack (Grafana/Prometheus) collecting metrics
  • Basic CI/CD pipeline for model deployment
  • Backup and disaster recovery procedures documented

Phase 2: Domain-Aware Pilots (Weeks 5-8)

Goal: Validate AI models in specific operational domains with narrow scopes.

Pilot 1: Log Anomaly Detection

Approach: Deploy single log analysis model for one service (e.g., web server).

Steps:

  1. Collect 7 days of log data from target service
  2. Establish baseline of normal log patterns
  3. Train anomaly detection model (Isolation Forest)
  4. Deploy model in Docker container with GPU access
  5. Configure alerts for detected anomalies
  6. Validate against known issues from past 30 days

Success Criteria:

  • Model detects 80% of known anomalies
  • False positive rate < 20%
  • Latency < 500ms p95 for analysis

Pilot 2: Configuration Drift Detection

Approach: Compare current fleet state against golden configs for one service.

Steps:

  1. Define golden configuration template for one microservice
  2. Cron job daily current state collection
  3. Embedding-based similarity comparison
  4. Slack notification for drift detection
  5. Manual validation of drift notifications

Success Criteria:

  • Detect 100% of configuration drifts (>5% changes)
  • False positive rate < 10%
  • Drift detection within 24 hours of change

Pilot 3: Capacity Forecasting

Approach: Forecast CPU/memory usage for one service over next 7 days.

Steps:

  1. Collect 90 days of historical usage data
  2. Train time-series forecasting model (Prophet)
  3. Generate daily forecasts with confidence intervals
  4. Compare forecasts to actual usage for accuracy
  5. Develop capacity planning dashboard

Success Criteria:

  • Forecast accuracy: R² > 0.80
  • Confidence interval calibration: 95% of actual within 95% CI interval
  • Automation: New forecasts generated daily without manual intervention

Phase 3: Scale-Out and Integration (Weeks 9-12)

Goal: Expand pilots to multiple services and integrate with enterprise tooling.

Integration Activities

Jenkins/GitLab CI/CD Integration

  • Add AI assessment stage to CI/CD pipeline
  • Pre-deployment risk scoring based on deployment history
  • Automated rollback triggers for detected failures

Identity Provider Integration

  • SSO integration for AI service authentication
  • Role-based access control (RBAC) for model access
  • Audit logging for AI service interactions

Security Integration

  • IP reputation filtering for AI API endpoints
  • Rate limiting to prevent abuse
  • Brute force protection for authentication

Scalability Improvements

Horizontal Scaling:

  • Deploy 2-3 replicas of each AI model
  • Load balancing across replicas
  • Auto-scaling based on request throughput

Model Optimization:

  • Quantize models for reduced memory footprint
  • Batch inference for increased throughput
  • Model distillation for latency-critical applications

Data Pipeline Scaling:

  • Scalable log aggregation (Kafka + Elasticsearch cluster)
  • Time-series database for metrics storage (Prometheus + Thanos)
  • Backup and restore procedures for model artifacts

Phase 4: Enterprise Readiness (Weeks 13-16)

Goal: Achieve production operational maturity for AI-enabled DevOps.

Production Readiness Checklist

Reliability:

  • 99.9% uptime for AI services (per SLO)
  • Automated failover for model instances
  • Disaster recovery tested (restored from backup < 1 hour)

Security:

  • SOC 2 Type II compliant infrastructure
  • Penetration test passed (no critical/high vulnerabilities)
  • Data encryption at rest and in transit (AES-256/TLS 1.3)
  • Role-based access control enforced
  • Audit logging with 90-day retention

Compliance:

  • GDPR-compliant data handling (residency, erasure, access)
  • Data processing agreement with vendors (if applicable)
  • Security certifications maintained (ISO 27001, etc.)

Operational:

  • Runbooks for common operational scenarios
  • On-call rotation with clear escalation policies
  • Capacity planning dashboard with 3-month forecast
  • Change management procedures documented

Continuous Improvement

Model Retraining:

  • Monthly model retraining with latest data
  • A/B testing for model updates
  • Canary deployments for model replacement

Feedback Loop:

  • Operator feedback on AI recommendations
  • False positive/negative tracking
  • Model performance metrics trended over time

Knowledge Sharing:

  • Documentation of lessons learned
  • Internal training for new operators
  • External conference talks (if approved)

goneuland.de Infrastructure Cross-References

Implementing AI-enabled DevOps requires foundational infrastructure components documented on goneuland.de:

Core Infrastructure

Docker Swarm Cluster

  • Orchestrate AI model containers across GPU nodes
  • Service discovery and load balancing
  • Rolling updates for model deployments

Traefik Reverse Proxy

  • Expose AI services with SSL/TLS encryption
  • Health checks and circuit breakers
  • Prometheus metrics export for monitoring

Apache Guacamole

  • Browser-based console access to AI infrastructure
  • Remote management from anywhere
  • Connection recording for audit trails

Identity and Access

Authelia Authentication

  • Two-factor authentication for AI service access
  • SSO integration with enterprise identity providers
  • Fine-grained access control per service

Keycloak SSO Server

  • Enterprise SSO for AI platform
  • Role-based access control (RBAC)
  • User federation with LDAP/Active Directory

Security and Protection

CrowdSec Security Layer

  • Brute force protection for AI API endpoints
  • Rate limiting to prevent abuse
  • IP reputation filtering for malicious traffic

Bitwarden Password Management

  • Secure credential management for AI infrastructure
  • Secrets vault for API keys and encryption keys
  • Audited access to sensitive infrastructure credentials

CI/CD Automation

Jenkins CI/CD Pipeline

  • Automated model deployment pipeline
  • Pre-deployment risk assessment integration
  • Automated rollback triggers for failed deployments

Monitoring and Observability

Grafana Dashboard Setup

  • Real-time monitoring of AI model performance
  • Resource utilization dashboards (GPU, memory, network)
  • Alerting for system health issues
  • Custom dashboards for anomaly detection timelines

Prometheus Metrics Collection

  • Collect time-series metrics from AI infrastructure
  • Model performance metrics (latency, accuracy, throughput)
  • Capacity planning data for infrastructure scaling

Elasticsearch Stack

  • Centralized log aggregation for AI services
  • Full-text search across operational logs
  • Kibana dashboards for log analysis visualization

Storage and Persistence

PostgreSQL Database Deployment

  • Persistent storage for model metadata
  • Audit logs for compliance requirements
  • Configuration drift state history

Minio Object Storage

  • Scalable storage for model artifacts
  • Backup and restore for model deployments
  • Data ingestion buffer for high-volume workloads

Risks and Mitigation Strategies

Risk 1: Poor Model Performance in Production

Scenario: AI models underperform in production, missing critical anomalies or flooding operators with false positives.

Mitigation:

  • Maintain humans in the loop for first 90 days of production deployment
  • Set conservative thresholds initially (higher precision, lower recall)
  • Implement feature flags for rapid rollback
  • Continuous A/B testing for model improvements
  • Establish false positive/negative tracking and improvement pipeline

Risk 2: Operational Complexity Burden

Scenario: The complexity of AI-enabled DevOps operations exceeds team capabilities, leading to maintenance burden and reduced operational efficiency.

Mitigation:

  • Start with narrow scope (single domain, single service) before expanding
  • Develop comprehensive runbooks and training materials hire or train ML engineering expertise
  • Implement comprehensive monitoring and alerting early
  • Prioritize operational simplicity over feature completeness

Risk 3: Data Privacy and Compliance Issues

Scenario: AI models process sensitive data in ways that violate regulatory requirements (e.g., training on customer data without consent).

Mitigation:

  • Design compliance-by-data-domicile architecture
  • Data encryption at rest and in transit
  • Role-based access control for operational data
  • Audit logging for all data access
  • Regular compliance reviews with legal/compliance teams

Risk 4: Vendor Dependency for Models

Scenario: Over-dependence on specific AI model families (e.g., only BERT, only OpenAI) limits flexibility and innovation.

Mitigation:

  • Use modular architecture to support multiple model families
  • Implement model abstraction layer for model replacements
  • Maintain open-source models as fallbacks
  • Regular evaluation of new model architectures

Risk 5: Cost Overruns

Scenario: Infrastructure costs (GPU instances, storage, licensing) exceed projections and budget constraints.

Mitigation:

  • Start with CPU instances for inference, add GPUs only as needed
  • Implement request batching and model quantization for efficiency
  • Use spot instances for non-critical workloads
  • Implement capacity planning dashboards for cost visibility
  • Phase deployments to validate investments at each stage

ROI Calculation Framework

Quantitative Benefits

Operational Efficiency Gains:

  • Reduced MTTR: 40-60% reduction in incident resolution time
  • Reduced Alert Fatigue: 50-70% reduction in manual alert triage
  • Reduced Toil: 60-80% reduction in manual operational tasks

Infrastructure Optimization:

  • Reduced Overprovisioning: 20-30% reduction in overprovisioned infrastructure
  • Improved Resource Utilization: 15-25% improvement in CPU/memory utilization
  • Extended Hardware Lifespan: 10-20% longer hardware replacement cycles

Cost Avoidance:

  • Avoided Outages: Estimate value of avoided downtime based on business impact
  • Reduced Team Turnover: Reduced on-call burnout reduces hiring costs
  • Faster Innovation: Reduced operational toil frees engineering time for innovation

Qualitative Benefits

Improved Reliability:

  • Proactive incident detection and prevention
  • More consistent operational procedures across team
  • Reduced human error through automated validation

Enhanced Compliance:

  • Automated compliance monitoring (configuration drift)
  • Audit-ready logging and monitoring
  • Reduced manual compliance overhead

Business Agility:

  • Faster deployments with automated risk assessment
  • More accurate capacity planning enables proactive scaling
  • Reduced time-to-market for new features

ROI Calculation Example

Scenario: Mid-sized company with 5 microservices, 3 operations engineers, 20TB infrastructure.

Investment (Year 1):

  • Infrastructure CAPEX: $50,000 (3 GPU nodes, storage, networking)
  • Personnel: $200,000 (ML engineer + operations training)
  • Software licenses: $20,000 (monitoring, security tooling)
  • Total Investment Year 1: $270,000

Benefits (Year 1):

  • Operational Efficiency: 50% reduction in toil = 1 FTE saved ($150,000)
  • Infrastructure Savings: 20% overprovisioning reduction = $40,000
  • Avoided Outages: 2 outages avoided × $50,000 impact = $100,000
  • Total Benefits Year 1: $290,000

ROI Year 1: (Benefits - Investment) / Investment = ($290K - $270K) / $270K = 7%

ROI Month 6: (Benefits Month 1-6 - Investment Month 1-6) / Investment Month 1-6

  • After 6 months: Benefits ~$145K, Investment ~$135K (cumulative)
  • ROI Month 6: ~7%

Note: ROI improves in subsequent years as investment amortizes over multiple years and models become more effective with more data.

Conclusion: The Path to AI-Enabled DevOps

The transformation from manual operations to AI-automated DevOps represents a tremendous opportunity for organizations to improve reliability, reduce operational toil, and accelerate innovation.

The journey begins with a strategic commitment to operational excellence and investments in both technical infrastructure and team capabilities. By starting small, iterating quickly, and learning from failures, organizations can gradually expand AI automation across operational domains.

The organizations that embrace AI-enabled DevOps today will enjoy competitive advantages in:

  • Reliability: Higher uptime, faster incident response
  • Efficiency: More productive teams, lower operational costs
  • Agility: Faster deployments, more flexible capacity planning
  • Innovation: Greater bandwidth for strategic initiatives, less time fighting fires

The time to start building AI-enabled DevOps capabilities is now—before competitors gain operational advantages that become insurmountable.


This article is part of the Transforming Operations Series on tobias-weiss.org, exploring how AI transforms operational workflows.