【2026年版】KubernetesでAIワークロードを最適化する方法：GPU管理からオートスケーリングまで

Tech Trends AI

2026年2月12日 - 3 minutes read - 463 words

はじめに

AI/MLワークロードの本番運用において、Kubernetesは事実上の標準プラットフォームとなっています。しかし、GPUリソースの効率的な管理やAI特有のスケーリング要件は、従来のWebアプリケーションとは大きく異なります。

本記事では、KubernetesでAIワークロードを最適化するための実践的な手法を解説します。

AI ワークロード向けKubernetesアーキテクチャ

基本構成

AI推論サービスをKubernetesで運用する際の典型的なアーキテクチャは以下の通りです。

[Ingress/API Gateway]
    ↓
[推論サービス Pod（GPU付き）]
    ↓
[モデルストレージ（S3/PV）]

ノードプールの設計

AI ワークロードでは、GPU ノードとCPUノードを分離したマルチノードプール構成が推奨されます。

ノードプール	用途	インスタンス例	GPU
system	システムコンポーネント	e2-standard-4	なし
cpu-workers	前処理・後処理	c2-standard-16	なし
gpu-inference	AI推論	a2-highgpu-1g	A100×1
gpu-training	モデル学習	a2-megagpu-16g	A100×16

GPU リソース管理

NVIDIA Device Plugin の設定

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

GPU の共有と分割

1つのGPUを複数のPodで共有するための手法：

GPU Time-Slicing: 時分割でGPUを共有（NVIDIA GPU Operator対応）
MIG（Multi-Instance GPU）: A100/H100のハードウェアレベル分割
vGPU: 仮想GPU技術による分割

# Time-Slicing設定例
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-sharing-config
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

Pod でのGPUリクエスト

apiVersion: v1
kind: Pod
metadata:
  name: inference-server
spec:
  containers:
  - name: model-server
    image: my-inference:latest
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "16Gi"
      requests:
        nvidia.com/gpu: 1
        memory: "12Gi"
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: "0"

AI推論サービスのデプロイ

Triton Inference Server の構成

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        args: ["tritonserver", "--model-repository=/models"]
        ports:
        - containerPort: 8000  # HTTP
        - containerPort: 8001  # gRPC
        - containerPort: 8002  # Metrics
        resources:
          limits:
            nvidia.com/gpu: 1
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        volumeMounts:
        - name: model-store
          mountPath: /models
      volumes:
      - name: model-store
        persistentVolumeClaim:
          claimName: model-pvc

vLLM による高性能LLM推論

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-70B-Instruct"
        - "--tensor-parallel-size"
        - "4"
        - "--max-model-len"
        - "8192"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 4
            memory: "320Gi"

オートスケーリング戦略

HPAによるスケーリング

AI推論ワークロードでは、GPUの使用率やリクエストキューの長さに基づいたスケーリングが効果的です。

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-inference
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "70"
  - type: Pods
    pods:
      metric:
        name: inference_queue_length
      target:
        type: AverageValue
        averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300

Karpenter / Cluster Autoscaler

GPUノードの追加・削除を自動化します。

# Karpenter NodePool例
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
      - key: "node.kubernetes.io/instance-type"
        operator: In
        values: ["g5.xlarge", "g5.2xlarge", "p4d.24xlarge"]
      - key: "karpenter.sh/capacity-type"
        operator: In
        values: ["on-demand", "spot"]
  limits:
    nvidia.com/gpu: 32
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

コスト最適化テクニック

スポットインスタンスの活用

推論ワークロードでスポットインスタンスを活用する場合の注意点：

グレースフルシャットダウン対応が必須
複数インスタンスタイプを指定して可用性を確保
フォールバックとしてオンデマンドを設定

モデルの最適化

手法	GPU メモリ削減	推論速度向上	精度影響
INT8量子化	50%	1.5-2x	軽微
INT4量子化	75%	2-3x	中程度
モデル蒸留	60-80%	2-5x	用途依存
プルーニング	30-50%	1.2-1.5x	軽微

リソース使用率の監視

# PrometheusでGPUメトリクスを収集
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: gpu-metrics
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s

まとめ

KubernetesでAIワークロードを最適化するためのポイント：

GPU管理: Device PluginとTime-Slicing/MIGでGPUリソースを効率利用
推論サーバー: Triton/vLLMなど専用ツールで高スループットを実現
オートスケーリング: GPU使用率やキュー長に基づく適切なスケーリング
コスト最適化: スポットインスタンスとモデル最適化で費用を削減

AI推論のコストと性能のバランスを取りながら、信頼性の高いサービスを構築しましょう。

カテゴリー

インフラ・セキュリティ

タグ

Kubernetes GPU AI推論コンテナオートスケーリング MLOps