Optimizing for Dual-GPU Recording: Scaling to 100+ Concurrent Streams guide illustration
⏱️ 4 min read

Optimizing for Dual-GPU Recording: Scaling to 100+ Concurrent Streams


Table of Contents

When your recording rig scales beyond 30-40 concurrent streams, a single GPU—even a powerhouse like the RTX 4090—can become a bottleneck. Not necessarily due to raw performance, but because of NVENC Session Limits and PCIe bandwidth starvation.

This masterclass covers how to architect a dual-GPU setup to effectively double your capture capacity.

The Dual-GPU Architecture

In a professional multi-GPU setup, we move away from the “One Card Does All” approach. Instead, we define Primary and Secondary roles:

  1. Primary GPU (The Ingestor): Responsible for high-fidelity 4K/8K recording and initial stream capture.
  2. Secondary GPU (The Transcoder): Handles background transcoding (e.g., converting RAW to HEVC for cloud archival) or managing lower-priority 1080p captures.

Optimal Hardware Combinations

RoleRecommended GPUWhy?
PrimaryRTX 4090 / 408024GB VRAM for high-res buffers; Dual AV1 encoders.
SecondaryRTX 4060 Ti (16GB)High VRAM-to-Price ratio; 8th Gen NVENC for efficient transcoding.

Mastering CaptureGem GPU Flags

CaptureGem allows you to target specific GPUs using FFmpeg-level device mapping.

Targeting Specific Devices

To assign a specific model recording to your second GPU, use the following flag in your config.yaml or CLI:

# CaptureGem CLI example targeting Device 1 (Secondary GPU)
capturegem --model "ModelName" --gpu 1 --ffmpeg-flags "-c:v h264_nvenc -gpu 1"

The config.yaml Multi-GPU Map

# Balanced load distribution
instances:
  - id: "rig-alpha"
    gpu_id: 0
    models: ["Model_A", "Model_B", "Model_C"]
  - id: "rig-beta"
    gpu_id: 1
    models: ["Model_D", "Model_E", "Model_F"]

NVENC Session Management

Consumer NVIDIA cards are typically locked to 5-8 concurrent NVENC sessions. While some driver “hacks” exist to bypass this, the safest professional approach is splitting the load.

💡 Pro Tip: VRAM is the Real Limit

Even if you unlock sessions, you will hit a VRAM wall. A 4K stream buffer typically consumes ~400MB of VRAM. An RTX 4090 (24GB) can safely handle ~45-50 high-bitrate sessions before encountering Out of Memory errors in the logs.

Avoiding PCIe Bandwidth Starvation

Most consumer motherboards (Z790/X670) offer a single x16 slot wired to the CPU. The second physical x16 slot often runs at x4 through the chipset.

The Bottleneck: If your primary GPU is saturating the PCIe bus with multi-stream writes, your secondary GPU may suffer from high latency, leading to dropped frames even if its CPU usage is low.

The Solution: Use a Threadripper or Xeon platform with 64+ lanes, or ensure your Secondary GPU is handling lower-bitrate tasks that don’t saturate the chipset link.

Operating System Setup

Docker Passthrough

When running in containers, you must pass the specific UUIDs to ensure Docker doesn’t “fight” over the primary device.

# docker-compose.yml snippet
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          device_ids: ['0', '1'] # Maps both GPUs to the container
          capabilities: [compute, utility, video]

Benchmarks: Single vs. Dual

MetricSingle RTX 4090Dual (4090 + 4060 Ti)Improvement
Max Stable 4K Streams3862+63%
Transcode Latency120ms45ms-62%
System Thermal Load82°C74°C (Distributed)-8°C

Cooling & Power

A dual-GPU rig can pull 800W+ under full recording load.

  • PSU: Use a 1200W Platinum rated unit.
  • Spacing: Ensure at least 2 slots of air between cards. Blowers are preferred for the bottom card to exhaust heat directly out the rear.

Summary Checklist

  1. Assign IDs: Map your high-priority models to GPU 0.
  2. Set Flags: Use the -gpu flag in your FFmpeg strings.
  3. Monitor VRAM: Use nvidia-smi to ensure neither card exceeds 90% utilization.
  4. Check Lanes: Verify PCIe link speed in GPU-Z.

Related guides

Rate this guide

Loading ratings...

Was this guide helpful?

Comments