Why Offloading Video Decode to Integrated GPUs Matters for Edge AI

A digital illustration depicts a system diagram on a dark background. On the left, three server racks connect to a hub, which leads to a large central box containing three stacked, upward-pointing arrows. From this box, a line goes to a cartoon bee icon, which then connects to two cloud icons on the right.

Overview

Video decoding consumes substantial CPU resources in edge AI systems, often becoming the bottleneck before any AI inference can begin. Our analysis using Intel QuickSync demonstrates that offloading decode to integrated GPUs dramatically reduces CPU usage and power consumption, enabling systems to handle significantly more camera streams. This untapped hardware acceleration already exists in most modern processors and requires only simple FFmpeg configuration changes to free your CPU cores for actual AI workloads.

Table of contents

  1. Introduction
  2. Key Findings: Why Hardware Decode Matters
  3. Why Hardware Acceleration Works So Well
  4. Benchmark Setup: Systems and Test Scenarios
  5. Benchmark Results: CPU vs GPU Acceleration
  6. When Hardware Acceleration Makes the Biggest Impact
  7. Try It Yourself: make87 System Template
  8. Beyond Intel: Universal Hardware Support
  9. Conclusion: Free Your CPU for What Matters
  10. Important Considerations
  11. Appendix

Introduction

Edge AI systems often need to handle multiple video feeds (e.g. from security cameras or robots) and run computer vision models on those streams. A critical but sometimes overlooked aspect is video decoding efficiency. Before any AI processing can happen, each camera's compressed stream (H.264, HEVC, etc.) must be decoded into raw frames – which can be computationally expensive. If you have several 4K cameras, decoding alone can chew through a large chunk of a CPU’s capacity, leaving less room for the actual vision algorithms. In power-constrained edge devices, it can also waste precious watts and generate excess heat.

Most modern CPUs include an integrated GPU with specialized video decode hardware. On Intel platforms this is known as QuickSync (exposed via VAAPI/oneVPL in software - take a look at the Intel documentation for more info and untangling of all APIs and namings), and similar blocks exist on ARM SoCs and NVIDIA Jetsons. Offloading the heavy lifting of video decoding to these fixed-function units reduces CPU load and power consumption, allowing your edge AI system to scale to more cameras or run heavier models.

In this article, we'll quantify the benefits through benchmarks on two systems – an older 4-core Intel i5 and a modern 12th-gen Intel i9.

At make87, we regularly work with clients building multi-camera computer vision systems who wonder whether their existing hardware can handle their ambitious deployments. A common pattern we see: engineering teams working with established 4-6 core systems often discover they have substantial untapped potential for handling multiple 4K camera feeds without requiring expensive upgrades. Most Intel systems from 2015+ include hardware video acceleration that increases system capacity, potentially eliminating hardware upgrade requirements. The key is knowing how to unlock this capability.

Key Findings: Why Hardware Decode Matters

Our benchmarks reveal advantages for computer vision pipelines:

Note: The most efficient scaling is configuring your camera to record at the resolution you actually need. If your ML model only needs 960×540 input, having the camera encode at that resolution eliminates both decode overhead AND scaling overhead entirely. However, when you need multiple resolutions from the same feed or can't control camera settings, hardware-accelerated scaling provides the next-best efficiency.

The bottom line: Hardware decode isn't just about video playback – it's about freeing your CPU cycles for the AI work that matters.

Why Hardware Acceleration Works So Well

The performance improvements result from fundamental architectural differences:

Now let's prove these architectural advantages with real-world benchmarks.

Benchmark Setup: Systems and Test Scenarios

To make this concrete, we set up a head-to-head comparison on two machines:

Both were tasked with processing a 4K (3840×2160) 20 FPS video feed from an IP camera (HEVC codec) over RTSP for ~30 seconds. The camera was pointed at a simple wall without any dynamic content. We evaluated four scenarios, each run two ways (CPU vs iGPU decode):

  1. RAW Decode (Full frame rate, full resolution) – Just decoding all frames to raw pixels with no extra processing.

  2. Subsampled Decode (Frame Drop) – Decoding and outputting only 2 FPS (dropping 90% of frames).

  3. Rescaled Decode (Spatial Resize) – Decoding all frames and downscaling them to 960×540 (quarter resolution).

  4. Rescaled + Subsampled – Decoding with both the 2 FPS frame drop and 960×540 resizing.

For each scenario, we ran containerized FFmpeg 7.1.1 with either software decode (using the CPU's libavcodec) or hardware-accelerated decode (using Intel's QuickSync via FFmpeg's QSV/VAAPI support). We measured CPU utilization, GPU utilization (for the iGPU runs), and CPU package power throughout each run.

Color Space Considerations: NV12 vs YUV420P

An important technical detail: we used each pipeline's native color space to avoid unnecessary conversion overhead that would skew results. The CPU decode pipeline naturally outputs YUV420P (planar format), while Intel's QuickSync hardware decoder outputs NV12 (semi-planar format).
Rather than force both pipelines to use the same output format (which would add color space conversion overhead to one path), we let each use its optimal format. This ensures we're measuring pure decode+scale performance, not artificial bottlenecks from format conversions. In real computer vision applications, you'd similarly choose your pipeline's native format or handle the conversion once at the boundary between video processing and inference. For broader hardware acceleration options in computer vision frameworks, see OpenCV's hardware acceleration documentation.
Each test ran for the same 30-second video segment. We used system monitoring tool s-tui to log CPU core utilization and power draw, and intel_gpu_top for GPU engine utilization during the runs.

Benchmark Results: CPU vs GPU Acceleration

Figure 1 presents comprehensive performance data across all tested scenarios on both CPU architectures. The results demonstrate consistent advantages for hardware acceleration across CPU utilization, power consumption, and GPU headroom metrics.

Figure 1: Comprehensive performance summary comparing all tested scenarios with measured data

CPU Utilization Results

As shown in Figure 2, hardware acceleration delivered 28-70% CPU reduction on the resource-constrained i5-7500T and 23-52% reduction on the high-core-count i9-12900HK, with the largest benefits occurring during preprocessing operations like scaling.

Figure 2: CPU utilization comparison showing reductions with hardware acceleration across different processing scenarios

Power Consumption Results

As shown in Figure 3, hardware decode reduced power consumption by 3.8W per stream on the i5 system and 5.3W per stream on the i9 system, providing substantial energy savings that scale with camera count in multi-stream deployments.

Figure 3: Power consumption comparison demonstrating substantial energy savings with hardware acceleration

GPU Utilization Results

While handling 4K HEVC streams, the iGPUs remained lightly loaded with HD 630 (i5) at 7-11% and Iris Xe (i9) at 3-11% total GPU utilization. As illustrated in Figure 4, this indicates capacity for additional parallel streams required by multi-camera vision systems.

Figure 4: GPU engine utilization showing massive available headroom across all hardware processing units

Multi-Stream Scaling: Real-World Complexity

To understand real-world performance, we tested both systems with 5 parallel streams across different scenarios, revealing important scaling behaviors that differ from simple linear projections as shown in Figure 5. The resource-constrained i5-7500T shows higher-than-linear scaling for raw decode scenarios due to memory bandwidth limitations when handling multiple large 4K streams simultaneously. However, preprocessing scenarios (scaling, subsampling) scale much more efficiently because reduced data volume alleviates memory transfer bottlenecks. The high-core-count i9-12900HK demonstrates sub-linear scaling, indicating better resource efficiency where the modern architecture shares resources (cache, memory controllers) more effectively than simple multiplication would suggest. Hardware acceleration maintains efficiency across multiple streams with minimal resource conflicts on both platforms.
The critical insight is that real multi-stream performance depends on system bottlenecks beyond just CPU cores - memory bandwidth, cache hierarchy, and I/O subsystems all influence scaling behavior, which is why preprocessing operations that reduce data movement provide disproportionate benefits in multi-stream scenarios. Note that specialized interconnect technologies like NVIDIA's NVLink and GPUDirect can bypass some of these constraints by enabling direct GPU-to-GPU communication and eliminating CPU bounce buffers, though these solutions are typically found in high-end data center hardware rather than edge AI systems.

Figure 5: Actual measured scaling from 1 to 5 parallel streams showing real-world multi-stream behavior

When Hardware Acceleration Makes the Biggest Impact

Resource-constrained platforms with 4-8 CPU cores see video decode quickly become the bottleneck, with our benchmarks showing the 4-core i5 handling 10+ streams via hardware acceleration versus only 2-3 streams with software decode. Multi-camera deployments benefit proportionally with camera count - four 4K streams would consume 80% CPU with software decode versus just 16% with hardware decode, freeing 3+ cores for AI processing.

The most dramatic improvements occur in preprocessing-heavy pipelines where computer vision systems resize frames for different model input requirements, as shown in Figure 6. Hardware acceleration performs decode and scale within the same GPU pipeline without transferring frames back to CPU memory, providing the 3.3× CPU reduction we measured in testing. Finally, high-resolution or complex codecs like 4K HEVC streams require substantially more processing than 1080p H.264, making them ideal candidates for hardware decoders specifically designed for these formats.

Figure 6: Focused comparison of scaling performance showing hardware acceleration's advantage

Try It Yourself: make87 System Template

To test hardware-accelerated camera decoding: We've created a simple make87 System Template that lets you connect to your IP camera and compare software vs hardware acceleration performance in real-time. The template includes both SW and QSV HW acceleration variants, plus a visual logger so you can confirm that frames are being processed correctly. This approach provides measurable results for hardware evaluation and stakeholder demonstrations on your specific setup.

System design with a simple camera driver and optional visual logging helpers

You can start by just running the camera-driver application and monitor CPU performance via selecting your node and going to "Metrics". Or you can go to "Terminal" and type docker stats for getting the container's usage. Before running the application, make sure to go to the config section by selecting the pen icon and entering your camera IP, path suffix and optionally username/password. When selecting the qsv variant, you need to mount your iGPU's render mount (typically: /dev/dri/renderD128) in the "RENDER" section of the application. This section does not exist for the default variant, which is used for software decoding.

Beyond Intel: Universal Hardware Support

Similar acceleration exists across platforms:

NVIDIA: Jetson devices and discrete GPUs include NVDEC hardware decoders. A Jetson Nano struggles with one 4k stream on CPU but handles multiple streams via NVDEC while keeping ARM cores free for AI inference.

ARM SoCs: Raspberry Pi 5 and similar devices include hardware video decode blocks. Proper utilization can reduce CPU usage from 100% to a fraction for video streams.

The key principle: check your platform's hardware decode capabilities. The performance patterns we demonstrated should apply universally, though implementation details vary by vendor.

Conclusion: Free Your CPU for What Matters

Hardware video acceleration determines whether edge AI systems can scale beyond basic configurations. Our benchmarks show that modern CPUs benefit from offloading video preprocessing to dedicated hardware.

For Computer Vision Engineers: If your current system processes multiple camera feeds, unused hardware acceleration represents available performance capacity. An FFmpeg flag change (-hwaccel qsv plus pipeline parameters) can reduce video processing CPU load by 50-80%, enabling larger models, more cameras, or higher inference throughput on the same hardware.

CPU cores should process algorithms rather than video decoding tasks that specialized silicon handles more efficiently.

We're planning additional benchmarks across edge AI platforms. Let us know what hardware acceleration challenges you're facing — your feedback shapes our future benchmarking priorities. Drop us a line with the platforms and use cases that matter to your computer vision deployments via our Discord server.

Important Considerations

Hardware Limits: Each iGPU has practical decode capacity limits that depend on resolution, frame rate, bitrate, and codec complexity. Intel doesn't publish hard limits, but community reports suggest 15+ concurrent lower-resolution streams are achievable on resource-constrained hardware. Our tests successfully ran 5×4K20fps HEVC streams on 2017-era HD Graphics 630, indicating substantial capacity. Benchmark your specific workload and target platform.

Codec Compatibility: Ensure your platform supports hardware acceleration for your specific codec. Intel 6th gen+ supports H.264/HEVC; newer generations add VP9/AV1.

Implementation Requirements: Hardware decode requires proper drivers and software support (FFmpeg with QSV/VAAPI, GStreamer with hardware plugins, etc.). The software setup is usually straightforward but platform-specific. Running ffmpeg -hwaccels will show available hardware acceleration methods on your system, and ffmpeg -codecs will list supported codecs (decoders and encoders). We found it easiest to use a pre-built FFmpeg docker image with Intel QSV support. linuxserver/ffmpeg has been a solid choice during our benchmark tests — just make sure to mount all required devices into the container so it can access them.

FFmpeg 8 Performance Note: With the recent official release of FFmpeg 8 and its dramatically rewritten libswscale (showing 2-40× performance improvements for scaling operations), retesting these benchmarks would provide updated performance data. The new swscale implementation may impact CPU-based scaling results and reduce the performance gap between software and hardware acceleration for resize-heavy pipelines. We're planning to revisit these measurements with FFmpeg 8 in future testing.

Appendix: Benchmark Reproduction Commands

Here are the exact FFmpeg commands we used for each test scenario, so you can reproduce these benchmarks on your own hardware.

Share article