Why Offloading Video Decode to Integrated GPUs Matters for Edge AI

Nisse KnudsenSep 17, 2025

Illustration for video decode GPU article

Overview

Video decoding consumes substantial CPU resources in edge AI systems, often becoming the bottleneck before any AI inference can begin. Our analysis using Intel QuickSync demonstrates that offloading decode to integrated GPUs dramatically reduces CPU usage and power consumption, enabling systems to handle significantly more camera streams. This untapped hardware acceleration already exists in most modern processors and requires only simple FFmpeg configuration changes to free your CPU cores for actual AI workloads.

Introduction
Key Findings: Why Hardware Decode Matters
Why Hardware Acceleration Works So Well
Benchmark Setup: Systems and Test Scenarios
Benchmark Results: CPU vs GPU Acceleration
When Hardware Acceleration Makes the Biggest Impact
Beyond Intel: Universal Hardware Support
Conclusion: Free Your CPU for What Matters
Important Considerations
Appendix

Introduction

Edge AI systems often need to handle multiple video feeds (e.g. from security cameras or robots) and run computer vision models on those streams. A critical but sometimes overlooked aspect is video decoding efficiency. Before any AI processing can happen, each camera's compressed stream (H.264, HEVC, etc.) must be decoded into raw frames – which can be computationally expensive. If you have several 4K cameras, decoding alone can chew through a large chunk of a CPU’s capacity, leaving less room for the actual vision algorithms. In power-constrained edge devices, it can also waste precious watts and generate excess heat.

Most modern CPUs include an integrated GPU with specialized video decode hardware. On Intel platforms this is known as QuickSync (exposed via VAAPI/oneVPL in software - take a look at the Intel documentation for more info and untangling of all APIs and namings), and similar blocks exist on ARM SoCs and NVIDIA Jetsons. Offloading the heavy lifting of video decoding to these fixed-function units reduces CPU load and power consumption, allowing your edge AI system to scale to more cameras or run heavier models.

At make87, we regularly work with clients building multi-camera computer vision systems who wonder whether their existing hardware can handle their ambitious deployments. A common pattern we see: engineering teams working with established 4-6 core systems often discover they have substantial untapped potential for handling multiple 4K camera feeds without requiring expensive upgrades. Most Intel systems from 2015+ include hardware video acceleration that increases system capacity, potentially eliminating hardware upgrade requirements. The key is knowing how to unlock this capability.

Key Findings: Why Hardware Decode Matters

Our benchmarks reveal advantages for computer vision pipelines:

CPU reduction: Hardware decode reduced CPU usage by up to 70%, freeing cores for AI inference. For multi-camera systems, this multiplies quickly.
Power savings: Using iGPU lowered CPU package power consumption by up to 5W per stream above idle. In edge deployments with 8+ cameras, this significantly reduces heat generation.
Scaling potential: The iGPU video engines remained lightly loaded even handling 4K HEVC streams, indicating headroom for 10+ parallel streams on hardware that would saturate with just 2-3 CPU-decoded streams.
Processing overhead compounds benefits: Tasks like scaling (common in CV preprocessing) amplified the advantage. This benefits inference pipelines that resize inputs.
Frame dropping: GPU acceleration provides efficient frame rate reduction because hardware can discard frames early in the decode pipeline (vpp_qsv=framerate=2), while CPU approaches (fps=2) still decode all frames before dropping them. This architectural difference makes GPU paths more efficient for applications that only need periodic frame analysis.

Note: The most efficient scaling is configuring your camera to record at the resolution you actually need. If your ML model only needs 960×540 input, having the camera encode at that resolution eliminates both decode overhead AND scaling overhead entirely. However, when you need multiple resolutions from the same feed or can't control camera settings, hardware-accelerated scaling provides the next-best efficiency.

The bottom line: Hardware decode isn't just about video playback – it's about freeing your CPU cycles for the AI work that matters.

Why Hardware Acceleration Works So Well

The performance improvements result from fundamental architectural differences:

Dedicated Silicon: iGPUs include fixed-function decoder blocks (Intel's VCS engine) specifically designed for video codecs. These implement complex operations like motion compensation and entropy decoding in specialized hardware, while CPUs must execute thousands of general-purpose instructions for the same work.
Optimized Data Flow: When processing video on CPU, each frame passes through multiple stages (decode → memory → scale → memory). Hardware acceleration can perform decode+scale in one pass, outputting only the final smaller frame. For example, this reduces memory bandwidth by 16× for 4K→540p scaling.
Parallel Processing: The CPU and iGPU work simultaneously — while the iGPU handles video preprocessing, CPU cores remain free for AI inference and other tasks.

Now let's prove these architectural advantages with real-world benchmarks.

Benchmark Setup: Systems and Test Scenarios

To make this concrete, we set up a head-to-head comparison on two machines:

Lenovo M910q (Intel i5-7500T) – A 4-core/4-thread CPU from 2017 (Kaby Lake) with Intel HD Graphics 630. This system represents a resource-constrained x86 platform. Measured specs from lscpu: 2.70GHz base/3.30GHz max, 6MB L3 cache, single-threaded cores (no hyperthreading).
Minisforum "Venus" (Intel Core i9-12900HK) – A modern 12th Gen mobile CPU with Intel Iris Xe Graphics. This system demonstrates how hardware acceleration benefits high-core-count platforms. Measured specs from lscpu: 14 cores/20 threads (6P+8E hybrid architecture), up to 5.00GHz, 24MB L3 cache, 11.5MB L2 cache.

Both were tasked with processing a 4K (3840×2160) 20 FPS video feed from an IP camera (HEVC codec) over RTSP for ~30 seconds. The camera was pointed at a simple wall without any dynamic content. We evaluated four scenarios, each run two ways (CPU vs iGPU decode):

RAW Decode (Full frame rate, full resolution) – Just decoding all frames to raw pixels with no extra processing.
Subsampled Decode (Frame Drop) – Decoding and outputting only 2 FPS (dropping 90% of frames).
Rescaled Decode (Spatial Resize) – Decoding all frames and downscaling them to 960×540 (quarter resolution).
Rescaled + Subsampled – Decoding with both the 2 FPS frame drop and 960×540 resizing.

For each scenario, we ran containerized FFmpeg 7.1.1 with either software decode (using the CPU's libavcodec) or hardware-accelerated decode (using Intel's QuickSync via FFmpeg's QSV/VAAPI support). We measured CPU utilization, GPU utilization (for the iGPU runs), and CPU package power throughout each run.

Color Space Considerations: NV12 vs YUV420P

An important technical detail: we used each pipeline's native color space to avoid unnecessary conversion overhead that would skew results. The CPU decode pipeline naturally outputs YUV420P (planar format), while Intel's QuickSync hardware decoder outputs NV12 (semi-planar format).

Rather than force both pipelines to use the same output format (which would add color space conversion overhead to one path), we let each use its optimal format. This ensures we're measuring pure decode+scale performance, not artificial bottlenecks from format conversions. In real computer vision applications, you'd similarly choose your pipeline's native format or handle the conversion once at the boundary between video processing and inference. For broader hardware acceleration options in computer vision frameworks, see OpenCV's hardware acceleration documentation.

Each test ran for the same 30-second video segment. We used system monitoring tool s-tui to log CPU core utilization and power draw, and intel_gpu_top for GPU engine utilization during the runs.

Benchmark Results: CPU vs GPU Acceleration

Figure 1 presents comprehensive performance data across all tested scenarios on both CPU architectures. The results demonstrate consistent advantages for hardware acceleration across CPU utilization, power consumption, and GPU headroom metrics.

Performance summary table comparing CPU vs GPU decode performance. — Figure 1: Comprehensive performance summary comparing all tested scenarios with measured data

CPU Utilization Results

As shown in Figure 2, hardware acceleration delivered 28-70% CPU reduction on the resource-constrained i5-7500T and 23-52% reduction on the high-core-count i9-12900HK, with the largest benefits occurring during preprocessing operations like scaling.

Figure 2: CPU utilization comparison showing reductions with hardware acceleration across different processing scenarios

Power Consumption Results

As shown in Figure 3, hardware decode reduced power consumption by 3.8W per stream on the i5 system and 5.3W per stream on the i9 system, providing substantial energy savings that scale with camera count in multi-stream deployments.

Figure 3: Power consumption comparison demonstrating substantial energy savings with hardware acceleration

GPU Utilization Results

While handling 4K HEVC streams, the iGPUs remained lightly loaded with HD 630 (i5) at 7-11% and Iris Xe (i9) at 3-11% total GPU utilization. As illustrated in Figure 4, this indicates capacity for additional parallel streams required by multi-camera vision systems. ‍

GPU utilization comparison — Figure 4: GPU engine utilization showing massive available headroom across all hardware processing units

Multi-Stream Scaling: Real-World Complexity

To understand real-world performance, we tested both systems with 5 parallel streams across different scenarios, revealing important scaling behaviors that differ from simple linear projections as shown in Figure 5. The resource-constrained i5-7500T shows higher-than-linear scaling for raw decode scenarios due to memory bandwidth limitations when handling multiple large 4K streams simultaneously. However, preprocessing scenarios (scaling, subsampling) scale much more efficiently because reduced data volume alleviates memory transfer bottlenecks. The high-core-count i9-12900HK demonstrates sub-linear scaling, indicating better resource efficiency where the modern architecture shares resources (cache, memory controllers) more effectively than simple multiplication would suggest. Hardware acceleration maintains efficiency across multiple streams with minimal resource conflicts on both platforms.

The critical insight is that real multi-stream performance depends on system bottlenecks beyond just CPU cores - memory bandwidth, cache hierarchy, and I/O subsystems all influence scaling behavior, which is why preprocessing operations that reduce data movement provide disproportionate benefits in multi-stream scenarios. Note that specialized interconnect technologies like NVIDIA's NVLink and GPUDirect can bypass some of these constraints by enabling direct GPU-to-GPU communication and eliminating CPU bounce buffers, though these solutions are typically found in high-end data center hardware rather than edge AI systems.

Single vs multi stream scaling — Figure 5: Actual measured scaling from 1 to 5 parallel streams showing real-world multi-stream behavior

When Hardware Acceleration Makes the Biggest Impact

Resource-constrained platforms with 4-8 CPU cores see video decode quickly become the bottleneck, with our benchmarks showing the 4-core i5 handling 10+ streams via hardware acceleration versus only 2-3 streams with software decode. Multi-camera deployments benefit proportionally with camera count - four 4K streams would consume 80% CPU with software decode versus just 16% with hardware decode, freeing 3+ cores for AI processing.

The most dramatic improvements occur in preprocessing-heavy pipelines where computer vision systems resize frames for different model input requirements, as shown in Figure 6. Hardware acceleration performs decode and scale within the same GPU pipeline without transferring frames back to CPU memory, providing the 3.3× CPU reduction we measured in testing. Finally, high-resolution or complex codecs like 4K HEVC streams require substantially more processing than 1080p H.264, making them ideal candidates for hardware decoders specifically designed for these formats.

Scaling benefits — Figure 6: Focused comparison of scaling performance showing hardware acceleration's advantage

Beyond Intel: Universal Hardware Support

Similar acceleration exists across platforms:

NVIDIA: Jetson devices and discrete GPUs include NVDEC hardware decoders. A Jetson Nano struggles with one 4k stream on CPU but handles multiple streams via NVDEC while keeping ARM cores free for AI inference.

ARM: Raspberry Pi 5 and similar devices include hardware video decode blocks. Proper utilization can reduce CPU usage from 100% to a fraction for video streams.

The key principle: check your platform's hardware decode capabilities. The performance patterns we demonstrated should apply universally, though implementation details vary by vendor.

Conclusion: Free Your CPU for What Matters

Hardware video acceleration determines whether edge AI systems can scale beyond basic configurations. Our benchmarks show that modern CPUs benefit from offloading video preprocessing to dedicated hardware.

For Computer Vision Engineers: If your current system processes multiple camera feeds, unused hardware acceleration represents available performance capacity. An FFmpeg flag change (-hwaccel qsv plus pipeline parameters) can reduce video processing CPU load by 50-80%, enabling larger models, more cameras, or higher inference throughput on the same hardware.

CPU cores should process algorithms rather than video decoding tasks that specialized silicon handles more efficiently.

We're planning additional benchmarks across edge AI platforms. Let us know what hardware acceleration challenges you're facing — your feedback shapes our future benchmarking priorities. Drop us a line with the platforms and use cases that matter to your computer vision deployments via our Discord server.

Important Considerations

Hardware Limits: Each iGPU has practical decode capacity limits that depend on resolution, frame rate, bitrate, and codec complexity. Intel doesn't publish hard limits, but community reports suggest 15+ concurrent lower-resolution streams are achievable on resource-constrained hardware. Our tests successfully ran 5×4K20fps HEVC streams on 2017-era HD Graphics 630, indicating substantial capacity. Benchmark your specific workload and target platform.

Codec Compatibility: Ensure your platform supports hardware acceleration for your specific codec. Intel 6th gen+ supports H.264/HEVC; newer generations add VP9/AV1.

Implementation Requirements: Hardware decode requires proper drivers and software support (FFmpeg with QSV/VAAPI, GStreamer with hardware plugins, etc.). The software setup is usually straightforward but platform-specific. Running ffmpeg -hwaccels will show available hardware acceleration methods on your system, and ffmpeg -codecs will list supported codecs (decoders and encoders). We found it easiest to use a pre-built FFmpeg docker image with Intel QSV support. linuxserver/ffmpeg has been a solid choice during our benchmark tests — just make sure to mount all required devices into the container so it can access them.

FFmpeg 8 Performance Note: With the recent official release of FFmpeg 8 and its dramatically rewritten libswscale (showing 2-40× performance improvements for scaling operations), retesting these benchmarks would provide updated performance data. The new swscale implementation may impact CPU-based scaling results and reduce the performance gap between software and hardware acceleration for resize-heavy pipelines. We're planning to revisit these measurements with FFmpeg 8 in future testing.

Appendix: Benchmark Reproduction Commands

Here are the exact FFmpeg commands we used for each test scenario, so you can reproduce these benchmarks on your own hardware.

← Back to blog