Why Offloading Video Decode to Integrated GPUs Matters for Edge AI

Overview
Video decoding consumes substantial CPU resources in edge AI systems, often becoming the bottleneck before any AI inference can begin. Our analysis using Intel QuickSync demonstrates that offloading decode to integrated GPUs dramatically reduces CPU usage and power consumption, enabling systems to handle significantly more camera streams. This untapped hardware acceleration already exists in most modern processors and requires only simple FFmpeg configuration changes to free your CPU cores for actual AI workloads.
Table of contents
- Introduction
- Key Findings: Why Hardware Decode Matters
- Why Hardware Acceleration Works So Well
- Benchmark Setup: Systems and Test Scenarios
- Benchmark Results: CPU vs GPU Acceleration
- When Hardware Acceleration Makes the Biggest Impact
- Try It Yourself: make87 System Template
- Beyond Intel: Universal Hardware Support
- Conclusion: Free Your CPU for What Matters
- Important Considerations
- Appendix
Introduction
Edge AI systems often need to handle multiple video feeds (e.g. from security cameras or robots) and run computer vision models on those streams. A critical but sometimes overlooked aspect is video decoding efficiency. Before any AI processing can happen, each camera's compressed stream (H.264, HEVC, etc.) must be decoded into raw frames – which can be computationally expensive. If you have several 4K cameras, decoding alone can chew through a large chunk of a CPU’s capacity, leaving less room for the actual vision algorithms. In power-constrained edge devices, it can also waste precious watts and generate excess heat.
Most modern CPUs include an integrated GPU with specialized video decode hardware. On Intel platforms this is known as QuickSync (exposed via VAAPI/oneVPL in software - take a look at the Intel documentation for more info and untangling of all APIs and namings), and similar blocks exist on ARM SoCs and NVIDIA Jetsons. Offloading the heavy lifting of video decoding to these fixed-function units reduces CPU load and power consumption, allowing your edge AI system to scale to more cameras or run heavier models.
In this article, we'll quantify the benefits through benchmarks on two systems – an older 4-core Intel i5 and a modern 12th-gen Intel i9.
At make87, we regularly work with clients building multi-camera computer vision systems who wonder whether their existing hardware can handle their ambitious deployments. A common pattern we see: engineering teams working with established 4-6 core systems often discover they have substantial untapped potential for handling multiple 4K camera feeds without requiring expensive upgrades. Most Intel systems from 2015+ include hardware video acceleration that increases system capacity, potentially eliminating hardware upgrade requirements. The key is knowing how to unlock this capability.
Key Findings: Why Hardware Decode Matters
Our benchmarks reveal advantages for computer vision pipelines:
- CPU reduction: Hardware decode reduced CPU usage by up to 70%, freeing cores for AI inference. For multi-camera systems, this multiplies quickly.
- Power savings: Using iGPU lowered CPU package power consumption by up to 5W per stream above idle. In edge deployments with 8+ cameras, this significantly reduces heat generation.
- Scaling potential: The iGPU video engines remained lightly loaded even handling 4K HEVC streams, indicating headroom for 10+ parallel streams on hardware that would saturate with just 2-3 CPU-decoded streams.
- Processing overhead compounds benefits: Tasks like scaling (common in CV preprocessing) amplified the advantage. This benefits inference pipelines that resize inputs.
- Frame dropping: GPU acceleration provides efficient frame rate reduction because hardware can discard frames early in the decode pipeline (
vpp_qsv=framerate=2
), while CPU approaches (fps=2
) still decode all frames before dropping them. This architectural difference makes GPU paths more efficient for applications that only need periodic frame analysis.
Note: The most efficient scaling is configuring your camera to record at the resolution you actually need. If your ML model only needs 960×540 input, having the camera encode at that resolution eliminates both decode overhead AND scaling overhead entirely. However, when you need multiple resolutions from the same feed or can't control camera settings, hardware-accelerated scaling provides the next-best efficiency.
The bottom line: Hardware decode isn't just about video playback – it's about freeing your CPU cycles for the AI work that matters.
Why Hardware Acceleration Works So Well
The performance improvements result from fundamental architectural differences:
Dedicated Silicon: iGPUs include fixed-function decoder blocks (Intel's VCS engine) specifically designed for video codecs. These implement complex operations like motion compensation and entropy decoding in specialized hardware, while CPUs must execute thousands of general-purpose instructions for the same work.
Optimized Data Flow: When processing video on CPU, each frame passes through multiple stages (decode → memory → scale → memory). Hardware acceleration can perform decode+scale in one pass, outputting only the final smaller frame. For example, this reduces memory bandwidth by 16× for 4K→540p scaling.
Parallel Processing: The CPU and iGPU work simultaneously — while the iGPU handles video preprocessing, CPU cores remain free for AI inference and other tasks.
Now let's prove these architectural advantages with real-world benchmarks.
Benchmark Setup: Systems and Test Scenarios
To make this concrete, we set up a head-to-head comparison on two machines:
Lenovo M910q (Intel Core i5-7500T) – A 4-core/4-thread CPU from 2017 (Kaby Lake) with Intel HD Graphics 630. This system represents a resource-constrained x86 platform.
Measured specs fromlscpu
: 2.70GHz base/3.30GHz max, 6MB L3 cache, single-threaded cores (no hyperthreading).Minisforum "Venus" (Intel Core i9-12900HK) – A modern 12th Gen mobile CPU with Intel Iris Xe Graphics. This system demonstrates how hardware acceleration benefits high-core-count platforms.
Measured specs fromlscpu
: 14 cores/20 threads (6P+8E hybrid architecture), up to 5.00GHz, 24MB L3 cache, 11.5MB L2 cache.
Both were tasked with processing a 4K (3840×2160) 20 FPS video feed from an IP camera (HEVC codec) over RTSP for ~30 seconds. The camera was pointed at a simple wall without any dynamic content. We evaluated four scenarios, each run two ways (CPU vs iGPU decode):
RAW Decode (Full frame rate, full resolution) – Just decoding all frames to raw pixels with no extra processing.
Subsampled Decode (Frame Drop) – Decoding and outputting only 2 FPS (dropping 90% of frames).
Rescaled Decode (Spatial Resize) – Decoding all frames and downscaling them to 960×540 (quarter resolution).
Rescaled + Subsampled – Decoding with both the 2 FPS frame drop and 960×540 resizing.
For each scenario, we ran containerized FFmpeg 7.1.1 with either software decode (using the CPU's libavcodec
) or hardware-accelerated decode (using Intel's QuickSync via FFmpeg's QSV/VAAPI support). We measured CPU utilization, GPU utilization (for the iGPU runs), and CPU package power throughout each run.
Color Space Considerations: NV12 vs YUV420P
Benchmark Results: CPU vs GPU Acceleration
Figure 1 presents comprehensive performance data across all tested scenarios on both CPU architectures. The results demonstrate consistent advantages for hardware acceleration across CPU utilization, power consumption, and GPU headroom metrics.
CPU Utilization Results
As shown in Figure 2, hardware acceleration delivered 28-70% CPU reduction on the resource-constrained i5-7500T and 23-52% reduction on the high-core-count i9-12900HK, with the largest benefits occurring during preprocessing operations like scaling.
Power Consumption Results
As shown in Figure 3, hardware decode reduced power consumption by 3.8W per stream on the i5 system and 5.3W per stream on the i9 system, providing substantial energy savings that scale with camera count in multi-stream deployments.
GPU Utilization Results
While handling 4K HEVC streams, the iGPUs remained lightly loaded with HD 630 (i5) at 7-11% and Iris Xe (i9) at 3-11% total GPU utilization. As illustrated in Figure 4, this indicates capacity for additional parallel streams required by multi-camera vision systems.
Multi-Stream Scaling: Real-World Complexity
To understand real-world performance, we tested both systems with 5 parallel streams across different scenarios, revealing important scaling behaviors that differ from simple linear projections as shown in Figure 5. The resource-constrained i5-7500T shows higher-than-linear scaling for raw decode scenarios due to memory bandwidth limitations when handling multiple large 4K streams simultaneously. However, preprocessing scenarios (scaling, subsampling) scale much more efficiently because reduced data volume alleviates memory transfer bottlenecks. The high-core-count i9-12900HK demonstrates sub-linear scaling, indicating better resource efficiency where the modern architecture shares resources (cache, memory controllers) more effectively than simple multiplication would suggest. Hardware acceleration maintains efficiency across multiple streams with minimal resource conflicts on both platforms.
The critical insight is that real multi-stream performance depends on system bottlenecks beyond just CPU cores - memory bandwidth, cache hierarchy, and I/O subsystems all influence scaling behavior, which is why preprocessing operations that reduce data movement provide disproportionate benefits in multi-stream scenarios. Note that specialized interconnect technologies like NVIDIA's NVLink and GPUDirect can bypass some of these constraints by enabling direct GPU-to-GPU communication and eliminating CPU bounce buffers, though these solutions are typically found in high-end data center hardware rather than edge AI systems.
When Hardware Acceleration Makes the Biggest Impact
Resource-constrained platforms with 4-8 CPU cores see video decode quickly become the bottleneck, with our benchmarks showing the 4-core i5 handling 10+ streams via hardware acceleration versus only 2-3 streams with software decode. Multi-camera deployments benefit proportionally with camera count - four 4K streams would consume 80% CPU with software decode versus just 16% with hardware decode, freeing 3+ cores for AI processing.
The most dramatic improvements occur in preprocessing-heavy pipelines where computer vision systems resize frames for different model input requirements, as shown in Figure 6. Hardware acceleration performs decode and scale within the same GPU pipeline without transferring frames back to CPU memory, providing the 3.3× CPU reduction we measured in testing. Finally, high-resolution or complex codecs like 4K HEVC streams require substantially more processing than 1080p H.264, making them ideal candidates for hardware decoders specifically designed for these formats.
Try It Yourself: make87 System Template
To test hardware-accelerated camera decoding: We've created a simple make87 System Template that lets you connect to your IP camera and compare software vs hardware acceleration performance in real-time. The template includes both SW and QSV HW acceleration variants, plus a visual logger so you can confirm that frames are being processed correctly. This approach provides measurable results for hardware evaluation and stakeholder demonstrations on your specific setup.
You can start by just running the camera-driver
application and monitor CPU performance via selecting your node and going to "Metrics". Or you can go to "Terminal" and type docker stats
for getting the container's usage. Before running the application, make sure to go to the config section by selecting the pen icon and entering your camera IP, path suffix and optionally username/password. When selecting the qsv
variant, you need to mount your iGPU's render mount (typically: /dev/dri/renderD128
) in the "RENDER" section of the application. This section does not exist for the default
variant, which is used for software decoding.
Beyond Intel: Universal Hardware Support
Similar acceleration exists across platforms:
NVIDIA: Jetson devices and discrete GPUs include NVDEC hardware decoders. A Jetson Nano struggles with one 4k stream on CPU but handles multiple streams via NVDEC while keeping ARM cores free for AI inference.
ARM SoCs: Raspberry Pi 5 and similar devices include hardware video decode blocks. Proper utilization can reduce CPU usage from 100% to a fraction for video streams.
The key principle: check your platform's hardware decode capabilities. The performance patterns we demonstrated should apply universally, though implementation details vary by vendor.
Conclusion: Free Your CPU for What Matters
Hardware video acceleration determines whether edge AI systems can scale beyond basic configurations. Our benchmarks show that modern CPUs benefit from offloading video preprocessing to dedicated hardware.
For Computer Vision Engineers: If your current system processes multiple camera feeds, unused hardware acceleration represents available performance capacity. An FFmpeg flag change (-hwaccel qsv
plus pipeline parameters) can reduce video processing CPU load by 50-80%, enabling larger models, more cameras, or higher inference throughput on the same hardware.
CPU cores should process algorithms rather than video decoding tasks that specialized silicon handles more efficiently.
We're planning additional benchmarks across edge AI platforms. Let us know what hardware acceleration challenges you're facing — your feedback shapes our future benchmarking priorities. Drop us a line with the platforms and use cases that matter to your computer vision deployments via our Discord server.
Important Considerations
Hardware Limits: Each iGPU has practical decode capacity limits that depend on resolution, frame rate, bitrate, and codec complexity. Intel doesn't publish hard limits, but community reports suggest 15+ concurrent lower-resolution streams are achievable on resource-constrained hardware. Our tests successfully ran 5×4K20fps HEVC streams on 2017-era HD Graphics 630, indicating substantial capacity. Benchmark your specific workload and target platform.
Codec Compatibility: Ensure your platform supports hardware acceleration for your specific codec. Intel 6th gen+ supports H.264/HEVC; newer generations add VP9/AV1.
Implementation Requirements: Hardware decode requires proper drivers and software support (FFmpeg with QSV/VAAPI, GStreamer with hardware plugins, etc.). The software setup is usually straightforward but platform-specific. Running ffmpeg -hwaccels
will show available hardware acceleration methods on your system, and ffmpeg -codecs
will list supported codecs (decoders and encoders). We found it easiest to use a pre-built FFmpeg docker image with Intel QSV support. linuxserver/ffmpeg has been a solid choice during our benchmark tests — just make sure to mount all required devices into the container so it can access them.
FFmpeg 8 Performance Note: With the recent official release of FFmpeg 8 and its dramatically rewritten libswscale (showing 2-40× performance improvements for scaling operations), retesting these benchmarks would provide updated performance data. The new swscale implementation may impact CPU-based scaling results and reduce the performance gap between software and hardware acceleration for resize-heavy pipelines. We're planning to revisit these measurements with FFmpeg 8 in future testing.
Appendix: Benchmark Reproduction Commands
Here are the exact FFmpeg commands we used for each test scenario, so you can reproduce these benchmarks on your own hardware.