For a multinational startup operating autonomous shopping carts in North American grocery retail, we built a camera-based barcode detection and decoding pipeline that runs on video captured inside a moving cart. It reached 86.7% video-level accuracy against a Dynamsoft baseline of 80% on the same 30-video test set.
Barcode detection in a cart context is different from barcode scanning at a checkout counter. The camera is moving. The product is moving. The angle is variable. Autofocus systems designed for static close-range scanning struggle at the distances and motion profiles of in-cart capture. Commercial barcode libraries — designed for clean, stable input — fail at useful rates under these conditions.
Autofocus lag, reflections, and motion blur.
In-cart camera footage includes frames with autofocus transitions, specular reflections on product packaging, and motion blur from both the cart and the customer's hand. Any single frame may be undecodable. A system that evaluates frames independently will miss barcodes that are present but momentarily obscured.
Commercial libraries have high precision but poor recall.
Libraries such as Pyzbar decode reliably when they decode — but they fail to detect the barcode region in the first place at cart-camera distances with degraded image quality. The precision/recall imbalance means they produce correct outputs rarely, rather than useful outputs reliably.
Barcode type diversity requires multiple decoding strategies.
A grocery store carries products with EAN-13, UPC-A, and other barcode formats in varying print quality and orientation. No single decoder performs consistently across the full type distribution — a robust pipeline needs multiple decoding strategies and a way to aggregate across them.
From YOLO localisation to a multi-frame polling pipeline that beat commercial baselines
Ran Dynamsoft and Pyzbar against the 30-video test set to establish commercial baselines before building a custom pipeline. Dynamsoft achieved 80% video-level accuracy. Pyzbar demonstrated high precision, poor recall — it decoded correctly when it decoded, but rarely decoded in the first place at cart-camera conditions.
Trained a YOLOv7 model to localise the barcode region within each frame, served via a Flask localhost HTTP endpoint. Localisation narrows the region of interest before decoding, substantially improving recall for downstream decoders that would otherwise fail on the full frame.
Applied Hough-based rotation correction and image enhancement to the detected crop before decoding — improving decodability on frames where the barcode is skewed or has degraded contrast.
Assembled an ensemble decoder — Pyzbar, EAN-13 reader, and type-specific CNN decoders backed by a barcode database — applied to each localised crop. Each decoder contributes a candidate; the ensemble aggregates across strategies rather than failing when any single decoder cannot read the barcode.
Rather than returning the first decoded result, the pipeline aggregates decode attempts across the full video clip and returns the most probable prediction — weighted by decode frequency and confidence. This is the mechanism that converts unreliable per-frame decoding into reliable video-level accuracy.
A four-stage pipeline: detect the barcode region, correct for rotation and image degradation, apply an ensemble of decoding strategies, and aggregate across frames before returning a result. Each stage addresses a distinct failure mode. Removing any one of them reduces accuracy.
Pyzbar has excellent precision on a clean, aligned barcode crop and poor recall when searching the full frame. Putting a YOLOv7 localiser in front of it converts the recall problem into a detection problem the network is good at — and the same crop benefits every downstream decoder. Hough-based rotation correction and image enhancement are applied to the crop before decoding. The decomposition — localiser in front, decoders behind — recurs across our computer vision deployments.
No single decoder performs consistently across EAN-13, UPC-A, and the other barcode types present in a typical grocery catalogue. Pyzbar, an EAN-13 reader, and type-specific CNN decoders backed by a barcode database each contribute a candidate per crop. The pipeline selects the most probable result rather than failing when any individual strategy cannot decode — a structural advantage that compounds across the dataset.
Single-frame barcode decoding is unreliable at cart-camera distances — any given frame may be motion-blurred, partially occluded, or mid-autofocus. Aggregating decode candidates across the full clip, weighted by frequency and confidence, converts unreliable per-frame accuracy into reliable video-level accuracy — the metric that actually matters for checkout, where the cart has many seconds to recognise the product, not one frame.
The pipeline reached 86.7% video-level detect-and-decode accuracy on the 30-video test set, against a Dynamsoft commercial baseline of 80% measured on identical conditions. At top-5 aggregation it reached 93.3%. Three compounding changes drove the improvement: YOLOv7 localisation gave downstream decoders the clean crops they perform well on; the ensemble decoder handled barcode type diversity that no single library covers consistently; and multi-frame polling converted variable per-frame reliability into consistent video-level accuracy.
Two boundaries are worth naming. The 30-video test set is a meaningful comparison set, not a national deployment population — the gap to Dynamsoft is the directly measured one, on the same input. And the pipeline still depends on the cart having many seconds with the product in view; pure single-frame accuracy is not what this architecture optimises for. This workstream sits inside a broader multi-year smart retail engagement, providing a complementary product-identification modality alongside camera-based SKU recognition.
86.7% video-level accuracy on 30-video test set — versus 80% Dynamsoft commercial baseline on identical conditions
93.3% accuracy at top-5 multi-frame aggregation
YOLOv7 detection stage improved downstream decoder recall by isolating barcode region before decoding
Ensemble decoder (Pyzbar + EAN-13 + CNN decoders) — no single decoder covers the full barcode-type distribution consistently; the ensemble handles what each alone cannot
Multi-frame polling aggregation — the mechanism that converts unreliable per-frame decoding into reliable video-level accuracy
Reading barcodes from in-the-field video is a different problem from scanning at a checkout counter. Detection-before-decoding, decoder ensembles, and multi-frame aggregation usually decide whether the pipeline holds up at the conditions a moving camera actually delivers.