CASE STUDY

Smart Cart Object Detection and Tracking

For a multinational startup building autonomous shopping cart technology in North American grocery retail, we advanced the in-cart perception system across a multi-year programme. The work covered detecting and tracking products inside a moving cart, distinguishing placement from removal events, and maintaining a coherent cart-content state across a complete shopping session.

Object Detection Multi-Object Tracking Adaptive Sampling TensorRT

The Challenge

Detecting products inside a cart sounds straightforward until you work through the actual constraints: moving cameras, partial occlusion, placement and removal events that look similar, a store network that cannot support continuous high-resolution video from dozens of carts simultaneously, and a data collection process that was itself expensive enough to be a bottleneck.

Camera blind spots and incomplete overlap.

The dual-camera cart setup left blind spots where products could be occluded for extended periods. The system had to maintain item identity through those gaps without treating every re-emergence as a new detection.

Placement, removal, and shift events all look similar.

A customer placing an item, adjusting its position, and removing it produce visually similar motion signatures. The system needed to classify these events correctly in real time β€” misclassification directly affects cart content accuracy and, downstream, checkout reliability.

Store-scale bandwidth constraint.

A store running roughly 50 simultaneous carts on a shared 10–15 Mbps network cannot sustain continuous full-resolution video streams from every cart. Continuous capture was not an option β€” the system had to be designed around the network, not against it.

Data collection cost was itself a bottleneck.

Building training sets at cart scale required manually driving carts and capturing footage across product placements. Expensive in both time and labour, it was explicitly scoped as a problem to solve β€” not a passive background constraint.

A shopper interacting with products inside a grocery store aisle

Project Timeline

From single-camera object detection to a stateful, session-scoped cart content system

Single-Camera Detection Baseline

Established a working object detection baseline on single-camera cart footage, characterising the failure modes β€” occlusion, motion blur, lighting variation β€” before adding tracking and multi-camera complexity.

Added multi-object tracking with a local ID per camera and a global ID across the cart session. Tuned the association logic to handle the specific ID-switch failure modes that appear when items are occluded or repositioned rather than removed.

Multi-Object Tracking + Global ID

Adaptive FPS Sampling

Replaced fixed-rate capture with motion-triggered adaptive frame rate sampling. The system captures at higher rates when motion is detected and reduces to a low idle rate when the cart is stationary β€” bringing bandwidth consumption within the store network constraint without sacrificing event detection.

Moved from stateless per-video processing to a session-scoped cart-state model: storing per-item features and last-known locations, keyed to a session identifier, so the cart's contents persist across the full shopping trip rather than resetting per clip.

Stateful Cart-State Model

Dual-Camera Fusion & Z-Order

Integrated detections from dual cameras using cosine similarity on feature vectors combined with IOU and location cues. Added z-order estimation to handle stacked items β€” a common failure mode when products are placed on top of each other in a full cart.

The Solution

We built a layered perception system where each component addresses a specific real-world constraint rather than a benchmark metric. The architecture is observable by design β€” every layer produces inspectable intermediate outputs rather than a single opaque end-to-end prediction.

Detection + Tracking Pipeline

Real-time object detection feeds a tracking layer that maintains local identity per camera and a global identity across the session. Association uses cosine similarity on feature vectors combined with IOU and location cues β€” robust to occlusion and repositioning without requiring re-detection to resolve identity. This is a recurring pattern in our production computer vision work: tracking is a separate, modular concern from detection, and treating it that way avoids the worst class of identity-switch failures.

Adaptive FPS Sampling

Frame rate is controlled by detected motion rather than a fixed clock. Bandwidth drops by orders of magnitude during idle periods while detection quality is preserved during active placement events. The alternative (lower resolution at fixed rate) degrades detection quality across the board β€” a worse trade for the same bandwidth budget.

Session-Scoped Cart State

Cart contents are tracked across the full shopping session, not reset per video clip. Per-item features and last-known locations are stored and updated as items are added, moved, or removed. Treating each frame or clip independently produces unrecoverable ID switches when items are temporarily occluded β€” stateful session memory is the architectural prerequisite, not an optimisation.

Technical Specifications

Frameworks PyTorch, TensorRT, OpenCV
Pipeline architecture Custom multithreaded pipeline β€” TensorRT thread-safety explicit by design
Target hardware Edge-first (in-cart compute); server-side option for store-level aggregation
Association method Cosine similarity on feature vectors + IOU + location cues
Identity scope Local ID per camera; global ID across session (session identifier)
Stacking handling Z-order estimation for layered items
Store network constraint 10–15 Mbps shared across ~50 simultaneous carts
Sampling approach Adaptive FPS β€” motion-triggered, not fixed-rate
Data collection Simulation environment for ID-matching evaluation; robotics-assisted acquisition explored
Retail store environment with stocked aisles, the deployment context for in-cart perception

The Outcome

The system advanced through several iterative development phases across a multi-year development programme. Two architectural decisions did most of the heavy lifting. Adaptive FPS sampling resolved the store-scale bandwidth constraint that had made continuous-capture architectures unviable on a typical store network. The session-scoped cart-state model shifted the system from a per-clip processor to a session-aware tracker capable of maintaining cart contents across a full shopping trip. Both are systems-design solutions, not model accuracy improvements β€” a recurring pattern in our computer vision deployments where the win comes from the surrounding architecture rather than a more accurate model.

The programme also included parallel workstreams in smart retail SKU recognition, multi-camera store tracking, shelf analytics, and security action recognition β€” all sharing the same camera infrastructure and perception backbone.

Key Achievements

Adaptive FPS sampling resolved the store-scale bandwidth constraint that had made continuous-capture architectures unviable on a typical store network

Session-scoped cart-state model maintained the cart's contents across the full shopping trip β€” not just per clip

TensorRT multithreaded pipeline with explicit thread-safety, designed for concurrent multi-stream operation

Dual-camera fusion with cosine similarity + IOU + z-order estimation for stacked and occluded items

Multi-year engagement advancing in-cart perception for autonomous grocery checkout

Adjacent Systems in This Engagement

Computer Vision Services

Our services feature expertise in classical computer vision, human-supervised system design for legal compliance, video pipeline optimisation with tools like FFmpeg, custom adaptable models, and explainable AI for ethical transparency.

Computer vision

Retail AI Solutions

We build production-ready CV systems for smart retail environments β€” in-cart perception, shelf analytics, SKU recognition, and security β€” all deployable on existing camera infrastructure without costly hardware upgrades.

Retail

GPU Performance Engineering

We deliver GPU-accelerated inference pipelines optimised for constrained edge hardware and high-throughput server deployments β€” profiling-led, architecture-first, with measurable performance outcomes.

GPU

Building Perception for Autonomous Retail?

In-store perception systems usually fail on the surrounding constraints β€” bandwidth, session continuity, thread safety under load β€” long before they fail on model accuracy. The right architecture decisions sit around the model, not inside it.