For a multinational startup building autonomous shopping cart technology in North American grocery retail, we advanced the in-cart perception system across a multi-year programme. The work covered detecting and tracking products inside a moving cart, distinguishing placement from removal events, and maintaining a coherent cart-content state across a complete shopping session.
Detecting products inside a cart sounds straightforward until you work through the actual constraints: moving cameras, partial occlusion, placement and removal events that look similar, a store network that cannot support continuous high-resolution video from dozens of carts simultaneously, and a data collection process that was itself expensive enough to be a bottleneck.
Camera blind spots and incomplete overlap.
The dual-camera cart setup left blind spots where products could be occluded for extended periods. The system had to maintain item identity through those gaps without treating every re-emergence as a new detection.
Placement, removal, and shift events all look similar.
A customer placing an item, adjusting its position, and removing it produce visually similar motion signatures. The system needed to classify these events correctly in real time β misclassification directly affects cart content accuracy and, downstream, checkout reliability.
Store-scale bandwidth constraint.
A store running roughly 50 simultaneous carts on a shared 10β15 Mbps network cannot sustain continuous full-resolution video streams from every cart. Continuous capture was not an option β the system had to be designed around the network, not against it.
Data collection cost was itself a bottleneck.
Building training sets at cart scale required manually driving carts and capturing footage across product placements. Expensive in both time and labour, it was explicitly scoped as a problem to solve β not a passive background constraint.
From single-camera object detection to a stateful, session-scoped cart content system
Established a working object detection baseline on single-camera cart footage, characterising the failure modes β occlusion, motion blur, lighting variation β before adding tracking and multi-camera complexity.
Added multi-object tracking with a local ID per camera and a global ID across the cart session. Tuned the association logic to handle the specific ID-switch failure modes that appear when items are occluded or repositioned rather than removed.
Replaced fixed-rate capture with motion-triggered adaptive frame rate sampling. The system captures at higher rates when motion is detected and reduces to a low idle rate when the cart is stationary β bringing bandwidth consumption within the store network constraint without sacrificing event detection.
Moved from stateless per-video processing to a session-scoped cart-state model: storing per-item features and last-known locations, keyed to a session identifier, so the cart's contents persist across the full shopping trip rather than resetting per clip.
Integrated detections from dual cameras using cosine similarity on feature vectors combined with IOU and location cues. Added z-order estimation to handle stacked items β a common failure mode when products are placed on top of each other in a full cart.
We built a layered perception system where each component addresses a specific real-world constraint rather than a benchmark metric. The architecture is observable by design β every layer produces inspectable intermediate outputs rather than a single opaque end-to-end prediction.
Real-time object detection feeds a tracking layer that maintains local identity per camera and a global identity across the session. Association uses cosine similarity on feature vectors combined with IOU and location cues β robust to occlusion and repositioning without requiring re-detection to resolve identity. This is a recurring pattern in our production computer vision work: tracking is a separate, modular concern from detection, and treating it that way avoids the worst class of identity-switch failures.
Frame rate is controlled by detected motion rather than a fixed clock. Bandwidth drops by orders of magnitude during idle periods while detection quality is preserved during active placement events. The alternative (lower resolution at fixed rate) degrades detection quality across the board β a worse trade for the same bandwidth budget.
Cart contents are tracked across the full shopping session, not reset per video clip. Per-item features and last-known locations are stored and updated as items are added, moved, or removed. Treating each frame or clip independently produces unrecoverable ID switches when items are temporarily occluded β stateful session memory is the architectural prerequisite, not an optimisation.
The system advanced through several iterative development phases across a multi-year development programme. Two architectural decisions did most of the heavy lifting. Adaptive FPS sampling resolved the store-scale bandwidth constraint that had made continuous-capture architectures unviable on a typical store network. The session-scoped cart-state model shifted the system from a per-clip processor to a session-aware tracker capable of maintaining cart contents across a full shopping trip. Both are systems-design solutions, not model accuracy improvements β a recurring pattern in our computer vision deployments where the win comes from the surrounding architecture rather than a more accurate model.
The programme also included parallel workstreams in smart retail SKU recognition, multi-camera store tracking, shelf analytics, and security action recognition β all sharing the same camera infrastructure and perception backbone.
Adaptive FPS sampling resolved the store-scale bandwidth constraint that had made continuous-capture architectures unviable on a typical store network
Session-scoped cart-state model maintained the cart's contents across the full shopping trip β not just per clip
TensorRT multithreaded pipeline with explicit thread-safety, designed for concurrent multi-stream operation
Dual-camera fusion with cosine similarity + IOU + z-order estimation for stacked and occluded items
Multi-year engagement advancing in-cart perception for autonomous grocery checkout
In-store perception systems usually fail on the surrounding constraints β bandwidth, session continuity, thread safety under load β long before they fail on model accuracy. The right architecture decisions sit around the model, not inside it.