Streamlining Sorting and Counting Processes with AI

How AI sorts and counts on production lines — YOLOv8 instance segmentation for size grading and YOLO-World zero-shot detection for ripeness counting.

Streamlining Sorting and Counting Processes with AI
Written by TechnoLynx Published on 19 Nov 2024

Counting and sorting items accurately is a foundational task in manufacturing, food processing, and logistics. For decades it was manual, error-prone, and slow. Computer vision changed that — and the choice of which vision approach to use now matters more than the choice to automate at all.

This article walks through how AI-driven sorting and counting actually works on a line: where rule-based machine vision earns its place, where learned models like YOLOv8 and YOLO-World fit, and how to build a working prototype that grades fruit by size and counts items by visual class. The intent is practical: enough texture to choose the right tool for an inspection problem, with code you can run against your own images.

For the broader decision between a packaged machine-vision system (Keyence/Cognex-style, deterministic, hardware-bound) and a custom computer vision deployment, see our decision framework for machine vision versus computer vision in manufacturing inspection. This piece sits one level below: it assumes you’ve decided computer vision is in scope and you want to know how the counting-and-sorting layer is built.

What does “AI sorting and counting” actually mean?

The phrase covers three loosely related capabilities:

  • Counting — detecting discrete objects in an image or video frame and producing a tally, optionally broken down by class.
  • Sorting — deciding, per object, which downstream path it should take (accepted, rejected, routed by size, routed by colour, routed by defect).
  • Grading — assigning a continuous or ordinal score to each object (size, ripeness, defect severity) that downstream logic uses for sorting or counting.

In practice these collapse into the same vision pipeline: detect objects, extract per-object attributes, then aggregate. The interesting question is which model does the detection and which features drive the sorting decision.

Decision surface: which vision approach fits which task?

Task shape Best fit Why
Fixed part, fixed lighting, binary pass/fail Rule-based machine vision (template matching, blob analysis) Deterministic, auditable, low latency — observed pattern across high-throughput lines
Variable appearance (organic produce, textiles), known classes Trained detector (YOLOv8, Mask R-CNN) Tolerates variation; needs labelled data and revalidation when the input distribution shifts
Open vocabulary, classes change often, low-volume sorting Zero-shot detector (YOLO-World, OWL-ViT) No retraining for new classes; lower precision than a fine-tuned model on the same task
Defect detection with rare positives Anomaly detection on top of a detector Pure classification fails when defect examples are scarce; reconstruction-based methods do better
Continuous attribute (size, area, count of sub-features) Instance segmentation + geometric measurement Bounding boxes lose shape information; masks let you compute area, perimeter, aspect ratio

This is observed pattern from sorting and grading deployments we’ve worked on; it is not a universal ranking. The “best fit” column shifts when throughput, regulatory audit requirements, or maintenance team skill change. A line that runs three SKUs forever does not need YOLO-World. A produce sorter that swaps between berries and stone fruit by season probably does.

The vision stack: from pixels to decisions

A working sorting-and-counting system has four layers, and each carries its own failure modes.

Image acquisition. Camera, lens, and lighting choices matter more than the model. Inconsistent lighting kills learned models faster than it kills rule-based ones, because the training distribution rarely covers every lighting state of a real factory. Backlit setups, telecentric lenses, and polarised illumination exist for reasons — they remove ambiguity at the optical layer rather than asking the model to solve it.

Detection. This is where YOLO-family models, Mask R-CNN, and zero-shot detectors live. For counting and sorting work the choice between bounding-box detection (YOLOv8 detection head) and instance segmentation (YOLOv8-seg, Mask R-CNN) is driven by whether you need shape, not just presence. Counting apples? A box is enough. Grading apples by area? You need the mask.

Attribute extraction. Given a detection, what do you measure? Mask area in calibrated units, dominant colour in HSV space, texture features, sub-region defect probability. This is where lightweight OpenCV operations slot in cleanly between the detector’s output and the sorting decision.

Decision and aggregation. The sorting logic itself is usually simple — a threshold, a sort, a class lookup. The harder part is aggregation across frames: not double-counting an apple that appears in three consecutive frames, handling occlusion, and reconciling counts when objects enter and leave the field of view. Tracking-by-detection (ByteTrack, BoT-SORT) is the standard answer here.

Where AI counting and sorting is deployed

The list below is illustrative rather than exhaustive — these are application shapes that recur across our engagements and in published case studies.

Automotive assembly

Fastener counting and defect detection on assembly lines: robots equipped with cameras run a detector (typically a CNN-based model in the YOLO family) over each station, classify fasteners by type, flag defects, and feed counts back to the line control system. The throughput requirement is high (low milliseconds per frame), the part catalogue is fixed, and audit traceability matters — which pushes the design toward a fine-tuned detector with deterministic post-processing rather than a zero-shot approach.

Adjacent reading: AI is reshaping the automotive industry.

Traffic management

IoT edge cameras count vehicles, classify them by type (car/truck/motorcycle/bus), and aggregate counts at the cloud. The vision stack is straightforward; the engineering challenge is edge deployment — running a detector on a power-constrained device with reliable connectivity. Edge inference reduces end-to-end latency and avoids streaming video to the cloud, which is a bandwidth and privacy win.

Adjacent reading: AI’s role in smart solutions for traffic and transportation.

Pharmaceutical packaging

Pill counting and inspection in blister packs and bottles, often with 360-degree multi-camera rigs. The regulatory regime (GMP, FDA validation) drives the design more than the vision technology does: every decision must be auditable, every model change must be revalidated, and false-negative tolerance is effectively zero. This pushes pharma inspection toward deterministic rule-based machine vision with learned models in a supporting role, not the other way around.

Adjacent reading: AI in pharmaceutics — automating meds.

Agriculture and livestock

Drone-mounted detectors that count and classify livestock over large areas, applying instance segmentation to distinguish individuals in close groups. The class catalogue is small but the visual variation (lighting, pose, partial occlusion) is large, which is exactly the regime where learned models beat rule-based pipelines.

Adjacent reading: smart farming and AI in livestock management.

Food processing

Counting and grading produce at speed. The example we work through below — apple grading by size and apple counting by ripeness — is a stripped-down version of the same shape. The food and beverage AI market is forecast to reach roughly USD 214 billion by 2033 (Precedence Research; directional industry-scale macro estimate, not an operational benchmark for any single deployment).

Adjacent reading: how the food industry is reconfigured by AI and edge computing.

Worked example: grading apples by size with YOLOv8-seg

The goal is to detect apples in a still image, compute the area of each apple’s mask in calibrated units, and surface only the largest 50% — a simple size-grade sort.

1. Install and import

pip install ultralytics opencv-contrib-python
from ultralytics import YOLO
import numpy as np
from pathlib import Path
import cv2

2. Load the instance segmentation model

YOLOv8 segmentation weights pretrained on COCO already know what an apple looks like, so no fine-tuning is needed for the prototype.

model = YOLO('yolov8n-seg.pt')

3. Calibrate pixels to physical units

This is the step most prototypes skip and then regret. Without calibration the “size” you compute is in pixels, which means nothing once camera distance or zoom changes. A reference object of known size in the frame fixes this; here we hard-code a ratio for clarity.

RATIO_PIXEL_TO_CM = 78          # 78 pixels per cm at this resolution
RATIO_PIXEL_TO_SQUARE_CM = 78 * 78

4. Run prediction and iterate over detections

results = model.predict('path/to/image')
area_list = []

for r in results:
    img = np.copy(r.orig_img)

    for c in r:
        b_mask = np.zeros(img.shape[:2], np.uint8)
        contour = c.masks.xy.pop().astype(np.int32).reshape(-1, 1, 2)
        cv2.drawContours(b_mask, [contour], -1, (255, 255, 255), cv2.FILLED)

        x1, y1, x2, y2 = c.boxes.xyxy.cpu().numpy().squeeze().astype(np.int32)
        roi = img[y1:y2, x1:x2]
        grey = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
        _, threshold = cv2.threshold(grey, 150, 255, cv2.THRESH_BINARY)
        contours, _ = cv2.findContours(threshold, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

        area_cm = sum(cv2.contourArea(cnt) / RATIO_PIXEL_TO_SQUARE_CM for cnt in contours)
        area_list.append(round(area_cm, 2))

        cv2.putText(img, f"Size: {round(area_cm, 2)}", (x1, y1),
                    cv2.FONT_HERSHEY_PLAIN, 1, (255, 0, 255), 2)

5. Sort and filter to the largest 50%

area_list.sort(reverse=True)
half_index = max(1, len(area_list) // 2)
largest_50_percent = area_list[:half_index]

A second pass over the detections renders only the apples whose area falls in largest_50_percent — that’s the sort decision, surfaced visually.

The thing worth noticing: the heavy lifting is done by the detector, but the sorting logic is plain Python and OpenCV. The model gives you per-object masks; the rest is arithmetic. This separation is what makes the pipeline auditable.

Worked example: counting apples by ripeness with YOLO-World

YOLO-World is a zero-shot detector — you specify the classes you want to find as text prompts, no fine-tuning required. It is not as accurate on a fixed task as a model fine-tuned on that task, but it is dramatically faster to deploy when classes change or training data is scarce.

from ultralytics import YOLOWorld
import supervision as sv
import cv2

model = YOLOWorld('yolov8s-world.pt')
model.set_classes(["Red Apple", "Green Apple"])

img = cv2.imread("Image.png")
results = model.predict(img)

detections = sv.Detections.from_ultralytics(results[0])
detection_list = detections.data['class_name']

red_count = sum(1 for item in detection_list if item == "Red Apple")
green_count = sum(1 for item in detection_list if item == "Green Apple")

font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(img, f'Ripe Apples: {red_count}',   (10, 30), font, 1, (255, 255, 0), 2, cv2.LINE_AA)
cv2.putText(img, f'Unripe Apples: {green_count}',(10, 60), font, 1, (255, 255, 0), 2, cv2.LINE_AA)

Two things to flag for anyone moving this from prototype to production:

  • Ripeness via colour alone is a crude proxy. Real produce sorting uses spectral imaging (near-infrared bands) because chlorophyll content is a better ripeness indicator than RGB hue. The colour-based version is a useful teaching example, not a deployment design.
  • Zero-shot accuracy on agricultural produce varies substantially by class, lighting, and background. Validate on your own images before assuming the prompt works.

Where this fits — and where it breaks

The two code walk-throughs above demonstrate the shape of an AI sorting-and-counting pipeline, but they do not represent what a production system looks like. A few honest boundaries:

  • Single-frame inference is not enough on a moving line. You need tracking to avoid double-counting and to handle occlusion. Add ByteTrack or BoT-SORT on top of the detector.
  • Lighting is half the problem. The model will look brilliant in the lab and fail at 3am when the warehouse lighting changes. Controlled illumination is not optional for high-precision sorting.
  • Validation must be against your own data. COCO-pretrained weights know apples but they do not know your apples, your conveyor, or your camera angle. Plan for a labelled validation set from day one.
  • Auditability is a design constraint, not a feature. In regulated industries (pharma, aerospace, food safety) every sorting decision must be traceable to a model version, a training set, and a validation report. Build the lineage system before you scale the model.

What TechnoLynx does in this space

We design and deploy custom computer vision systems for sorting, counting, and inspection problems where off-the-shelf machine vision is too rigid and a generic ML model is too imprecise. Our engagements typically cover the full stack — optical setup, model selection and training, edge or GPU-accelerated deployment, and the validation harness that proves the system meets the production requirement. We work with manufacturing, food processing, and logistics teams who have already automated the easy parts and need help with the visually ambiguous ones.

If you’re at the point of choosing between a packaged machine-vision vendor and a custom CV build, our decision framework for machine vision versus computer vision in manufacturing inspection is the right next read. If the decision is made and the question is how to architect the build, contact us.

FAQ

Machine vision vs computer vision: which inspection approach fits my manufacturing line?

It depends on three variables: how much the input varies (lighting, part geometry, defect appearance), how strict the audit requirement is, and what your maintenance team can support. Low variation + strict audit + rule-based maintenance skill points to machine vision. High variation + acceptable model retraining loops + ML-capable team points to computer vision. Many lines run hybrid systems where rule-based vision handles the deterministic checks and a learned model handles the ambiguous ones.

What is machine vision, and how does it differ from a custom computer vision system?

Machine vision usually refers to packaged industrial systems (Keyence, Cognex, Basler) that pair fixed-function cameras with rule-based image-processing pipelines — template matching, blob analysis, edge detection. Custom computer vision systems use trained neural networks (detection, segmentation, classification) and are built around the specific problem. Machine vision is deterministic, low-latency, and easier to validate; custom CV is adaptive, generalisable, and required when the inspection task involves visual variation that rules cannot enumerate.

When does a Keyence/Cognex-style machine-vision system beat a custom CV deployment?

When the inspection task is well-bounded (fixed parts, controlled lighting, binary outcomes), when audit traceability is mandatory, when the maintenance team is electrical/controls rather than ML, or when throughput requirements are in the sub-millisecond range. Packaged machine vision wins on time-to-deploy and ongoing supportability for tasks that fit its operating envelope.

How much does a vision inspection system cost across machine-vision versus custom-CV options?

Packaged machine-vision systems are typically priced per station and per camera, with software licences bundled — a defensible cost model when the task is fixed. Custom computer vision projects carry higher up-front engineering cost (data collection, model development, validation) but lower marginal cost per additional inspection class once the platform exists. The crossover point depends on how many distinct inspection problems a single line has to handle; for one fixed problem, packaged wins; for an evolving catalogue, custom wins over time.

Is computer vision AI/ML, and does the answer change the procurement path?

Modern computer vision is a subfield of machine learning — the detection and segmentation models discussed above (YOLOv8, YOLO-World, Mask R-CNN) are all neural networks. Traditional machine vision is not ML in this sense; it is rule-based image processing. The distinction matters for procurement because ML-based systems require ongoing data and revalidation, which changes the support model from “buy a box” to “operate a model lifecycle.”

Which production constraints (latency, lighting, throughput) push the decision one way or the other?

Sub-millisecond latency and strictly controlled lighting push toward packaged machine vision; the rule-based pipelines are faster and more predictable in their operating envelope. Variable lighting, mixed-product lines, and inspection tasks that change quarterly push toward custom CV; learned models tolerate visual variation that rule-based systems cannot. Throughput by itself is rarely the deciding factor — both approaches can run at line speed on modern hardware.

Sources

  • Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. (2024). YOLO-World: Real-Time Open-Vocabulary Object Detection. arXiv preprint.
  • Dinh, D.-L., Nguyen, H.-N., Thai, H.-T., & Le, K.-H. (2021). Towards AI-Based Traffic Counting System with Edge Computing. Journal of Advanced Transportation, 2021, 5551976.
  • Precedence Research. AI in Food and Beverages Market.
  • Skalski, P., & Gallagher, J. (2024). YOLO-World: Real-Time, Zero-Shot Object Detection. Roboflow Blog.
  • Ultralytics. YOLOv8 Documentation.
Back See Blogs
arrow icon