Condition Monitoring: How It Works and What the Reliability Artefacts Look Like

A vibration sensor on a gearbox, a temperature probe on a transformer winding, and a dashboard that turns red past a fixed number. That is what most teams buy when they buy condition monitoring. It works for about three months.

The failure is predictable, and it has nothing to do with the sensor quality. Equipment baselines move. A pump that runs cool in March runs warmer under July ambient load; a motor that vibrates at one signature when new shifts its signature as bearings wear; a transformer’s thermal profile changes with grid demand. A fixed threshold set against the March baseline starts firing in July — not because anything is failing, but because the world changed and the threshold did not. Operators learn within a couple of weeks which alerts are noise, and they mute them. After that, the system is decorative.

So the real question for condition monitoring is not whether you can detect the signal. Detection is the easy part — accelerometers and RTDs have been doing it for decades. The question is whether operators will still trust the alerts in month six. That is a reliability question, and it is answered by artefacts, not by sensors.

How Does Condition Monitoring Work, and What Does It Mean in Practice?

Condition monitoring is the practice of inferring the health of a physical asset from signals it emits while running — vibration, temperature, acoustic emission, lubricant chemistry, electrical signatures — so that degradation is caught before it becomes failure. In practice it is a pipeline: a transducer produces a raw signal, that signal is conditioned and sampled, features are extracted (an FFT band, an RMS velocity, a winding hot-spot estimate), and those features are compared against some notion of “normal” to decide whether the asset is drifting toward a fault.

The naive mental model stops at “compared against some notion of normal” and fills that gap with a constant. The expert model treats “normal” as a learned, maintained, drift-aware baseline — which is to say, it treats the sensor signal as the input to an anomaly system rather than the output of a measurement.

That reframing is the whole article. Once condition monitoring is an anomaly system, it inherits every reliability obligation that anomaly systems carry: the baseline has to be calibrated, its sensitivity has to be evidenced, its drift has to be telemetered, and its false positives have to be reviewed by someone. Those are the same artefacts that make any operational anomaly system trustworthy, applied to a physical asset instead of a software metric. We treat the two as the same discipline on purpose, because the failure modes are identical: a system that cries wolf gets ignored, and an ignored system is worse than none because it carries the appearance of coverage.

Fixed-Threshold vs Anomaly-Based: Where the Two Diverge

Both approaches look identical on day one. Both have sensors, both have a dashboard, both fire alerts. The divergence point is always the first seasonal or duty-cycle change — the first time the asset’s normal operating envelope shifts for a benign reason.

Dimension	Fixed-threshold setup	Anomaly-based, artefacted setup
Definition of “normal”	A constant set at commissioning	A baseline learned per operating regime, re-estimated as conditions move
Response to seasonal load shift	Fires false alerts, or misses real ones if the threshold was set loose	Adapts the baseline; flags only deviation from the regime-appropriate normal
Sensitivity	Implicit and unrecorded	Calibrated and documented (detection rate vs false-positive rate)
Drift handling	None — operators notice the rot manually	Drift telemetry tracks baseline movement against expected envelopes
False positives	Accumulate until operators mute the channel	Routed to a review queue, triaged, fed back into calibration
Typical lifespan in active use	Muted within a sprint	Sustained 6+ months past go-live

That last row is the ROI anchor, and it is worth stating as a citable claim: condition-monitoring deployments that ship with calibration evidence and a managed false-positive review queue tend to stay in active operator use six or more months past go-live, where fixed-threshold setups are commonly muted within the first sprint (observed pattern across industrial-monitoring engagements; not a benchmarked rate). Concretely, that is the difference between catching a bearing degradation two to three weeks ahead of failure and discovering it through an unplanned line stop.

The measurable outcome is not raw detection coverage. It is sustained true-positive yield against a managed false-positive rate. A system that detects everything and floods the operator scores zero on this metric within a month, because the operator stops looking.

What Baselines and Drift Telemetry Does It Need Past Month Three?

The three-month mark is where fixed-threshold systems start to rot, so it is a useful design target. To survive it, a condition-monitoring deployment needs three things the sensor vendor does not ship.

First, a regime-aware baseline. “Normal” for a variable-speed drive is not one number; it is a family of numbers indexed by load, speed, and ambient conditions. The baseline has to know which regime the asset is in before it judges the signal. Without this, every duty-cycle change reads as an anomaly.

Second, drift telemetry — instrumentation that watches the baseline itself, not just the signal. When the learned normal moves, you want to know whether it moved because the asset is genuinely wearing (legitimate, slow, expected) or because something in the sensing chain shifted (a loose mount, a recalibrated PLC, a replaced sensor). This is the same telemetry surface that feeds a production monitoring harness, and it is what lets a human distinguish “the machine is aging” from “the model is lying.”

Third, a false-positive review queue with a feedback path. Every alert an operator dismisses is a labelled training example. If those dismissals vanish into a log nobody reads, the system never improves. If they route into a queue that periodically retunes the baseline, the false-positive rate decays over time instead of accumulating. This queue is the single most important reliability artefact in condition monitoring, and it is the one most often absent from off-the-shelf packages.

How Does Sensitivity Calibration Apply to a Vibration or Temperature Signal?

Sensitivity calibration answers a specific question: at this setting, what fraction of real faults do we catch, and what false-positive rate do we pay for it? It is a curve, not a number, and it has to be evidenced against real or representative fault data.

For a vibration channel, this means showing — on seeded or historical bearing-fault data — how the detector’s true-positive rate trades against its false-positive rate as the decision boundary moves. For a temperature channel on a transformer winding, it means showing the same trade-off against thermal-anomaly events, accounting for the slow thermal time constants that make temperature a lagging indicator. The artefact is the documented curve plus the chosen operating point and the reasoning behind it.

The reason this matters more for some signals than others is physical. Vibration is a leading indicator with high information content and a fast time constant — you can be sensitive. Temperature on a large thermal mass is a lagging indicator; by the time it deviates, you have less warning, so the calibration leans toward catching the deviation reliably even at some false-positive cost. The artefact requirement does not change; the operating point you choose on the curve does. We pay close attention to this because the same calibration discipline that works on a fast rotating asset will under-warn on a slow electrical one if you copy the operating point across without re-deriving it.

Which Techniques Map to Which Failure Modes?

Condition monitoring is not one technique. The artefact requirement shifts with the physics of each.

Technique	Best-detected failure modes	Indicator type	Artefact emphasis
Vibration analysis	Bearing wear, imbalance, misalignment, looseness	Leading, fast	Sensitivity curve against seeded bearing faults; regime-indexed baseline
Temperature / thermal	Overload, cooling failure, winding hot-spots	Lagging, slow	Operating point biased to reliable detection; thermal time-constant modelling
Oil / lubricant analysis	Gear and bearing wear debris, contamination	Trending, sampled	Trend baseline over sampling intervals; less real-time, more drift tracking
Acoustic emission	Early crack initiation, partial discharge	Leading, very fast	High sensitivity, aggressive false-positive review queue
Electrical signature	Motor faults, transformer insulation degradation	Mixed	Regime baseline indexed to grid/load conditions

The common thread: every technique produces a signal that must be judged against a maintained baseline, and every technique generates false positives that must be triaged. The technique determines where on the sensitivity curve you sit and how fast your drift telemetry must react. It does not change whether you need the artefacts.

How Does It Apply to Transformers and Motors, Where the Baseline Differs?

Rotating mechanical assets and electrical assets behave differently enough that practitioners often treat them as separate disciplines. A bearing has a vibration signature that maps cleanly to a fault frequency; a transformer’s health shows up in dissolved-gas chemistry, winding temperature, and partial-discharge acoustics, none of which has a clean rotational frequency to anchor against.

The baseline for an electrical asset is therefore more about slow trending and regime-indexing against grid load than about spectral signatures. For motors, you sit between the two worlds — electrical signature analysis catches rotor-bar and insulation faults, while vibration catches the mechanical side. The reliability artefacts are the same shape, but the baseline model and the calibration data are asset-specific. We have written separately on condition monitoring of transformers and how anomaly-reliability artefacts keep it trustworthy, because the electrical-asset baseline problem deserves its own treatment beyond what fits here.

Where Is the Boundary With People-Surveillance Use Cases?

Condition monitoring, as we use the term, applies to industrial and energy assets — machines, transformers, motors, lines. The anomaly-detection machinery is general, and that generality invites a tempting overreach: pointing the same drift-aware anomaly stack at people. We do not. Asset condition monitoring infers the health of equipment from physical signals it emits by operating. People-surveillance use cases — behavioural anomaly detection on individuals — fall outside this scope, raise distinct ethical and legal obligations, and are not what “condition monitoring” means in an industrial reliability context. The boundary is the difference between monitoring a thing for failure and monitoring a person for behaviour, and we keep it firmly drawn.

How Does It Integrate With an Existing SCADA or Historian Stack?

A condition-monitoring system that lives in its own portal, separate from the operators’ SCADA HMI and process historian, is a system operators will not look at. The artefacts have to live where the work happens. In practice this means the anomaly signals, the baseline state, and the false-positive queue surface inside the existing workflow — alarms routed through the same alarm management the operators already trust, drift telemetry written back to the historian so it is queryable alongside process data, and the review queue accessible without a context switch.

This integration discipline is itself a reliability concern. An artefact nobody can reach is an artefact that does not exist. The same logic underpins our broader production AI reliability engineering practice: the evidence that a system works has to be visible to the people responsible for it, in the tools they already use, or it decays into shelfware regardless of technical quality. For a concrete view of how this plays out in a vertical deployment — where the asset, the load profile, and the economics are real rather than abstract — our analysis of when AI-driven operational anomaly detection earns its cost in industrial and energy workloads grounds the artefact discussion in a specific setting.

FAQ

How does condition monitoring work, and what does it mean in practice?

Condition monitoring infers asset health from signals the equipment emits while running — vibration, temperature, lubricant chemistry, acoustic or electrical signatures — so degradation is caught before failure. In practice it is a pipeline that conditions a raw signal, extracts features, and compares them against a notion of “normal.” The expert version treats that “normal” as a learned, maintained baseline rather than a fixed constant, which makes condition monitoring an anomaly system rather than a measurement.

What is the difference between fixed-threshold condition monitoring and an anomaly-based approach that stays calibrated?

A fixed-threshold setup defines “normal” as a constant set at commissioning, so it floods operators with false alerts the first time load or season shifts the baseline — and gets muted within a sprint. An anomaly-based, artefacted approach learns a regime-aware baseline, tracks its drift, and routes false positives to a review queue, which keeps it in active operator use six or more months past go-live (observed pattern across industrial-monitoring engagements; not a benchmarked rate).

What sensor baselines and drift telemetry does condition monitoring need to stay trustworthy past month three?

It needs a regime-aware baseline that indexes “normal” by load, speed, and ambient conditions; drift telemetry that watches the baseline itself to distinguish genuine wear from sensing-chain changes; and a false-positive review queue with a feedback path that retunes the baseline over time. Without these, fixed-threshold systems begin rotting around the three-month mark as the operating envelope shifts.

How does sensitivity-calibration evidence apply to a condition-monitoring signal like vibration or temperature?

Sensitivity calibration documents the trade-off curve between true-positive rate and false-positive rate against real or representative fault data, plus the chosen operating point. Vibration is a fast leading indicator, so you can sit at high sensitivity; temperature on a large thermal mass is lagging, so the operating point biases toward reliable detection even at some false-positive cost. The artefact stays the same; the chosen point on the curve changes with the physics.

Why does the false-positive review queue matter for condition monitoring specifically?

Because every dismissed alert is a labelled example, and condition-monitoring systems live or die on operator trust. If dismissals vanish into an unread log, the false-positive rate accumulates until operators mute the channel; if they route into a queue that periodically retunes the baseline, the rate decays over time. It is the single most important reliability artefact and the one most often missing from off-the-shelf packages.

Where is the boundary between condition monitoring (industrial/energy assets) and people-surveillance use cases that fall outside scope?

Condition monitoring applies to equipment — machines, transformers, motors, lines — inferring health from physical signals the asset emits while operating. People-surveillance use cases, which apply anomaly detection to individuals’ behaviour, fall outside this scope, carry distinct ethical and legal obligations, and are not what condition monitoring means in an industrial reliability context.

How does condition-monitoring software integrate with an existing SCADA or historian stack so the artefacts stay in the operators’ workflow?

The anomaly signals, baseline state, and review queue must surface inside the tools operators already trust: alarms routed through existing alarm management, drift telemetry written back to the historian so it is queryable alongside process data, and the review queue reachable without a context switch. An artefact nobody can reach is an artefact that does not exist, so integration is itself a reliability concern.

What condition-monitoring techniques map to which failure modes, and how does the artefact requirement change across them?

Vibration analysis catches bearing wear, imbalance, and misalignment as a fast leading indicator; thermal monitoring catches overload and cooling failure as a lagging one; oil analysis trends wear debris; acoustic emission catches early cracks and partial discharge; electrical signature analysis catches motor and insulation faults. The technique sets where you sit on the sensitivity curve and how fast your drift telemetry must react, but every technique still needs a maintained baseline and a false-positive review queue.

How does condition monitoring apply to electrical equipment such as transformers and motors, where the baseline signal differs from rotating mechanical assets?

Electrical assets lack a clean rotational fault frequency, so their baseline relies on slow trending and regime-indexing against grid load — dissolved-gas chemistry, winding temperature, partial-discharge acoustics. Motors sit between worlds, with electrical signature analysis catching rotor and insulation faults and vibration catching the mechanical side. The reliability artefacts keep the same shape, but the baseline model and calibration data are asset-specific.

Condition monitoring fails the same way every ignored alarm system fails: not with a wrong detection, but with a true one nobody reads in month six. The sensor is never the hard part. The hard part is the sensitivity-calibration evidence and drift telemetry that keep the alerts worth trusting after the baseline has moved three times — the operational-anomaly validation lens that turns a noisy dashboard back into a system operators act on.